Webscraping links from google search based on when it’s posted using Selenium and Beautifulsoup

Neeharika Kusampudi
4 min readAug 15, 2020

Hello!

Now, whats Selenium? All I ever heard was Aluminium! That was a failed joke! Hahaha…. Jokes aside, let me tell you this project took my BLOOD, my SWEAT, and my TEARS!! So, I thought why don’t I share it with all of you, next time when you are looking for something like this. You don’t have go through as much trouble as I did.

Photo by Marvin Meyer on Unsplash

What is Selenium? This is one of the web-scraping ways that has a cool feature. It lets you automate the process! It allows you to perform quick and short operations like performing search options, clicking the next page, hovering your mouse, etc.

Refer the documentation for in-depth knowledge.

The goal of this project, I want it to search the term “Selenium” on google, filter it by the 24 hrs/week/month, etc. And then extract the links of the pages, automatically navigating to the next page. Great! Now we know what we want. Let’s get to it.

First, we start by installing the web-drivers onto your computer. I chose to work with Chromedrivers. To install Chromedrives, you need to dentify the chrome version on your computer and install its respective version(If you are wondering how that is done, then click on the three dots, help and about google chrome).

Identifying chrome version.

Next, we need to install the package, using the command pip install selenium

Installing the package(I have already installed on mine).

If you are facing trouble with the installation like I did. It could be that your python not added to your path in the environmental variable under properties. Here’s a YouTube link to help you out with it. Amazing! Your packages have been installed! You are ready to code! Python is amazing they offer you libraries that don’t need you to do the dirty work. So, we will be importing the libraries of Selenium.

Libraries to install from selenium

Okay, it’s very crucial to locate where your driver for the Selenium to work. Remember where you have installed your drivers because that would be the path to access the driver. I installed mine in C drive in the folder called Program Files (x86).

PATH of the chromedriver looks like this

Perfect, the path is all sorted, the driver is ready! Next, the driver needs to open the link and type the word to be searched and click the search button when done.

Searching the word Selenium on google.

As you can see below, I have customized the search based on the past week. You can search for a day old or a month old post as well. What I am trying to achieve here is that I am automating the filtering process on google.

Filtering the websites based on week.

I have used beautifulsoup to extract the links while selenium automates the process. Here the driver fetches the page source then beautiful parses the XML code.

We are at the final phase extracting the links, Google has created a tag called td for every new page. Therefore, the total number of td tags can provide with the total number of pages.

Iterative process to extract links from all the pages.

The links of each website are in the <a> tag which is located under <div> of class: ‘r’. In the above code, the inner loop extracts all the links of a page and the outer loop iterates over all the pages.

Extracted links list.

Based on the output, I could see that it requires data cleaning, lot’s unwanted links, and ‘#’. This happens because a single link has many <a href> tags. For the cleaning part, I have decided to put in a DataFrame format, which is easier to clean.

Converting it into a DataFrame.

For Data Cleaning, I could see a few links containing certain terms. Therefore I have eliminated the rows that contain a part of the string.

Finally, exported the output into a CSV file.

Exporting DataFrame into CSV file.

Reach out to me at any time on LinkedIn and check out my Github for full code. And if you liked this article, give it a few claps. I will sincerely appreciate it.

--

--