Webscraping links from google search based on when it’s posted using Selenium and Beautifulsoup
Hello!
Now, whats Selenium? All I ever heard was Aluminium! That was a failed joke! Hahaha…. Jokes aside, let me tell you this project took my BLOOD, my SWEAT, and my TEARS!! So, I thought why don’t I share it with all of you, next time when you are looking for something like this. You don’t have go through as much trouble as I did.
What is Selenium? This is one of the web-scraping ways that has a cool feature. It lets you automate the process! It allows you to perform quick and short operations like performing search options, clicking the next page, hovering your mouse, etc.
The goal of this project, I want it to search the term “Selenium” on google, filter it by the 24 hrs/week/month, etc. And then extract the links of the pages, automatically navigating to the next page. Great! Now we know what we want. Let’s get to it.
First, we start by installing the web-drivers onto your computer. I chose to work with Chromedrivers. To install Chromedrives, you need to dentify the chrome version on your computer and install its respective version(If you are wondering how that is done, then click on the three dots, help and about google chrome).
Next, we need to install the package, using the command pip install selenium
If you are facing trouble with the installation like I did. It could be that your python not added to your path in the environmental variable under properties. Here’s a YouTube link to help you out with it. Amazing! Your packages have been installed! You are ready to code! Python is amazing they offer you libraries that don’t need you to do the dirty work. So, we will be importing the libraries of Selenium.
Okay, it’s very crucial to locate where your driver for the Selenium to work. Remember where you have installed your drivers because that would be the path to access the driver. I installed mine in C drive in the folder called Program Files (x86).
Perfect, the path is all sorted, the driver is ready! Next, the driver needs to open the link and type the word to be searched and click the search button when done.
As you can see below, I have customized the search based on the past week. You can search for a day old or a month old post as well. What I am trying to achieve here is that I am automating the filtering process on google.
I have used beautifulsoup to extract the links while selenium automates the process. Here the driver fetches the page source then beautiful parses the XML code.
We are at the final phase extracting the links, Google has created a tag called td for every new page. Therefore, the total number of td tags can provide with the total number of pages.
The links of each website are in the <a> tag which is located under <div> of class: ‘r’. In the above code, the inner loop extracts all the links of a page and the outer loop iterates over all the pages.
Based on the output, I could see that it requires data cleaning, lot’s unwanted links, and ‘#’. This happens because a single link has many <a href> tags. For the cleaning part, I have decided to put in a DataFrame format, which is easier to clean.
For Data Cleaning, I could see a few links containing certain terms. Therefore I have eliminated the rows that contain a part of the string.
Finally, exported the output into a CSV file.