Quick guide: How to create a Python-based scraper
The usage of web scraping is actively increasing, especially among large e-commerce companies as a way to gather data in order to compete, analyze rivals, and research new products. Web scraping is a method of extracting information from websites. In this tutorial, learn how to create a Python-based scraper. Dive into the code and see how it works.
In the modern world of big data it’s hard to follow everything that is going on. And it gets even more complicated for businesses that need a lot of information to achieve success. But first they need to gather this data somehow which means they have to process thousands of resources.
There are two ways of gathering data. You can use API medial websites offer, and it is the best way to get all the news. Also, API is very easy to use. Unfortunately, not every website offers this service. Then we’re left with the second approach – web scraping.
What is web scraping?
This is a method of extracting information from websites. An HTML page is nothing more than a collection of nested tags. Tags form some kind of a tree with a root in the <html> tag and break the page into different logical pieces. Each tag can have its own descendants (children) and parents.
For example, an HTML page tree can look like this:
To process this HTML you can work with text, or a tree. Bypassing this tree is web scraping. We will only find the nodes we need among all this diversity and take information from them! This approach primarily focuses on transforming unstructured HTML data into easy to consume information structured into a database or sheets. Data scraping requires a bot that will gather the information, and a connection to the Internet via HTTP or web browser. In this guide, we will use Python to create a scraper.
What do we need to do:
- Get URL of a page we want to scrape data from
- Copy or download HTML content of this page
- Process this HTML content and get the required data
This sequence allows us to pop on the required URL, get the HTML data and then process it to receive the required data. But sometimes we need to get on the website first, and then go to a specific web address to receive data. Then we have to add one more step – signing into the website.
We will use the Beautiful Soup library to analyze the HTML content and get all the required data. It is a wonderful Python packet for the scraping of HTML and XML documents.
The Selenium library will help us to make the scraper enter the website and go to the required URL address within one session. Selenium Python helps with such actions as clicking on a button, entering the content, and so on.
Let’s dive into the code
First of all, let’s import the libraries we’re going to use.
# import libraries from selenium import webdriver from bs4 import BeautifulSoup
Then we need to show the driver of the browser the way to Selenium to launch the web browser (we’re going to use Google Chrome here). If we don’t want the bot to display the graphic interface of the web browser, we will add the “headless” option in Selenium.
Web browsers without a graphic interface (headless) offer automated management of a web-page in the environment that is very similar to all the popular web browsers. But in this case, all the activity is held through the interface of a command line or with the use of network communications.
# path to chrome driver chromedriver = '/usr/local/bin/chromedriver' options = webdriver.ChromeOptions() options.add_argument('headless') #open a headless browser browser = webdriver.Chrome(executable_path=chromedriver, chrome_options=options)
After we have set up the browser, installed libraries, and created the environment, we begin working with HTML. Let’s go to the enter page and find the identifier, class, or names of fields where the user has to enter their email address and password.
# go to login page browser.get('http://playsports365.com/default.aspx') # search tags by name email = browser.find_element_by_name('ctl00$MainContent$ctlLogin$_UserName') password = browser.find_element_by_name('ctl00$MainContent$ctlLogin$_Password') login = browser.find_element_by_name('ctl00$MainContent$ctlLogin$BtnSubmit')
Then we will send the login data into these HTML tags. To do that we need to press the action button to send the data to the server.
# add login credentials email.send_keys('********') password.send_keys('*******') #click on submit button login.click()
Once we’ve entered the system successfully, we will go to the required page and gather the HTML-content.
# after successful login, go to the "OpenBets" page browser.get('http://playsports365.com/wager/OpenBets.aspx') # get HTML content requiredHtml = browser.page_source
Now when we have the HTML-content, the only thing that’s left is to process this data. We will do it with the help of the Beautiful Soup and html5lib libraries.
html5lib is a Python package that implements the HTML5 scraping algorithm that is rather impacted by modern web browsers. Once we get the normalized structure of the content, we can search for the data in any child element of the HTML tag. The information we’re looking for is in the table tag, so we’re looking for it.
soup = BeautifulSoup(requiredHtml, 'html5lib') table = soup.findChildren('table') my_table = table
We will find the parent tag once, and then go through child ones in recursion and print out the values.
# receiving tags and printing values rows = my_table.findChildren(['th', 'tr']) for row in rows: cells = row.findChildren('td') for cell in cells: value = cell.text print (value)
To execute this program you will need to install Selenium, Beautiful Soup and html5lib using pip. Once the libraries are installed, the command #python <program name> will print out the values into the console.
That’s how you scrape any website.
If we scrape the website that renews the content frequently – for example, a sports results table – you should create the cron task to launch this program at specific time intervals.
Great! Everything works, the content is getting scraped, the data is filled in and everything would be fine except for one thing – the number of requests that we have to make to get data.
Sometimes the server gets bored with the same person making a bunch of requests and the server bans it. Unfortunately, not only people have limited patience.
In this case, you have to disguise yourself. The most common reasons for the ban are 403 error, and frequent requests to a server that are sent when your IP is blocked. The 403 error is thrown by the server when it is available and able to process requests, but for some personal reasons refuses to do this. And the first is already solved – we can pretend to be a human by generating a fake user agent with html5liband pass to our requests a random combination of OS, specifications, and browser. In most cases, this should work fine to accurately collect the information you are interested in.
But sometimes it’s not enough to just put time.sleep () in the right places and fill out the request-header. So you need to look for powerful ways to change this IP. To scrape the large amount of data you can:
– Develop your own infrastructure of IP addresses;
– Use Tor – this topic could be devoted to several large articles, which, actually, has already been done;
– Use a commercial proxy network;
For beginners in web scraping, the best option is to contact a proxy provider – such as Infatica and the alike, who can help you to set up proxies and take care of all the difficulties in proxy server management. Scraping large amounts of data requires a lot of resources, so there is no need to “reinvent the wheel” by developing your own internal infrastructure for proxy management. Even a lot of the largest e-commerce companies outsource their proxy management using proxy network services as the # 1 priority for most companies is data, not proxy management.
The usage of web scraping is actively increasing, especially among large e-commerce companies. It helps them to compete, analyze rivals, control price trends and research new products. Today data collecting is stylish, cool, and actually interesting. You can get data sets that no one has ever processed to do something new. However, do not forget about the restrictions imposed by the server, including the ban. They appear for a reason to protect the site from unfriendly requests and DDoS attacks. It is worth treating someone else’s work with respect, and even if the server does not have any protection, this is not a reason for unlimited requests. Especially if this can lead to its shutdown – no one has canceled the criminal punishment.
We hope this guide was helpful to you. If you are interested in further studying scraping, we recommend this book by Ryan Mitchell.
Have a successful scraping!