Python Web Scraping To Database



Scraping data from a website is one the most underestimated technical and business moves I can think of. Scraping data online is something every business owner can do to create a copy of a competitor’s database and analyze the data to achieve maximum profit. It can also be used to analyze a specific market and find potential costumers. The best thing is that it is all free of charge. It only needs some technical skills which many people have these days. In this post I am going to do a step by step tutorial on how to scrape data from a website an saving it to a database. I will be using BeatifulSoup, a python library designed to facilitate screen scraping. I will also use MySQL as my database software. You can probably use the same code with some minor changes in order to use your desired database software.
Please note that scraping data from the internet can be a violation of the terms of service for some websites. Please do appropriate research before scraping data from the web and/or publishing the gathered data.

GitHup Page for This Project

I have also created a GitHub project for this blog post. I hope you will be able to use tutorial and code to customize it according to your need.

We would like to show you a description here but the site won’t allow us. Web Scraping Reference: A Simple Cheat Sheet for Web Scraping with Python October 24, 2018 Once you’ve put together enough web scrapers, you start to feel like you can do it in your sleep. Web Scraping Using Python What is Web Scraping? Web Scraping is a technique to extract a large amount of data from several websites. The term 'scraping' refers to obtaining the information from another source (webpages) and saving it into a local file. For example: Suppose you are working on a project called 'Phone comparing website,' where you require the price of mobile phones, ratings,. 2) Octoparse Octoparse is a web scraping tool easy to use for both coders and non-coders and popular for eCommerce data scraping. It can scrape web data at a large scale (up to millions) and store it in structured files like Excel, CSV, JSON for download.

Requirements

  1. Basic knowledge of Python. (I will be using Python 3 but Python 2 would probably be sufficient too.)
  2. A machine with a database software installed. (I have MySQL installed on my machine)
  3. An active internet connection.

The following installation instructions are very basic. It is possible that the installation process for beautiful soup, Python etc. is a bit more complicated but since the installation is different on all different platforms and individual machines, it does not fit into the main object of this post, Scraping Data from a Website and saving it to a Database.

Installing Beautiful Soup

In order to install Beautiful Soup, you need to open terminal and execute one of the following commands according to your desired version of python. Please note that the following commands should probably be executed with “sudo” written before the actual line in order to give it administrative access.

For Python 2 (If pip did not work try it with pip2):

For Python 3:

If the above did not work, you can also use the following command.

For Python 2:

For Python 3:

If you were not able to install beautiful soup, just Google the term “How to install Beautiful Soup” and you will find plenty of tutorials.

Installing MySQLdb

In order to able to connect to your MySQL databases through Python, you will have to install MySQLdb library for your Python installation. Since Python 3 does not support MySQLdb at the time of this writing, you will need to use a different library. It is called mysqlclient which is basically a fork of MySQLdb with an added support for Python 3 and some other improvements. So using the library is basically identical to native MySQLdb for Python 2.

To install MySQLdb on Python 2, open terminal and execute the following command:

To install mysqlclient on Python 3, open terminal and execute the following command:

Installing requests

requests is a Python library which is used to load html data from a url. In order to install requests on your machine, follow the following instructions.

To install requests on Python 2, open terminal and execute the following command:

To install requests on Python 3, open terminal and execute the following command:

Now that we have everything installed and running, let’s get started.

Step by Step Guide on Scraping Data from a Single Web Page

I have created a page with some sample data which we will be scraping data from. Feel free to use this url and test your code. The page we will be scraping in the course of this article is https://howpcrules.com/sample-page-for-web-scraping/.

1. First of all we have to create a file called “scraping_single_web_page.py”.

2. Now, we will start by importing the libraries requests, MySQLdb and BeautifulSoup:

3. Let us create some variables where we will be saving our database connection data. To do so, add the below lines in your “” file.

4. We also need a variable where we will save the url to be scraped, into. After that, we will use the imported library “requests” to load the web page’s html plain text into the variable “plain_html_text”. In the next line, we will use BeautifulSoup to create a multidimensional array, “soup” which will be a big help to us in reading out the web page’s content efficiently.

Your whole code should be looking like this till now:

These few lines of code were enough to load the data from the web and parse it. Now we will start the task of finding the specific elements we are searching for. To do so we have to take a look at the page’s html and find the elements we are looking to save. All major web browsers offer the option to see the html plain text. If you were not able to see the html plain text on your browser, you can also add the following code the end of your “scraping_single_web_page.py” file to see the loaded html data in your terminal window.

To execute the code open terminal, navigate to the folder where you have your “scraping_single_web_page.py” file and execute your code with “python scraping_single_web_page.py” for Python 2 or “python3 scraping_single_web_page.py” respectively. Now, you will see the html data printing out in your terminal window.

5. Scroll down until you find my html comment “<!– Start Sample Data to be Scraped –>”. This is were the actual data we need starts. As you can see, the name of the class “Exercise: Data Structures and Algorithms” is written inside a <h3> tag. Since this is the only h3 tag in the whole page, we can use “soup.h3” to get this specific tag and its contents. We will now use the following command to get the whole tag where the name of the class is written in and save the content of the tag into the variable “name_of_class”. We will also use Python’s strip() function to remove all possible spaces to the left and right of the text. (Please note that the line print(soup.prettify()) was only there to print out the html data and can be deleted now.)

6. If you scroll down a little bit, you will see that the table with the basic information about the class is identified with summary=”Basic data for the event” inside its <table> tag. So we will save the parsed table in a variable called “basic_data_table”. If you take a closer look at the tags inside the table you will realize that the data itself regardless of its titles are saved inside <td> tags. These <td> tags have the following order from top to down:

According to the above, all text inside the <td> tags are relevant and need to be stored in appropriate variables. To do so we first have to parse all <td>s inside our table.

7. Now that we have all <td>s stored in the variable “basic_data_cells”, we have to go through the variable and save the data accordingly. (Please note that the index in arrays start from zero. So the numbers in the above picture will be shifted by one.)

8. Let’s continue with the course dates. Like the previous table, we have to parse the tables where the dates are written. The only difference is that for the dates, there is not only one table to be scraped but several ones. To do so we will use the following to scrape all tables into one variable:

9. We now have to go through all the tables and save the data accordingly. This means we have to create a for loop to iterate through the tables. In the tables, there is always one row (<tr> tag) as the header with <th> cells inside (No <td>s). After the header there can one to several rows with data that do interest us. So inside our for loop for the tables, we also have to iterate through the individual rows (<tr>s) where there is at least one cell (<td>) inside in order to exclude the header row. Then we only have to save the contents inside each cell into appropriate variables.

This all is translated into code as follows:

Please note that the above code reads the data and overrides them in the same variables over and over again so we have to create a database connection and save the data in each iteration.

Saving Scraped Data into a Database

Now, we are all set to create our tables and save the scraped data. To do so please follow the steps below.

10. Open your MySQL software (PhpMyAdmin, Sequel Pro etc.) on your machine and create a database with the name “scraping_sample”. You also have to create a user with name “scraping_sample_user”. Do not forget to at least give write privileges to the database “scraping_sample” for the user “scraping_user”.

11. After you have created the database navigate to the “scraping_sample” database and execute the following command in your MySQL command line.

Now you have two tables with the classes in the first one and the corresponding events in the second one. We have also created a foreign key from the events table to the classes table. We also added a constraint to delete the events associated with a class if the class is removed (on delete cascade).

We can go back to our code “scraping_single_web_page.py” and start with the process of saving data to the database.

12. In your code, navigate to the end of step 7, the line “language = basic_data_cells[12].text.strip()” and add the following below that to be able to save the class data:

Here, we use the MySQLdb library to establish a connection to the MySQL server and insert the data into the table “classes”. After that, we execute a query to get back the id of the just inserted class the value is saved in the “class_id” variable. We will use this id to add its corresponding events into the events table.

13. We will now save each and every event into the database. To do so, navigate to the end of step 9, the line “max_participants = cells[9].text.strip()” and add the following below it. Please note that the following have to be exactly below that line and inside the last if statement.

here, we are using the variable “class_id” first mentioned in step 12 to add the events to the just added class.

Scraping Data from Multiple Similar Web Pages

This is the easiest part of all. The code will work just fine if you have different but similar web pages you would like to scrape data from. Just put the whole code excluding the steps 1-3 in a for loop where the “url_to_scrape” variable is dynamically generated. I have created a sample script where the same page is scraped a few times over to elaborate this process. To check out the script and the fully working example of the above, navigate to my “Python-Scraping-to-Database-Sample” GitHub page.

Once you’ve put together enough web scrapers, you start to feel like you can do it in your sleep. I’ve probably built hundreds of scrapers over the years for my own projects, as well as for clients and students in my web scraping course.

Occasionally though, I find myself referencing documentation or re-reading old code looking for snippets I can reuse. One of the students in my course suggested I put together a “cheat sheet” of commonly used code snippets and patterns for easy reference.

I decided to publish it publicly as well – as an organized set of easy-to-reference notes – in case they’re helpful to others.

While it’s written primarily for people who are new to programming, I also hope that it’ll be helpful to those who already have a background in software or python, but who are looking to learn some web scraping fundamentals and concepts.

Table of Contents:

  1. Extracting Content from HTML
  2. Storing Your Data
  3. More Advanced Topics

Useful Libraries

For the most part, a scraping program deals with making HTTP requests and parsing HTML responses.

I always make sure I have requests and BeautifulSoup installed before I begin a new scraping project. From the command line:

Then, at the top of your .py file, make sure you’ve imported these libraries correctly.

Making Simple Requests

Make a simple GET request (just fetching a page)

Make a POST requests (usually used when sending information to the server like submitting a form)

Pass query arguments aka URL parameters (usually used when making a search query or paging through results)

Inspecting the Response

See what response code the server sent back (useful for detecting 4XX or 5XX errors)

Access the full response as text (get the HTML of the page in a big string)

Look for a specific substring of text within the response

Best Data Scraping Tools (Free/Paid) 1) Scrapingbee. Scrapingbee is a web scraping API that handles headless browsers and proxy management. It can execute Javascript on the pages. 3) Scraping-Bot. 4) Bright Data (formerly Luminati) 5) xtract.io. Best web scraper.

Check the response’s Content Type (see if you got back HTML, JSON, XML, etc)

Extracting Content from HTML

Now that you’ve made your HTTP request and gotten some HTML content, it’s time to parse it so that you can extract the values you’re looking for.

Using Regular Expressions

Using Regular Expressions to look for HTML patterns is famously NOT recommended at all.

However, regular expressions are still useful for finding specific string patterns like prices, email addresses or phone numbers.

Run a regular expression on the response text to look for specific string patterns:

Using BeautifulSoup

BeautifulSoup is widely used due to its simple API and its powerful extraction capabilities. It has many different parser options that allow it to understand even the most poorly written HTML pages – and the default one works great.

Compared to libraries that offer similar functionality, it’s a pleasure to use. To get started, you’ll have to turn the HTML text that you got in the response into a nested, DOM-like structure that you can traverse and search

Look for all anchor tags on the page (useful if you’re building a crawler and need to find the next pages to visit)

Look for all tags with a specific class attribute (eg <li>..</li>)

Look for the tag with a specific ID attribute (eg: <div>..</div>)

Look for nested patterns of tags (useful for finding generic elements, but only within a specific section of the page)

Look for all tags matching CSS selectors (similar query to the last one, but might be easier to write for someone who knows CSS)

Get a list of strings representing the inner contents of a tag (this includes both the text nodes as well as the text representation of any other nested HTML tags within)

Return only the text contents within this tag, but ignore the text representation of other HTML tags (useful for stripping our pesky <span>, <strong>, <i>, or other inline tags that might show up sometimes)

Python Web Scraping Tools

Convert the text that are extracting from unicode to ascii if you’re having issues printing it to the console or writing it to files

Sudden strike 4 - battle of kursk for mac. Get the attribute of a tag (useful for grabbing the src attribute of an <img> tag or the href attribute of an <a> tag)

Putting several of these concepts together, here’s a common idiom: iterating over a bunch of container tags and pull out content from each of them

Iclipart. I-CLIP is not only the most flexible mini-wallet in the world, it is also the best designed to enable a clear overview of its contents: up to 12 cards and any number of banknotes. Here at I-CLIP, every I-CLIP order is free of shipping costs. The most functional and thoughtfully crafted minimalist wallets. 100% made in Germany. I-CLIP holds up to 12 cards and your bills in one ultra-slim and lightweight design. IClip - Simply Useful Clipboard Manager iClip saves the contents of the clipboard every time you copy plain-text, rich-text, images, files, anything. Access your clip history in the gorgeous iClip interface. IClip for Mac OS X by Irradiated Software. Each I-CLIP is thoughtfully crafted using only the highest quality materials. The frames are made out of a durable high-performance plastic composite. Our leathers are treated with all-natural tanning agents and we never use chrome tanned leathers. The I-CLIP frames are made out of a specially developed high-performance plastic composite that is extremely durable and lightweight. Weighing just over half an ounce, I-CLIP.

Scrapy

Using XPath Selectors

BeautifulSoup doesn’t currently support XPath selectors, and I’ve found them to be really terse and more of a pain than they’re worth. I haven’t found a pattern I couldn’t parse using the above methods.

If you’re really dedicated to using them for some reason, you can use the lxml library instead of BeautifulSoup, as described here.

Storing Your Data

Now that you’ve extracted your data from the page, it’s time to save it somewhere.

Note: The implication in these examples is that the scraper went out and collected all of the items, and then waited until the very end to iterate over all of them and write them to a spreadsheet or database.

I did this to simplify the code examples. In practice, you’d want to store the values you extract from each page as you go, so that you don’t lose all of your progress if you hit an exception towards the end of your scrape and have to go back and re-scrape every page.

Writing to a CSV

Probably the most basic thing you can do is write your extracted items to a CSV file. By default, each row that is passed to the csv.writer object to be written has to be a python list.

In order for the spreadsheet to make sense and have consistent columns, you need to make sure all of the items that you’ve extracted have their properties in the same order. This isn’t usually a problem if the lists are created consistently.

If you’re extracting lots of properties about each item, sometimes it’s more useful to store the item as a python dict instead of having to remember the order of columns within a row. The csv module has a handy DictWriter that keeps track of which column is for writing which dict key.

Writing to a SQLite Database

You can also use a simple SQL insert if you’d prefer to store your data in a database for later querying and retrieval.

More Advanced Topics

These aren’t really things you’ll need if you’re building a simple, small scale scraper for 90% of websites. But they’re useful tricks to keep up your sleeve.

Javascript Heavy Websites

Contrary to popular belief, you do not need any special tools to scrape websites that load their content via Javascript. In order for the information to get from their server and show up on a page in your browser, that information had to have been returned in an HTTP response somewhere.

It usually means that you won’t be making an HTTP request to the page’s URL that you see at the top of your browser window, but instead you’ll need to find the URL of the AJAX request that’s going on in the background to fetch the data from the server and load it into the page.

There’s not really an easy code snippet I can show here, but if you open the Chrome or Firefox Developer Tools, you can load the page, go to the “Network” tab and then look through the all of the requests that are being sent in the background to find the one that’s returning the data you’re looking for. Start by filtering the requests to only XHR or JS to make this easier.

Once you find the AJAX request that returns the data you’re hoping to scrape, then you can make your scraper send requests to this URL, instead of to the parent page’s URL. If you’re lucky, the response will be encoded with JSON which is even easier to parse than HTML.

Content Inside Iframes

This is another topic that causes a lot of hand wringing for no reason. Sometimes the page you’re trying to scrape doesn’t actually contain the data in its HTML, but instead it loads the data inside an iframe.

Again, it’s just a matter of making the request to the right URL to get the data back that you want. Make a request to the outer page, find the iframe, and then make another HTTP request to the iframe’s src attribute.

Sessions and Cookies

While HTTP is stateless, sometimes you want to use cookies to identify yourself consistently across requests to the site you’re scraping.

The most common example of this is needing to login to a site in order to access protected pages. Without the correct cookies sent, a request to the URL will likely be redirected to a login form or presented with an error response.

However, once you successfully login, a session cookie is set that identifies who you are to the website. As long as future requests send this cookie along, the site knows who you are and what you have access to.

Delays and Backing Off

If you want to be polite and not overwhelm the target site you’re scraping, you can introduce an intentional delay or lag in your scraper to slow it down

Scraping

Some also recommend adding a backoff that’s proportional to how long the site took to respond to your request. That way if the site gets overwhelmed and starts to slow down, your code will automatically back off.

Python Web Scraping Tutorial

Spoofing the User Agent

By default, the requests library sets the User-Agent header on each request to something like “python-requests/2.12.4”. You might want to change it to identify your web scraper, perhaps providing a contact email address so that an admin from the target website can reach out if they see you in their logs.

More commonly, this is used to make it appear that the request is coming from a normal web browser, and not a web scraping program.

Using Proxy Servers

Even if you spoof your User Agent, the site you are scraping can still see your IP address, since they have to know where to send the response.

If you’d like to obfuscate where the request is coming from, you can use a proxy server in between you and the target site. The scraped site will see the request coming from that server instead of your actual scraping machine.

If you’d like to make your requests appear to be spread out across many IP addresses, then you’ll need access to many different proxy servers. You can keep track of them in a list and then have your scraping program simply go down the list, picking off the next one for each new request, so that the proxy servers get even rotation.

Setting Timeouts

If you’re experiencing slow connections and would prefer that your scraper moved on to something else, you can specify a timeout on your requests.

Handling Network Errors

Just as you should never trust user input in web applications, you shouldn’t trust the network to behave well on large web scraping projects. Eventually you’ll hit closed connections, SSL errors or other intermittent failures.

Web Scraping Python Example

Learn More

If you’d like to learn more about web scraping, I currently have an ebook and online course that I offer, as well as a free sandbox website that’s designed to be easy for beginners to scrape.

Python Web Scraping To Database Download

You can also subscribe to my blog to get emailed when I release new articles.