Crawling the web with Python 3.x

4

These days, most everyone is familiar with the concept of crawling the web: a piece of software that systematically reads web pages and the pages they link to, traversing the world-wide web. It’s what Google does, and countless tech firms crawl web pages to accomplish tasks ranging from searches to archiving content to statistical analyses and so on. Web crawling is a task that has been automated by developers in every programming language around, many times — for example, a search for web crawling source code yields well over a million hits.

So when I recently came across a need to crawl some web pages for a project I’ve been working on, I figured I could just go find some source code online and hack it into what I need. (Quick aside: the project is a Python library for managing EXIF metadata on digital photos. More on that in a future blog post.)

But I spent a couple of hours searching and playing with the samples I found, and didn’t get anywhere. Mostly because I’m working in Python version 3, and the most popular Python web crawling code is Scrapy, which is only available for Python 2. I found a few Python 3 samples, but they all seemed to be either too trivial (not avoiding re-scanning the same page, for example) or too needlessly complex. So I decided to write my own Python 3.x web crawler, as a fun little learning exercise and also because I need one.

In this blog post I’ll go over how I approached it and explain some of the code, which I posted on GitHub so that others can use it as well. NOTE: the code on GitHub now includes some improvements over the initial version described below.

Getting Started

Python development on Windows has become more common in recent years, and the available tools are rapidly improving. If you’re an IDE type, the popular choices are PyCharm and Python Tools for Visual Studio. PTVS is a great option that gives you the full power of Visual Studio’s debugger as well as some great features such as adaptive code completion (it watches how you use your functions and modifies its tooltip suggestions accordingly), and I use it often. But frankly, at my heart I’m an old-school text-editor type. So I’ve been using Sublime as my primary Python dev tool the last few years, and then this year I moved to Visual Studio Code, the open source text editor from Microsoft that’s based on Atom. VS Code is off to a very strong start, and it’s evolving quickly; the more I use it, the more I like it.

VScode-taskrunnerIf you use VS Code for Python development, you can configure a task runner to execute your code when you press Ctrl-Shift-B (for “build”). To set this up, go to the Command Palette (Ctrl-Shift-P) and then select Tasks: Configure Task Runner. That will bring up the tasks.json file for the current project/folder, where you can use the settings show to the right to configure it to use the Python interpreter to run your program.

That assumes you have Python installed, of course. I have Python 3.4.3 installed, and for this web crawler project I also installed two popular Python modules:

  • The requests module, which makes it trivially easy to load web pages via HTTP.
  • Beautiful Soup, which automates the scraping of content from web pages and does a great job of hiding the messy details caused by the chaotically inconsistent HTML practices across the world-wide web.

With those tools installed (all free, and all quick downloads with simple installs), I was ready to write my web crawler starter project.

Building Crawlerino

My goal was to create a simple program that provides a framework for handling the repetitive details of web crawling such as loading pages, finding the links, keeping track of what’s been called, and so on. Then I can plug custom processing code into this framework for whatever I actually want to do with the pages that have been crawled. Here’s a high-level diagram of the structure of my program:

flowchart

And here are a few notes on the numbered steps in the diagram:

1 – creating the queue. I could have used Python’s built-in list data type for this, but lists don’t provide good performance if you’re repeatedly pulling items off the front of them (because the entire list needs to be re-written in memory each time). So I used a deque (double-ended queue) from the collections module, which is designed for this scenario and provides fast and predictable performance.

2 – loading the page and finding the links. This is where requests and Beautiful Soup come into play, and they make the code extremely simple compared to the alternatives. Here’s the code:

response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, "html.parser")
links = [a.attrs.get('href') for a in soup.select('a[href]')]

Those three lines read the web page, create a DOM of the page, and extract a list of the targets of all links on the page. It can’t get much simpler than that!

3 – doing something with the crawled pages. This is just a comment in the source code, to be filled in with the specific functionality needed for each use of the crawler. In the specific case that motivated me to write this, I’ll be adding code here to find all the images on each page and analyze their use of EXIF metadata. Beautiful Soup will make that very easy to do.

4 – adding the page’s links to the queue. I’ve implemented a few things here to deal with relative URLs, avoid re-crawling the same page multiple times, and so on. Python’s list comprehensions make it easy to modify the list of links; here’s the code:

# remove fragment identifiers
links = [urldefrag(link)[0] for link in links]
# remove any empty strings
links = list(filter(None,links))
# if it's a relative link, change to absolute
links = [link if bool(urlparse(link).netloc) else urljoin(url,link) for link in links]
# if singledomain=True, remove links to other domains
if singledomain:
    links = [link for link in links if (urlparse(link).netloc == domain)]
 
# add these links to the queue (except if already crawled)
for link in links:
    if link not in crawled and link not in pagequeue:
        pagequeue.append(link)

And that’s all there is to it. I now have a working Python 3 crawler I can use whenever I want to read web pages for any reason. The entire source code is about 60 lines, and you can download it from GitHub. Here’s an example of it running on the first 30 pages found at Microsoft.com:

testrun

It’s not super fast, but web crawling is mostly constrained by network bandwidth so I don’t care much about that. And for those who might be tempted to say that a dynamic scripted language like Python isn’t fast enough for these sorts of tasks, I’ll leave you with this little inconvenient truth: Google’s web crawler was originally written in Python (a very old 1.x version) and they didn’t re-write it in highly optimized C++ until they had already become the most dominant search engine in history. A great example of not falling for the allure of premature optimization based on pedantic theories. :)

And now it’s time to get back to my EXIF project and put this crawler to work!

 

Share.

4 Comments

  1. Hey,

    Interesting article. But what if we need to crawl the entire web to find some “interesting stuff” … The queue will then become too big and we will probably need a database. Imagine we follow each link of the crawled webpage (internal and external).

    How to manage a so big task ?

    Thanks for your time.

  2. Yes, this is just a simple starting point and you’d need a database if you’re crawling too many pages to fit into an in-memory queue. Generally speaking, I’d use existing search engines for crawling all (or most) of the web, and I use these sorts of tools for specialized tasks that only need to crawl a few thousand pages or less.

  3. Chandresh Kumar Maurya on

    @DMAHUGH, Is there a way to know how many links are there in total on a website so that I’m sure that I’ve crawled all the links.

    • Generally speaking, I’m not sure everyone would agree on the definition of “all the links on a web site.” For example, if the home page links to pages A and B but not to page C, then should links from page C be included in the total? You can assume that the expression soup.select(‘a[href]’) will return all of the links on a single page, so if your code crawls all of those (as this example does), and then you also crawl the list of links on each of the destination pages, you will have all of the links that are accessible from the starting page. But from the example I gave, we’re not crawling page C, because there is no way to know that page exists if you’re starting from pages that don’t ever link to it.

Leave A Reply