These days, most everyone is familiar with the concept of crawling the web: a piece of software that systematically reads web pages and the pages they link to, traversing the world-wide web. It’s what Google does, and countless tech firms crawl web pages to accomplish tasks ranging from searches to archiving content to statistical analyses and so on. Web crawling is a task that has been automated by developers in every programming language around, many times — for example, a search for web crawling source code yields well over a million hits.
So when I recently came across a need to crawl some web pages for a project I’ve been working on, I figured I could just go find some source code online and hack it into what I need. (Quick aside: the project is a Python library for managing EXIF metadata on digital photos. More on that in a future blog post.)
But I spent a couple of hours searching and playing with the samples I found, and didn’t get anywhere. Mostly because I’m working in Python version 3, and the most popular Python web crawling code is Scrapy, which is only available for Python 2. I found a few Python 3 samples, but they all seemed to be either too trivial (not avoiding re-scanning the same page, for example) or too needlessly complex. So I decided to write my own Python 3.x web crawler, as a fun little learning exercise and also because I need one.
In this blog post I’ll go over how I approached it and explain some of the code, which I posted on GitHub so that others can use it as well. NOTE: the code on GitHub now includes some improvements over the initial version described below.
Python development on Windows has become more common in recent years, and the available tools are rapidly improving. If you’re an IDE type, the popular choices are PyCharm and Python Tools for Visual Studio. PTVS is a great option that gives you the full power of Visual Studio’s debugger as well as some great features such as adaptive code completion (it watches how you use your functions and modifies its tooltip suggestions accordingly), and I use it often. But frankly, at my heart I’m an old-school text-editor type. So I’ve been using Sublime as my primary Python dev tool the last few years, and then this year I moved to Visual Studio Code, the open source text editor from Microsoft that’s based on Atom. VS Code is off to a very strong start, and it’s evolving quickly; the more I use it, the more I like it.
If you use VS Code for Python development, you can configure a task runner to execute your code when you press Ctrl-Shift-B (for “build”). To set this up, go to the Command Palette (Ctrl-Shift-P) and then select Tasks: Configure Task Runner. That will bring up the tasks.json file for the current project/folder, where you can use the settings show to the right to configure it to use the Python interpreter to run your program.
That assumes you have Python installed, of course. I have Python 3.4.3 installed, and for this web crawler project I also installed two popular Python modules:
- The requests module, which makes it trivially easy to load web pages via HTTP.
- Beautiful Soup, which automates the scraping of content from web pages and does a great job of hiding the messy details caused by the chaotically inconsistent HTML practices across the world-wide web.
With those tools installed (all free, and all quick downloads with simple installs), I was ready to write my web crawler starter project.
My goal was to create a simple program that provides a framework for handling the repetitive details of web crawling such as loading pages, finding the links, keeping track of what’s been called, and so on. Then I can plug custom processing code into this framework for whatever I actually want to do with the pages that have been crawled. Here’s a high-level diagram of the structure of my program:
And here are a few notes on the numbered steps in the diagram:
1 – creating the queue. I could have used Python’s built-in list data type for this, but lists don’t provide good performance if you’re repeatedly pulling items off the front of them (because the entire list needs to be re-written in memory each time). So I used a deque (double-ended queue) from the collections module, which is designed for this scenario and provides fast and predictable performance.
2 – loading the page and finding the links. This is where requests and Beautiful Soup come into play, and they make the code extremely simple compared to the alternatives. Here’s the code:
response = requests.get(url) soup = bs4.BeautifulSoup(response.text, "html.parser") links = [a.attrs.get('href') for a in soup.select('a[href]')]
Those three lines read the web page, create a DOM of the page, and extract a list of the targets of all links on the page. It can’t get much simpler than that!
3 – doing something with the crawled pages. This is just a comment in the source code, to be filled in with the specific functionality needed for each use of the crawler. In the specific case that motivated me to write this, I’ll be adding code here to find all the images on each page and analyze their use of EXIF metadata. Beautiful Soup will make that very easy to do.
4 – adding the page’s links to the queue. I’ve implemented a few things here to deal with relative URLs, avoid re-crawling the same page multiple times, and so on. Python’s list comprehensions make it easy to modify the list of links; here’s the code:
# remove fragment identifiers links = [urldefrag(link) for link in links] # remove any empty strings links = list(filter(None,links)) # if it's a relative link, change to absolute links = [link if bool(urlparse(link).netloc) else urljoin(url,link) for link in links] # if singledomain=True, remove links to other domains if singledomain: links = [link for link in links if (urlparse(link).netloc == domain)] # add these links to the queue (except if already crawled) for link in links: if link not in crawled and link not in pagequeue: pagequeue.append(link)
And that’s all there is to it. I now have a working Python 3 crawler I can use whenever I want to read web pages for any reason. The entire source code is about 60 lines, and you can download it from GitHub. Here’s an example of it running on the first 30 pages found at Microsoft.com:
It’s not super fast, but web crawling is mostly constrained by network bandwidth so I don’t care much about that. And for those who might be tempted to say that a dynamic scripted language like Python isn’t fast enough for these sorts of tasks, I’ll leave you with this little inconvenient truth: Google’s web crawler was originally written in Python (a very old 1.x version) and they didn’t re-write it in highly optimized C++ until they had already become the most dominant search engine in history. A great example of not falling for the allure of premature optimization based on pedantic theories.
And now it’s time to get back to my EXIF project and put this crawler to work!