A Python toolbox to turn the internet into data
“Python?”, you say? Python is a programming language. You don’t need to know much about Python to follow along, but you do need to know how to install packages — and have Python installed. To do that, see our “Software Guides” page.
The internet is full of useful information. Some of it is distributed across 20 different pages, some of it is nestled in complex tables that test the limits of copy-and-paste, but crucially it is not where you want it to be: in a neat file on your computer.
Python can solve this problem.
Fetching things with “Requests”
The first step is downloading things — pictures, PDFs, webpages, or data files. In the Bad Old Days, the tool for the job was urllib2. As we near the end of the world here in 2012, however, we should all be using Requests (HTTP for Humans).
Using it (after installing it) looks like this:
>>> r = requests.get('http://www.isthatcherdeadyet.co.uk/') >>> print 'Not yet' in r.text True
The above lines could be saved in a script to notify us of important world events. We can also use Requests to download files:
>>> r = requests.get('http://ur1.ca/afhvw') >>> f = open('nyan.gif', 'w') >>> f.write(r.content) >>> f.close()
Finally, we can download a whole HTML file using Requests, and use another Python package to parse the page and process our data.
Getting data from webpages with “BeautifulSoup”
If you ask a programmer friend how to get data out of a webpage, they might tell you to use “regular expressions”. Unfortunately, most of the websites on the internet are not written in anything approaching a “regular” way — we would need “highly-irregular nonsense expressions”. This kind of bad HTML is politely called “tag soup”, which is why the tool to parse them is called BeautifulSoup instead. Note the warning that a new version has been released recently, and some guides may be out of date. This one isn’t. (Yet).
BeautifulSoup provides a natural, “Pythonic” (don’t ask me, or indeed anyone else) way of searching for particular data in an HTML page. Let’s list all the programming languages Wikipedia knows about:
>>> from bs4 import BeautifulSoup >>> import requests >>> r = requests.get('https://en.wikipedia.org/wiki/List_of_programming_languages') >>> soup = BeautifulSoup(r.content) >>> links = soup.select('div#mw-content-text table.multicol ul li a') >>> print [i.text for i in links] [u'A# .NET', u'A# (Axiom)', u'A-0 System', u'A+', u'A++', ...
Here we use the same method — “CSS selector syntax” — to specify which parts of the web page we’re interested in as we would when styling a webpage in the first place, or writing jQuery code. Yay for reusable skills.
Now some data is sitting on your computer, it is yours to hack at. For instance, you could
For the moment, though, we’ll just cover storing it as a file. Adopt a chunk of the internet today!
Saving it for later
If you’re downloading PDFs, images or other data files, you can probably just use
>>> import csv >>> f = open('languages.csv', 'w') >>> writer = csv.writer(f) >>> writer.writerows([[l.text.encode('utf-8',] for l in links]) >>> f.close()
OK, so that got a bit gnarly towards the end (exposing some limitations of the built-in Python csv package), but we’re done; “languages.csv” is saved to disk. Impress your friends by printing it out on index cards and mounting tastefully above your TV…
And so on
A corollary of Murphy’s law is that the most useful information on the internet is trapped in the most unimaginably byzantine and ugly websites. From trash collection schedules to election results to Twitter, there’s a world of “almost-there” data to be re-processed, analysed and made more useful.
It’s great that some systems have been officially opening up under the banner of “Open Data”, but with techniques like the above we’re not limited by others’ inertia, or their limited budgets. If you can see something on a screen — with very few exceptions — there’s already a way of grabbing it in a useful format.