A Python toolbox to turn the internet into data

“Python?”, you say? Python is a programming language. You don’t need to know much about Python to follow along, but you do need to know how to install packages — and have Python installed. To do that, see our “Software Guides” page.

The internet is full of useful information. Some of it is distributed across 20 different pages, some of it is nestled in complex tables that test the limits of copy-and-paste, but crucially it is not where you want it to be: in a neat file on your computer.

Python can solve this problem.

Fetching things with “Requests”

The first step is downloading things — pictures, PDFs, webpages, or data files. In the Bad Old Days, the tool for the job was urllib2. As we near the end of the world here in 2012, however, we should all be using Requests (HTTP for Humans).

Using it (after installing it) looks like this:

>>> r = requests.get('http://www.isthatcherdeadyet.co.uk/')
>>> print 'Not yet' in r.text

The above lines could be saved in a script to notify us of important world events. We can also use Requests to download files:

>>> r = requests.get('http://ur1.ca/afhvw')
>>> f = open('nyan.gif', 'w')
>>> f.write(r.content)
>>> f.close()

Finally, we can download a whole HTML file using Requests, and use another Python package to parse the page and process our data.

Getting data from webpages with “BeautifulSoup”

If you ask a programmer friend how to get data out of a webpage, they might tell you to use “regular expressions”. Unfortunately, most of the websites on the internet are not written in anything approaching a “regular” way — we would need “highly-irregular nonsense expressions”. This kind of bad HTML is politely called “tag soup”, which is why the tool to parse them is called BeautifulSoup instead. Note the warning that a new version has been released recently, and some guides may be out of date. This one isn’t. (Yet).

BeautifulSoup provides a natural, “Pythonic” (don’t ask me, or indeed anyone else) way of searching for particular data in an HTML page. Let’s list all the programming languages Wikipedia knows about:

>>> from bs4 import BeautifulSoup
>>> import requests 

>>> r = requests.get('https://en.wikipedia.org/wiki/List_of_programming_languages')
>>> soup = BeautifulSoup(r.content)

>>> links = soup.select('div#mw-content-text table.multicol ul li a')
>>> print [i.text for i in links]
[u'A# .NET', u'A# (Axiom)', u'A-0 System', u'A+', u'A++', ...

Here we use the same method — “CSS selector syntax” — to specify which parts of the web page we’re interested in as we would when styling a webpage in the first place, or writing jQuery code. Yay for reusable skills.

Now some data is sitting on your computer, it is yours to hack at. For instance, you could

For the moment, though, we’ll just cover storing it as a file. Adopt a chunk of the internet today!

Saving it for later

If you’re downloading PDFs, images or other data files, you can probably just use f.write(r.content) as above. For data you’ve parsed from webpages, though, you’ll need to generate the output file yourself. A file format called CSV (comma-separated values) is helpful here — it’s just a simple table, where each line is a row and columns are separated by commas (hence the name). CSV files can be uploaded to Google Docs, read back into a script (Python, or most other programming languages) and analysed in tools like Google Refine or Excel. Making CSV files in Python is very simple. Here’s how we could save Wikipedia’s list of programming languages to a CSV file:

>>> import csv
>>> f =  open('languages.csv', 'w')
>>> writer = csv.writer(f)
>>> writer.writerows([[l.text.encode('utf-8',]  for l in links])
>>> f.close()

OK, so that got a bit gnarly towards the end (exposing some limitations of the built-in Python csv package), but we’re done; “languages.csv” is saved to disk. Impress your friends by printing it out on index cards and mounting tastefully above your TV…

And so on

A corollary of Murphy’s law is that the most useful information on the internet is trapped in the most unimaginably byzantine and ugly websites. From trash collection schedules to election results to Twitter, there’s a world of “almost-there” data to be re-processed, analysed and made more useful.

University Website

It’s great that some systems have been officially opening up under the banner of “Open Data”, but with techniques like the above we’re not limited by others’ inertia, or their limited budgets. If you can see something on a screen — with very few exceptions — there’s already a way of grabbing it in a useful format.

One Comment

  1. auremoser

    lovely resource, thanks! I’ve found the following helpful:

    The Open Data Handbook: http://opendatahandbook.org/
    The Data Wrangling Handbook: http://handbook.schoolofdata.org/en/latest/index.html
    The Open Data Institute: http://www.theodi.org/ (for the affiliation agnostic)

    Data Wrangling has excellent if abbreviated resources on Python.

    Also, in terms of tag mining, this python open source project is pretty interesting: https://github.com/apresta/tagger

    Hope that helps!