Narratives of Occupy Day 2

After the morning tutorials, we picked up where we left off yesterday. The plan:

  • Build up our list of data sources for people’s narratives of Occupy (and mainstream media comparators),
  • Fetch the stories and metadata, clean up the resulting dataset and store in an accessible format,
  • Make some overall observations based on samples,
  • Develop a method of characterising each entry based on natural language processing,
  • Visualise interesting findings.

We wrote a script to download text content and metadata from the “We Are the 99%” Tumblr, and used built-in TextMate features to remove HTML tags and fix other data issues. The final CSV had 3413 posts, of which 2522 had text (many of the posts have not yet been transcribed). From this, we were able to draw a quick frequency chart showing posts to the site over time:

A lot of activity around September / October last year, and some sporadic updates now. No great surprises, but the overall chronology of the data set is good to bear in mind while we search for themes across it.

We used IBM ManyEyes to generate some initial visualisations (click to see more context and interact):

Obviously, a lot of the issues described in the Tumblr are financial (rent, bills, work, food, job, money, debt) and personal (family, life, depression, friends). Health is also a big one, and many of the stories mention the cost, unfairness and inefficacy of the health system. Education also appears frequently (college, school, degree), and a word net showing “* is not *” reveals a theme of it not having been worth the debt.

Next, we wanted to categorise the stories somehow, and get some overall data on the issues raised. Using Latent Dirichlet allocation (LDA), we wrote a script to automatically detect topics based on word usage. Setting the script to search for 4 categories gave 4 topics, shown here with the names we assigned them (bold) and the most common words in each category:

Students Ideals Health Care Jobs/Economy
job people job business
college country insurance work
debt world work company
school make afford home
student american pay worked
work live health paid
pay hard time money
time government live bank
loans future food made
degree change mother credit

One quick pie chart later, and we can see which topics were most common (based on treating each post as belonging to its single most-likely category):

By the end of the hackathon, we had collected the above visualisations into a preliminary infographic overview of the Tumblr stories, which we presented to the rest of the group. Our (slightly messy!) working notes from the day are online.

Next steps

  • We were hoping to compare this data: firstly over time, and secondly in comparison to mainstream media portrayals of the motivations behind Occupy. The former is a case of further processing of the existing dataset, and we do have some data on the latter (pulled from “Media Cloud” using a script we wrote), which just needs to be cleaned and made sense of.
  • Right at the end of the day, we began to investigate methods of extracting key phrases from the stories, such as “I want …”, and visualising those separately. Extending this could confirm or debunk the mainstream media narrative that the early days of Occupy were characterised by unclear and conflicting demands.
  • Extracting demographic data is a hard problem, but we did come up with some ideas (including algorithms that guess age and gender based on writing style) — this would allow us to see how representative the stories are compared to the general population, and the demographics of Occupy as determined by the Occupy Research General Survey.
  • About 30% of the data pulled from Tumblr had no text content as the characteristic images (holding up a piece of paper with a hand-written story) had not been transcribed. We could use crowd-sourcing or optical character recognition to fill in the gaps, potentially contributing this back to the Tumblr site.

Conclusions

The prevalence of healthcare- and education-related posts is interesting. The mainstream understanding of Occupy (particularly in terms of party politics) has been centred around income inequality, but folk narratives so far paint a slightly different picture. It appears that many people identify with “The 99%” based on everyday struggles — although money is important, it could be perhaps be seen as an expression of an underlying demand for a minimum standard of living, and a responsible collective solution to shared personal crises. Our preliminary findings from “We are the 99%” suggest that messaging and outreach based on this concept of mutual support may be more effective than continuing to try to rally around abstract issues.

3 Comments

  1. Hi! You saw this analysis of the We Are the 99 percent Tumblr, posted last October, right? It’s really good :) http://rortybomb.wordpress.com/2011/10/09/parsing-the-data-and-ideology-of-the-we-are-99-tumblr/

    • supervacuo

      Yep! Thanks for the reminder, though.

      We mentioned it in our (brief) post at the end of the first day of the hackathon.

  2. Pingback: Does Congress Ever Listen? | Information Visualization