ThemaTweets – Visualizing the French Elections Buzz on Cloud Foundry

February 3, 2012 Cloud Foundry

As the platform matures, we see many cool applications being built on top of Cloud Foundry. This is the first in a series of guest blog posts by application developers explaining what their application does, how it is architected, what they like in Cloud Foundry and what needs to be improved.

Guest blog post by Eric Bottard

ThemaTweets2012 allows you to visualize what people say about the french elections candidates on Twitter, how these candidates relate to key themes of this 2012 event and analyze information under several angles.


The Story

In late 2011, Google France launched a data visualization challenge about the upcoming 2012 presidential elections. The rules of the challenge were pretty straightforward : provide a web application that leverages data from Google and/or Twitter and shed a new light on the french elections. The winners would eventually win a trip to Mountain View, or a nice Android tablet.

My team and I started hacking pretty late and although we wanted to use that opportunity to experiment with new technologies and great ideas (after all, the main reason we engaged in the challenge was for the fun of it), we focused ourselves on one simple thing : detecting the topics that mattered the most to people, and how candidates to the elections related to those. Now, you’ll admit that 140 characters is not much to convey structured information. Add to this the fact that people don’t always put a lot of effort into writing elaborate sentences on Twitter, so we decided to keep things simple : we would only count occurrences of candidates and topics, without trying to analyze the tone of the tweet.

Being a dataviz oriented application, ThemaTweets allows the user to navigate between different views of the same raw data. The main screen shows an overall breakdown of candidates and then the topics mentioned in tweets (eg. Nicolas Sarkozy is found in 43% of the tweets we captured and amongst those, 28% cite the topic “Taxes”). Click on a button and you’ll invert the view (eg. 54% of tweets captured deal with “Ecology”, of which 33% are linked to François Hollande).

One can also view the evolution of these figures over time. This is already useful and interesting, but we wanted to go further than that and help some hidden information surface out of those simple datapoints. So the application also has two other views :

  1. one allows us to see how topics “belong” to a political side. By assigning some kind of left-right alignment to each of the candidates, we can average the relative positions of topics relative to each other.
  2. the other selects the topics that are the most often encountered for each candidate and then displays the “overlap” that one can find between two (or more) candidates. For example, if N. Sarkozy is linked to “Euro crisis” in 34% of the tweets and Dominique de Villepin is credited with 21% of tweets for the same topic, we’d link them together with an arc. The thickness of the arc at both ends reflects the percentages.

Lastly, we added a view that highlights the terms we captured in the tweets, so that people can witness were the data is coming from.


Of course, users can view the evolution of topics or candidates share over time

How the application actually works

Under the hood, ThemaTweets is a simple web application, built on top of Spring and Hibernate. The data model is rather simple, with an added twist whereby we aggregate data hourly to avoid recomputing means and sums every time a view is needed. It leverages the twitter4j library to connect to twitter, making use of the streaming API. There are fixed keyword terms made of the candidates names (and their political parties, collaborators and so on). This allows the twitter API to pre-filter results. Then, for each tweet that is received, the app tries to match it against keywords related to topics. Contrary to the candidate filter, there is an opportunity here to apply Lucene related technology where we tokenize and stem the words (so that we would match “écologie”, “écolo”, “écologiste” as well as variants without the accented “é”). While we originally went for a trained bayesian filter approach, the ratio of useful words to garbage was really huge. So we decided to keep the algorithm, but using only handcrafted positive weighs for the topic words. This worked really well.

Terms that are captured in tweets are highlighted and linked to views in the application, so that users understand where the data is coming from

Being proficient with the technologies used helped us prototype quickly, with an added productivity bonus coming from Spring Data JPA. The application UI is really client centric, with data coming in JSON from Spring MVC controllers.

Interacting with Cloud Foundry

We chose from the start to experiment with some cloud infrastructure, just for the fun of it. Moreover, we had no idea of the exposure the app would get once the challengers would be revealed. Sadly, we did not have time to lose trying to bend our minds around limitations in JPA queries that actually operate on a NoSQL backend, or trying to have maven and a cloud related eclipse plugin play nicely together. So we finally went for Cloud Foundry after one or two days.

The setup was the most impressive experience I guess. Imagine adding 3 lines of maven plugin configuration to your pom.xmland being able to push the application code directly to the cloud ! The interaction with the vmc command line tool was also very pleasant, and provisioning services by simply following a wizard felt refreshing.

I think the most astonishing feature was the fact that, ThemaTweets being a rather simple app with only one datasource, it got reconfigured automagically to use the cloud mysql instance, as detailed here. So the very same code that ran locally could be deployed in the cloud with no modification.

Also, having access to a genuine java virtual machine without restrictions also was decisive in our choice. Indeed, the Twitter streaming API requires a long lived thread to be connected to the twitter servers, which ruled out some of the alternatives.


Those three lines of XML configuration are all that is needed to deploy to the cloud with maven

Of course, there were some difficulties nevertheless. The biggest drawback comes from the fact that when you’re in the cloud, you’re pretty blind to what happens with your app. At first, there was no way to know what was getting stored in our database, so we had to roll out our own back office application. Luckily, the tunnel solution was discovered soon after, although we had a very hard time making it work on a mac (something that should go more smoothly in the future if I understand correctly).

A side effect of this blindness is that importing / exporting data is pretty complicated, even with the tunnel working. Because of timeouts that have to be enforced on the mysql services, exporting our data had to be done in chunks, which was a bit tedious. Nevertheless, I believe the tunneling approach is very smart and could allow other usages in the future.

We also witnessed some strange memory usage from time to time, whereas the app was behaving very nicely locally.

Some additions to the platform could also be considered as “nice to have”, even though we made it without them. For starters, code versionning was not really an issue for us, as the app was not live until the challenge was over. But because Cloud Foundry allows services to be bound to multiple applications, I guess you can get away with having another app alias running the new code and being bound to the same services.

We also needed our Twitter connector to be a singleton among the whole cluster. While we ended up starting it from an http thread, having a cloud-aware singleton spring scope supported by the platform would be nice (remember that you can’t know beforehand the IP addresses of the machines you’ll be running on, etc.)

I also recently realized that although Cloud Foundry almost gives access to a vanilla Tomcat server, there is no way to add custom jars to the lib/ folder (eg. to allow Spring load time weaving). The platform does add the mysql/postgresql jdbc connector jar for you though.

Finally, I guess what the platform today lacks compared to the competition is a nice looking web UI, although the vmc command line tool (and the REST API behind it) offer a lot information.

Happy End

As you can guess, we were really impressed by the technology. We did not win the first prize in the challenge, but made it to the finals nevertheless. For the purpose of being shown on this youtube channel, the app had to be moved and operated by Google. Because Cloud Foundry did not force us to modify our plain Spring MVC app, the very same war file is now hosted on another platform and it works like a charm.

Signup for Cloud Foundry today, and start building your own cool app!

About the Author

Biography

Previous
New in Pivotal Tracker: Improved Stories!
New in Pivotal Tracker: Improved Stories!

Stories in Pivotal Tracker have been given a serious upgrade. For the most part it’s all pretty self explan...

Next
Pivotal Tracker Maintenance this Saturday (Feb. 4) at 9:00am PST (17:00 UTC)
Pivotal Tracker Maintenance this Saturday (Feb. 4) at 9:00am PST (17:00 UTC)

We've got a major update of Pivotal Tracker planned for this weekend, which requires downtime while we run ...