It’s a romantic myth that the intangible feelings evoked by a song defy analysis. How can a person or algorithm predict what song will inspire a distinct feeling—the freedom of a bygone summer, the sensation of first love? Of course, selling records has always relied on prediction, requiring an army of A&R representatives capable of intuiting the winning formula of melody, sentiment, and image, to find the next big hit.
The problem is, that’s a notoriously hit and miss process, and given the current music economy, there’s less money than ever to leave hitmaking to intuition and chance. So can a sufficiently large set of qualitative data collected from music listeners yield an algorithm that can predict hits?
The notion isn’t that crazy. Services like Pandora and Spotify are data-driven enterprises, slicing the sentiments evoked by a particular piece of music into a myriad of descriptors as specific as “ebullient” and “late-night melancholy”. This is how they’re able to deliver, with uncanny accuracy, individualized playlists millions of users enjoy.
Visualization by Marek Naborczyk.
The EMI Music Data Science Hackathon held last Saturday, July 21st, in London aimed to determine whether such sentiment analysis could yield a predictive algorithm capable of identifying whether a song would be a hit. Sponsored by Data Science London, the competition gave 175 data scientists access to 118,000 responses from the EMI One Million Interview Dataset, a mine of qualitative data generated from 20-minute interviews of a million music listeners in 23 countries, from a wide range of demographic backgrounds.
For many of the attendees, this was a unique dataset to work with, says Greenplum, a division of EMC data scientist, Brendan Moran, who attended the event. “It was quite different,” he says, “because it was a qualitative dataset, and also because they went after very different demographics than you’d get from Spotify—your granny isn’t going to be on Spotify.”
Using the competition platform Kaggle, players competed in teams to develop an algorithm that could combine respondents’ demographics, their artist and track ratings, answers to questions about music, and words used to describe EMI artists. Competitors were given access to Greenplum’s Unified Analytics Platform, comprising Greenplum MPP Database and Hadoop infrastructure and a collaboration environment called Chorus, that allowed them to crunch the big datasets and collaborate over the course of 24 hours.
Though data from one million interviews might seem exhaustive, competitors soon found it too limited, says Moran. “The more astute members of the community said that the dataset was too small,” Moran says. To augment the EMI dataset, competitors tapped publicly available data from sources such as Stanford’s Learning Word Vectors for Sentiment Analysis research.
For Moran, this illuminated that developing useful predictive algorithms requires more than a single set of internal data. “You can’t solve a lot of these problems just by yourself or with your own data,” he says. “Sometimes you have to use publicly available datasets and pull from the community, and then you have to offer that back to the community, so you build a corpus of knowledge.”
Visualization by Kaggle user Kevin.
While competitors did not have to use the Greenplum Unified Analytics Platform (UAP), the event demonstrated the value of a broad collaboration platform.
The teams using UAP had an advantage, benefiting from the platform’s speed, flexibility, and collaborative tools—an asset given that the hackathon lasted 24 hours, and some team members had to steal away for a break or nap during the marathon contest. “A number of teams who were trying to crunch data on their laptops suddenly found themselves unable to complete the running of models in the dying moments of the competition,” notes Moran.
The winning algorithm in the London competition, developed by the Innovations team, used a random forest technique, which is an approach commonly taken in Kaggle competitions, as the company’s President and Chief Scientist Jeremy Howard noted during a Big Data for the Public Good talk at Code for America in May. Moran notes that this technique highlights the value of the Greenplum platform. “What’s interesting about a random forest is that it’s a horsepower-based approach to gaining insight,” he says. “ If you’re going to run a random forest model, you want to run it at depth, and whereas a laptop rapidly runs out of capability, our MPP database doesn’t.”
Predictive algorithms may not be as romantic as the image of a Simon Cowell figure who can predict the next big star through experience and intuition, but in time they’re likely to be more accurate, and less mean-spirited. An algorithm that can beat Cowell will require more development and far bigger datasets than competitors had access to at the hackathon, but the winning team made a compelling case for the promise and potential of predictive analytics and collaboration platforms.
The winning infographic, from the EMI One Million Songs data, by Gregory Mead:
About the AuthorMore Content by Paul M. Davis