Television executives and media companies are beginning to embrace the value of data science when it comes to understanding viewership. By combining unstructured data (e.g. text, video, etc.) with traditional data sources, data scientists are using machine learning to identify how production decisions impact ratings and which have the most influential effect.
In a recent engagement with a global media conglomerate, the Pivotal Data Labs team investigated what makes viewers tune-in to and tune-out of specific television shows. The challenge at the onset was that large and diverse amounts of data for broadcast shows is not generally available, given that most data is still collected manually. Thus, we had to be creative about what additional datasets could be used to feed a predictive model.
Existing efforts, which used manually collected metadata, had reached a ceiling in predictive performance. Although these models were sophisticated, they were all fed with features based off of structured data. To improve upon these existing efforts, we explored an augmentation of this dataset with unstructured sources like video, audio, transcript, and social. Ultimately, we decided to use transcript data in our modeling efforts since it was the most readily available. By doing so, we were able to successfully improve upon existing models and provide actionable insights that could be taken directly to TV show producers.
In this blog, we will describe our approach, the tools that we used, and some lessons learned.
Background: Adding More Data (Science) to Traditional Ratings
Historically, media companies have been limited in their understanding of viewers, using only third party data sources, such as Nielsen television ratings, to track and analyze audience size and composition. Nielsen collects data from both diaries and television-connected devices to measure viewing habits for many demographics such as age, gender, race, economic class, and area. However, for a TV show producer, this data does not give specific feedback about how to improve an individual broadcast or episode.
Unlike the digital, social world, using data to drive decisions is not common in the television world and considered quite innovative. The only companies doing something similar are the newer media companies like Amazon and Netflix, who have been tracking the actual big data numbers to determine what shows are likely to be successful, such as House of Cards with Kevin Spacey. For example, online-focused companies take approaches using meta-tags with information about as much as 30 million plays per day to determine what will be a hit, what viewers like, and what keeps them watching.
Goals: Bringing Pivotal Data Science into the Picture
In order to help our customer improve their understanding of viewer behavior, we delivered an end-to-end solution—this encompassed a framework to ingest and manipulate the unstructured transcripts, predictive models, and a means to interact with the data and models.
While many commercial solutions are specialized and proprietary, we were able to build an open solution using the Pivotal platform which sets the foundation for future advanced analytics work. Additionally, this solution was built to scale both in terms of number of programs (i.e., every show in their network) as well as broadcasts (i.e., every show that has ever aired).
The project deliverables included::
- A text analytics framework—ingesting, transforming and modeling transcript data in a scalable way
- In-database machine learning models—using predictive toolsets, like MADlib or Python libraries via PL/Python
- An application—incorporating the data and models into a lightweight application to explore the data and provide what-if simulations
Data, Platform, and Approach
Multiple sources were made available for the project: Nielsen ratings data, manually collected metadata, and show transcripts.
Each data source differed in format and quality. The Nielsen data was provided in report form and required minimal effort to load into the Pivotal platform for analysis. The manually collected data was also in report form; as with most manually collected data, it contained a lot of entry errors and is typically unreliable for modeling purposes. Finally, the show transcripts were in text format and held little to no consistent structure from one broadcast to another.
The final model was deployed on a Pivotal Hadoop/HAWQ instance exposed to Pivotal Cloud Foundry as a service for production usage. A prototype Node.js application was pushed to the same Cloud Foundry instance, which exposed end-users to analytical insights and allowed them to interact with model results.
As with most data science projects, and text analytics in particular, the majority of effort is spent cleaning and manipulating data. Part of this effort was developing a framework that would take the inconsistently formatted transcripts and prepare them so that one could apply any number of sophisticated NLP approaches and algorithms. In this project, we used a topic model to generate features for the overall model. This framework could also have been used for additional features based on tone, language complexity, and more.
The text framework included the following steps:
- Data Clean Up: Matching up spoken text with speakers in non-standardized text
- In-database Text Transformation: Parsing, Tokenization, Lemmatization, and TF-IDF
- Corpus Reduction: Defining the dictionary of interest
- Text Modeling: LDA modeling to identify the underlying topics within the transcripts
The output from the text framework was combined with other features and fed into a series of supervised models built for each viewer population.
The modeling stage started with narrowing down the tens of thousands features generated to those found to be most predictive of viewership metrics. Using MADlib’s parallelized implementation of linear regression, a regression was run for every feature to calculate its specific influence on ratings. The most relevant features were then filtered further for multicollinearity. Several algorithms were then compared to identify the most performant model for the data, with elastic net regression resulting in the highest predictive accuracy.
The Insights and Results
It is a commonly held belief that show format (and specifically commercial breaks) have the highest measurable impact on viewership. However, we found that it is truly a mix of format, content, and people on a show. These factors also differ depending on the population demographics of the audience.
Unexpected important variables included:
- Speaker characteristics
- Number of people shown on screen at a time
- Broadcast topics
Although we included thousands of features based off of the manually collected metadata, the vast majority of them ended up falling out of the final model because they had no predictive power. Instead, the most relevant variables were derived from the transcript data. This analysis delivered a clear perspective on the drivers of show viewership and popularity changes over time—a new and significant value to decision-making.
In about 8 weeks, the project was delivered, demonstrated the power of leveraging unstructured data, and showed the extensibility of the Pivotal platform. Armed with the code, the platform, and training via knowledge transfer, the company has taken the next steps towards becoming a data-driven enterprise—building an application that leverages a wide set of data and data science to provide actionable insights directly to TV broadcast decision makers.
Other Articles You Might Like
- A Peek Under the Hood of The Connected Car: What It Does & How It Applies to IoT Systems
- Silicon Valley’s Secret Weapon: Pivotal Labs
- How We Found Out The #Patriots Beat The #Seahawks on Twitter Too
- Re-Architecting Genomics Pipelines to Handle the Rising Wave of Data
- Pivotal’s Top Predictions for Data Science and Big Data in 2015
- Pivotal HD and HAWQ: Product | Docs | Downloads or Blog Articles
- Pivotal Data Labs: About or Blog Articles
About the AuthorMore Content by Jarrod Vawdrey