Earlier this year, mobile video usage overtook video usage on personal computers and research continues to show how important mobile video is to consumers. With these trends, mobile carriers and media companies are trying to learn more about usage to improve user experience and advertising revenues, especially with big advertising money makers like sports.
Earlier this year, a major sports network and mobile carrier approached Pivotal to create a system for measuring cellular data and live video usage through mobile applications. The approach used Spring XD, Pivotal HD, Redis, and Spring Boot to quickly build a big data platform and set of analytical dashboards. The resulting analytics application is helping business executives from both companies understand how live video is used and what peak data usage looks like. The whole project only took a few resources and a few months.
Mobile/Media Big Data Architecture Overview
At a high level, data moves through a factory as depicted in the diagram below. Raw data is captured from mobile phones in JSON format by a Spring XD cluster where several processes are performed. The data is then stored in an HDFS cluster where Spring XD batch jobs use SQL via the HAWQ interface to HDFS and store the calculated reports in Redis. Spring Boot is then used with Angular.js and D3.js to show analytics to end users.
Capturing Raw Data from Mobile Devices in JSON
To produce the raw data, a mobile device is instrumented to capture the number of bytes used while viewing a video. The mobile application sends a “heartbeat” message that is formatted in JSON and contains the number of bytes consumed since the last report. Generally, the mobile app sends one of these reports every minute. Since there can be a large number of viewers during a live game, the system receives millions of heartbeat messages per hour. When designing the system, we decided to keep all of the heartbeat data over time, as with the concept of keeping all data inside a data lake. This way, teams could use the data to produce additional analysis in the future. In addition, the customers had a quick timeline. They needed the system in place prior to the start of the next sports season.
Ingesting and Processing Data with Spring XD and HDFS
To handle the data ingestion, the Pivotal team utilized Spring XD to receive and transform the raw data. The processed data was stored in an Apache Hadoop® File System (HDFS) cluster as part of Pivotal HD. Spring XD and HDFS provided a flexible, scalable solution to handle the massive amount of incoming JSON heartbeat data—as much as a half terabyte per season of sports.
After the data was stored in HDFS, we used Pivotal HAWQ and SQL to generate the necessary reports from the data stored in HDFS. Spring XD’s batch job capability was leveraged to execute the reports after a game was done. The batch jobs produced JSON reports and stored them in Redis, providing immediate access to a web-based dashboard. Spring XD and Pivotal HD provided clustering and failover support to ensure business continuity in the event of a server failure.
Spring XD turned out to be a perfect solution because it was designed with a distributed and extensible pipeline for ingesting data, processing it, performing real-time analytics on the received data, and then exporting it. Using Spring XD, we quickly implemented a system to receive the raw heartbeat data from the mobile applications. Spring XD provides several “sources” and “sinks” out of the box, and we were able to modify the existing HTTP source and HDFS sink to receive and store the raw heartbeat data. We also created a custom “processor” module that enabled us to filter and validate the raw data, ensuring that it was formatted as expected and had valid data in the expected JSON fields. Spring XD also provided a mechanism to “tap” data from one stream to another. Using a tap, it was simple for us to configure a stream to count the number of heartbeat packets received. This allowed us to ensure the system worked as expected—we could see the amount of live data going through the system in real-time. Since Spring XD supports a distributed runtime deployment, it also ensured that inbound data would continue to process in the event of any server failure.
Creating Analytical Reports with Spring XD, HAWQ, Redis, and Spring Boot
With HDFS storing the raw data, we used the Pivotal Xtension Framework (PXF) and HAWQ to perform SQL queries directly against the JSON data. PXF maps the JSON to columns in a database, enabling HAWQ to execute SQL queries directly against the JSON data in HDFS. Using SQL, the product manager could then define the report queries.
For the next step, we turned to Spring XD’s batch capabilities. Custom “jobs” were configured with the appropriate SQL queries to execute. The jobs are scheduled to execute early in the morning, after a game has occurred, and the report results are stored in Redis. Redis stores all the reports so that users can view any of them without needing to spend the time computing them on demand. Finally, we implemented a simple REST API, using Spring Boot, to serve the JSON reports to the user interface.
Together, Spring XD, Pivotal HD and HAWQ, Redis, and Spring Boot were used to quickly create a big data solution for massive ingesting, analyzing, and reporting on cellular data use for live, game-day videos. Now, the client can identify trends and optimize media revenue.
- Spring XD: Project and Documentation | Blog Articles | Using XD for Real Time Twitter
- Pivotal HD (SQL on HDFS via HAWQ): Product | Documentation | Blog Articles
- Redis: Website | Blog Articles
- Spring Boot: Project and Documentation | Blog Articles
- Pivotal Open Source: Products and Projects | Blog Articles
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the AuthorMore Content by Allan Baril