Exploratory Data Science: When To Use An MPP Database, SQL on Hadoop or Map Reduce

July 17, 2014 Ian Huston

featured-atlas-network The members of the Pivotal Data Labs team are often asked what tools and platforms we use to analyze large datasets and build cutting edge predictive models. Some of our colleagues recently presented a talk on the open source software used by our team, while in this post we will consider the importance of choosing the right platform and focus on exploratory data science. We always want to use the right tool for the right job, which means understanding what data processing is needed, performance requirements, and budgetary limitations.

In a recent blog post, Catherine Johnson outlined “When To Use Apache Hadoop® vs In-Memory vs MPP” in your system. If you need to get access to data on sub-millisecond time scales, in-memory systems are preferable. However, there is a difference between real-time analytics and real-time access. For analytics, you need to query all the relevant stored data in a timely fashion. When data expands beyond the capacity of in-memory systems, persistent storage platforms such as the Pivotal Big Data Suite become necessary.

So how do you choose between Pivotal GPDB, our massively parallel processing database, and Pivotal HD with HAWQ, our Hadoop® distribution and SQL query engine, when performing data science work? In this post, are going to compare these considerations across four key categories: data pre-processing requirements, functionality, speed, and cost.

Pre-processing Requirements

In many ways, your choice of platform is determined by the data you want to analyze. Instead of thinking in terms of structured, semi-structured, or unstructured data, consider the amount of pre-processing needed to develop an effective predictive model.

Predictive models and machine learning algorithms need specific and well-formatted inputs. The steps to transform the raw data into something usable for modeling depend on the source and type of data.

The following chart compares the parts of the Pivotal Big Data Suite for different data pre-processing requirements. This choice of tools reflects our team’s collective experience of what is well suited or best suited (easiest to use) for each data type across many domains. For your organization, the choice may also depend on the exact data specifications and the expertise of your data scientists. The Pivotal HD (PHD) categories here include using MapReduce on Apache Hadoop® and other tools in the Apache Hadoop® ecosystem like Apache Spark, along with a separate category for HAWQ.

BDS-PreProcessingsReqs

What part of the Pivotal Big Data Suite is appropriate for analyzing different data sources?

Transactional data and traditional customer information records are best suited for Pivotal GPDB, as they require little to no pre-processing.
Geospatial data often requires relatively complex geometric calculations, which PostGIS on Pivotal GPDB can handle.
Raw log files, XML or JSON files and typical social media data are semi-structured, and this is a situation where HAWQ combined with the Pivotal Extension Framework is ideal. Users write SQL to interact with files in HDFS, enabling quick insights without writing Pig or MapReduce jobs. Depending on the log files, Pivotal GPDB can also be used to parse semi-structured logs efficiently.
For text it is easy to include open-source Natural Language Processing toolkits in your processing pipeline within GPDB & HAWQ using Procedural Languages such as PL/Python.
Video or image data demands extensive pre-processing requirements. Pivotal HD is the best choice, especially when combined with Spring XD and other Apache Hadoop® projects like Graphlab, OpenMPI for Apache Hadoop® and Apache Spark. For large-scale image processing tasks, GPDB & HAWQ may also be suitable. Our colleagues will expand on this in a future post.

Functionality

When dealing with very large datasets, it is essential that you apply your predictive model to the data in place, cutting down on costly data transfers.

In-database analytics has been a core feature of Pivotal GPDB since 2009, enabling everything from standard statistical tests to complex machine learning algorithms with MADlib. This open source machine learning library helps our data science team to build predictive models on billions of rows of data and has interfaces in Python and R. With the release of Pivotal HD 2.0, HAWQ can now leverage the full power of MADlib.

HAWQ adds SQL’s expressive power to Hadoop® so that we can use familiar syntax and write complex queries that leverage features such as Window Functions. The wide variety of tools in the Apache Hadoop® ecosystem is also available as mentioned above.

Pivotal GPDB provides multi-level partitioning of your data, which is essential for accelerating analytics over large data volumes. Additionally, the append-only nature of HDFS means that Pivotal GPDB should be used in situations where you need to update records.

Speed

The key to agile data science is to get answers to your analytics questions fast enough that you can ask the next question immediately.

Pivotal GPDB excels when speed and high throughput are essential. Parallel loading directly onto segment hosts dramatically reduces the time to insert large volumes of data.

The new ORCA query optimiser provides both Pivotal GPDB and HAWQ with exceptional query performance. By using a cost-based optimiser, ORCA dramatically improves performance compared to the rule-based systems used by others. For interactive queries typically used in exploratory data science, Pivotal GPDB has always provided the response times needed for rapid iteration. Using ORCA, HAWQ now reaches similar levels of performance for many of these queries.

Cost

In the past, the argument for Apache Hadoop® over traditional RDBMs was driven by cost. The previous tradeoff was that Apache Hadoop® was much cheaper, but could not connect as easily to well established business tools and did not have a familiar SQL interface.

The release of the Pivotal Big Data Suite removes these concerns. The Pivotal Big Data Suite allows you the flexibility to choose between HAWQ and Pivotal GPDB with a single licensing structure based on processor cores used, not data size or data growth. Pivotal also offers an unlimited subscription to Pivotal HD so that you don’t need to fear license and support costs as your data grows.

Summary

Whether you need to have high scalability for an online trading system, perform object recognition over thousands of hours of video footage, or detect insider security threats on 300,000 corporate computing devices, the first step is to make your data ready for exploration and analysis.

The strengths of the different products in the Pivotal Big Data Suite are complementary and enable our data science team to achieve innovative and exciting results with our customers. To learn more about the considerations to take into account when choosing the right Pivotal Big Data Suite products for your use case, be sure to read our colleague Catherine Johnson’s blog post, “When To Use Apache Hadoop® vs In-Memory vs MPP” .

About the Author: Noelle Sio is a Principal Data Scientist at Pivotal, with a background in mathematics, statistics, and data mining with an emphasis on digital media. Her work has mainly focused on helping companies across multiple industry verticals extend their analytical capabilities by exploring and modeling digital data, specifically to create an underlying analytics framework to optimize a consumer’s experience. Noelle holds an M.S. in Applied Mathematics and an A.B. in Applied Mathematics and Physical Anthropology.

About the Author

Biography

Deciphering PM Lingo

I’m often asked for a list of terms a new Product Manager should know. This may be because someone is tryin...

Why Is My NTP Server Costing Me $500/Year? Part 2: Characterizing the NTP Clients

In the previous blog post, we concluded that providing an Amazon AWS-based NTP server that was a member of ...

Exploratory Data Science: When To Use An MPP Database, SQL on Hadoop or Map Reduce