Demonstrating the High-Performance Future of Hadoop at Strata

February 27, 2013 Paul M. Davis

Josh

The ten- and hundred-fold gains in productivity, speed, and accessibility that Pivotal HD brings to Hadoop took center stage at an afternoon breakout session at the Strata Conference on Wednesday, February 27th. Josh Klahr, Greenplum’s Vice President of Product Management, and Gavin Sherry, Senior Director of Engineering offered a deep dive into the High-Performance Future of the platform. Emphasizing Greenplum’s belief that Hadoop represents a fundamental paradigm shift, Klahr said, "relational databases changed the data fabric decades ago, and we believe Hadoop will change the data fabric in a similar fashion."

PivotalHD

The strengths of the platform are well-known within the enterprise: Hadoop is flexible, scalable, inexpensive, and fault-tolerant. A testament to that stability, Klahr shared that despite a disc failure during the Pivotal HD product launch on Monday morning, the demo went off without a hitch. Yet despite its many advantages, Hadoop presents numerous productivity challenges in the enterprise. There’s a fundamental skills gap within businesses using Hadoop, which often find it challenging to integrate the platform within the existing data environment. Klahr observed that in many enterprises, "people skilled in traditional BI or visualization tools have difficulty working with Hadoop for interactive analysis." Klahr cited an example case study, a large retailer boasting 2,000 Servers, 32,000 CPU Cores, 12,000 Disk Drives, and 128TB Memory, which only utilizes 20% of these prodigious resources.

Pivotal HD Architecture

Enter HAWQ, what Klahr named "The Crown Jewels of Greenplum," the result of 10 years of work on the Greenplum database. HAWQ, a scalable SQL database engine running on top of the Hadoop cluster, addresses the challenges of integrating Hadoop into an enterprise data environment. HAWQ is an SQL-compliant interface and query optimizer capable of horizontal scalability, robust data management, support for common Hadoop formats, interactive queries, and deep analytics. Moreover, it can eliminate the need to move large datasets between a Hadoop cluster running on commodity hardware, and a second environment for running SQL queries. This familiar process slows productivity, increases the risk of error, and requires analysts to pick and choose what data to pull out of the Hadoop cluster. "Our experience is that it’s a game-changer for customers who have been working with two different clusters," Klahr said.

HAWQ

Gavin Sherry shared deep insight into the architectural decisions made by Greenplum’s engineering team. The HAWQ Master Host runs directly on top of HDFS, parsing and optimizing queries, and managing data. This model offers cost-based optimization, robust join support, enhanced data security, multiple table formats, and more advantages, and has been "tested on a 1000 node best-of-breed cluster with academics and significant corporate partners," Sherry said.

Dynamic Pipelining

Unlike other efforts to bring SQL queries to Hadoop, HAWQ is lightning fast and stable, thanks to Dynamic Pipelining, a parallel data flow framework to stream data back through the cluster. It includes a run time execution environment, as well as a run time resource management layer, which ensures that queries are completed, even with very demanding queries requiring heavy cluster utilization. The query engine buffers data, spilling intermediate results to a local disk instead of HDFS, when necessary, Sherry explained. Presenting a demo to the audience, Sherry marveled at its speed, stating "it blows my mind that we’ve been able to so seamlessly integrate with Hadoop and deliver these results."

Klahr returned to demo HAWQ from a more traditional BI perspective. Klahr using his data visualization tool of choice, Tableau, to demo an interactive query on a billion rows of data, delivered in real-time from a Hadoop cluster in Las Vegas. He noted that he was using an off-the-shelf version of Tableau, and that the software views the cluster to be a traditional SQL database, no modifications required.

HAWQ Benchmarks vs. HiveHAWQ Benchmarks vs. Impala

Closing out the session, Klahr and Sherry presented a number of impressive results from benchmark tests. Greenplum compared a set of queries that completed using Hive, Impala, and HAWQ, using an industry standard dataset and two identical clusters, Pivotal HD and CDH. "A query you could do in 24 hours in Hive, we’re seeing them being done in 3 minutes," Klahr said. In horizontal scale testing, he added, "with HAWQ, as we doubled the number of nodes, we cut the query processing time in half." The increase in speed, productivity, and usability are transformational, Klahr emphasized. He closed the session by reiterating "The Importance of 10x," citing Steven Levy of Wired‘s observation, "Thousand-percent improvement requires rethinking problems entirely, exploring the edges of what’s technically possible, and having a lot more fun in the process."

About the Author

Biography

More Content by Paul M. Davis
Previous
Continuous Integration to Cloud Foundry.com Using Jenkins in the Cloud
Continuous Integration to Cloud Foundry.com Using Jenkins in the Cloud

Continuous integration using Jenkins is increasingly seen as an effective tool for reducing the cycle time ...

Next
Greatly Refined Story Editing in Pivotal Tracker for iOS 1.6
Greatly Refined Story Editing in Pivotal Tracker for iOS 1.6

Today we’ve released Pivotal Tracker for iOS 1.6 to the App Store. With this release we’ve thoroughly refin...

How do you measure digital transformation?

Take the Benchmark