Performance Benchmark: Pivotal HAWQ Beats Impala & Apache Hive—Part 1

July 21, 2015 Dan Baskette

sfeatured-hawq-feathers In the world of big data and analytics, query speed makes a huge difference.

But, speed isn’t the whole picture. The real experts in the industry are now rightly expanding their focus to consider the “quality and completeness” of SQL on Hadoop implementations, and this is why we decided to revisit recent benchmarks for a few key players in the market.

“Quality” means looking at speed along with overall query effectiveness, which has a major impact on a company’s ability to maximize their benefits from big data analytics. And that can directly impact profit, scale, complexity, developer productivity, and operational costs. Poor technical decisions in this area ultimately waste time and money. They can even put revenue at risk as explained at the end of the technical benchmark below.

One thing was clear as a macro observation after these new tests—hands down,HAWQ beats Cloudera Impala and shows many advantages over Apache Hive. This short bullet list highlights the outcomes of the technical details explained further below:

HAWQ runs the set of TPC-DS queries in roughly half the wall clock time that Impala can. By design, HAWQ also finishes most queries in fraction of time versus Apache Hive
HAWQ beats Impala by overall 454% in performance. (By measuring and comparing the geometric mean across the TPC queries, )
Compared to Hive, HAWQ provides an additional of 344% of performance improvement on complex queries
Importantly, Impala and Apache Hive™ do not support all 99 of the standard TPC-DS queries. Impala effectively finished 62 out of 99 queries while Hive was able to complete 60 queries. These queries represent the minimum market requirements, where HAWQ runs 100% of them natively.
Impala and Apache Hive™ also lack key performance-related features, making work harder and approaches less flexible for data scientists and analysts.
- In more detial, HAWQ showed much better performance than Impala on queries with large joins, rollups, or complex sub-selects. These queries are often used by BI tools or Analysts.
SQL-related workarounds were not necessary with HAWQ whatsoever, test and development iterations were much quicker and with fewer issues.
HAWQ also provides more robust query partitioning (supporting range and multi-level partitions) and performs significantly better on certain classes of workloads—BI roll-ups, predictive analytics, and machine learning—the most critical use cases for enterprise reporting and big data analytics.

The rest of the post below covers the technical aspects of Impala and Apache Hive’s™ incomplete support for TPC-DS and how HAWQ adds value on top of Apache Hive. We get into very specific examples of subquery limitations, covers efficient partitioning approaches, outlines the performance test process, and shows diagrams of the results between HAWQ, Apache Hive™, and Impala. It’s been about a year since our previous testing, in which we published a SIGMOD paper, so we were excited to check in on how things had changed.

HAWQ, Hive, and Impala Performance Environment Configuration

My first test was with a well known tool in the market, Cloudera Impala. Because of Pivotal’s recent partnership announcements with Hortonworks, I also decided to run with Hive 0.14 with Apache Tez™ to develop an understanding of where Pivotal HAWQ 1.3 provided value above and beyond today’s contemporary Hive, which is one of the prominently used tools in Hadoop and has shown great progress.

The configuration tested was a 15 node cluster with 3 master nodes and 12 data nodes. Each node was configured with 16 cpu cores, 64 GB RAM, 10Gb networking, and 22 x 900 GB SAS disk drives. This configuration is somewhat light on memory, but in return delivered an abundance of disk IO. The combination also had some interesting effects that we will dive on below, but I believe it adequately represents what we see in the user community, and also demonstrates how hardware choice can lead to specific benefits and drawbacks.

The test itself was run using a 30TB TPC-DS dataset size. This is not an official TPC-DS for a variety of reasons, but primarily I was looking for a common set of queries that all platforms supported natively and that set was a subset of the TPC-DS query set.

So, we’ve covered the high level observations, the testing environment. and requirements, in my next blog post we’ll take a deeper dive into those findings and the actual results behind them.

Learn More

Find out more about Pivotal HAWQ Product Info, Documentation, and Downloads
Read more Pivotal HAWQ and Big Data Suite Blog Articles
Check out more information on the Pivotal Query Optimizer
Read our newsletters on big data, cloud native platforms (PaaS), and data science

Editor’s Note: ©2015 Pivotal Software, Inc. All rights reserved. Pivotal HAWQ is a trademarks and/or registered trademarks of Pivotal Software, Inc. in the United States and/or other countries. Apache, Apache Hive, and Apache Tez are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Dan is Director of Technical Marketing at Pivotal with over 20 years of experience in various pre-sales and engineering roles with Sun Microsystems, EMC Corporation, and Pivotal Software. In addition to his technical marketing duties, Dan is frequently called upon to roll-up his sleeves for various "Will this work?" type projects. Dan is an avid collector of Marvel Comics gear and you can usually find him wearing a Marvel shirt. In his spare time, Dan enjoys playing tennis and hiking in the Smoky Mountains.
Follow on Twitter More Content by Dan Baskette

Data Science Deep Dive: Applying Machine Learning To Customer Churn

In this post, Esther Vasiete, from the Pivotal Data Science Team, explains how data science and machine lea...

Mapping the Cloud-Native Journey

At OSCON this week the organizers reported over 30 talks submitted this year on Microservices topics, up fr...

Performance Benchmark: Pivotal HAWQ Beats Impala & Apache Hive—Part 1

HAWQ, Hive, and Impala Performance Environment Configuration

About the Author

Previous

Next

Performance Benchmark: Pivotal HAWQ Beats Impala & Apache Hive—Part 1

HAWQ, Hive, and Impala Performance Environment Configuration

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.