In the world of big data and analytics, query speed makes a huge difference.
But, speed isn’t the whole picture. The real experts in the industry are now rightly expanding their focus to consider the “quality and completeness” of SQL on Hadoop implementations, and this is why we decided to revisit recent benchmarks for a few key players in the market.
“Quality” means looking at speed along with overall query effectiveness, which has a major impact on a company’s ability to maximize their benefits from big data analytics. And that can directly impact profit, scale, complexity, developer productivity, and operational costs. Poor technical decisions in this area ultimately waste time and money. They can even put revenue at risk as explained at the end of the technical benchmark below.
One thing was clear as a macro observation after these new tests—hands down,HAWQ beats Cloudera Impala and shows many advantages over Apache Hive. This short bullet list highlights the outcomes of the technical details explained further below:
- HAWQ runs the set of TPC-DS queries in roughly half the wall clock time that Impala can. By design, HAWQ also finishes most queries in fraction of time versus Apache Hive
- HAWQ beats Impala by overall 454% in performance. (By measuring and comparing the geometric mean across the TPC queries, )
- Compared to Hive, HAWQ provides an additional of 344% of performance improvement on complex queries
- Importantly, Impala and Apache Hive™ do not support all 99 of the standard TPC-DS queries. Impala effectively finished 62 out of 99 queries while Hive was able to complete 60 queries. These queries represent the minimum market requirements, where HAWQ runs 100% of them natively.
- Impala and Apache Hive™ also lack key performance-related features, making work harder and approaches less flexible for data scientists and analysts.
- In more detial, HAWQ showed much better performance than Impala on queries with large joins, rollups, or complex sub-selects. These queries are often used by BI tools or Analysts.
- SQL-related workarounds were not necessary with HAWQ whatsoever, test and development iterations were much quicker and with fewer issues.
- HAWQ also provides more robust query partitioning (supporting range and multi-level partitions) and performs significantly better on certain classes of workloads—BI roll-ups, predictive analytics, and machine learning—the most critical use cases for enterprise reporting and big data analytics.
The rest of the post below covers the technical aspects of Impala and Apache Hive’s™ incomplete support for TPC-DS and how HAWQ adds value on top of Apache Hive. We get into very specific examples of subquery limitations, covers efficient partitioning approaches, outlines the performance test process, and shows diagrams of the results between HAWQ, Apache Hive™, and Impala. It’s been about a year since our previous testing, in which we published a SIGMOD paper, so we were excited to check in on how things had changed.
HAWQ, Hive, and Impala Performance Environment Configuration
My first test was with a well known tool in the market, Cloudera Impala. Because of Pivotal’s recent partnership announcements with Hortonworks, I also decided to run with Hive 0.14 with Apache Tez™ to develop an understanding of where Pivotal HAWQ 1.3 provided value above and beyond today’s contemporary Hive, which is one of the prominently used tools in Hadoop and has shown great progress.
The configuration tested was a 15 node cluster with 3 master nodes and 12 data nodes. Each node was configured with 16 cpu cores, 64 GB RAM, 10Gb networking, and 22 x 900 GB SAS disk drives. This configuration is somewhat light on memory, but in return delivered an abundance of disk IO. The combination also had some interesting effects that we will dive on below, but I believe it adequately represents what we see in the user community, and also demonstrates how hardware choice can lead to specific benefits and drawbacks.
The test itself was run using a 30TB TPC-DS dataset size. This is not an official TPC-DS for a variety of reasons, but primarily I was looking for a common set of queries that all platforms supported natively and that set was a subset of the TPC-DS query set.
So, we’ve covered the high level observations, the testing environment. and requirements, in my next blog post we’ll take a deeper dive into those findings and the actual results behind them.
- Find out more about Pivotal HAWQ Product Info, Documentation, and Downloads
- Read more Pivotal HAWQ and Big Data Suite Blog Articles
- Check out more information on the Pivotal Query Optimizer
- Read our newsletters on big data, cloud native platforms (PaaS), and data science
Editor’s Note: ©2015 Pivotal Software, Inc. All rights reserved. Pivotal HAWQ is a trademarks and/or registered trademarks of Pivotal Software, Inc. in the United States and/or other countries. Apache, Apache Hive, and Apache Tez are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author
Dan is Director of Technical Marketing for Data and Analytics at Pivotal with over 20 years experience in various pre-sales and engineering roles with Sun Microsystems, EMC Corporation, and Pivotal Software. In addition to his technical marketing duties, Dan is frequently called upon to roll-up his sleeves for various "Will this work?" type projects. Dan is an avid collector of Marvel Comics gear and you can usually find him wearing his Marvel Vans. In his spare time, Dan enjoys playing tennis and hiking in the Smoky Mountains.More Content by Dan Baskette