Introducing Pivotal HD

February 25, 2013 Donald Miner

I’m very excited to finally be able to talk in public about something we’ve been working on for quite some time. In 2011 we announced our first version of Greenplum Hadoop, GPHD. This week, we’ll have a number of events to introduce Pivotal HD, our new Hadoop distribution. Pivotal HD isn’t just about making Hadoop better, it’s significantly expanding Hadoop capabilities as a data platform. In this first post, I’ll explore the core functionality enabled by the major new component of Pivotal HD: HAWQ , the first, fully functional, high-performance, all-encompassing relational database that runs in Hadoop.

Pivotal HD’s core Hadoop distribution includes all the usual suspects, except that we added truncate capability to HDFS to support database constructs. Pivotal HD has HDFS, MapReduce, Pig, Hive, and Mahout, as well as tools we’ve added to the ecosystem, which include:

Command Center – manages and monitors your HDFS, MapReduce, and HAWQ.
Virtualization extensions and Isilon support – The ability to deploy Hadoop in a virtualized environment and/or on Isilon for particularly challenging enterprise deployments.
ICM (Installation/Configuration/Management) – Our subsystem for managing and administering several Pivotal HD Hadoop clusters through many different interfaces. ICM offers many hooks to integrate Hadoop clusters into your existing IT management infrastructure without getting in the way of your job.
Spring Hadoop – Helps developers by integrating Hadoop into the Spring framework. This also includes Spring Batch, which can ease the job management and execution tasks on Pivotal HD.

These features will be familiar to users of the Greenplum HD 1.2 release, where they’ve been available for the past few months.

Pivotal HD Command Center

COMMAND CENTER

What takes the Pivotal HD Hadoop distribution to the next level is a major new component that we are very proud of: HAWQ, a relational database that runs atop of HDFS. In short, we gutted the code where our solid, mature, fast MPP Greenplum Database was writing to disk — instead, it writes directly to HDFS.

HAWQ is a relational database that runs atop of HDFS. It has its own execution engine, separate from MapReduce, and manages its own data, which is stored on HDFS. HAWQ bridges the gap, a SQL interface layer on top of HDFS that also organizes data. It also boasts a core feature called GPXF (Greenplum Extension Framework) that allows HAWQ to read data from flat files in HDFS that are stored in just about any common format (delimited text, sequence files, protobuf and avro.) In addition, it has native support for HBase, supporting HBase predicate pushdown, hive connectivity, and offering a ton of intelligent features to retrieve HBase data.

Here is just a quick look at the features of HAWQ, many of which have been in the wild, in our Greenplum Database product:

HAWQ draws from the 10 years of development on the Greenplum Database product. Features include:

Columnar or row-oriented storage to provide benefits based on different workloads. This is transparent to the user and is specified when the table is created. HAWQ figures out how to shard, distribute, and store your data.
Seamless partitioning allows you to separate your tables on a partition key, enabling super-fast scans of subsets of your data by pruning off portions of your data that you don’t need in a query. Common partition schemes are on dates, regions, or anything you commonly filter on.
Our parallel query optimizer and planner take SQL queries that look like any other you’ve written before, then intelligently looks at table stats to figure out the best way to return data. For example, it determines what kind of join to use on the fly based on the size of the data sets.
Table-by-table specification of distribution keys allow you to design your table schemas to take advantage of node-local joins and group bys, though HAWQ will perform will even without this.
Fully compliant and robust SQL92 and SQL99 support. We also support the SQL 2003 OLAP extensions. We are 100% compatible with PostgreSQL 8.2, and have added significant amounts of functionality and functions past that.
The full suite of administration tools from our Greenplum Database makes the system easy to maintain, install, and use.
Full ODBC and JDBC support allow you to plug in your existing BI tools (Tableau, QlikView, etc.) or applications to hook into HAWQ. You can also plug in a BI tool through our ODBC or JDBC connector, to access HBase data or HDFS data via GPXF.

All of this, inside your own Hadoop cluster! These features and functionalities stretch what your Hadoop cluster can do, further into the analytical and data service realm, where Hadoop struggles to do well due to its design as a batch data processing system primarily.

Hadoop complements our database technology very well. It handles unstructured data, data archive and storage, and schema-less process. Together, Hadoop and HAWQ cover all the bases, empowering you to use the right tool for the right job within your same cluster.

So, how does HAWQ work? We have what we call “segment servers” manage a shard of each table. Several segment servers run on each data node of your cluster. This shard of data, however, is completely stored within HDFS. We have a “master” node that has the job of storing the top-level metadata, as well as building the query plan and pushing the node-local queries down to the segment servers.

When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine. HAWQ follows MPP architecture, streaming data through stages in a pipeline, instead of spilling and check pointing to disk (like MapReduce). Also, the segment servers are always running, so there is no spin-up time.

In our initial tests, we find that HAWQ is hundreds of times faster than Hive. We have also done some initial testing against some of our competing SQL-on-Hadoop solutions and are orders of magnitude faster for some queries (especially group by and joins, which are rather important).

What results is a near real time analytical SQL database that runs on Hadoop. You can run queries with sub-second response time, while at the same time running over much larger datasets and processing with the full expressiveness of SQL, in the same engine. Meanwhile, you don’t have to sacrifice or compromise anything from your Hadoop/MapReduce side of the house. They work together to get the job (whatever it is) done.

To learn more about Pivotal HD:

About the Author

Biography

How Hadoop Can Disrupt the Database Industry

Hardly any book has attracted more attention among software companies than Clayton Christensen’s “The Innov...

How to Use Mobile to Drive Conversions

According to Google’s thinkmobile presentation, almost four out of five U.S. shoppers use their smartphone ...

Introducing Pivotal HD

About the Author

Previous

Next

Introducing Pivotal HD

About the Author

Previous

Next

Related content in this Stream

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.