Pivotal Greenplum 6, Now GA, Uses PostgreSQL to Reimagine Modern Analytics at Scale

Ivan Novick

Pivotal Greenplum 6 is now generally available. Check out the docs, then download it from Pivotal Network.

Over the past 16 years, Greenplum has helped enterprises analyze data more effectively. These firms use Greenplum to increase revenue, decrease cost, and add across-the-board efficiencies to their operations. That’s the power of a scale-out high-performance analytics data warehouse! 

PostgreSQL, the wildly popular open-source database, is at the heart of Pivotal Greenplum.  Greenplum scales out to large clusters using a specialized interconnect between the Postgres instances and a SQL optimizer that is cluster aware. That’s how Greenplum can tackle traditional data warehouse use cases such as star schema, enterprise data warehouses, as well as mixed modal analytics with integrated advanced analytics. Whatever the structure of your data — geospatial, text, natural language, structured, semi-structured and graph analytics — Greenplum is up to the challenge. Want to embrace machine learning, deep learning and artificial intelligence? Greenplum has you covered.

Customers tell us that Pivotal Greenplum is ideal for workloads of 100 terabytes to 50+ petabytes of data. Enterprises use Greenplum to run high volumes of analytical and mixed workload SQL queries.

“My journey with Pivotal began in 2014 at Morgan Stanley, where I am the global head of database engineering. We wanted to address two challenges: 1) The ever-increasing volume and velocity of data that needed to be acquired, processed, and stored for long periods of time (more than seven years, in some cases) 2) The need to satisfy the growing ad hoc query requirements of our business users.”

 

Howard Goldberg, Executive Director, Global Head of Database Engineering Morgan Stanley, from the Foreword to “Data Warehousing with Greenplum, 2nd Edition

And the product keeps getting better, thanks to the open-source community. Thousands of end users worked with hundreds of engineers around the world to deliver Greenplum 6.  The open community is a big reason why Greenplum is widely hailed by industry analysts

Enterprises also enjoy the convenience of running Pivotal Greenplum on a range of affordable hardware options. Run it in your data center, or in the public cloud. Greenplum runs great on bare metal, virtual machines, and containers orchestrated by Kubernetes. We've also launched a new infrastructure option that's proving quite popular: Greenplum Building Blocks. It's an engineered system optimized for advanced analytics workloads, built on Dell EMC hardware.

Thanks to the community’s hard work, we can help folks like Howard from Morgan Stanley  solve their most pressing data warehousing and analytics challenges.

So what’s new in Pivotal Greenplum 6? Glad you asked! Peruse the 12 highlights below, then download and install the open-source version.

New Capabilities via Postgres 9.4

Greenplum manages and processes petabytes of data with parallel query execution. How? By networking hundreds of instances of Postgres via a data pipeline and UDP interconnect. It’s all orchestrated by a parallel query optimizer that’s built for massive scale (GPORCA). Greenplum 6 bundles an updated version of Postgres, advancing from v8.3 to v9.4. Consequently, Greenplum users enjoy Postgres features such as JSONb and HSTORE Data Types  which enables optimized storage of semi-structured documents that can be searched, analyzed and queried as well as improved efficiency of core data processing for faster execution of queries with this improved db kernel.

Massive Performance Improvements For OLTP and Short Query Workloads

Real world workloads for data warehouses and enterprise analytical databases require a combination of large query, short query, and online transaction processing (OLTP) for a true hybrid transaction and analytical system. That’s why Pivotal Greenplum has always had OLTP and transaction semantics with ACID safe properties. In version 6, we improved the performance of these workloads. You’ll like the results! We’ve achieved up to 70x performance gains these for these workloads. (Check out the Greenplum Blog for the specifics.) Transaction speeds ranging from 4300 transactions per second to 220,000 transactions per second are possible depending on the workload and hardware environment!

Replicated Tables Improve Performance by Avoiding Data Movement

Replicated tables improve the performance of traditional data warehouse workloads that use data models such as star schema. Now, Greenplum 6 pre-broadcasts replicated tables to all database segments in the cluster. This means that dimension tables can be joined locally with fact tables. This avoids data movement across the cluster, resulting in speedier performance. More information on data distribution in version 6 is here.

Cluster Expansion Completes Faster

Greenplum 6 clusters can now utilize newly added hardware much quicker. Data hashing in v6 has been updated to use a new algorithm which minimizes data movement when changing the number of servers in the cluster. Cluster size changes no longer require shuffling of all data in the cluster. Instead, movement is limited to the amount of data needed to fill the newly added hardware.

Admins Now Have Greater Control Over the Allocation of Disk Space

Administrators can now set quotas for disk space usage at both the schema and user role level. This gives DBAs greater control of system utilization in multi-tenant environments. And it prevents specific users or workloads from starving other users and workloads of disk space.  The DBA can rest comfortably knowing limits have been set for users in the system and will be enforced by Greenplum.

Faster External Tables Performance on Amazon S3

Amazon Web Services provides S3 Select, designed to return only the data you need from a storage object. Removing unnecessary data movement can dramatically improve the performance and reduce bandwidth costs.  The Platform Extension Framework (PXF) in Greenplum 6 has been enhanced to directly access S3 Select API to provide higher performance for selective queries on CSV and Parquet data in Amazon S3.

We’ve also done quite a bit of performance work with our friends on AWS across-the-board for Greenplum. Here’s how you can get started with Pivotal Greenplum in the public cloud.

Write Ahead Logs Boost Stability (and Peace of Mind)

Greenplum 6 now uses database Write Ahead Logs to capture all data and metadata changes on disk. (The feature builds on the write-ahead logging in Postgres 9.4.)  These changes are replicated synchronously to other nodes in a cluster, to help ensure that the database system is always online and has redundant copies of data storage for fault tolerance.

Better Data Storage Compression Trims Your Infrastructure Cost

Greenplum 6 compresses data faster. Consequently, you can store more data in the same hardware footprint, lowering costs.  This new capability is based on ZStandard compression, an open source compression algorithm developed by Facebook.

Improved Data Access Control, Protects PII Data

Administrators can now control data access at the column level, in addition to table, schema and database granularity. These controls are based on user roles and permissions.  User queries attempting to access unauthorized columns will be denied.

This is especially helpful when working with Personally Identifiable Information (PII) such as credit card numbers or national identification numbers, stored in columns. The feature gives you the best of both worlds: access control for unauthorized users, and the convenience of keeping the sensitive information in the same table as other related (but less sensitive) data. 

Easier to Install Open Source Distribution 

The Greenplum community website now hosts pre-compiled and installers for multiple Linux distributions: RedHat, CentOS, Debian, and Ubuntu. So users can pick their favorite option, then easily download and install Greenplum with pre-packaged binary installation programs and no need to compile from source.

Greenplum Command Center Boosts Observability for Admins

Pivotal Greenplum includes a new version of Greenplum Command Center (GPCC).  GPCC 6 is used by DBAs and data architects to monitor the database system and inspect workloads, system utilization, locking, query progress, and historical analysis.

MADlib 1.16 Offers New Possibilities for Deep Learning

Apache MADlib 1.16 and its support for highly parallel, GPU-accelerated processing for Deep Learning model training is part of Greenplum 6. Greenplum 6 users can take advantage of GPUs embedded in cluster hardware, achieving 2 orders of magnitude or more faster performance from CPU only processing. Give these features a spin if you have predictive analytics use cases or need to perform image recognition!

Pivotal Greenplum Streaming Server Brings Kafka to Your Analytics Deployment

Why use Apache Kafka? It’s perfect for guaranteed, resumable once-only loading. It’s an attractive alternative to the traditional ETL model. Now Greenplum Streaming Server (GPSS) enables real-time, continuous updating of data sets in Pivotal Greenplum.  This method has been used successfully in IoT and financial trading use cases. Here, Kafka continuously streams data into Pivotal Greenplum. This combination works in harmony to ensure that data is available for analytics and reporting in real time!

Try Greenplum Today!

The Greenplum community continues to grow and thrive. Adoption continues apace. It’s time to take the next step, and see how Greenplum can work for you!

SAFE HARBOR STATEMENT

This blog also contains statements which are intended to outline the general direction of certain of Pivotal's offerings. It is intended for information purposes only and may not be incorporated into any contract.  Any information regarding the pre-release of Pivotal offerings, future updates or other planned modifications is subject to ongoing evaluation by Pivotal and is subject to change. All software releases are on an “if and when available” basis and are subject to change. This information is provided without warranty or any kind, express or implied, and is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions regarding Pivotal's offerings. Any purchasing decisions should only be based on features currently available.  The development, release, and timing of any features or functionality described for Pivotal's offerings in this blog remain at the sole discretion of Pivotal. Pivotal has no obligation to update forward-looking information in this blog.

This blog contains statements relating to Pivotal’s expectations, projections, beliefs, and prospects which are "forward-looking statements” and by their nature are uncertain. Words such as "believe," "may," "will," "estimate," "continue," "anticipate," "intend," "expect," "plans," and similar expressions are intended to identify forward-looking statements. Such forward-looking statements are not guarantees of future performance, and you are cautioned not to place undue reliance on these forward-looking statements. Actual results could differ materially from those projected in the forward-looking statements as a result of many factors. All information set forth in this blog is current as of the date of this blog. These forward-looking statements are based on current expectations and are subject to uncertainties, risks, assumptions, and changes in condition, significance, value and effect as well as other risks disclosed previously and from time to time by us. Additional information we disclose could cause actual results to vary from expectations. Pivotal disclaims any obligation to, and does not currently intend to, update any such forward-looking statements, whether written or oral, that may be made from time to time except as required by law.

PIVOTAL, GREENPLUM are either registered trademarks or trademarks of Pivotal Software, Inc. in the United States and/or other countries. REDHAT and CentOS are registered trademarks of RedHat, Inc.  DEBIAN is a registered United States trademark of Software in the Public Interest, Inc. Amazon Web Services is a registered trademark of Amazon.com, Inc. or its affiliates in the United States and/or other countries. UBUNTU is a registered trademark of Canonical, Inc. Apache MADLIB is a trademark of The Apache Software Foundation. Other names may be trademarks of their respective owners.

About the Author

Ivan Novick

Ivan has been working on big data, databases, and enterprise systems for over a decade. He spent 5 years in the financial industry building trading systems; worked at Yahoo on the data warehouse system before Hadoop was created; hacked on a MySQL storage engine for a year and has spent the last ​10 years in various capacities working on ​the Pivotal ​Greenplum product. Ivan's passion is building ​next generation data platforms. In his free time, he has also been a beginning yoga student for the last 10 years. Born and raised in NYC Ivan is now is enjoying the California lifestyle where has resided since 2006.

Follow on Twitter Follow on Linkedin Visit Website
Previous
More Capabilities, Same Goal, New Name—Transform with the Pivotal Platform
More Capabilities, Same Goal, New Name—Transform with the Pivotal Platform

Next
Pivotal + VMware—Transforming How More of the World Builds Software
Pivotal + VMware—Transforming How More of the World Builds Software

On August 22, 2019, Pivotal entered into a definitive agreement to be acquired by VMware. Rob Mee, Pivotal’...