Introducing Pivotal Greenplum 5.3

December 13, 2017 Cesar Rojas

Agile software development is at the very core of Pivotal and it is a key driver of innovation for  Pivotal Greenplum. After our recent announcement of version 5, today we’re excited to announce the Pivotal Greenplum 5.3 release.

Being released so close to the holiday season, we believe this is a nice gift from our engineering team to the Greenplum community. Let's review what makes Pivotal Greenplum 5.3 such a great version:

 

Greenplum Containerization

Greenplum 5.3 is a foundational release that delivers early containerization features as we move toward future integration with Pivotal Container Service (PKS).

A fully containerized Greenplum will be unique in the analytical database world as many traditional data analytics platforms are monolithic and difficult to abstract. A containerized Greenplum will provide the ability to scale to more users, more workloads, and have less noisy neighbor impacts. It will also give the database administrator (DBA) ultimate control to manage the system and balance different user query requests.

Greenplum 5.3 delivers foundational components that enhance resource isolation and elasticity by allowing Query Interfaces (e.g. ANSI compliant SQL, Python, and R) to be containerized within the platform.

Query Containerization

  • Powered by the new Greenplum 5.3 Resource Groups feature.
  • This new capability further enhances the stability and manageability of Greenplum while at the same time allowing for richer resource isolation for multi-tenant and mixed workloads.

  • It provides OS level grouping of CPU and memory resources along with concurrent transactions to ensure each is guaranteed a predetermined amount.

  • Resource group CPU management is built on top of Linux Control Groups (cgroups), which provide well isolated and automatically bursted CPU resources to all groups.

  • Memory allocation of each resource group is pre-configured both at the group and query level.

  • Resource groups implement transaction-based concurrency management.  This allows for the level of concurrency to be managed by the DBA and create an orderly queue for queries waiting to enter the system.

Support for Trusted Languages (R/Python) Containerization on Greenplum

  • Powered by the new Greenplum 5 PL/Container (preview feature).

  • This is an implementation of a trusted language execution engine capable of bringing up Docker containers to isolate executors from the host OS, which allows sandboxing.

  • PL/Container runs Python and R code inside a Docker container. The server side code running inside Greenplum communicates with the container using an RPC protocol.

  • Containers are pre-configured with Pivotal Greenplum for data science workloads and can also be customized or built from scratch for different end user workloads.  Multiple different containers can be deployed to accommodate different development teams with different requirements.

 

Greenplum Data Ecosystem Extensibility

Greenplum 5.3 significantly improves the existing level of integration with the Apache Hadoop and Apache Spark frameworks.

Improved integration with the Hadoop ecosystem

  • Apache Hadoop is a popular distributed processing framework that has been primarily deployed as large data repositories (or “data lakes”). Enterprises are looking for hybrid approaches that combine the best elements of the data lake with the query performance of an MPP engine, like Pivotal Greenplum, for advanced analytics.  For those use cases, Pivotal Greenplum 5.3 offers the Platform eXtension Framework (PXF), a REST API abstraction layer that allows Pivotal Greenplum to query Hadoop data in a highly parallel way.

  • The new PXF integrates functionality from Pivotal HDB (a feature known as “Pivotal Extension Framework”) to provide feature parity with Pivotal HDB and data integration to a broader Hadoop ecosystem.

  • With PXF, Pivotal Greenplum users can federate queries across both data within the platform as well as federated queries to external Hadoop sources. This symbiotic relationship combines the cost and storage advantages of the data lake with the performance of the Pivotal Greenplum MPP query engine.

  • PXF includes built-in plugins for accessing data inside HDFS files, Hive tables, and HBase tables. Designed to be extended, users can create custom extensions to access other parallel data stores, processing engines, or file and storage formats.

Pivotal Greenplum and Apache Spark integration

  • Apache Spark is an extremely popular and fast in-memory engine for big data processing. It provides built-in modules for streaming, SQL, machine learning and graph processing. Spark users, such as data scientists and data engineers, want to run fast in-memory analytics, exploratory analytics and ETL processing while using data that is persisted on Pivotal Greenplum. Users will be able to leverage Spark JDBC driver to load and unload data from Greenplum.

  • The Pivotal Greenplum Spark Connector provides high speed, parallelized data transfer between the Greenplum Database and Apache Spark clusters.

 

Greenplum Open Source Improvements

Greenplum 5.3 builds on the open source support by adding Greenplum Database open source binaries for the Ubuntu Linux operating system.

Greenplum Database Open Source Binaries on Ubuntu

  • Prior to Greenplum Database 5.3, distribution was only available via source code from Github; this all changes with 5.3 pre-packaged binaries.

  • A binary open source option will provide the Greenplum community with an easier, faster, and more consistent installation.

  • We expect this will significantly increase the mindshare and adoption of Greenplum (both open source and commercially).

  • Ubuntu users can leverage native apt-get commands to install Greenplum with ease from the Personal Package Archive that contains the compiled releases.

 

Other Capabilities

Finally, Pivotal Greenplum 5.3 adds a number of new capabilities, including; a new backup & restore utility, a case-insensitive based module for text searches, and our new enterprise support for SUSE (SLES) 12.

New Version of Greenplum Backup & Restore (preview feature)

  • The new Greenplum Backup & Restore provides higher performance, reduced lock contention for online backups, progress monitoring & reporting, and additional configurability options.

  • The new Greenplum Backup & Restore utility is included in the Greenplum 5.3 release. Based on extensive feedback from Greenplum customers, we have implemented many of their suggestions specific to performance and usability for a brand new backup and restore experience.

  • Improved Performance    

    • Multiple concurrent backups resulting in 50% faster run times.

    • 6x performance increase in metadata backups.

    • Improved compression efficiency, decreasing run times by up to 3x.

  • User Experience

    • Decreased catalog locking, resulting in less contention with ETL processes.

    • Improved levels of monitoring and logging.

    • Additional levels of object filtering for selective backup & restores.

    • Multiple output file formats to aid in migrations from previous versions of Greenplum.

Case Insensitive Text (citext) Module

  • This is a new feature backported from PostgreSQL and it allows the execution of case-insensitive text searches. It can compare all matches to ‘cesar rojas’ (‘Cesar Rojas’ || ‘CESAR ROJAS’ || ‘cesar rojas’ || etc).

  • This is an important feature for customer migrating from databases like Teradata into Pivotal Greenplum and it is a key element of our Greenplum text processing strategy.

SUSE Linux Enterprise Server (SLES) 12 Support

  • Now Pivotal provides official Pivotal Greenplum support for SLES 12. With this addition, Pivotal Greenplum now offers full support for the enterprise distributions of Redhat and SUSE.

 

For More Information

About the Author

Cesar Rojas

Cesar Rojas serves as the Head of Product Marketing for Pivotal Greenplum, responsible for setting the messaging and go to market strategy for Greenplum. Prior to joining Pivotal, Mr. Rojas was Director of Product Marketing for the Teradata Portfolio for Hadoop and Teradata Aster offerings. Mr. Rojas is an advanced analytics and data management veteran with 15 years of experience working for the largest data analytics vendors as well as successful data startups. Mr. Rojas has an MBA with emphasis in eBusiness from Notre Dame de Namur University, as well as a bachelor's in Computer Engineering.

Follow on Twitter
Previous
Getting Kubernetes to Production
Getting Kubernetes to Production

How the partnership between Pivotal, Google and VMWare brings agility and security to your containers.https...

Next
.NET or Java — For Microsoft and its New Partners, it’s not Either, but Both
.NET or Java — For Microsoft and its New Partners, it’s not Either, but Both

Learn about Microsoft’s growing collaboration with the Java community and open source.https://medium.com/me...