The Evolution of a Data Platform: A Journey With Greenplum and Kubernetes

June 13, 2018 Dan Baskette

Pivotal Greenplum has undergone myriad changes since it’s initial release more than a decade ago, but the pace of development has increased significantly in the last few years. This is largely due to the efforts of the Greenplum development team, which adopted practices like pairing and continuous integration, to simultaneously maintain high-quality code and get new features into the hands of users faster than ever.

Cloudy with a Chance of Machine Learning

During this period of rapid product development, a trend began to emerge. Customers began asking about running Greenplum in the public cloud. While the public cloud providers offer their own analytical database and data warehouse offerings, they don’t provide the enterprise-grade capabilities that our customers require and are accustomed to with Greenplum. We’re talking about capabilities like distributed data loading — possible with Greenplum because of its massively parallel processing engine - and built-in machine learning - possible thanks to Pivotal’s commitment to Apache MADlib and other open source projects that run in-database within Greenplum.

So, Pivotal started the effort to embrace the public cloud as a home for company’s data products. We initially built AWS Cloud Formation scripts to deploy Pivotal Greenplum on AWS. This work has since been replicated on Microsoft Azure, with other IaaS platforms on the horizon.  Additionally, support for cloud-specific object storage was added to allow customers to offload/archive data out of the database and into long-term, less performant, but cost-effective storage.


The “State” of Cloud Native Computing

This “push-button” multi-platform deployment methodology definitely attracted some customer attention. The feedback from those customers was that this functionality is great and we’d love to have it within our own on-premises Greenplum deployment, as well.

Meanwhile, the state of infrastructure and platform technology continued to evolve. Specifically, Kubernetes and it’s first-class support of stateful workloads has gained enormous popularity. K8s provides support for these workloads via techniques such as Stateful Sets, which let a unique identifier and storage connections follow a pod regardless of where it’s started.  Stateful Sets and related technologies have jump started a new market for databases in K8s.

Greenplum’s “push-button” mentality also aligns well with Pivotal Container Service (PKS). PKS combines the operational power of BOSH with Kubernetes to simplify how enterprises deploy, manage and run Kubernetes clusters. PKS addresses Day 1 and Day 2 operations of Kubernetes clusters, but also includes VMware NSX-T to provide a software-defined network for easier configuration of networking with multiple tenants across multiple K8s cluster. All of this is available for self-provisioning by application teams to provide that “push-button” experience.

Additionally, some of our largest customers expressed interest in deploying the entire Pivotal Greenplum stack. What does that mean?  At the time, Greenplum customers relied on a separate vendor for operating system support. Many customers told us they were interested in removing this variable from their implementations by leveraging an embedded OS like they get when using Pivotal Cloud Foundry. This could potentially be a cluster running PKS to provide a Kubernetes installation, and then Greenplum running in Kubernetes leveraging an embedded Ubuntu OS.

Hello Road, Meet Rubber.

At PostgresConf 2018 in New Jersey, Pivotal held the first Greenplum Summit.  At the event, our own Goutam Tadi presented “Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cloud-Native Greenplum.” (You can watch Goutam’s presentation here and check out his slides here.    

It was a very early peek into the work the team had been doing in containerizing Pivotal Greenplum and deploying container-based clusters in a Kubernetes cluster. This early work deployed a Kubernetes cluster and then used a Helm chart to install a Greenplum cluster within that K8s cluster. Helm charts provide some lifecycle management hooks that allow timed call-out, such as pre-install or post-install, to handle some configuration tasks associated with the software being installed. While these are called Lifecycle Hooks, they don’t address regular operation of the software as part of the lifecycle and instead are focused only on the install, upgrade, or deletion of the software. This early version did not include automation of any of the day-to-day operations of the database, but these are areas that are currently being developed by the Pivotal team.

Applying Learnings to Deliver Value

Based on this early version of the software, the Greenplum on Kubernetes team grew and are now hard at work building a production-ready Kubernetes Operator for Greenplum. An operator builds on Kubernetes custom resources and custom controllers by coding domain-specific knowledge into a Kubernetes API extension. These customer controllers have access to the Kubernetes API. This domain-specific knowledge allows the operator to monitor the application and perform application-specific tasks in addition to Kubernetes tasks based on the state of the monitored application.

For example, if a node fails, a Greenplum operator could spin up a new Kubernetes node, start the Greenplum containers, join them to the database cluster and initiate a resync, if required. This is a powerful addition to the Kubernetes ecosystem and enables automation of many of the Day 2 tasks associated with running a stateful application. This is the where we start to obtain true value from running these workloads in Kubernetes, value-add functionality above and beyond that of a standard, bare metal deployment.

The team has also done a lot of testing with various storage models within Kubernetes.  An exciting development for stateful workloads, such as Greenplum, is the progression of local persistent volumes. Also, the recent addition of local raw persistent volumes opens up even more deployment possibilities that weren’t available with remote volumes or network-based storage. Both of these models allow for an increased performance profile at the expense of K8S pod portability.

The engineering team is early in the development cycle and the current plan is to build and release this functionality in multiple phases. The first phase will be an operator that installs and configures clusters on-demand, and the follow-on phases will address more Day 2 functionality around running the cluster, such as node failure, master failover, and scaling of the Greenplum cluster

This is an exciting time in the history of Pivotal Greenplum. It’s exciting to see the Pivotal Greenplum team embrace open-source and cloud architectures to tackle traditional data challenges in a modern way. So, stay tuned!

 

About the Author

Dan is Director of Technical Marketing for Data and Analytics at Pivotal with over 20 years experience in various pre-sales and engineering roles with Sun Microsystems, EMC Corporation, and Pivotal Software. In addition to his technical marketing duties, Dan is frequently called upon to roll-up his sleeves for various "Will this work?" type projects. Dan is an avid collector of Marvel Comics gear and you can usually find him wearing his Marvel Vans. In his spare time, Dan enjoys playing tennis and hiking in the Smoky Mountains.

More Content by Dan Baskette
Previous
The 3 Stages to Observability for Modern Apps
The 3 Stages to Observability for Modern Apps

Next
Enterprise Architects, It's Time to Learn How the CredHub Service Broker Applies the Principle of Least Privilege to Your Secrets.
Enterprise Architects, It's Time to Learn How the CredHub Service Broker Applies the Principle of Least Privilege to Your Secrets.

The CredHub Service Broker is now a beta. It's a service broker that helps developers secure off-platform s...

How do you measure digital transformation?

Take the Benchmark