Pivotal Greenplum: Open-Source, Massively Parallel Data Platform for Advanced Analytics

December 2, 2016

Greenplum Accelerates Your Digital Transformation

Data is at the center of digital transformation, driving how transformation happens. But data is messy, and it’s everywhere—in the cloud and on-premises, and in di erent types and formats.

Even several years after the introduction of data lake solutions, enterprises continue to struggle with applying analytics to disjointed data types and silos. Integrating structured with unstructured data is a major issue, and traditional enterprise data architectures fail to support real-time insights. Data volume growth puts pressure on infrastructure and resources.

Data professionals can meet these challenges with Pivotal Greenplum, the only open- source, shared-nothing, massively parallel processing (MPP) data platform designed for advanced data analytics at petabyte scale. Greenplum makes data science approachable via familiar ANSI SQL and common procedural languages to easily harness the power of Greenplum’s MPP architecture for faster modeling.

Each server node in Greenplum owns and manages a distinct portion of the overall data. The system automatically distributes data and parallelizes query workloads across all available hardware.

A 2017 Gartner survey suggests it takes an average of 52 days to build a predictive model. Speed of model development is therefore a top concern in choosing a data platform. By embedding machine learning in an MPP platform, Pivotal Greenplum can help analysts and data scientists run more models in less time.

Extend SQL with graph analytics and machine learning
Greenplum supports Apache MADlib, an open-source library of distributed, in-database analytical methods. These are implemented as user-de ned functions that can be invoked with standard SQL—nearly 60 graph, statistical, and machine-learning functions are supported.

Add geospatial and text data for complex use cases
Greenplum also supports PostGIS, a spatial database extension for PostgreSQL that allows geographic information system (GIS) objects to be stored and processed in the database. Pivotal GPText, based on Apache SolrCloud, enables the processing of raw text data (including email and social media feeds) with an easy-to-use SQL interface.

Support for Python and R analytical libraries through procedural language extensions (PL/X)
Greenplum allows users to write user-de ned functions (uDF) in a wide range of languages including SQL, Perl, Python, R, C, and Java, and supports distributed execution of uDFs. Furthermore, Greenplum users can leverage functions from any of the add-on packages of these languages (e.g., TensorFlow for Python, rstan for R) in their uDFs. Greenplum 5 also provides easy-to-use installers for the most popular add-on libraries for Python and R.

Support your Apache Spark users
Apache Spark is an extremely fast, in-memory data-processing engine. The Pivotal Greenplum Spark Connector provides high-speed, bi-directional, parallelized data transfer between Greenplum and Apache Spark clusters. It enables users to run fast in-memory analytics, exploratory analytics, and ETL processing with data persisted on Greenplum.

Handle traditional BI workloads with ease
Greenplum o ers comprehensive SQL-92 and SQL-99 language support with SQL 2003 oLAP extensions, including window functions, rollups, cubes, and a wide range of other expressive functionality. All queries are executed in parallel across the entire cluster. Standard database interfaces (including PostgreSQL, SQL, oDBC, JDBC, oLEDB, etc.) are fully supported and certi ed with a wide range of business intelligence (BI) and extract/ transform/load (ETL) tools.

MULTI-CLOUD, INFRASTRUCTURE-AGNOSTIC DEPLOYMENT MINIMIZES LOCK-IN

Run your analytics anywhere you need them. Pivotal Greenplum is a portable, 100% infrastructure-agnostic software solution. Deploy on bare-metal servers, on private cloud (both openStack and vMWare vSphere are supported), and on public IaaS (AWS, Azure, and now on the Google Cloud Platform). ubuntu users can use native commands to install Greenplum with ease from the Personal Package Archive that contains the compiled releases.

CONNECT TO HADOOP AND PUBLIC CLOUD REPOSITORIES

Using external tables, Pivotal Greenplum can query data that is natively stored in AWS S3, along with data stored in the Greenplum cluster. This means that a single analytical query can be segmented and distributed to several environments.

For users who have (or are considering) a data lake, the Platform eXtension Framework (PXF) combines the cost and storage advantages of the data lake with the performance of the Greenplum MPP query engine. With PXF, Greenplum users can federate queries across internal tables and external Hadoop sources, such as HDFS, HBase, and Hive. PXF is a REST API abstraction layer that enables Pivotal Greenplum to query Hadoop data in
a highly parallel way. It also includes a plugin for JSoN  les, and users can create custom connectors to access other data stores, processing engines, or  le and storage formats via framework APIs.

STABILITY AND SCALABILITY WITH NEW CONTAINERIZATION FEATURES

To provide enhanced resource isolation and elasticity for multitenant and mixed loads, Greenplum now provides containerization features for SQL and trusted languages.

SQL containerization
Greenplum Resource Groups provide resource isolation for query multi-tenancy and mixed workloads. SQL containerization groups together CPu and memory resources—along with concurrent transactions—to ensure each is guaranteed a predetermined amount. Resource groups implement transaction-based concurrency management. This allows for the level
of concurrency to be managed by the DBA, and it creates an orderly queue for queries waiting to enter the system.

Trusted language containerization
PL/Container is an implementation of a trusted language execution engine capable of bringing up Docker containers to isolate the execution of PL/R and PL/Python from a Greenplum database host. The server-side code running inside Greenplum communicates with the container using an RPC protocol.

SUMMARY
Greenplum is an open-source data analytics platform that provides powerful and rapid analytics on very large volumes of data. uniquely geared toward machine learning and advanced data science, Greenplum delivers unmatched analytical query performance on large data volumes and tight integration with leading analytical libraries and software stacks. Additional details on Greenplum can be found in the product and documentation pages. An open-source version of Greenplum (Greenplum Database) is also available for download at greenplum.org.

Previous
Pivotal Cloud Foundry: The Leading Enterprise Platform Powered by Cloud Foundry
Pivotal Cloud Foundry: The Leading Enterprise Platform Powered by Cloud Foundry

Next
Reference Architecture for DevOps
Reference Architecture for DevOps