Pivotal Greenplum: The First Open Source Massively Parallel Data Warehouse

December 2, 2016


Static IT budgets, exploding data volumes, and ever-evolving competitive landscape have catalyzed new ways of thinking about effective systems for data analytics in enterprises. Legacy data management solutions have not been able to scale to the volume of data and deliver advanced analytical capabilities needed to address this new market reality. At the same time, proven massively parallel processing data warehouses have led to new approaches for effective data exploration and business insights.

Pivotal Greenplum is the only open source shared nothing, massively parallel processing (MPP) data warehouse that has been designed for business intelligence processing and advanced data analytics. The enterprise-grade analytical database provides powerful and rapid analytics on very large volumes of data.


Key Architectural Tenets

Each server node in the Greenplum owns and manages a distinct portion of the overall data. The system automatically distributes data and parallelizes query workloads across all available hardware, moving the processing dramatically closer to the data and its users, as a result, delivering maximum resource utilization and incredible expressiveness.

Pivotal Greenplum is the best platform on the market for mission critical analytics. The shared-nothing MPP architecture enables massive data storage, loading, and processing with unlimited linear scalability. Adaptive services provide enterprises with high availability, workload management, etc. Key product features enable petabyte-scale loading, polymorphic storage, comprehensive language and advanced machine learning support etc. In addition, all major third-party analytic and administration tools are supported through standard client interfaces.

Greenplum is regarded as the most scalable mission-critical analytical database and is in use by a large number of leading enterprises worldwide.

Core Capabilities Deliver a Fully Featured Analytical Data Warehouse

Greenplum incorporates several core capabilities that deliver extremely high query performance and throughput, reliable query completeness and correctness and strong support for complex queries at petabyte scale data volumes with mixed workloads.

Proven Open Source Technology: After a decade of software hardening, Pivotal made Greenplum available as an open source data warehouse called the “Greenplum Database”. The Greenplum Database project is under Apache License v2.0 and it is openly available to all contributors on greenplum.org.

Massively Parallel Processing Architecture: he Pivotal Greenplum architecture provides automatic parallelization of data and queries—all data is automatically partitioned across all nodes of the system, and queries are planned and executed using all nodes working together in a highly coordinated fashion.

Petabyte-Scale Loading: High-performance loading uses MPP Scatter/Gather Streaming technology. Loading speeds scale with each additional node to greater than 10 terabytes per hour, per rack. Continuous streams are loaded using trickle micro-batching at extremely high data ingest rates.

Polymorphic Data Storage and Execution: The table or partition storage, execution, and compression settings can be configured to suit the way data is accessed. Customers have the choice of row- or column-oriented storage and processing for any table or partition. Columnar storage is ideal for accessing a limited number of attributes over an extended record set, such as for historical analytics of specific attributes.. Row storage is ideal for accessing the complete attribute set for a limited set of records such as for obtaining all the information about a recent transaction.

Pivotal Query Optimizer: Pivotal Query Optimizer (PQO) is the industry’s first cost-based query optimizer for big data workloads. PQO can scale interactive and batch mode analytics to large data sets in the petabytes without degrading query performance and throughput, a task that is prohibitively expensive for traditional EDWs and existing alternatives. PQO is also capable of handling a wide range of complex queries with concurrent and mixed workloads. This enables large teams to work in parallel on multiple analytics use cases with advanced analytics and diverse workloads.

In-Database Compression: In-database compression uses industry-leading compression technology to increase performance and dramatically reduce the space required to store data. Customers can expect to see up to 30x disk space reduction with a corresponding increase in effective I/O performance.

Multi-level Partitioning: Flexible partitioning of tables is based on date, range, or value. Partitioning is specified using a Data Definition Language (DDL) and enables an arbitrary number of levels. The query optimizer will automatically prune unneeded partitions from the query plan.

Solving Business Problems with Proven Analytics

Comprehensive SQL support: Greenplum offers comprehensive SQL-92 and SQL-99 language support with SQL 2003 OLAP extensions, including window functions, rollup, cube, and a wide range of other expressive functionality. All queries are parallelized and executed across the entire system. Standard database interfaces (PostgreSQL, SQL, ODBC, JDBC, OLEDB, etc.) are fully supported and certified with a wide range of business intelligence (BI) and extract/ transform/ load (ETL) tools. This enables existing analytic tools and applications that use standard SQL constructs and interfaces to work over Greenplum with minimal reintegration effort. This prevents vendor lock-in for the enterprise and fosters innovation at the same time containing business risk.

Advanced machine learning: Greenplum has some of the most advanced machine learning support among analytical databases in the industry. These capabilities are provided through Apache MADlib (incubating), an open source library for scalable in-database analytics extending the SQL capabilities on Greenplum through user-defined functions. This enables normal analytics workloads to embed advanced machine learning constructs to implement powerful, large scale analytical use cases.

Support for PL/* programmable analytics: Greenplum enables users to implement functions in PL/Python, PL/Java, PL/R, PL/SQL, PL/Perl etc. that are executed in massively parallel mode. This enables powerful programmatic analytics capabilities to be executed natively at massive scale as the use cases require.

Data Federation using GPHDFS: Greenplum supports data federation with all the major Hadoop distributions enabling the use of HDFS file system to create and update external tables thereby minimizing data movement.

PostGIS support: Greenplum has extensive support for PostGIS, a spatial database extension for PostgreSQL that allows GIS (Geographic Information Systems) objects to be stored and processed in the database. The Greenplum PostGIS extension includes support for spatial indexes and functions for analysis and processing of GIS objects.

Security and Business Continuity

Data Security: Security is a key consideration for ensuring enterprise policy and regulatory compliance for the data managed in analytical databases. Security can be categorized as authentication, authorization, audit and data encryption. Greenplum supports numerous authentication mechanisms including Kerberos, LDAP, Radius etc. Authorization is performed using roles and privileges. Roles can be defined at user, group or super-user levels and privileges at database operator level on specific database objects. Greenplum is capable of logging and auditing a variety of events and SQL statements at multiple levels of detail. Encryption is supported on data-in-motion using SSL and data-at-rest using the US Federal Information Processing Standards (FIPS) compliant pgcrypto package that supports numerous column-level encryption functions.

Fault-tolerance and data availability: Fault tolerance and data availability is achieved via a series of mechanisms including: Hardware Level RAID, software level mirroring and dual cluster mechanisms (for active-standby and active-active operation) and backup & restore. Several targets are supported for backup including EMC Data Domain appliance, Symantec NetBackup or using parallel NFS mount. Both incremental and full backups are supported. These mechanisms ensure business continuity and high availability in the face of hardware, software and network level failures, significantly minimizing business risk for the enterprise.

Simplified Management and Flexible Deployment

Greenplum Command Center and Package Manager: Greenplum Command Center monitors system performance metrics, analyzes system health, and allows administrators to perform management tasks such as start, stop, and recovery. It has a built-in interactive graphical web application that enables users to view and interact with the collected Greenplum system data. Greenplum Package Manager automates install, uninstall, update, and query of analytics extensions and supports package migration during upgrade, segment recovery, expansion, and standby initialization.

Together, the two tools are designed to significantly simplify the configuration and management of Greenplum, resulting in overall reduction in operational costs of the system for the enterprise.

Flexible Deployment Model: Greenplum is available as part of the Pivotal Big Data Suite and supports multiple deployment models:

  • Software: Packaged software distribution for integration with user-provided commodity hardware running Linux OS.
  • Appliance: EMC Data Computing Appliance (DCA) – fully integrated Hardware + Software solution, available ranging from 1⁄4 rack with 4 nodes to hundreds of nodes.
  • Virtualized IaaS: In a virtualized compute + storage environment

The flexibility in deployment models caters to multiple enterprise considerations around cost, performance, control, security, regulatory requirements, etc.


Greenplum is an open source data warehouse that provides powerful and rapid analytics on very large volumes of data. Uniquely geared toward machine learning and advanced data science, Greenplum is powered by the world’s most advanced cost-based query optimizer delivering unmatched analytical query performance on large data volumes, flexibility, complete set of features, and tight integration with leading analytical libraries and software stacks.

  Download the PDF

Pivotal Cloud Foundry: The Leading Enterprise Platform Powered by Cloud Foundry
Pivotal Cloud Foundry: The Leading Enterprise Platform Powered by Cloud Foundry

Reference Architecture for DevOps
Reference Architecture for DevOps