Run Anywhere. Persist Anywhere. Open Everywhere.
By Derek Comingore and Cesar Rojas
The Logical Data Warehouse Then and Now
Back in 2009, Gartner analyst Mark Beyer clearly understood that the traditional Enterprise Data Warehouse (EDW) was unable to support the emerging data revolution. He accurately predicted the type of analytical environment needed to support both the explosion of interactive and sensor generated data as well as the more traditional transactional data. He called this environment the Logical Data Warehouse, or LDW.
It is 2017 and few LDWs are deployed in the wild as of yet. The LDW hasn’t proliferated as expected, largely due to its extremely complex and labor intensive methods for deployment leveraging legacy data warehousing products’ approaches. You have to wonder: haven’t we learned anything from the past eight “big data” years? Can a new reference architecture, based on mature open source MPP technology emerge and rescue the LDW from the predatory business practices used by the same old tyrants that ruled the EDW market? For the majority of enterprise, the benefits of the LDW remain frustratingly out of reach.
Data warehousing products have historically required data be persisted within their own unique environments and formats to enable rich analytics (analytics lock-in). Leveraging the LDW architecture, enterprises are enabled to produce insights from any data asset regardless of location or format thus removing analytics lock-in. Here at Pivotal, we are actively tearing down old walls to enable enterprises to fully-and-easily materialize the LDW concept and reap its associated benefits.
A Data-Driven World & Open Source Software are Changing the Game
Gartner estimates there are 8.4 billion connected things in use worldwide today, rising to 20.4 billion connected things by 2020. And with this connectivity comes data and its associated insights. If you ever had any doubt as to the relevance of data in our modern world, simply look around and you will “see” data everywhere. Common examples include personal fitness, home energy, and social media. We are actively living in a data-driven world!
Modern enterprises competing in this data-driven world require a diverse mixture of data processing and persistence capabilities across a variety of computing landscapes. The data-driven enterprise is not only using analytics to make traditional business decisions but is also developing new revenue streams and business models with data by offering customers “smart” experiences. In parallel, open-source software has drastically risen in popularity over the past ten years to become the defacto standard for enterprise software. Proprietary software is our generation’s new legacy.
The New Logical Data Warehouse: Architectural Attributes
Based on feedback from many of our customers and the community, the market is looking for a modern-and-mature LDW with the following intrinsic characteristics.
Infrastructure Agnostic: Run Anywhere
Cloud-Native, multi-cloud support is critical. Modern LDWs must be able to run natively in any cloud/IaaS environment users desire. Whether the organization chooses private cloud or one of the popular public clouds, the LDW must run and persist natively within.
Data Virtualization: Persist Anywhere
A modern LDW must be able to virtualize data across a variety of storage technologies. The data warehouse user community does not know (nor care) where the underlying data resides behind logical tables. The data may reside in a local cluster, a traditional Hadoop-based data lake or newer cloud technologies such as Amazon Web Services (AWS) S3. Wherever the enterprise persists its data assets, the modern logical data warehouse must be able to access and expose it to users for reporting, querying and advanced analytics.
Open Source: Open Everywhere
To avoid vendor lock-in, a modern LDW must be open. An open source LDW architecture benefits from the innovation created by the larger community and the additional engineering momentum that is required in a compute and storage agnostic world. Enterprises can access source code and even contribute new features themselves.
Advanced Workload Management
As the modern-day LDW exposes more of the enterprise’s data assets, workload management becomes critical. A mature LDW must be able to dynamically govern system resources in real-time across all manners of compute and storage. This capability ensures SLAs can be met regardless of data location.
Support for ETL 2.0: High-Throughput, Low-Latency Data Ingestion ETL
Traditional ETL (ETL 1.0) methodologies and tooling are no longer sufficient for next-generation use-cases. LDWs benefit from new approaches to data ingestion in our data-driven world. New LDWs must be able to accommodate not only traditional batch and micro-batch ingest but also real-time data streams. Real-time data streaming provides the LDW with fresher data for the larger user community to do batch or even real-time analytics against.
Modern enterprises require a new ETL environment (ETL 2.0) that can support these diverse data pipelines across a dynamic number of consuming entities. This is a trend not likely to stop anytime soon. With the rise in Internet of Things (IoT) workloads, ETL 2.0 will continue to increase in importance.
Pivotal Greenplum: Defining the Modern Logical Data Warehouse
Pivotal Greenplum, based on the Greenplum Database open source project, was born a pure software data warehouse (i.e. not tied directly to any hardware platform) over 10 years ago. Unlike other Massively Parallel Processing (MPP) database systems on the market, Pivotal Greenplum is a 100% software solution that runs wherever basic Linux operating systems are supported. This provides Pivotal Greenplum with the ability to run on public and private clouds in addition to on-premise bare metal configurations. Pivotal continues to invest into Pivotal Greenplum, releasing Generally Availability for both the AWS and Azure marketplaces.
Furthermore, the engineering teams behind Pivotal Greenplum recognized early on the need for data access across a variety of data stores. And the data access mechanisms found in Pivotal Greenplum support both read and write operations, in parallel. Pivotal Greenplum supports an external table mechanism that allows it to access data across many different data stores including all major Hadoop distributions, AWS S3, NFS mounts, including dynamic data sources such as the output of calling external processes.
In order to process data quickly and efficiently across a variety of data stores, Pivotal introduced GPORCA, also known as the Pivotal Query Optimizer. GPORCA is an open source, next-generation query optimizer engineered for processing big data regardless of where the source data resides. GPORCA provides significant performance improvements over the original Pivotal Greenplum Query Optimizer (Planner) for virtualized data assets.
Pivotal has made extensive investments in Pivotal Greenplum’s workload management subsystems. There are two workload management subsystems found within Pivotal Greenplum: Resource Queues & GP-WLM. Resources Queues provide a proactive and flexible workload management configuration whereby the database administrator can carve out system resources effectively based upon forecasted workloads. GP-WLM conversely provides a real-time and dynamic workload management capability, where the database administrator can dictate what actions to invoke based upon real-time cluster events.
For data streaming scenarios, Pivotal Greenplum supports a variety of ingest mechanisms. There are a variety of data pipeline frameworks such as Spring Cloud Stream that provide the ability to stream or micro-batch records for ingestion into Pivotal Greenplum. Pivotal has also engineered a GemFire-Greenplum Connector for bidirectional data movement between GemFire regions and Pivotal Greenplum tables. The connector enables Pivotal Greenplum, when coupled with Pivotal GemFire, to provide true high-performance Hybrid Transactional/Analytical Processing (HTAP) database capabilities, ingesting data into memory before being written to disk via Pivotal Greenplum for downstream analytics. The connector is being further extended to support Apache Spark as well. These capabilities make Pivotal Greenplum a perfect candidate for IoT analytic workloads.
As mentioned, Pivotal Greenplum is based on the open source software project Greenplum Database. The Greenplum Database source code can be found on GitHub here. Pivotal is fully committed to open source innovation and communities. It open sourced all of its data fabrics in 2015 as part of the company’s commitment to the value and promise of open source software. While Pivotal continues to be a primary code contributor, there is broad community contributions into Greenplum Database in true open source fashion.
Pivotal LDW Approach: Active Collaboration within the Community
When it comes to the LDW, Pivotal recognizes no one vendor can do it all. We are currently in the process of evaluating alliances with other open source vendors to potentially launch a community initiative around the LDW for the benefit of the enterprise. Stay tuned for more soon.
About the Author
Derek Comingore is a Technical Lead at Pivotal Data. Derek has been advising customers on the implementation of modern data architectures and back-end systems for more than a decade.Follow on Twitter More Content by Derek Comingore