The Path to the Modern Data Warehouse Is a Stream

June 27, 2019 Bob Glithero

Streaming data is becoming entrenched as a key part of modern application frameworks, but until recently hasn’t integrated well with the traditional database world. Streaming data is event-centric, distributed, variable, unbounded, and unordered, while relational data is batch-centric, centralized, structured, bounded, and ordered. 

In her talk “ETL is Dead, Long Live Streams,” Neha Narkhede, the CTO and co-founder of Confluent, describes the disruption of traditional extract, transform, and load process (ETL) due to the needs of stream processing, as ETL infrastructure strains under a constantly growing volume and velocity of data from distributed sources. The big question now is whether the data warehouse can evolve to integrate those two worlds in order to get fresh data into the hands of analysts and data scientists quickly. Is there a path to unbounding the data warehouse to better meet the challenges of streaming data?

ETL is typically snapshot-driven based off the state of the database at a point in time. But in the time interval between snapshots, the state of the database is unknown and can’t be analyzed. The business typically has to live with only approximate results during these intervals. The challenge is that, historically, data warehouses were loaded on a predictable schedule (e.g., daily batches).

Driven by a desire to act faster with fresher insights, some companies began breaking the ETL process into smaller and smaller intervals—from daily, down to minutes. But as intervals decreased, the increased overhead of job tracking and maintenance diverted expert time and attention away from the business. 

Streaming data, in particular, exposes the limitations of traditional ETL. Streaming is event-centric, and knowing when the event actually occurred is central to most streaming analytics. ETL, on the other hand, is process-centric, and batch systems have no concept of an event or the time it occurred until it is loaded into the target system. This introduces a potential problem matching event time (when an event actually occurs) to processing time (when an event becomes known to the data warehouse via a batch load).  You can think of batches as windows of events, such as a series of customer browsing sessions on an online site. It can sometimes be tricky to associate events to the correct session in order to analyze customer behavior.

If we care about event-time in the analysis, and we’re using batch processing, events have to arrive as ordered in order to process time correctly. However, real-world data, especially from distributed systems, usually arrives unordered. We see examples of this in applications like monitoring or billing, or geo-distributed applications where there’s significant latency in gathering events from all sources.

The Pivotal Greenplum Kafka Integration

It’s these streaming capabilities that inspired us to build a Kafka integration for our Pivotal Greenplum Database, so it can handle the demands of data analysts today and in the future. Greenplum is a massively parallel-processing database, specifically designed to manage large-scale machine learning and business intelligence workloads. The Pivotal Greenplum Kafka Connector uses the log-based nature of Kafka to map events from any number of topics into Greenplum tables with high speed, parallel data transfer for faster, more fault-tolerant ingestion compared to traditional ETL. 

Kafka helps Greenplum manage the timeliness of events by ensuring they arrive in a time-ordered manner before ingestion. This is helpful, for example, in use cases where interruptions from the source (e.g., the system goes offline) causes a large latency in processing. But with Kafka’s log-oriented nature, Greenplum can simply reprocess the log to ingest events from the point of interruption forward, all in the proper order. 

Figure 1: Greenplum-Kafka Integration

The integration is a consumer that uses Greenplum external tables to transform and insert the data into a target Greenplum table. It provides high-speed, parallel data transfer from Apache Kafka to Pivotal Greenplum (Confluent’s commercial distribution is also supported).

The consumer can read streaming events from a variety of sources, for a wide range of uses.  For example, in data integration scenarios (and migration from legacy data warehouses) based on change data capture, database rows are streamed as individual events.  In the case of network security analytics, packet capture data events can be streamed to the data warehouse for offline analysis.   

Enterprises can use Kafka for continual data loading with light transformations, and Pivotal Greenplum for deep analytics that require extensive aggregation and analysis. The consumer uses a configuration file that defines load parameters such as the Greenplum connection options, the Kafka broker and topic, and the target Greenplum table. 

First, the consumer formats the data. It then loads, transforms, filters, and maps the data into Greenplum tables (one table per topic), where they can be refined for machine learning. The consumer requires less manual effort for data loading, delivers lower latency from event to query, and provides easier recovery from interruptions and unexpected events. 

The consumer’s features are designed to speed loading of data from Kafka topics into Greenplum tables with strong guarantees of correctness:

  • Continual, stateful data loading

  • Non-blocking data ingestion

  • Resumable with strong consistency guarantees -- if loading is interrupted, Greenplum keeps track of the loading state and needs only to reprocess the topic log from the point of interruption.

  • Automated aggregation and maintenance jobs

  • Issue SQL and UDF automatically on commits if desired

 

The Greenplum-Kafka Integration can also provide exactly-once delivery assurance with Kafka version 0.11 or later. Exactly-once semantics means that each incoming event updates the database once -- no duplicates or missing events -- to ensure that the database always holds the correct values.

In addition to data integration and packet capture analysis, we are seeing some emergent use cases in the areas of:

  • Real-time fraud  detection, where streams of tens of millions of transaction messages per second are analyzed by Apache Flink for event detection and aggregation and then loaded into Greenplum for historical analysis.

  • Massive ingestion of signaling data for network management in mobile networks.

  • Aggregation of system and device logs.


Find out how Greenplum disrupts ETL and bridges the worlds of streaming and structured data to create a modern data warehouse. Greenplum is free to download, with binaries for Ubuntu at Greenplum.org. Read more information about real-time data loading with Kafka with an overview of the Greenplum-Kafka Integration. For more on how Greenplum leverages Kafka for data integration, learn about our data pipeline partners.

Bob Glithero

Bob is Senior Manager, Product Marketing for VMware Tanzu Data Services.

previous-Article
The Making of a Cloud-Native CI/CD Tool: The Concourse Journey
The Making of a Cloud-Native CI/CD Tool: The Concourse Journey

Reviewing Concourse at Five Years and at Version 5.x

next-Article
How to Build a Critical Application Assessment Framework for Your Bank
How to Build a Critical Application Assessment Framework for Your Bank

When banks move apps to the cloud, they often need a framework to assess the needs of their mission-critica...

×

Subscribe to our Newsletter

!
Thank you!
Error - something went wrong!