Pivotal Greenplum 5.10 Introduces Greenplum-Kafka Connector for Real Time Data Loading

July 26, 2018 Ivan Novick

Pivotal is well known for its agile development processes and the Pivotal Greenplum product is no exception.  Pivotal Greenplum is built with an agile development cadence and now version 5.10, the 10th release, is released in July 2018, just 10 months from the initial release of 5.0.

The headline feature of the 5.10 release of Pivotal Greenplum is Apache Kafka® integration, provided with the Pivotal Greenplum-Kafka Connector.  Apache Kafka has in recent times become an industry standard technology for streaming processing, data ingestion, and enterprise bus use cases.  In the world of big data, the velocity and volume of incoming data are ever increasing, and a system is needed to capture this data as it comes into the enterprise.  

Key criteria for efficient data ingestion provided by Apache Kafka include:

  • Non-blocking data ingestion to ensure data can be consumed as fast as its generated

  • Scalability to petabyte scale

  • Scalability of data readers so that the incoming data can be processed by growing number of consumer processes throughout the enterprise

  • Clear automated data retention policy to avoid natural desire to collect all data indefinitely.

  • Enable in-place analytics and processing on the data streams as needed

Apache Kafka and Greenplum: Better Together

Apache Kafka is complementary to the Relational Database model and not an alternative to it.  With a Relational Database, like Pivotal Greenplum, data can be ingested and stored for longer time periods and perform aggregation, grouping, and summary type business reporting as well as advanced analytics that requires historical data analysis.  This type of full data scanning and aggregation is perfect for a RDBMS or a Data Warehouse and doesn’t fit the real-time streaming world.

Users want access to the data in real time for in-line processing in Apache Kafka and they also want the data to be delivered reliably from Apache Kafka and into the RDBMS or Data Warehouse for SQL analysis.  

A Stock Exchange Use Case

Let's take a hypothetical use case where a Stock Exchange wants to store all trades on the stock exchange for the last 10 years in Pivotal Greenplum and be able to do analytics and reporting on the trades with SQL and advanced machine learning and analytics libraries.  They also want the latency from the time of a trade happening on the exchange to it being ingested into Pivotal Greenplum for analysis to be minimized to just several seconds. This is possible with the new Pivotal Greenplum-Kafka Connector.

Using the connector, DBAs and application developers can create a YAML configuration file that will map the incoming data from Apache Kafka Topics into Pivotal Greenplum database tables, columns, and rows.  Below is a sample YAML configuration file from the documentation:

The Greenplum-Kafka Connector can then be started using the YAML configuration file and all data that is published to the relevant Apache Kafka topics will be captured and loaded using Pivotal Greenplum’s high speed, direct to database segment, data loading architecture.  Back to our hypothetical stock exchange use case, once this process is started, it will ensure that all trades that are published in the queue are loaded into the Pivotal Greenplum database and available for query and analysis with minimal, several second, delays.

One of the reasons Apache Kafka is so elegant and successful is its scalability of readers which comes from the fact that Apache Kafka servers do not track its readers.  All of the burdens of reading data from Apache Kafka is on the user, in this case, the Greenplum-Kafka Connector. The connector can use the atomic ACID properties of Pivotal Greenplum to store the state of its data loading progress and resume from the offset point it left off in the Kafka Topic whenever needed.  Data could even be truncated from Pivotal Greenplum and ETL can be re-run by resuming from a point in time in the Kafka topic by modifying the data offsets.

All of these characteristics make for a future world of continual data loading in real time, where in-place data transformations and processing can be done in Apache Kafka and then Apache Kafka topic data can be reliably and atomically transported into Pivotal Greenplum for deep analytics that require multi-row aggregation and analysis.

Welcome to the future today!

 

For more information about the Greenplum-Kafka connector, please read Greenplum documentation.  

About the Author

Ivan Novick

Ivan has been working on big data, databases, and enterprise systems for over a decade. He spent 5 years in the financial industry building trading systems; worked at Yahoo on the data warehouse system before Hadoop was created; hacked on a MySQL storage engine for a year and has spent the last ​7​ years in various capacities working on ​the Pivotal ​Greenplum product. Ivan's passion is building ​next generation data platforms. In his free time, he has also been a beginning yoga student for the last 10 years. Born and raised in NYC Ivan is now is enjoying the California lifestyle where has resided since 2006.

Follow on Linkedin Visit Website More Content by Ivan Novick
Previous
The Relationship Between Visual Explain Plans and Textual Explain Plans
The Relationship Between Visual Explain Plans and Textual Explain Plans

Debuting in Greenplum Command Center 4.2 is a new beta feature—the visual explain plan. The plan’s purpose ...

Next
One Query To Rule Them All: Demonstrating Integrated Analytics in Greenplum
One Query To Rule Them All: Demonstrating Integrated Analytics in Greenplum

One of the top advantages of the Greenplum Database is the ability to run integrated analytics in-database....