How to Build Modern Data Pipelines with Pivotal GemFire and Spring Cloud Data Flow

July 25, 2018 Greg Green

Data management is a hard problem. Compounding the situation: siloed legacy environments. Enterprises often have dozens of systems that store data. For example, one system might manage a catalog of products. Another hosts inventory information. A third tracks customer feedback and incidents.

How do you correlate all the data in these systems? It’s a hard problem. Of course, the health of the business depends on your ability to extract material trends from this data!

So engineers often create custom apps to gather up data from all these silos. They aim to build an interface that makes sense of all these disparate systems. The goal of this software? To offer up data that can answer the questions crucial to a company’s health, like:

  • What products are selling?

  • Who is buying it? How are they using it?

  • Where is it being used?

  • How efficient is our customer acquisition?

  • How can we increase product value?

In practice, that rarely works out. We’ll explain why this approach is often problematic. Then, we’ll offer up examples of modern data pipelines that can improve data analysis workflows.

Diagnosing the Legacy Approach

A crude way  to solve data analysis challenges is to query all your source systems in real-time. Unfortunately, performance quickly degrades. So it’s not a great option for most scenarios.

There’s a few other problems you’ll often run into:

Capacity planning is hard. Capacity planning for legacy systems is usually done in a waterfall fashion. That means it’s difficult to dynamically adjust capacity as datasets grow. This leads to single points of failure. And there’s no easy way around this - significantly increasing the capacity demands in many cases are simply not an option.

Data warehouses and data marts aren’t practical when you need real-time data. Decisions need to get made faster, based on real-time data. That’s often where warehouses and data marts fall short.

And once again, scaling is problematic. As more and more apps need access to the underlying data repositories, you’ll often wind up buying lots of expensive hardware. Even worse, large-scale migration to new hardware can take months.

So this can result in a sticky situation. The future of your business depends on access to fast, accurate data in real-time. You’ve got the data, but no good way to surface it to users. Fortunately, there’s a solution.

Build Modern Data Pipelines!

A modern approach to solving the data processing problem is to use integration pipelines that aggregate data into cached data sets. Integration pipelines are flexible to evolve over time using simple scripting conventions.

How do you get started?

For our part at Pivotal, we’ve seen enterprises use two products in tandem to deliver more business agility in these scenarios:

SCDF can be used to stream data from source systems to a cache in Pivotal GemFire based on configurable pipeline flows.

Here’s a logical view of a common architecture:

Build Flexible Data Pipelines with Spring Cloud Data Flow

SCDF is a data integration platform that supports real-time data pipeline processing. This allows users to deploy long or short-lived data processes.

SCDF supports a pipeline definition language that is similar to UNIX commands. Consider the following example.

In UNIX, the piped output of the ”grep” into “awk” command gets some values from files based on a provided search criteria.

grep ... | awk ...

With SCDF you can connect out-of-box and custom modules to form flexible processing pipelines in a similar fashion to UNIX piped commands.

SCDF supports a REST API to allow users to define, start and monitor pipeline flows. It features a shell interface you may use to manage flows through an interactive command line prompt.

SCDF also provides a GUI dashboard for a user-friendly way of managing and monitoring data pipelines.

Each pipeline in the data flows consists of Spring Boot apps using the Spring Cloud Stream or Spring Cloud Task microservice frameworks. Use Spring Cloud Stream for long lived flows. Spring Cloud Task is best for short-lived process flows. Streams are backed by messaging frameworks like RabbitMQ or Apache Kafka that offer integration, durability and scalability. Spring Boot can also wrap calls to scripting languages (Python, Ruby, Groovy) or other REST services. The flows can be defined and maintained directly by technical users or administrators. 

Pivotal GemFire: High-Performance Caching

GemFire is an In-Memory Data Grid (IMDG) based on the Apache Geode™ open source project. You can use GemFire as a data cache for outputted data flows from your pipelines.

GemFire's two main components are a locator and cache server (a.k.a. data node).

IMDG

Here’s how GemFire works:

  • The client connects to a locator to access data. All data is stored on the cache server.

  • Cache servers/data nodes register with locators to be discovered by clients or other data nodes.

  • The knowledge of data nodes and the data location is abstracted from the client.

  • The number of data nodes can be scaled up to handle increased data or clients.

GemFire even supports instances where data is replicated across multiple data centers over a wide area network in near real time. You can have active-active or active-passive cross data center deployments. For example, you can have redundancy in New York and London data centers. This way, you can reduce latency by hosting data closer to applications and users.

GemFire “Regions” Boost Scalability and Availability

Data is managed in a region. Here, a region is similar to a table in a traditional relational database. Each region can have a different data policy.

A replicated region data policy stores a copy of each entry on every data node. This is normally used to store smaller data sets such as reference data. In the example below, “locations” is a replicated region that holds location object details. You’ll notice, NY, Charlotte and NJ are example location instances stored in the region on each data node.

A partitioned region data policy stores a copy of each entry on only one primary data node. This way each data node only stores pieces of the primary values in entries and/or a configured number of backup copies to increase fault tolerance. Partitioned region data policy allows you to store larger datasets on the cache. You’ll notice, Joe, Imani and Nyla are example user instances stored in a balanced/shared manner across each data node.

Data Access

GemFire supports NoSQL operations to get region entry objects very quickly, through the use of a key. Regions are based on key/value pairs.

Users can store an entry in a region using a "put" operation.

Users can retrieve an entry object directly from a region using the "get" operation by providing the key identifier.

GemFire also supports SQL-like queries through its Object Query Language (OQL). With OQL you can select objects by a particular attribute in a “where” clause.

It supports simple and complex queries (like nested queries). These data access options enable users to get the complete view of the information that is needed. It combines the traditional access conventions like queries with new object-based NoSQL caching approaches.

Recall the business questions from our introduction. You can see how valuable this flexibility really is!

Other Highlights

In general, GemFire use cases tend to be based on FAST data access patterns with sub-second response times.

GemFire also supports;

  • Event listeners (similar to database triggers)

  • Functions (similar to stored procedures)

  • Transactions

  • Full-text searches

  • And more

With so many possibilities, users should be able to aggregate data knowledge over time and still not be limited in terms of being able to scale.

SCDF and GemFire: Better Together

Imagine you have siloed product and availability source systems. You can have a SCDF pipeline that starts by reading incremental sources from a product-source system. The change data capture records can be cached into GemFire as needed.

productSource | GemFire

 Another pipeline can read availability information from a source system. The information can be piped into GemFire to merge the datasets.

availablitySource | gemfire

This approach would allow you to marry many different, but related, datasets. SCDF can allow you to pipe together various data sets to previously siloed systems just like a savvy UNIX user would pipe together commands like “grep” and “awk”.

 With SCDF you can cache datasets to GemFire using simple scripting conventions with ease. GemFire can store different data types. Objects, JSON, XML and binary can all be stored on the grid.

How These Modern Patterns Deliver Better Business Outcomes

At Pivotal, we help our customers deliver better outcomes, as defined by Speed, Stability, Scalability, Security and Savings. Let’s see how our proposed architecture measures up.

Speed

The cached data now supports fast access. Getting faster answers to important questions increases overall productivity. In many cases, there is a significant savings you can realize with  responses in milliseconds, instead the seconds/minutes/hours that characterize legacy approaches.

Stability

Multiple levels of redundancy is the key to a reliable data platforms. Eliminating single points of failure is essential. The data pipeline must have reliable integration messaging. The data must be both guaranteed to be delivered and available whenever needed.

With GemFire’s built-in reliability and SCDF’s messaging durability support, there are no more single points of failure. If the data pipelines can feed the systems of record - in addition to the cache - then there would be no more stale data. Cleaner integrations across multiple applications would be less of a burden on operation support teams. In our examples, multiple applications get their data from the grid.  

Scalability

Users can continue to cache more information to answer a potentially infinite number of questions.

The data pipelines should be easy to evolve over time. The cached views  minimize the impact on source systems by off loading some of the access performance needs.

Scalability of both the pipelines and data cache is a key feature that’s  built in from the start. Horizontal scaling allows the solution to be more agile. GemFire allows you to dynamically add data nodes to handle increased data sets and user volumes. SCDF allows you to dynamically increase the number of Spring Boot app instances to handle large data volumes faster in the pipeline. It is hard to predict future data capacity needs, so workloads must allow for dynamical scaling on demand.    

Security

SCDF and GemFire have been designed to satisfy common enterprise needs. Connections between pipelines, cache and applications can be easily secured. Both SCDF and GemFire support encrypted TCP-based network communication using SSL/TLS. Both support configurable application or user level access for authentication and authorization security.

GemFire supports a fine-grained data access control to regions and/or particular data sets. You can grant different levels of access (administrator, read/write, or read-only) to different users to prevent unauthorized access.

For example, you can allow a reporting user to have read-only access to a specific set of regions. You can give a manager read/write permission to a one or more entries in a single region based on the keys. This level of security control is often needed as more users get  access to the grid over time.

Savings

Our modern design helps you eliminate reliance on expensive legacy systems over time. This can be a huge cost savings, especially for systems that may be hard to manage. Both SCDF and GemFire allow you to run your deployments on inexpensive commodity hardware or virtualized instances. Evolve data and pipelines over time by emphasizing configuration over customization, and you’ll realize additional savings.

Learning More

We’ve shown you how siloed legacy systems and apps can be modernized in ways that are very flexible using SCDF and GemFire.

We have seen companies move to these types of architectures. The decision to build this type of architecture generally increases the overall agility to get the needed data. Using SCDF pipelines with GemFire has been the chosen approach for many modern data solutions.

Visit us at Pivotal.io to learn more on using Pivotal GemFire and Spring Cloud Data Flow for modern data pipelines.

Also see the following related articles:

About the Author

Greg Green

Gregory Green is a Data Engineer at Pivotal. He has over 23 years of diverse software engineering experience in various industries such as pharmaceutical, financial, telecommunication and others. Gregory specializes in GemFire, SCDF, Spring and PCF based solutions. He holds a Bachelor and Master degree in Computer Science.

More Content by Greg Green
Previous
The .NET Developer’s Guide to SpringOne Platform 2018
The .NET Developer’s Guide to SpringOne Platform 2018

From pre-conference training to .NET-specific and language-agnostic talks, attending SpringOne Platform is ...

Next
CredHub and The Road to Credential Rotation
CredHub and The Road to Credential Rotation

CredHub manages credential generation, storage, and access for the Cloud Foundry ecosystem. In this post, w...