Every enterprise has data silos. They exist as a consequence of systems and processes that have diverged over time. The respective silos for application developers, data scientists, and data engineers become an acute problem when attempting to monetize enterprise data in new, digital ways.
Those Data Silos Have Got to Go
Pivotal customers such as Home Depot, CoreLogic, Ford, and many others described this issue to our engineering teams in recent times. A structured platform such as Pivotal Cloud Foundry enables the rapid development of cloud-native apps. Now, these organizations (and many others like them) want to remove these data silos in a similarly opinionated way. From there, they can integrate data sources with their cloud-native apps in new and exciting ways.
Here’s a look at Spring Cloud Data Flow and its ecosystem - more on these projects in a moment.
Spring Cloud Data Flow Ecosystem
Cloud-Native Architecture for Data Scientists & Data Engineers
Spring Cloud Data Flow aims to solve the data silos problem with a cloud-native architecture. Spring Boot users will appreciate SCDF immediately: SCDF provides Spring Boot-based microservice frameworks and developer toolkits for data-centric use cases.
SCDF gives data scientists and data engineers a microservices-friendly way to handle these scenarios:
Simple Extract, Transform, Load (ETL), a tried-and-true way to ingest, transform, and load data into relational and NoSQL systems.
Event-driven stream processing, required to work with high-throughput, context-specific data. This is common in IoT workloads, like processing the moving average of the temperature associated with sensor devices.
Closed-loop style predictive analytics, such as an algorithm that predicts a customer’s propensity to churn, based on a model derived from historical data.
What’s the benefit for data teams?
Data scientists, database administrators, and other engineers can now take advantage of modern software engineering practices to deliver data-driven business solutions to the market continuously.
With Spring Cloud Data Flow:
- The unit of work is more focused
The development effort can be improved with incremental iterations, done rapidly
Spring Boot becomes a consistent model across different departments in the organization. Developers and data teams can rally around a common pattern. They can use the same toolkits, processes, and automation.
Spring Cloud Data Flow 1.1 Features
We announced the general availability of Spring Cloud Data Flow last July. We’ve been busy since then!
Since that initial release, a mind-blowing 92 minor releases were done by the Spring Cloud Data Flow project ecosystem. And during this time, we received excellent feedback and validation from the community and customers (including many great conversations at SpringOne Platform). These interactions helped us prioritize a few key themes:
Improvements to Spring Cloud Data Flow's orchestration service. This makes it easier to operationalize microservice-based data pipelines on modern runtime platforms.
Broader support for event-driven stream processing with Spring Cloud Stream.
Support for short-lived finite amounts of data processing with Spring Cloud Task.
Let’s review how Spring Cloud Stream and Spring Cloud Task combine to address integration between data silos. We’ll also examine a few other key enhancements in Spring Cloud Data Flow 1.1.
Spring Cloud Stream
Spring Cloud Stream provides an event-driven microservice framework to quickly build message-based applications that can connect to external systems such as Cassandra, Apache Kafka, RDBMS, Hadoop, and so on. These integrations are done via binders, like these new implementations.
Apache Kafka. Building upon the standalone development efforts through Spring for Apache Kafka project, the Spring Cloud Stream for Apache Kafka binder gets a complete rewrite. This production-ready binder implementation now supports the latest Apache Kafka 0.9.x/0.10.x releases.
Google Pub/Sub. Here's an experimental Google pub/sub binder implementation. With Pivotal Cloud Foundry on Google Compute Platform (GCP), the Google Pub/Sub binder implementation provides a tightly integrated solution through GCP service brokers to facilitate high-throughput stream processing.
JMS. A flexible jms-binder implementation that supports ActiveMQ, Solace, and IBM MQ as JMS vendors is now available. We plan to release a more full-featured binder implementation in the future.
Spring Cloud Stream now includes "schema registry server" and a general-purpose "schema registry client." Now event-driven applications can include metadata about “payload schema”. Apps can independently evolve as their payload data structure changes, even the upstream and downstream data processing apps remain unchanged.
This release of Spring Cloud Stream also includes Confluent's Schema Registry. To learn more, see this talk by Vinicius Carvalho from SpringOne Platform.
Reactive Stream Processing
This programming model is incredibly flexible - use it for simple time-windowing and moving-average scenarios, and for more complex event processing requirements. Check out this time-windowing sample to get an idea.
The beauty of Spring Cloud Stream? It expedites the creation and testing of event-driven data processing business logic. Data engineers no longer have to re-code, re-configure, and repeat mundane tasks for each requirement. This approach yields a good core architecture for integrating data silos - a foundation that simplifies the fulfillment of individual requirements. Further, Spring Cloud Stream prevents rewrites and greatly lowers technical debt over time, by separating custom code, and the boilerplate core infrastructure needed for stream processing, messaging and integration.
Spring Cloud Stream also takes care of many important (but tedious!) things for the developer. With Spring Cloud Stream, the engineer doesn’t have to worry about the plumbing of event data. And the engineer no longer has to think about building or maintaining things like:
How the application discovers the underlying messaging middleware infrastructure
The integration layer between application and the middleware
“Last mile” adapters to integrate with common systems and technologies
Persistent publish/subscribe semantics, consumer groups, and partitions
All of this heavy lifting is automatically handled by Spring Cloud Stream! Abstractions like the pub/sub semantics, consumer groups, and partitions are portable across popular messaging middleware. This reduces the refactoring caused by messaging technology changes, and also shortens the learning curve for developers in general.
Further, the programming model helps engineers focus on their data processing code, along with test fixtures for incremental validations. The final application artifact, the output of all this custom code, is portable on a variety of modern runtime platforms - as Marius Bogoevici demonstrated at Devoxx Belgium recently.
Spring Cloud Task
The Spring Cloud Task project provides a short-lived microservice framework to quickly build applications for finite data processing workloads such as an ETL/ELT batch-jobs, or a predictive model-training algorithm that runs for a limited time.
Here are a few noteworthy additions to this project.
Flexible System Configuration
Task Repository: The repository includes results of the task execution including the stack traces. Thanks to the community, DB2 is now supported as part of the available options.
Task Execution Identifiers: As part of keeping track of the task execution and its lifecycle events, there is now an option to supply externally generated execution identifiers. There is also support for accepting task identifiers as provided by the runtime platforms where the task actually runs.
Partitioned Batch Jobs
- Running batch-jobs in parallel is often required when dealing with large volumes of data. Spring Cloud Task have has a simple configuration option to support this use case. Simply define the number of parallel workers, and data processing is instantly parallelized in a variety of cloud settings. Cloud Foundry, Apache Mesos, Kubernetes, and Apache YARN are all supported. This screencast below demonstrates how easy it is to use the full power of Spring Cloud Data Flow in this scenario.
Instead of wasting time on a custom, “snowflake” solution, data engineers can instead focus on solving business problems directly with the help of tools for resilient, short-lived data processing.
We’ve brought Spring Cloud Stream and Spring Cloud Task together. As a result of this unification, Spring Cloud Data Flow consolidates the way data pipelines are created, tested and operationalized in production!
This release adds capabilities to directly address our market research and community feedback:
Nested Pipelines. For pipelines involving multiple processing steps and nested TAP'd streams, there's now a Flo graph to visualize the entire topology.
Bulk Definition Uploads. Using the Flo canvas for "bulk definitions", you can now import a file or copy/paste the task definition DSLs in bulk. This functionality includes incremental validations, error highlights, and snappier navigation between the errors for quicker Task DSL corrections. Watch Andy Clement’s demonstration of this feature below.
Security. LDAP(S), File-based, and Basic authentication options are now included.
Spring Boot Compatibility. Both the streaming and task applications are now compatible with Spring Boot 1.3.x and 1.4.x releases. The property whitelisting feature is supported in both the release lines as well.
Stream and Task Applications. The out-of-the-box utility applications have gone through several updates, including the addition of apps that launch tasks via stream events. There's a new Task application to migrate data from an RDBMS to HDFS. Several Spring Integration improvements and bugfixes were applied to all the stream applications as well.
Decoupling Releases & Drinking Our Own Champagne
Even with all this progress, we are of course still learning and improving the release process. To this end, we have decomposed the monolithic release of stream and task app-starters.
Since app-starters are standalone Spring Boot utility apps, it’s not necessary to release them together - especially when only a handful of them change frequently.
We have migrated all the standalone applications to their GitHub repo. We are also maintaining them as separate apps, fixing bugs and updating dependencies independently. This model allows us to release updates whenever we want. No tight coupling, no more monolithic releases!
As an open-source project, we sincerely thank all the contributors for their time and efforts to improve the Spring Cloud Data Flow ecosystem. We look forward to continued collaboration with you!
This is great validation for our product direction - it shows how easy it is to apply Spring Cloud Data Flow’s cloud-native patterns in different runtime platforms. Here’s what Donovan had to say about all this in his own words, in an email to the project team:
“It's not often you get that 'wow' feeling when reading about a new framework but that is exactly what happened when I first clapped eyes on Spring Cloud Data Flow. Since following the project from the 1.0.0.M1 days I have been consistently impressed and excited at the flexibility and innovation this project lends to various requirements, from traditional data pipelines, orchestrating event based microservices to general integration patterns, the overall design lends itself to an ever increasing list of use cases. Born from this enthusiasm, I have of late also enjoyed contributing towards the project and really appreciate the support from the team in reviewing and promoting contributions. I am honoured and proud to be part of such an awesome project all round. - Donovan Muller
Several exciting features are in the pipeline:
A Pivotal Cloud Foundry tile for Spring Cloud Data Flow is in development. This will simplify provisioning and deliver many powerful security capabilities.
Apache Kafka is great for high-throughput and low latency use-cases; we plan to support this project as a first-class citizen. To this end, we’re developing a native integration for Kafka Streams (KStreams). This will help you build standalone event-driven applications with the KStreams APIs directly in the programming model.
To improve the project’s security posture, we are exploring ways to implement RBAC/ACL for streams and tasks.
Metrics and monitoring within the pipeline at the application level (and as an aggregate) is under consideration.
A new concept “composed tasks” that would allow a batch-job as a “directed graph of other batch-jobs” is planned.
Dashboard/Flo has always been a priority—a visual representation of “composed tasks” is coming up in the near future.
We are excited to bring improvements to the ecosystem. The community is crucial to our progress, and we look forward to your questions, comments, and feature requests! Please reach out to us on StackOverflow or GitHub.
About the Author
Sabby Anandan, Product Manager in the Spring Team at Pivotal. He focuses on building products that address the challenges faced with iterative development and operationalization of data-intensive applications at scale. Before joining Pivotal, Sabby worked in engineering and management consulting positions. He holds a Bachelor's degree in Electrical and Electronics from the University of Madras and a Master's degree in Information Technology and Management from the Carnegie Mellon University.Follow on Twitter More Content by Sabby Anandan