Spring Cloud Data Flow 2.0 at a Multi-platform Enterprise

March 6, 2019 Sabby Anandan

We are pleased to announce the general availability of Spring Cloud Data Flow (SCDF) 2.0!

You can download the Spring Cloud Data Flow release from the Spring repository right now.

Why Spring Cloud Data Flow?

Spring Cloud Data Flow (SCDF) is a toolkit for building data integration, batch, and real-time data processing pipelines to support IoT and other data-intensive business applications.

Alright, so... what about it? The following visual unpacks the value proposition for the developers and operations teams.

What’s new in Spring Cloud Data Flow v2.0?

The major release is packed with feature-improvements including the flexibility to configure a multi-platform backend to orchestrate streams and tasks from Cloud Foundry to Kubernetes, and vice-versa. To monitor streaming data pipelines, a comprehensive solution with the help of Prometheus, InfluxDB, and Grafana is now available. In terms of security, SCDF and the related components in the architecture default to OAuth2 and OpenID Connect as the standard. Further, SCDF v2.0 builds upon the stable foundations of Spring Boot 2.1.x and Spring Cloud 2.1.x, to bring Java 11 compatibility.

Before we dive into the individual features, let’s briefly review the improvements in the SCDF ecosystem.

Ecosystem Update

Spring Cloud Task

As a framework to build and run a batch-job as a short-lived microservice application, Spring Cloud Task reached the feature-complete state. The primary contribution towards the latest Spring Cloud Task v2.1 release is the Java 11 compatibility.

Spring Cloud Task and SCDF continue to drive the adoption of batch data processing in the cloud. Check out the 24x7 ETL reference architecture to get an idea of the complexity of the architecture in the cloud, and how it comes together end-to-end with the help of SCDF.

Spring Cloud Stream

Thanks to Spring Cloud Stream that you can build and test event-driven streaming applications in isolation, and take it to the cloud and run it without any extra work. The latest v2.1 release of Spring Cloud Stream has brought in a lot of improvements to the framework. You can read about them in the Spring release-blog, but let’s review an important theme here.

Composable Streaming Functions

A new programming model based on Spring Cloud Function is available in v2.1. With that, you can leverage the “function composition”— a method to compose a series of business-operations from a set of predefined functions. You can pick and chain them to allow your data to pass through each of the business functions while the framework handles the responsibility of the data journey and the data transformation. For SCDF users, you can compose a series of app-starters into a single functional-chain. The developer-experience is discussed in the Composed Function in SCDF blog with an example.

Spring Cloud Skipper

Skipper continues to be an integral component to SCDF to facilitate and promote CI/CD for the business applications in the streaming data pipelines. You can watch the webinar to learn how to apply CI/CD for data pipelines.

The primary driver for Skipper v2.0 is to bring alignment with SCDF v2.0 and the integrated layers such as the common Spring Boot and Spring Cloud foundation, consistent security experience, database schema upgrades, and the data migration support. We do this so that you as a user, you won’t have to run into migration issues when switching to the new release.

Applications

The latest Einstein and Elston release-trains continue to improve the stability of the out-of-the-box utility applications—a reusable catalog to address for common use-cases. There are a variety of samples that use these applications as use-case showcase. The release builds upon Spring Cloud Stream 2.1.x and Spring Cloud Task 2.1.x improvements, to provide a consistent experience with the latest release of Spring Boot.

That’s a roundup on the ecosystem. Now let’s dive into the newer feature improvements in SCDF v2.0.

Multi-platform Stream/Task Orchestration

Spring Cloud Data Flow supports the design, choreography, deployment, and the monitoring of data pipelines composed of Spring Cloud Stream or Spring Cloud Task microservices. In the v2.0, however, we are opening up the capability to orchestrate a deployment model to deploy streaming and batch applications to multiple platform backends.

What is the business value?

The Cloud Foundry users who are exploring options to run stateful stream processing pipelines on Kubernetes with systems like Apache Kafka or to run compute-intensive batch-jobs, this new capability in SCDF v2.0 provides an easier getting started experience.

Let’s take a use-case and illuminate the new feature in the light of it.

Requirement: As an SCDF user, I’d like to create and deploy a streaming data pipeline with Apache Kafka as the message broker running on Kubernetes. Since I’m a Cloud Foundry user and stream processing performance is the key, I’d want to orchestrate the deployment model in which the message broker and the streaming applications are collocated in the same hardware and network infrastructure.

Solution: In v2.0, it is possible to set up multiple deployment platform backends in a single instance of SCDF. Here, it’d be setting the Kubernetes cluster information (i.e., Master-URL, security credentials, and the namespace) as configurations to SCDF running on Cloud Foundry. Now, when deploying a stream, the user would have the option to select the configured Kubernetes platform from the dropdown as shown in the screenshot below. Upon successful deployment, the streaming applications and the Kafka topics are closer to one another, and the stream processing I/O latency overhead can be reduced.

Related to the previous use-case, imagine there’s a downstream need to train a neural network based on the incoming data. Since GPU-based computing is becoming more of an obvious choice to dramatically speed up the time it takes to train neural networks, a deployment pattern could be set up in SCDF where the real-time streaming pipeline continue to run closer to where Kafka is, and a batch-job could train the neural network predictive-model running against the GPU resources. Once when the model is trained and ready, the streaming apps could react to the newly trained model automatically. This type of continuous learning and the feedback loop is often described as “closed-loop analytics”.

Real-time Data Analytics

A growing number of enterprise customers are moving towards an event-driven and streaming architecture.

Doing real-time analytics on streaming data can be challenging and time-consuming to develop and maintain. We have run some experiments to come up with options to make it easier for you. A few new applications join the collection of apps, as part of the latest Einstein.GA release-train: `counter-processor` and `counter-sink`. With these applications, you can compute relatively a simpler calculation to a complex state-aggregation to investigate and derive business outcomes in real-time. Further, you can use the application both at the processor and sink positions in the GUI/DSL, thus opening up interesting new opportunities for stream aggregation use-cases.

_The_{above screencast from Christian Tzolov walks through the newer Analytics improvements.}

Monitoring Data Pipelines

Metrics and monitoring come up as a critical requirement for every enterprise customer.

Given the centralized orchestration abilities of SCDF, there's a misconception that SCDF and the streaming applications in the data pipelines directly communicate, but they don’t. The applications independently run and connect with the other apps through a message broker. SCDF attempts to fetch the status of the individual applications to reconstitute the overall streaming data pipeline health, and this happens on-demand when the user requests for the current status from the Shell or Dashboard.

For monitoring in particular, in v2.0, we rely on Micrometer be the gateway to a variety of APM monitoring systems. Building on Micrometer foundation, it is now possible to deploy streams of data pipelines from SCDF to either have the individual applications push metrics or have the APM tools like Prometheus scrape metrics from the apps autonomously. Since the monitoring systems provide the persistence for metrics/alerts and the ability to query against the persistence, it is now possible to interact and drill-in to application statistics at runtime without impacting the application behavior.

Given the popularity of Prometheus and InfluxDB, we have built a few handy Grafana dashboards. The SCDF v2.0 Dashboard include native integration to launch the stream or application-specific Grafana dashboards for Prometheus and InfluxDB. These two monitoring options are just a showcase. For the other Micrometer backends (e.g., DataDog), it is possible for the users to experiment with the desired tooling that best fit the business need. In the future release, we plan to extend it to Tasks and Batch data pipelines for a consistent monitoring experience.

_Above:_{a walkthrough of data pipeline monitoring in SCDF using InfluxDB.}

New Security Design

In a microservices architecture, OAuth2 and OpenID Connect, a token-based authentication and authorization is the standard practice. In v2.0, we took the opportunity to redesign the security infrastructure to extend the best practices to microservices composed in the data pipelines.

Both the internal and as well the user-facing client tools including Shell, RESTful endpoints, DataFlowTemplate, Composed Task Runner, Skipper, and the Dashboard, all of it can be set up as single sign-on with UAA as the backend.

Here are some of the benefits.

Token-based authentication and authorization as the standard. This brings the flexibility to centrally manage the users and their roles, including the ability to set up auto-renewal, revocation, or expiration attributes of the tokens.
If the requirement is to have a simple user/password set up, sure, that’s possible through OAuth’s password-grant type option.
For advanced use-cases with LDAP, a chained-authentication model is available in UAA, which would authenticate users against UAA first and then LDAP.
For PAS and PKS users, all this is a native component since both the platforms ship UAA by default as an internal component, and there’s nothing extra to set up particularly for SCDF.
New granular roles are available. You can govern the user-facing operations such as create, delete, update, schedule, view and manage more explicitly. These roles can be mapped towards the OAuth2 scopes or the Active Directory groups in LDAP.

_Check_{out the above screencast from Gunnar Hillert to review the security improvements.}

Deeper Integration with Pivotal Cloud Foundry and Kubernetes

The commercial version of SCDF for Pivotal Cloud Foundry will build upon the v2.0 GA release in the coming weeks. Likewise, the SCDF helm-chart for Kubernetes will update to the latest release shortly. Stay tuned.

Join the Community!

We can’t wait for you to try out Spring Cloud Data Flow 2.0, and if you want to ask questions, or give us feedback, please reach out to us on Gitter, StackOverflow, or GitHub.

About the Author

Sabby Anandan is a Product Manager on the Spring Team at VMware. He focuses on building products that address the challenges faced with iterative development and operationalization of data-intensive applications at scale. Before joining VMware, Sabby worked in engineering and management consulting positions. He holds a Bachelor’s degree in Electrical and Electronics from the University of Madras and a Master’s in Information Technology and Management from Carnegie Mellon University.
More Content by Sabby Anandan

My DevOps Reading List

Must-reads for anyone embarking on a DevOps journey.

Check out the Power of CI and Pipelines

Continuous Integration and pipelines increase help increase the velocity of software development and reduce...

Spring Cloud Data Flow 2.0 at a Multi-platform Enterprise

Why Spring Cloud Data Flow?

What’s new in Spring Cloud Data Flow v2.0?