Announcing Spring Cloud Data Flow 1.2: Unified Service for Event-driven, Streaming, and Batch Data Pipelines

May 16, 2017 Sabby Anandan

Driverless cars. Personalized digital assistants. Smart devices. Virtual reality - what's the common denominator to all this?

They all start with Data.

Spring Cloud Data Flow and the related ecosystem of projects tackle the toughest problems of data collection, curation, and aggregation. Enterprise developers can focus on ways in which data can be put to use to deliver a better customer experience, without being burdened by the tedious task of solving these problems over and over again.

Spring Cloud Data Flow 1.2.0.RELEASE is now generally available and here's how the ecosystem of projects further simplify the data journey within the enterprise.

Spring Cloud Stream's primary focus is to enable developers to build event-driven streaming applications quickly.

Spring Cloud Task aims to solve the assembly challenges around executing finite workloads in a production environment.

Spring Cloud Data Flow orchestrates the composition of Spring Cloud Stream and Spring Cloud Task microservices into coherent data pipelines. For the curious, the table below goes into the characteristics of streaming and task/batch workloads.

	Streaming: Long-running Pipelines	Task/Batch: Short-lived Pipelines
Execution Model	Long-lived means “always running” and reacting to real-time events	On-demand execution and efficient resource utilization
Data Processing Model	Continuous streams of data processing events	Scheduled or on-demand data processing
Use-case Applicability	High throughput and low latency SLA expectations Immediate insights and real-time feedback	Large finite datasets with resilient error handling Ad-hoc data processing with downstream responsibilities upon completion

Let's review the top-level projects in the ecosystem, its evolution, and the newly added feature capabilities.

Stream Processing

Developers who are embracing event-driven architectures to replace the legacy request/reply systems or to create high-throughput and low-latency streaming applications don't have to repeat the boilerplate message-broker discovery, content negotiation, and type conversions.

Spring Cloud Stream's lightweight binding libraries can do that for you for several messaging platforms, and the number is growing!

Dynamic Event Handlers

Ready to dive into the world of event-driven architectures? In this recent release, Spring Cloud Stream introduces support for message dispatching to multiple downstream event-handlers. Having this natively supported in the programming model provides the ability to construct event-driven microservice applications with dynamic routing and segregation. This is handy for event-sourcing and CQRS use-cases. A recent blog by David Turanski on this topic walks through the goals and the overall benefits.

@EnableBinding(Sink.class)
@EnableAutoConfiguration
public static class AccountSink {

    @StreamListener(target = Sink.INPUT, condition = "headers['eventType']=='create'")
    public void handleAccountCreateEvent(@Payload AccountEvent accountEvent) {
        // handle account creates
    }

    @StreamListener(target = Sink.INPUT, condition = "headers['eventType']=='delete'")
    public void handleAccountDeleteEvent(@Payload AccountEvent accountEvent) {
        // handle account deletes
    }
}

(Fig: An example of dynamic routing based on `eventType` embedded in the header)

Reactive and Nonreactive Event Processors

With the generalization of reactive and nonreactive primitives, we now have a consistent approach on how the streaming applications can be built to interact either directly with core message channels or as asynchronous sequences of RxJava Observables or Project Reactor Flux event streams.

@StreamListener(Processor.INPUT)
@SendTo(Processor.OUTPUT)
public Flux<WordCount> count (Flux<String> flux) {
  return flux.window(ofSeconds(5), ofSeconds(1))
    .flatMap(window ->
      window.groupBy(word -> word)
        .flatMap(group -> group.reduce(0, (count,word) -> count + 1)
          .map(count -> new WordCount(group.key(), count))));
}

(Fig: A word count example using Spring Cloud Stream and Project Reactor Flux)

Binders

As we started adding new binder implementations (e.g., JMS, Google PubSub), we came to realize that the mechanics behind the binder lifecycle and bootstrapping were repetitive. One of the goals for this release was to simplify and reduce redundancy at this layer. The push to separate the concerns of infrastructure from the core messaging constructs is now complete.

With the introduction of the Service Provider Interface binder “provisioning” abstraction, we have a cleaner, more consistent model on how the binder implementation can be discovered, provisioned, and prepared for the messaging components to come together at runtime.

In response to feedback from community and customers, the RabbbitMQ and Apache Kafka binder implementations went through significant stability improvements, new enhancements, and bug-fixes.

Evolving beyond Spring Integration-based binder implementations, Kafka Streams support is now in incubation. Watch the following screencast where Spring Cloud Stream’s Project Lead, Marius Bogoevici demonstrates this integration.

For more details, please review Spring Cloud Stream’s Chelsea.RELEASE blog.

Operationalizing Finite Workloads

Though there's growing demand for high throughput and low latency streaming requirements, there's still a critical need in the enterprise for either scheduled or offline batch data processing.

As a short-lived microservices framework, this is where Spring Cloud Task shines.

Parent-child Tasks

A sophisticated data processing use case may require the implementation of scaling tasks by launching additional tasks dynamically. Spring Batch always supported this form of scaling natively via both remote partitioning and remote chunking. Building upon its foundation, Spring Cloud Task supplemented the story with remote-partitioned batch-jobs, launching the steps in the batch-job as individual “worker tasks” for cloud scale elasticity. Now the framework itself provides parent-child associations, which can be useful for both batch-job and raw-task workloads. The benefits include the ability to track the overall topology progress and as well as serviceability.

External Execution Tracking

Tasks are executed on a platform like Cloud Foundry or Kubernetes, each of which has its own native methods of log management and container orchestration. This new feature makes it possible to map the execution of a Spring Cloud Task application to the platform’s runtime execution.

For more details, please review Spring Cloud Task’s 1.2.0.RELEASE blog.

Production Grade Orchestration

Here are the highlights of Spring Cloud Data Flow's new feature capabilities built on top of Spring Cloud Stream and Spring Cloud Task.

Composed Tasks

Batch-job pipelines typically include multiple steps for data ingestion, transformation, filtering, and persistence. Composed tasks are helpful for such use cases.The chain of operations may include executions in sequence, parallel, conditional transitions, or a combination of the above.

Spring Cloud Data Flow provides an easy to get started Composed Task DSL and a graphical interface for defining and reliably orchestrate the entire graph. Error handling, stopping, and restarting of steps are all part of the orchestration semantics.

Andy Clement demonstrates the graphical interface experience in the following screencast.

Metrics and Monitoring

“If you can't measure it, you can't improve it” - Peter Drucker.

Defining, building, and orchestrating data pipelines in Spring Cloud Data Flow now gets a companion feature. Operations view! The real-time streaming metrics are natively embedded in the dashboard for monitoring and alerting, and we plan on significantly expanding these capabilities in the upcoming releases. The following screencast demonstrates this feature capability.

Docker as the First-class Citizen

While Maven provides robust application lifecycle management for streaming and task/batch applications, we now also support the ability to orchestrate data pipelines made of Docker images in Cloud Foundry.

We also continue to support Docker artifacts in Kubernetes and Apache Mesos.

Companion Artifact

We are introducing an optional companion artifact. This describes the main properties that your stream or task app supports, and is usable from the Shell or the Dashboard.

Docker and Maven resolution strategies in Spring Cloud Data Flow both benefit from this optimization.

Security

As we continue to improve the security features, we have added support for role-based access controls. Users can now be assigned with granular access privileges, thus preventing open system access.

We have made significant improvements to OAuth authentication. For users directly invoking Spring Cloud Data Flow's RESTful APIs, instead of providing `username:password` combination via BasicAuth, you can retrieve an OAuth2 Access Token from a preferred OAuth2 provider and supply the Access Token in the HTTP header, to negotiate and gain access.

Applications

Spring Cloud Stream App Starters and Spring Cloud Task App Starters have evolved since the last release.

Along with new version upgrades, enhancements, and bug-fixes, the latest Bacon-release-train of Spring Cloud Stream App Starters includes newly added pgcopy-sink, mongodb-source, aggregator-processor, and header-enricher-processor.

Likewise, the latest Belmont-release-train of Spring Cloud Task App Starters includes a new composed-task-runner application to supplement the larger composed task feature discussed earlier.

Christian Tzolov contributed a tensorflow-processor, which can be used in a streaming pipeline to predict real-time insights such as the Twitter Sentiment Analysis or Image Pattern Recognition.

(Fig: Twitter Sentiment Analysis using TensorFlow Processor)

What's next?

The Spring Cloud Data Flow team is determined to improve the tools we provide for building and operating production-grade data pipelines.

Our mission has always been to enable developers to focus on what matters most: be it the developer experience, programming model, testability, or serviceability.

Areas under consideration for the next major release include continuous integration, continuous deployment, usability, and the composition of pipelines using functions.

About the Author

Sabby Anandan is a Product Manager on the Spring Team at VMware. He focuses on building products that address the challenges faced with iterative development and operationalization of data-intensive applications at scale. Before joining VMware, Sabby worked in engineering and management consulting positions. He holds a Bachelor’s degree in Electrical and Electronics from the University of Madras and a Master’s in Information Technology and Management from Carnegie Mellon University.
More Content by Sabby Anandan

Diversify Your Cloud Portfolio or Bank on Failure

Why would a business choose a strategy of investing in a single cloud? It doesn’t make sense in today’s mar...

How We Did Take Your Kid to Work Day at Pivotal

Children of Pivots were shown how Pivotal works with clients on take your kid to work day.

Announcing Spring Cloud Data Flow 1.2: Unified Service for Event-driven, Streaming, and Batch Data Pipelines