Available today, Spring Cloud Data Flow is now out of incubation and is generally available.
Typically, application developers developing web or mobile applications tend to follow a set of practices that aligns with modern software development patterns. Whereas, when we look at data heavy personas such as data engineers and data scientists, their usual development methods are rather more traditional, as they’re not equipped with tools and techniques to enable continuous deliveries. Spring Cloud Data Flow bridges this gap by providing the development and operational model to reimagine data-centric use cases as a composition of loosely coupled microservices with the flexibility to evolve in isolation, thus paving the path for continuous deliveries—a key towards the enterprise data transformation journey.
Backstory: Monolith-first to Microservices
Spring XD has had a successful tenure with a good community and customer traction. When we looked at the current data landscape in most enterprises, we saw that most big data solutions were really data integration issues. Spring XD set out to address those issues in a way that was familiar to enterprise developers. By building on the well-tested foundation of the Spring portfolio, Spring XD delivered familiarity towards developing applications that could handle massive scale in a robust way.
While Spring XD provided an easy-to-use experience and boosted developer productivity—it also had limitations embracing modern software engineering methods such as Test Driven Development (TDD) and Continuous Delivery (CD). Likewise, Spring XD’s monolithic architecture started showing early signs of anti-patterns and the delayed time-to-market due to its growing list of dependencies and the tight coupling with them.
In short, Spring XD needed to be reimagined to accelerate development for data workers and to to be cloud native.
The genesis for what was to become Spring Cloud Data Flow started in early March of 2015, when Mark Fisher and Dave Syer met at Boston for Spring Cloud engineering meetings. What started as a sideline discussion, after a few brainstorming sessions, new ideas were born. They started exploring “messaging microservice applications” and very quickly, they produced an early prototype of Spring Boot applications communicating via RabbitMQ.
Though it is easy to independently run these Spring Boot applications, there was still the need for a higher-level orchestration layer that allows us to compose many of those applications into a coherent streaming pipeline and yet, the applications run as loosely-coupled independent components that communicate via a binding mechanism through messaging middlewares such as Apache Kafka or RabbitMQ.
Further extending this thought process, Mark explored the idea of orchestrating Spring Boot applications as composite units of deployment running in Cloud Foundry’s Lattice (now deprecated). With Lattice as the runtime, Mark designed the initial version of an application deployer Service Provider Interface (SPI) that can serve as the gateway to deploy Spring Boot applications to the Diego runtime.
With this fundamental shift to the core design, the architecture turned into a modular and loosely coupled set of components. This prototype was powerful enough to convince leads Mark Fisher and Mark Pollack this was the way forward to solve for data delivery in cloud native architectures.
Evolution of the Ecosystem
Unlike Spring XD, where the streaming and batch modules were bundled in the (monolith) distribution itself, this modern ecosystem of projects embraces the microservices approach with loosely coupled services. This is how one can run Spring Cloud Stream and Spring Cloud Task applications as standalone artifacts via `java -jar`, `cf push`, `docker run` commands, or use Spring Cloud Data Flow to orchestrate and deploy applications built with these frameworks to a variety of runtime platforms.
Let’s take a look at three of the projects that evolved while building the new Spring Cloud Data Flow.
Building Blocks of Spring Cloud Data Flow Ecosystem
Spring Cloud Stream
Further building upon the “messaging microservice applications” journey, the very first project that spun out of this exercise was Spring Cloud Stream. The core premise of Spring Cloud Stream is, Spring Integration meets Spring Boot and that together evolves into a lightweight event-driven microservices framework, so the developers can quickly develop and productionize event-driven microservice applications that can connect to external systems. Incepting, developing, testing and operationalizing Spring Cloud Stream applications was no different than any other Spring Boot applications. The developer experience remains consistent all throughout the lifecycle—thus enabling consistent development practices across application and data teams at the enterprise. For convenience, we have created a few sample use cases.
Spring Cloud Data Flow
While developing strong constructs and vetting the programming model of Spring Cloud Stream, in parallel, the deployer SPI investment continued with two concrete runtime platform implementations.
- Cloud Foundry. As the de-facto platform for cloud native applications, the Cloud Foundry implementation of the SPI was an immediate candidate and yet it was seamless to adapt given the proof-of-concept was originally built on top of Lattice—conceptually, they were largely similar.
- Apache YARN. Given that Spring XD’s production installations were largely bare-metal and the majority of them having relevance with Apache Hadoop® workloads in their install base, it was a natural next step to build a deployer implementation to support the existing customers with a migration path, and the Apache YARN implementation was incepted.
Soon enough, we received community contributions with newer deployer implementations for Kubernetes and Apache Mesos runtime platforms. Note, the project at this point was still being commonly referred to as Spring XD 2.0.
Soon enough after that, the scale of refactoring and the fundamental changes to project design evolved so far past the previous incarnation, Spring XD was rebranded to Spring Cloud Data Flow. Today, this new project provides an orchestration service for composable data microservices on modern runtimes such as Cloud Foundry (including PCFDev), Apache YARN, Apache Mesos and Kubernetes—and again, for convenience, we have created a few sample use cases.
Spring Cloud Task
The majority of the data architectures require ETL/ELT pipelines. This is also prevalent for machine learning use cases, where the model training is always run on offline data pipelines, as it can take hours to days to complete. This is where Spring Batch comes to the rescue with its programming model delivering high performance through optimization and partitioning techniques. In the context of delivering microservices, there’s also the application requirement to perform finite amounts of data processing and then terminate to free up resources. In reality, they are short-lived applications. To fill this gap, the Spring Cloud Task project was founded. The core premise of Spring Cloud Task is: Spring Batch meets Spring Boot and that together evolves into a short-lived microservices framework, so the developers can quickly develop and productionize task applications including Spring Batch jobs. Spring Cloud Task provides patterns to develop a directed graph of “multi-step” workflows such as data ingestion, filtering, processing, notifications, and polyglot persistence. Again, here are a few sample use cases to illustrate these in action.
The New Architecture
Spring Cloud Data Flow simplifies the development and deployment of applications focused on data processing use cases. The major constructs of the architecture are Applications, the Data Flow Server, and the target runtime.
Reference Architecture: Consuming data from an `http` endpoint and writing to Cassandra
- Applications come in two flavors:
- Spring Cloud Stream based long-lived applications where an unbounded amount of data is consumed or produced via messaging middleware—RabbitMQ or Apache Kafka
- Spring Cloud Task based short-lived applications that process a finite set of data and then terminate
- The Data Flow server is a Spring Boot application that provides the orchestration mechanics and the foundational constructs to serve the DSL-shell, Dashboard, Flo and REST-APIs
- The runtime is the place where applications execute. The Data Flow server delegates to the Spring Cloud Deployer SPI implementation to deploy the applications to each target runtime. The target runtimes for applications are platforms that you may already be using for other application deployments
Architecture: Spring Cloud Data Flow
The Journey Ahead
The modular microservice based architecture provides several benefits, and one of the important benefits is the ability to move faster and independently evolve core components in isolation. Each project in the ecosystem has its own backlog and release cadence, with the goal to release often and rapidly provide new capabilities for our users.
There are several new capabilities planned for the upcoming releases ranging from new binder implementations, newer out-of-the-box streaming and task applications, UI/Flo usability improvements, Canary Deployments, Netflix’s Spinnaker integration, and a Cloud Foundry tile in the marketplace to easily provision Spring Cloud Data Flow with all the necessary peripherals.
For the summary of feature capabilities and the download links, please review the engineering release blog.
About the Author
Sabby Anandan, Product Manager in the Spring Team at Pivotal. He focuses on building products that address the challenges faced with iterative development and operationalization of data-intensive applications at scale. Before joining Pivotal, Sabby worked in engineering and management consulting positions. He holds a Bachelor's degree in Electrical and Electronics from the University of Madras and a Master's degree in Information Technology and Management from the Carnegie Mellon University.Follow on Twitter More Content by Sabby Anandan