HOWTO: Spring XD for Real Time Analytics With Twitter Example Code

May 14, 2014 David Turanski

Spring XD is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. The project’s goal is to simplify the development of big data applications. This post introduces some of Spring XD’s basic concepts by walking through one of our most popular demonstration applications, the Analytics Dashboard.

The Analytics Dashboard demonstrates some of Spring XD’s the out-of-the-box features in the area of real time analytics. Spring XD provides support for the real-time evaluation of various machine learning scoring algorithms as well as simple real-time data analytics using various types of counters and gauges. This demonstration illustrates the use of some of Spring XD’s counters.

Streams and Taps

Spring XD allows you to build streams declaratively using a familiar pipes and filters syntax based on the UNIX model. The canonical example is ingesting data from a Twitter stream and storing it in HDFS for later analysis. In Spring XD, this stream is defined simply as

twitterstream | hdfs

A stream consists of a source and a sink and optionally intermediate processing steps. In the example above, the source is twitterstream and the sink is hdfs. Each of these is generically referred to as a module. Spring XD provides these and a number of additional modules commonly used for constructing streams. This means you can build and deploy a stream, which is actually a distributed application, often without any additional coding. The pipe itself, represented by the “|” is typically backed by a distributed transport protocol. Currently Spring XD supports Rabbit MQ and Redis for remote data transport. Hence each module actually runs in different process, communicating over the network via messaging middleware. Spring XD also provides a single node runtime which runs everything in a single process and uses “local” transport (direct memory access) by default. The single node runtime makes Spring XD very simple to set up for demonstrations, proof of concept, and testing.

The above stream, once created and deployed to the Spring XD runtime will ingest incoming tweets and store them in HDFS. NOTE: This assumes that you have already installed HDFS. Spring XD supports all major Hadoop distributions, including Pivotal HD.

It is also possible to create a tap on any stream. A tap works like an ordinary stream except that it uses an existing stream as its source. A Tap is an example of the WireTap pattern described in Enterprise Integration Patterns on which Spring Integration is based. Taps are extremely useful for real time analytics as we shall see.

Counters and Gauges

Spring XD provides a number of modules, including counters and gauges, that produce stream metrics and store them in Spring XD’s analytics repository. Spring XD provides both Redis and In memory repository implementations; the latter useful for demonstration and testing in conjunction with the single node runtime. Counters and Gauges included in the Spring XD distribution include:

Counter – a simple count of messages flowing through a stream
Field Value Counter – a count of occurrences of unique values for a specific field in a POJO or JSON payload
Aggregate Counter – keeps a total count, but also retains the total count values for each minute, hour day and month of the period for which it is run. May be queried for a given date range and resolution.
Gauge – similar to a counter, holds a single long value which is application defined and bound to a unique name
Rich Gauge – an application defined double value that also keeps a running average, along with the minimum and maximum values and the sample count.

The Analytics Dashboard

This sample application demonstrates some of Spring XD’s capabilities described above. We create a primary stream to ingest data from Twitter and then create a few taps on the primary stream to demonstrate the aggregate counter and the field value counter. Spring XD provides all of this out-of-the-box, requiring no coding. The dashboard itself is a separate application, written in javascript and HTML, to access Spring XD metrics via REST endpoints exposed by the Spring XD Admin process. In the single node runtime, the Admin is embedded along with its HTTP server, exposing REST endpoints. The dashboard itself runs in a browser, backed by its own HTTP server, to host the static pages which pull the counter and gauge values from Spring XD. The data is updated in real-time and displayed rather nicely using charts provided by d3.js.

The Spring XD Shell

The Spring XD distribution includes a CLI application called the XD Shell, used to execute commands and queries to Spring XD. For example, the shell is used to create and deploy streams. The shell is also a REST client to the Spring XD Admin. The shell commands used to create and deploy the streams needed for the Analytics Dashboard are:

xd:> stream create tweets --definition "twitterstream | log"

xd:> stream create tweetlang --definition "tap:stream:tweets

xd:> stream create tweetcount --definition "tap:stream:tweets > aggregate-counter" --deploy

xd:> stream create tagcount --definition "tap:stream:tweets > field-value-counter --fieldName=entities.hashtags.text --name=hashtags" --deploy

xd:> stream deploy tweets

The stream create command is followed by the name of the stream which must be unique to Spring XD and the definition which defines the stream using Spring XD’s DSL based on the UNIX pipes and filters syntax. For the demo, we can keep things as simple as possible by having the primary stream simply dump its output to the console log rather than HDFS. For this we use the built-in log sink in place of hdfs. Upon deploying the stream, you should see tweets rendered as JSON in the terminal session running the Spring XD single node application.

Stream Deployment

Note that create and deploy are separate commands; create simply validates and saves the stream definition, ensuring all the referenced modules exist and the module options (e.g., —fieldName) are valid. In general, deployment to a Spring XD cluster consisting of multiple Container nodes may require further deployment specifications which are a separate concern from the stream definition itself. For example, Spring XD provides the ability to horizontally scale individual modules by specifying a number of instances. In addition, various strategies for targeting an individual Container instance or a group of Containers are supported. We can see from the examples above, the create command accepts a –deploy option, combining these steps for convenience. Here we want to defer deploying the primary stream until all its taps are in place, otherwise we would lose any sample data processed before the metrics are active.

The demo uses two different field value counters and one aggregate counter. Let’s look at one field value counter in detail:
xd:> stream create tweetlang --definition "tap:stream:tweets > field-value-counter --fieldName=lang" --deploy

Here we are tapping the primary stream named tweets. The tapped stream must exist, meaning it has already been created. The field_value_counter is a built-in module that acts as a sink for the tap. The tweetlang stream consumes a copy of the messages originating from the tweets source, twitterstream in this case. The source emits tweets as JSON. The JSON is evaluated and the value of the top level lang field is extracted in order to count the occurrences of each language represented in the twitter stream. Compare this to the tagcount definition above. The tagcount tap counts the occurrences of individual hashtags, a common way to monitor what is trending on Twitter. The only difference is the value of the fieldName parameter. This illustrates how the same module may be configured for use in different streams. Also notice that any nested JSON node may be evaluated. Additionally, each tweet may contain multiple hashtags, so there is some projection magic going on to evaluate the field expression entities.hashtags.text, courtesy of the Spring Expression Language (SpEL).

Running the Demo

Running the demo requires the following steps

* Download and install the latest Spring XD Distribution from the link provided on the project page
* Clone the spring-xd-samples repository on GitHub
* Follow the instructions included with the analytics dashboard project
* Note, the twitterstream source requires twitter credentials which may be obtained here

SpringOne 2GX 2014 is Around the Corner!

Book your place at Spring One in Dallas, Texas for Sept 8-11 soon. It’s simply the best opportunity to find out first hand all that’s going on and to provide direct feedback. Expect a number of significant new announcements this year. We are anticipating that several in-depth Spring-XD sessions will be presented.

About the Author

David Turanski is a Spring Advisory Architect at Pivotal. He has been a core committer on Spring Integration, Spring Cloud Data Flow, Spring Cloud Stream, Spring Cloud Task, and Spring Data GemFire. Prior to joining the Spring team in 2010, David held various positions as consultant, enterprise architect, and software engineer, delivering distributed mission critical systems in various industries including transportation and logistics, aerospace, finance, pharmaceutical, manufacturing, and healthcare. Currently, David works directly with customers to build solutions with Spring and Pivotal Cloud Foundry, while continuing to contribute to Spring products.
Follow on Twitter

Spinning up useful VMs quickly with Vagrant, Puppet and Puppet Forge

Often during development it can be convenient or even necessary to spin up a virtual machine. And if you’re...

Transform Your Skills: Simple Steps to Set Up SQL on Hadoop

In this post, Senior Field Engineer Alfred Domingo shows SQL administrators and developers how easy it is t...

HOWTO: Spring XD for Real Time Analytics With Twitter Example Code

Streams and Taps

Counters and Gauges

The Analytics Dashboard

The Spring XD Shell

Stream Deployment

Running the Demo

SpringOne 2GX 2014 is Around the Corner!

About the Author

Previous

Next

HOWTO: Spring XD for Real Time Analytics With Twitter Example Code

Streams and Taps

Counters and Gauges

The Analytics Dashboard

The Spring XD Shell

Stream Deployment

Running the Demo

SpringOne 2GX 2014 is Around the Corner!

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.