An Open Source Reference Architecture For Real-Time Stock Prediction

December 1, 2015 William Markito

sfeatured-geodeThere is a certain myth—that it is possible to apply AI and machine learning algorithms on a server farm, move to Hawaii, and let the machines trade all day while you enjoy life on the beach. Well, the problem is that trading markets continually change—economic forces, new products, competition, world events, regulations, and even tweets are all factors. While there is no free lunch, companies can still get a “better, more healthy, and cheaper meal” with the help of open source machine learning algorithms and data analysis platforms. In the case of stock market it’s a common practice to check historical stock prices and try to predict the future using different models.

While this post does not cover the details of stock analysis, it does propose a way to solve the hard problem of real-time data analysis at scale, using open source tools in a highly scalable and extensible reference architecture. The architecture below is focused on financial trading, but it also applies to real-time use cases across virtually every industry. More information on the architecture covered in this article is also available online via The Linux Foundation, Slideshare, YouTube, and Pivotal Open Source Hub, where the components in this architecture can be downloaded.

The Architecture: Stock Prediction And Machine Learning

At the highest level, the stock prediction and machine learning architecture, as shown in the diagram below, supports an optimization process that is driven by predictive models, and there are three basic components. First, incoming, real-time trading data must be captured and stored, becoming historical data. Second, the system must be able to learn from historical trends in the data and recognize patterns and probabilities to inform decisions. Third, the system needs to do a real-time comparison of new, incoming trading data with the learned patterns and probabilities based on historical data. Then, it predicts an outcome and determines an action to take.

Screen Shot 2015-11-18 at 8.21.20 AM

While the above diagram is simplified, this type of architecture has several fundamental considerations to address when the scope of the system increases. Importantly, there is the amount of data and system integration. Many sources and types of data are used to predict outcomes along with a variety of sinks for data to be processed by. In an environment of 20 data sources and 20 processing sinks, real-time functions still must operate with very low latencies. This presents a scaling issue on two fronts. First, data processing applications need to address horizontal scale by adding more nodes and maintaining extremely fast responses in real-time. Additionally, the system will be storing more data over time. Beyond the growth of historical data sets, different analytical workloads are also executed and used to improve prediction models.

Using Open Source Components In The Architecture

As each of the high-level components expand into further more detail, open source products can be applied to the entire architecture in various capacities. This includes Spring XD (Now Spring Cloud Data Flow), Apache Geode (incubating), Spark MLlib, Apache HAWQ, and Apache Hadoop™.

Screen Shot 2015-11-18 at 8.28.25 AM

The data flow and pipeline would generally follow six steps, as shown in the diagram above and further explained in the outline below. Importantly, each component is loosely coupled and horizontally scalable:

  1. Live data, from the Yahoo! Finance web services API, is read and processed by Spring XD, which greatly simplifies data flow orchestration, provides built-in connectors for systems integration, is Java-based, and can perform all types of transformations. The data is then stored in memory with a fast, consistent, resilient, and linearly scalable system—Apache Geode (incubating) in this case, which can also provide event distribution.
  2. Using the live, hot data from Apache Geode, a Spark MLib application creates and trains a model, comparing new data to historical patterns. The models could also be supported by other toolsets, such as Apache MADlib or R.
  3. Results of the machine learning model are pushed to other interested applications and also updated within Apache Geode for real-time prediction and decisioning.
  4. As data ages and starts to become cool, it is moved from Apache Geode to Apache HAWQ and eventually lands in Apache Hadoop™. Apache HAWQ allows for SQL-based analysis on petabyte-scale data sets and allows data scientists to iterate on and improve models.
  5. Another process is triggered to periodically retrain and update the machine learning model based on the whole historical data set. This closes the loop and creates ongoing updates and improvements when historical patterns change or as new models emerge.

Running A Simplified Architecture On A Laptop

To allow the system to run on a common laptop, it needs to be simplified. Here is an approach presented at Apache Big Data 2015 and hosted by the Linux Foundation in Budapest, Hungary, presented by Fred Melo and myself. The approach basically removes the long-term data storage components of Apache HAWQ and Apache Hadoop™:

Screen Shot 2015-11-18 at 8.38.54 AM

Every component of the solution has a well defined responsibility and scales on premise or on most cloud topologies. For ease in deployment, maintenance and support the open source components can be wired with Pivotal Cloud Foundry for application runtimes, Pivotal GemFire instead of Apache Geode, or other components of the Pivotal Big Data Suite for large, historical data sets.

For the GitHub-hosted bits that support this architecture, there is also a JavaFX example application that acts as the client and presents a scrolling graph, updated in real-time, based on the events being pushed by Apache Geode servers as new data gets pulled from Yahoo! Finance by Spring XD. There is also a stock information simulator that can be used if no internet connectivity is available to collect information or for development purposes.


As the image shows, information is being collected and generates a few common indicators for a given symbol, such as last close price, moving average, and predicted moving average.

Learning More

The source code and instructions for setting up this architecture are available on Pivotal Open Source Hub. The downloads also include a Vagrant box with everything installed for the laptop model.

Editor’s Note: ©2015 Pivotal Software, Inc. All rights reserved. Pivotal, Pivotal Greenplum, Pivotal GemFire and Pivotal Cloud Foundry are trademarks and/or registered trademarks of Pivotal Software, Inc. in the United States and/or other countries. Apache, Apache Hadoop, Hadoop, Apache Geode, Apache MADlib, Apache HAWQ, and Apache Spark are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.


About the Author


More Content by William Markito
A Look at The Implications Of Real-Time Data
A Look at The Implications Of Real-Time Data

When building systems to handle real-time or streaming data, we need to look at some architectural elements...

This Month In Data Science: November 2015
This Month In Data Science: November 2015

With the Presidential race heating up, the increasing importance of data science within the candidates’ cam...


Subscribe to our Newsletter

Thank you!
Error - something went wrong!