Nearly 1.2 million people die in road crashes each year (WHO – 2015) with additional millions becoming injured or disabled. Road traffic injuries are predicted to become the fifth leading cause of death by 2030. Many in the automotive industry agree, most traffic accidents are both predictable and preventable. And they are betting they can help, by combining the power of data science and the Internet of Things (IoT).
Naturally, as a company at the forefront of the IoT, Pivotal’s data science team is working with several manufacturers on different approaches to help humans drive more safely. In this article, I will demonstrate how we are helping one of the largest German car manufacturer to build a scalable IoT platform that uses weather data and data from other cars to warn drivers of dangerous conditions. Aside from the technology, I will also explain the data science involved in this project.
When we started with this project, our client came to us with the idea that they wanted to be able to predict road conditions in Germany from weather data and data from other cars. And they wanted this information, in real-time, in the cars to make their drivers some of the safest in the world.
It is a great idea—both to protect and attract customers. However, they were starting from scratch, had a tight budget and it was only 10 weeks’ time until the next important board meeting where they wanted to show this idea’s value.
For some, it would be scary to take on such an important project with these kinds of time and budget constraints. However, our team was set up for exactly this kind of challenge, so we accepted.
Figure 1: Cloud-Native Lambda Architecture.
First, we eliminated time spent on infrastructure by putting everything on the cloud—specifically AWS. Figure 1 shows the technology stack and architecture that we used for this project.
We decided to go with the lambda architecture since training a model to predict road conditions based on weather data and other car data is not trivial, even if it is only for Germany. Lambda architecture patterns leverage batch training, and we felt this was more appropriate than online learning due to the complexity of using deep learning to train our model, and training such a model can take up some time.
The Stream Component
Figure 2: The streaming component.
In the first step we had to ingest data (weather data and car data) to both serve the real-time and batch layer. In our case, we used Spring XD to write our data ingestion pipeline. Figure 2 shows that we are streaming both weather and car data to Amazon S3 and Redis. We choose Spring XD mainly for its easy to set up domain-specific language pipelines, its capability to invoke shell and python scripts, as well as its ability to scale. The ability to invoke shell and python scripts was especially handy as we had to decode and transform the data in real-time (the data was in a special binary format and also unstructured). In the near future, we would like to upgrade Spring XD to its successor, Spring Cloud Data Flow, which was still too new for our client when we started this project.
The Batch Layer
Figure 3: The batch layer.
In the batch layer, we stored the data in comma separated value (csv) format for each that came from the stream in AWS’s storage service, S3. Then we spun up an EMR cluster with Apache Spark to process the data. Particularly, we were using PySpark to reduce the number of features in the dataset. This was necessary as we were using a deep neural network to train the model to predict the road conditions, and feeding a lot features into the network can cause problems in terms of computational complexity—complexity that could delay results for weeks. To shorten the computational length, we reduced the dimensions of the data using fixed grid space around the point that we want to predict.
After reducing the features, we use AWS’s GPU instances to train the neural network. For building the model we turned to Keras, a deep learning library, written in Python and capable of running on top of either Tensorflow or Theano. The key advantage of Keras is that you can easily create models and train it on either CPU or GPU instances seamlessly. Moreover, it supports both feed-forward and recurrent neural networks. Once the training and evaluating the model was complete, we stored the model on Redis to use in the real-time layer.
Finally, we use Luigi, a Python module that helps you build complex pipelines of batch jobs, to create our analytical pipeline. This is important as we want to retrain our model on a regular basis without any manual workload.
The Real-Time Layer
Figure 4: The real-time layer.
In the real-time layer, we are streaming data to Redis, on a constant basis depending on the occurrence of the data. Then, we create a queue in Redis where we store a fixed time interval of data and use the already cached model in the batch layer to return predictions of the road conditions. This is achieved by a Predictive API service that we created specifically for this project. The Predictive API service then returns predictions which is then enriched by the Enricher service, e.g. add further meta data like GPS locations, the measured class, etc.
At Pivotal Labs, we know choices like these are important, if not inevitable. This is why we use an API first approach that wraps data science model as early as possible in an API so that software or data engineers can collaborate early in the project— even before we have built or evaluated the model yet. This is a critical step to build data-driven applications.
Of course, PCF has additional advantages like self-healing capacity, dynamic routing and many more. A full list of advantages can be found here.
Figure 5: Pivotal Cloud Foundry PaaS.
With the technology stack in place, let’s dive into the rationale behind the data science approach we used to predict road conditions.
Introduction Into Deep Learning
Predicting road conditions with weather data and data from other cars can be very complex since it involves three dimensions—the measured value, the location and time. To solve such a problem, we used deep learning, which belongs to a class of machine learning techniques, that employs algorithm which mimics the human brain in solving multi-dimensional problems such as image classification, handwriting recognition and many more.
Recurrent Neural Network
We opted to use a Recurrent Neural Network (RNN) versus a Feed-Forward Neural Network where the input that you feed into the network not only goes into one direction but can also go back and forth. This is ideal for us as our input as already mentioned has a time component included. An Autoregressive/Moving Average (AR/MA) model was also considered, but ultimately rejected because we wanted to process the raw data as little as possible.
In terms of RNNs, there are many ways to construct the network (see Figure 6) like one to one, one to many, and many more. In our case, we used a many to one relationship to handle sequences of data (i.e., measured values for each time step per location) and our output class is binary, so for example, our target variable may be either wet road condition or not wet road condition.
Figure 6: Various options to construct recurrent neural networks.
Concerning our target variable, we were faced with imbalanced classes, which occurs when one class occurs far less frequently than the others. In this case, wet or icy roads were less frequent conditions than not wet or icy roads. To compensate, we relied on evaluation metrics such as precision and recall instead of accuracy to evaluate our model. Moreover, we use over- and under sampling techniques to balance the target class.
When we started with the modelling, we experimented first with different variants of RNNs like Long-Short Term Memory (LSTM) models or Gated Recurrent Unit (GRU) models. At the end of the day, we ended up with using simple RNNs which gave us a better performance in terms of computationally complexity and model performance evaluation (higher precision and recall). Moreover, we used GPUs instead of CPUs which proved to be 10x faster when training a RNN network. Another challenge we had was to find the optimal network. We spent a lot of time to find the right architecture (number of hidden layers, activation function, dropout etc.) and tuning the parameters (number of epochs, early stopping, seed).
Finally, one major problem was that a lot of data was needed until the network really learned something. Sometimes it stuck in a local minimum and the learning rate did not converge as expected.
At the end of the 10 weeks, we successfully built a scalable IoT Platform that fully runs on the cloud. Our core principle, API first, helped us to bring our data science model quickly into production so that our client could test it and give early feedback. There is more work to do like expanding the model, and operationalizing insights in dashboard applications, however by choosing PCF and cloud infrastructure, we were able to prove out an exceptional amount of work in just 10 weeks.
If you are interested in learning more, or would like Pivotal Labs and our data science team to help you jump start your project, please contact us. Alternatively, you can always find me on Twitter @datitran.
Watch the full talk from the Cloud Foundry Summit Europe 2016:
- API First For Data Science
- Blog article on class imbalances
- Another article around connected cars or watch our connected car demo
- Read other articles from Pivotal Data Scientists
- Check out the product info, downloads, and documentation for Pivotal Cloud Foundry
About the Author
Dat works as a Senior Data Scientist at Pivotal. His focus is helping clients understand their data and how it can be used to add value. To do so, he employs a wide range of machine learning algorithms, statistics, and open source tools to help solve his clients’ problems. He is a regular speaker and has presented at PyData and Cloud Foundry Summit. His background is in operations research and econometrics. Dat received his MSc in economics from Humboldt University of Berlin.Follow on Twitter More Content by Dat Tran