Jointly authored by Gautam Muralidhar, Esther Vasiete, and Srivatsan Ramanujam
In this article, we are going to explain exactly—including with pointers to sample code—how we use data science to develop connected car applications for automobile companies like Ford, BMW, Mercedes, and Volkswagen. As a very recent sign of our worldwide leadership in this space, Ford just invested $182 million in Pivotal alongside Microsoft and prior investors.
We have a strong pedigree here—before the connected car sector started showing a 45% CAGR through 2020, we were one of the first companies to begin developing advanced, data science-driven apps in the space. We were also one of the first to share an automotive reference architecture for real-time, data science-driven apps. Pivotal brings together all the required components—streaming data services, cloud native architectures, big data platforms, massively parallel analytics, in-memory data grids, and data science toolkits—to turn innovative, data-driven ideas into real-time customer experiences. With the addition of of our open source heritage and agile development best practices, we are a worldwide leader in applied data science.
While this article will focus on predictive maintenance, our field engineering, data science, and Pivotal labs teams have examined many connected car use cases—driver behavior analysis, improved safety, greater fuel efficiency, remote access, in-car infotainment, e-commerce, fleet management, vehicle-to-vehicle crash avoidance, remote diagnostics, insurance adjustments, traffic avoidance, and more. Below, we will explain the specific data science challenges, data sources, processing workloads, feature creation approaches, and machine learning algorithms applied to predictive maintenance. Then, we will cover how to operationalize these insights and make use of them.
The Predictive Maintenance Problem
The classic predictive maintenance problem is illustrated in Figure 1. Given data streamed from a vehicle—such as diagnostic trouble codes (DTCs) and other vehicle parameters at the time of occurrence of the trouble codes (e.g., odometer reading, vehicle speed, engine temperature, torque, etc.), can we predict an ensuing repair or a maintenance job on the vehicle?
For example, a car is driven for 5 months and emitted several DTCs along with other vehicle parameters during that time period. At the end of the 5th month, the car had an unscheduled transmission job at the dealership’s service center. The car had been emitting DTCs all along starting from the first month. Could this unscheduled transmission job have been predicted based on the DTCs that were emitted along with other vehicle parameters?
DTCs are alphanumeric codes that are emitted by the on-board diagnostic systems in the car, and they typically signal when a vehicle sensor reports values outside the normal or accepted range. Predictive maintenance problems are challenging because DTC signals are not always symptomatic of an ensuing repair. For instance, a DTC might indicate or report a sensor fault, but it might be caused by the systems being monitored or other faulty sensors which are overcompensating.
To complicate the problem, multiple unscheduled repairs or jobs could be performed on the same day. Further, some DTCs might prompt the ‘Check Engine Light,’ but the actual repair might not be immediately apparent from the DTC. In these cases, a trained mechanic would determine the best action to take based on one or more DTCs. But, creating a comprehensive, granular rule-based approach for diagnoses is hard to construct. However, there is a reasonable hypothesis—given a sufficiently large number of past repairs performed by humans and DTCs leading up to the repairs, a data driven, machine learning-based approach could infer relationships between DTCs and repairs.
Figure 1: The predictive maintenance problem
Why Solve the Predictive Maintenance Problem?
Solving the predictive maintenance problem benefits multiple parties. Automakers gain an avenue to monitor trending problems with their vehicles and potentially prevent NHTSA recalls. It also serves as a powerful quality assurance tool, potentially identifying faulty/unreliable parts by analyzing the correlations and causations of the sensor data and the repair jobs. For dealers, it provides a tool to monitor vehicles for potential problems, which they can bring to the notice of the parent auto company or the customer. And, this type of customer experience has a high likelihood of improving customer satisfaction ratings.
A solution could also serve as a digital assistant to mechanics working on the car. For fleet management companies, it gives them an insight on assets and operations—they can identify which vehicles are likely to be off the road for repairs, allowing them to better manage their fleet and inventory. Finally, the end customer might not need to deal with unexpected or out-of-schedule vehicle repairs and get immediate, predictive alerts for potential critical failures.
Data Sources for Predictive Maintenance
Broadly speaking, there are two main data sources needed to infer relationships between DTCs and repairs: 1) vehicle sensor data including DTCs and other vehicle parameters, and 2) data on repairs and repair diagnostics from the dealership or auto mechanics.
For the first data set, automobile manufacturers approach sampling and data transmission differently. Some auto companies only record the vehicle parameters at the time of DTC occurrence while others sample the vehicle parameters at high frequencies (e.g., once every second). Naturally, from a data science and machine learning perspective—the more data the better—high frequency sampling is desirable. Importantly, high frequency data enables the development of predictive models that do not overly rely on the actual DTCs because noisy vehicle parameters can incorrectly generate DTCs.
Then, there is the repair data. Once a sequence of DTCs are emitted and observed over a certain period of time, we need to label these sequences. Ideally, any time window comprising a set of DTCs would be labelled by the type of repair job(s) performed at the end of the time window. To infer relationships between DTCs and repairs, a sufficiently large number of historical repair data is desirable. Fortunately, most automakers are really good at collecting warranty claims and repair data from their dealership network. The quality of the repair data is a critical factor in building DTC-based models and predicting repairs.
There is also a trade-off. As repair descriptions get more granular (assuming data cleanliness and quality are maintained), the modeling problem becomes harder. This is because the number of classes for repair classification expands and places severe constraints on the number of training examples available for each repair type. For example, if we just restrict the problem to one thing—identifying whether a car is going to have an unscheduled transmission, engine, or suspension repair—then the problem is a lot more manageable—the whole data set can be used to train the model. If we try to identify the specifics of the transmission, engine, or suspension repairs (e.g., at the level of identifying the subsystems or the parts that need to be worked upon), we must group the available data sets and consequently use smaller sets for each classification.
In our work with the automobile company, we focused on hierarchically predicting the major system and subsystem jobs. In other words, we first predict which of the major systems most likely need to be worked upon such as the transmission, engine, suspension, etc. Based on the predictions we make at the first level, we predicted which of the subsystems (subsystem specific to transmission, engine, suspension, etc.) most likely need repair.
There is also the consideration of latency requirements, and solutions differ on a case by case basis. For automakers and dealers, monitoring vehicle quality on a batch basis could still be done. If the intent is to alert customers to pull over based on a DTC or other set of parameters or to adjust the car’s system in real-time, then the latency requirements would dictate real time/near real time processing.
Data Workloads for Predictive Maintenance
Data sources for the predictive maintenance problem are a combination of structured (e.g., vehicle data comprising of fields such as year, make, model, etc., warranty parts and claims, etc.) and unstructured data sources (repair order narratives, time series of DTCs and vehicle parameters such as odometer reading, speed, engine temperature, engine torque, acceleration, etc., which are typically time series). To be able to build machine learning models for these data sources, it is necessary to ingest these diverse sources of data into a backend data system that enables unified processing.
The Pivotal technology stack provides the right tools for this. For example, our connected car blog article discusses how data ingestion can be achieved through stream or batch processing via systems such as Spring Cloud Data Flow and messaging systems such as RabbitMQ—both open source technologies we contribute to. These systems enable easy ingestion of data from the car to an Apache Hadoop™ backend, a natural landing zone for sensor data. Once in Hadoop, the sensor data can be loaded into SQL tables using Pivotal HDB— Pivotal’s Hadoop distribution and Hadoop native SQL engine powered by Apache HAWQ (incubating). Figure 2. illustrates an example data set for a table comprising DTCs and vehicle parameters recorded at DTC occurrence. Relational models are also possible, like VIN being a foreign key to a repairs, vehicle, or parts table. Structured data sources such as vehicle data and warranty parts can also be landed in Hadoop from existing relational databases using technologies such as Apache Sqoop and then queried with SQL by using Pivotal HDB.
Figure 2: Example data model for a table comprising of DTCs and vehicle parameters
Data Processing and Feature Creation for Predictive Maintenance
Once vehicle sensor data and warranty repairs have been ingested, data processing, feature creation, and machine learning can be carried out at scale. We do this by leveraging HAWQ’s massively parallel processing (MPP) architecture. For example, signal processing operations used to filter noisy sensor data can run in a high performance environment with petabyte-sized data sets. More complex Bayesian filtering, such as Kalman filtering, can be carried out in parallel across HAWQ and HDFS nodes. These Jupyter notebooks illustrate examples for some of these signal processing operations. Feature creation, likewise, can be carried out at scale on HAWQ.
For the predictive maintenance problem we worked on, we vectorized the DTCs along with other variables such as previous job/repair type, number of days elapsed since the same job, odometer readings, and more. These went into a single, highly dimensional, although sparse, feature vector. The feature vectors were created over a window of time defined between two successive repairs of the same kind for a given car. We experimented with the window end date by setting it to 1, 15, 30, and 60 days before the ensuing repair and aggregating information about DTCs and other variables observed during the corresponding windows. The intent behind this experimentation was to assess if ensuring repairs could be caught early. Our analysis revealed that, while there was some loss in the predictive power of the model going back 1-2 months prior to the repair date, the loss wasn’t dramatic, which implies that key DTCs that matter show up early.
Machine Learning for Predictive Maintenance
To begin the machine learning design process, we are faced with a multi-class/multi-label problem where the number of classes could be very large depending on the job/repair granularity (could be in the order of 1000’s). As mentioned above, this type of granularity means there are very few training examples per class label, making it a challenge to learn meaningful class conditional distributions. The problem is analogous to multi-tagging of a large corpus of documents. To address it, we adopted a hierarchical approach. We first built a classifier to predict the major system level repairs. Here, a class was a system such as transmission, repair, engine, suspension, etc. For predicting the repairs under each major system, we aggregated training examples for all subsystems under a given system and assigned the same class label to them. Then, we predicted the subsystem repairs for each system.
For multi-class classification, we built a parallel one-vs-rest multiclass classifier using Python Scikit-learn functions such as Logistic Regression and Random Forests via PL/Python user defined functions on HAWQ. Figure 3 illustrates the parallelization of a one-vs-rest multiclass classifier on HAWQ, where each classifier (i.e., the number of classifiers equals the number of class labels) is built and runs on a separate segment host or node. For readers who are interested, we have packaged the one-vs-rest multiclass data preparation for parallel model building as a utility function in PDLTools, a publicly available library of common data science functions developed by the Pivotal Data Science team for the Pivotal MPP platforms. The base model can also be built using Apache MADlib.
Figure 3: Parallel one-vs-rest multiclass classifier
Operationalizing the Predictive Maintenance Solution
Once the models are built, the only way they will produce a valuable impact is if they are put into use by automakers, dealers, and fleet management teams who must improve their customer experiences. We see our customers integrate the machine learning models with web applications hosted on Pivotal Cloud Foundry. For example, an app might constantly score the data streaming off a car and notify the driver, dealer, and manufacturer of the issue. The Pivotal Data Science team has set up a boilerplate for building web apps using Flask, a python-based web framework that consumes predictions and insights from models built on Pivotal HDB and Pivotal Greenplum. Interested readers can find the boilerplate code here.
To operationalize our models, we need an application to evaluate some information and take action on it, like starting a workflow if a DTC event fires. For this real-time event evaluation, applications can use some set of in-memory data, streaming functions, event notifications, or model definitions like PMML to allow applications to evaluate an aspect of incoming data. Pivotal Gemfire (i.e. Apache Geode) is an ultra low latency in-memory data grid, which can serve as a caching layer for models built in HAWQ or Greenplum via PMML. Figure 4 illustrates an example architecture for real-time/near real-time model operationalization using Pivotal’s stack.
Figure 4: Real-time Model Operationalization for predictive maintenance using Pivotal’s Technology Stack
In the big picture of automotive engineering lifecycles, this real world approach to connected car applications is just a first step. Companies often look at an initial program like this as a proof of concept, followed by early market test programs. This invariably leads to future architecture discussions that will certainly address data granularity, sampling frequency, in- versus out-of-car processing, alternative algorithms, and more.