Data Science How-To: Text Analytics-as-a-Service

September 20, 2016 Chris Rawles


At Pivotal, we’ve written about the untapped business value in unstructured data and how we utilize natural language processing (NLP) to help our customers. We continue building upon this work by demonstrating how—after the data exploration, feature engineering, and model building stages—to deploy and operationalize a text analytics model. In this example, we show an approach to deploying a scalable trained sentiment classifier that can also conveniently be used for additional text analytics and other data science tasks.

Specifically, in this post we’ll demonstrate how to:

If you’d like to jump straight to the code, the GitHub repository is available here.

Data Scientists And End Users: Completing The Analytics Loop

We’ve written about the business impact of deploying machine learning models as a service using a microservice-based approach and API first data science. The full value of data science is realized by operationalizing the data science workflow and exposing model predictions and insights to the end user.

Successful data science models are developed to serve and bring value to an end user—whether the end user is a customer, analyst, or domain expert. End users consume and interact with models in different contexts such as via a web application or a command line API request.

Ultimately, however, no model is perfect and a successful model is an evolving one. The fastest cycle for improving a model is an iterative process of continually gaining user feedback and new data, re-training, and re-deploying over and over again to continuously hone the model. This is something that should be done often and programmatically, as shortening this analytics loop results in better models and more business impact.

Cloud Foundry helps tighten the analytics loop by providing a scalable platform for deploying and managing analytical models. A key benefit of Cloud Foundry is it eliminates the headache of bringing a model to life without spending energy worrying about routing and domain configuration, load balancing, environment installation, etc. This equates to data scientists spending more time building models and writing code, which makes data scientists happy and end users even happier.

Deploying The Model

We demonstrate the process of model operationalization on Cloud Foundry by deploying this sentiment classifier, which is trained in a distributed computing environment using PL/Python in Greenplum Database on 1.6 million Tweets using distant supervision for automatic labeling and a logistic regression model for sentiment analysis. The following example utilizes Jupyter Notebook (via Jupyter Kernel Gateway) for model deployment. In addition, we also built a Flask implementation of this example.

The example consists of 3 files:

  1. text-analytics-service-pcf.ipynb – the Jupyter Python notebook applying the model
  2. manifest.yml – instructs Cloud Foundry how to deploy our application
  3. environment.yml – defines the required environment for our app

The Jupyter notebook text-analytics-service-pcf.ipynb reads a trained serialized Python scikit-learn model which is then exposed as a HTTP POST request.

Next, we write the manifest.yml file, which will instruct Cloud Foundry to call the jupyter-kernelgateway command, exposing our model as a RESTful microservice:

The manifest file specifies instructions and metadata – name, memory usage, disk usage, buildpack, etc. – for pushing an app to Cloud Foundry. The buildpack provides the framework for installing the necessary Python packages using the package managers conda and pip. We indicate the specific required packages in the environment.yml file:

That’s it! With these 3 files, we can now cf push and deploy our app to Cloud Foundry:

We can access the classifier using a POST request returning a result from 0 to 1 where 0 indicates more negative sentiment and 1 indicates more positive sentiment:

Finally, we can easily scale our classifier by spinning up new instances in response to changes in demand using cf scale:

Taking A Model Into Production

Our model is now served as a scalable autonomous microservice. By decoupling our model, different users are able to consume our model in different contexts using our API—whether that user is a developer integrating the model into a web application or a business analyst accessing our model from the commandline. In addition, by decoupling our model from the surrounding systems, we reduce the complexity of our modeling architecture allowing us to deploy and update our model in isolation. Our autonomous model can also be easily integrated into a data processing framework such as Spring Cloud Data Flow.

Model Persistence

Prior to operationalization, the data science workflow—data exploration, feature engineering, and model building—are frequently performed in a distributed architecture optimized for machine learning such as Greenplum Database, Apache HAWQ (incubating), Apache Spark™, etc. Models developed in such environments can be persisted and deployed using Predictive Model Markup Language (PMML). In addition to PMML, models developed in Greenplum and HAWQ using PL/Python, for example, can be persisted—using serialization or other markup languages—and deployed on Cloud Foundry.

model training

Example of a data science workflow. Model training occurs in Greenplum Database and operationalization occurs in Pivotal Cloud Foundry. The model is accessible via an API request accessed from a Spring Cloud Data Flow data processing pipeline.

Jupyter Kernel Gateway

Jupyter Notebook is an essential tool in the data scientist’s toolkit. Deploying a notebook as a microservice offers the advantage of enabling a data scientist to operationalize her code while staying within the Jupyter environment—a setup often ideal for testing and prototyping.

Deploying a model to production requires crucial steps such as incorporating a security layer using API authentication, embedding data validation checks, supporting exception handling, etc.

Web frameworks such as Flask or Django, include authentication support and many essential components for bringing a model to production and building RESTful APIs.

Next Steps

We are continuing upon this work and incorporating our model into a real-time text analytics application. Check out our GitHub repository for updates.

Additional Resources

About the Author

Chris Rawles

Chris Rawles is a senior data scientist at Pivotal in New York, New York, where he works with customers across a variety of domains, building models to derive insight and business value from their data. He holds an MS and BA in geophysics from UW-Madison and UC Berkeley, respectively. During his time as a researcher, Chris focused his efforts on using machine-learning to enable research in seismology.

More Content by Chris Rawles
Why Specifying A Default Partition May Slow Query Performance
Why Specifying A Default Partition May Slow Query Performance

In this post, Charles Killam, Pivotal Principal Technical Instructor, explains how data partitions work in ...

Meet Pivotal Cloud Foundry 1.8—Because Time to Value Is Contagious
Meet Pivotal Cloud Foundry 1.8—Because Time to Value Is Contagious

The new Pivotal Cloud Foundry 1.8 release delivers more power and flexibility to improve the critical measu...


Subscribe to our Newsletter

Thank you!
Error - something went wrong!