Simplifying Data Science Workflows With Pivotal Cloud Foundry

July 2, 2015 Abby Kearns

sfeatured-CFSummitIan Huston and Alexander Kagoshima of Pivotal Labs delivered a presentation at the Cloud Foundry Summit 2015 demonstrating how they have used Pivotal Cloud Foundry to deliver data-driven applications to clients. Data scientists synthesize a wide range of skills in their efforts to understand complex data sets and deliver insights, and Pivotal Cloud Foundry enables practitioners to quickly get to work, rather than losing time setting up servers or performing operations tasks. They detailed how Pivotal Cloud Foundry simplifies their work through three common use cases, including calculating risk for an insurance company, performing predictive maintenance upon industrial machines, and analyzing customer behavior to improve products and services.

Ian listed three benefits that data scientists need from a platform: a place to store data and an easy way to capture that data, the ability to channel that data and store it long-term, and quick access to the stored data so they can perform large-scale computation. Most important is that the data scientists can then deliver results using the platform, whether those take the form of a website, a data API, or an interactive visualization.

Pivotal Cloud Foundry provides these benefits through its support of data services. It provides easy access to services such as Redis, MySQL, or RabbitMQ, as well as simple connections to user-provided services, such as an existing big data infrastructure running Apache Hadoop® or an Apache Spark™ cluster. Simple binding of services also enables easy switching between a test data store and a live production data store.

Alex expanded on the computation side of the process, explaining how Pivotal’s data scientists developed Pivotal Cloud Foundry buildpacks for R and Python. With access to the sophisticated algorithms these high-level languages provide, the data scientists can then extract large data sets from clients stores and begin developing models within a distributed big data platform. The challenge, then, is pushing the model and actionable insights back to the client.

Alex also detailed two approaches to using Pivotal Cloud Foundry in a big data context. One is to continue using an existing big data platform, with its robust computation capabilities and libraries, and then deliver insights through visualizations in a web app managed by Pivotal Cloud Foundry. Another approach is to leverage the distributed computation power that Pivotal Cloud Foundry provides, using the big data platform only for storage. To demonstrate how this is done, he showed a prototype prediction API deployed to Cloud Foundry which can be accessed through JSON. They then showed off a couple of demos, including a website that predicts the duration of traffic disruptions on London streets, and a web app that lets users explore an insurance data set.

In closing, Ian asked the community to share their on how people are using Pivotal Cloud Foundry for data science. Readers interested in sharing their stories can reply in the comments on this blog, or contact the presenters via Twitter. Ian can be reached @ianhuston and Alex can be reached @akagoshima.

Watch the full talk from the Cloud Foundry Summit:

Learn More:

Transcript:

Ian Huston:
Okay I think that’s the last few stragglers coming in now so, maybe we’ll get started. My name is Ian Huston, along with Alex Kagoshima here, and we’re gonna talk about data science on Cloud Foundry. And something Andrew Clay Shafer said in his talk this afternoon really resonated with me about we’re trying to build a community of practice, and I think that’s really what we’re doing here as well so we’re gonna talk a little bit about how we think about doing data science on CF, we’d also really like to hear any input from you, what you’ve done, what you’ve tried, what worked, what hasn’t worked. We’ll talk a little bit about how you can maybe get involved later.

So first of all, who are we? We’re both working as data scientists at Pivotal Labs, which is the agile software development arm of Pivotal. We both actually work in Europe, Alex in Berlin, myself in London. We’ve been using Cloud Foundry for the last few years to deliver data driven applications for our customers. What we really do with our customers is we really try and work with them to get value out of their data. Maybe just have a quick show of hands, who here would identify themselves as a data scientist? … Okay, we’ve got a few. So it’s not maybe as rare as I thought. And, who works with data scientists, or provides services or operations for data scientists? … Okay, so a lot more hands going up. And who has heard the buzz word, but doesn’t really know what a data scientist is, and wonders why I keep putting those words together? Anyone? You all know what Data Science is? Okay that’s great.

Really brief recap then, maybe, is to understand what is a Data Scientist and what is part of their job. So this Venn diagram is famously created by Drew Conway, and it kind of shows the mix of skills you need to have to be a data scientist. So you need programming skills, definitely, hacking and coding skills, but you also need quite a bit of math and statistical knowledge, and then to actually apply that to a problem you domain knowledge in one area. When you get the intersection of all these three you get data science, and a data scientist.

Maybe a different way of saying it is this quote from Josh Wills, it says that “a data scientist is a person who is better at statistics than a software engineer, and better at software engineering than a statistician.” And the point about this is that we’re not really software engineers, we don’t have computer science backgrounds in the main, like I have a physics research background, and some of us have machine learning backgrounds, but we didn’t really go through a traditional software engineering education. And I think what that means is that a platform like Cloud Foundry is actually really ideal for us, because we are the people who really don’t want to get bogged down in setting up and configuring servers and maintaining and doing operations on them. Because really, we’re trying to get as quickly to business value by understanding data and providing some insights. So, where software developers in the past had to stand up servers themselves and provision and do those kind of things, as a data scientist that is really not my core skill, my core competency, so I want to be out actually doing a data science task. I don’t really want to be doing that. So that’s why Cloud Foundry is kind of interesting for us.

Briefly though, what are the kinds of projects that we actually work on? Well there are a wide variety of them, here’s three sort of straightforward examples. For example you could be an insurance company that wants to understand the risk. You have insurable risks, buildings in different places, and maybe you want to understand how natural disasters like earthquakes or flooding will affect those buildings. So how much money would you lose if a particular country or a particular region flooded, and so we have a client who’s trying to do this and they’re trying to run large scale very computationally intensive tasks. What we’re trying to do is trying to help them to run that in a parallel way, maybe to use it in database systems, and go from being able to run 1 or 10 of these statistical procedures to being able to run a thousand or ten thousand of them. To get a better understanding and a better insight, and in effect reduce the risk that they have.

We’ve heard a lot today about the internet of things or the industrial internet. Predictive Maintenance would come under that sort of heading. This is where we have some mechanical thing, maybe hard drives, or maybe it’s an oil drilling platform, and you’re trying to predict when it will fail because the cost of having that system out of production is very high. I’ve heard people have systems with a cost of hundreds of thousands of dollars if it’s out for one hour, or one day. If we can predict when those outages might happen, we’ll be able to either repair them in advance or send the right spare parts that need to be there. Or maybe take them out of production and put something else in its place in time that we don’t actually don’t get that down time. So we do that with a mixture of large scale machine learning processes, understanding the live data feeds that are coming in from those industrial internet applications, and trying to predict and then take action because of that.

And then the third one here is understanding your customer. So lots of enterprises and large companies have siloed data where they understand a little bit about their customer over here, and another little bit over here, but these never talk to each other. So, trying to bring those together, trying to understand your customer from a holistic point of view, and then being able to provide better services, better customer experience because of that. And that’s quite a lot of what we do. But there’s a lot of other things, for example like trying to reduce fraud in banking, or trying to predict the destination of your journey in a car, and we do a lot of these different things and we want be able to provide the data science services in a quick and easy way and get to those data driven apps.

So what does a data scientist really need out of a platform. Or what sort of infrastructure do they need to do their work? Really I think it boils down to three things. We need somewhere to store data and some easy way to capture that data. So for example in the internet of things the wide variety of different types of data coming in from different devices. We need a way to be able to channel that data somewhere and be able to store it long term, and be able to access that easily as well, like not have it in long term storage which is very hard to get at. For example, I’m working with a client at the moment and we tried to do a data extract, a relatively small size of data, like it would fit in my free drop box account, but it took over 24 hours to get that extract out. That was 24 hours we couldn’t work on the data. So we need somewhere easy to put data and access it. We need somewhere where we can do large scale intensive computations, so running at scale with distributing computation systems like Apache Spark, or on top of Hadoop, MapReduce, Paradigm—that kind of thing.

But finally, and this is where we really get to value, we need to be able to deliver results. Whether that’s purely just as a list of results on a website, or it’s a data API where someone can go access it and get predictions for different things. Alex is going to talk a little bit about that. Or it might be simply an interactive sort of data visualization where you’re able to explore the data and see what the consequences are. So we need all three of these things. I’m going to talk about the first one and Alex is going to talk about the next two.

So I think the first of these is data storage. How do we get data in and how do we keep it somewhere. In Cloud Foundry terms, platform terms, these are data services. We want an easy way to get access to these services without me having to go and download Redis myself and install it and try and tune it. Or an easy way to get a key value store and just push things towards it. I also want to be able to build an application that can actually feed that relatively easily as well. So instead of just getting someone to deliver me a hard drive and I have to load it up somewhere with Internet of Things and online real time streaming data, we’re gonna get these streams of data in, and we’re gonna need to be able to do something with it quickly.

So there’s kind of a natural way of doing this in CF with data services, so you can have your managed service, and there’s lots of examples now and we’ve heard a lot about these today and will tomorrow. But even things like highly-available mySQL or Redis or even RabbitMQ. We want to be able to create them easily, and want to be able to bind our applications to them as well. But you know lots of people have dedicated stand alone big data infrastructure, they might have their own Hadoop installation, something like a Apache Spark cluster or whatever else. User provided services allow you to connect to those really quickly and easily, and enable you to use your existing infrastructure without having to manage it through Cloud Foundry. Now you may want to get to the point where you manage it and provision it using something like BOSH, but using user provided services for now gets you to meet that distributed data requirement today if your service isn’t managed by CF at the moment.

And one good way of thinking about this is the ease with which you can switch from a test data store to a real live production data store. You know, a sort of traditional way of doing this in data science, you might have to actually go and edit your files and change the way the data flow happens. Here, we can just bind to a different service so I can have one app pushed to CF that is bound to my test PostgreSQL instance, and then I push another app but I bind that to my production instance, or I switch between the two. So that provides a really easy way of going from one to the other. So that was the data services part. Alex, you’re going to talk a little bit about the computation and the delivery of results.

Alex Kagoshima:
Sure. Thanks again. So I’m going to talk a little bit about the compute part, so on the one hand I’m going to explain a little bit what are the typical challenges when we work on actual customer projects with this, and show the concept of a little prototype we developed. But first of all, as data scientists what we usually do in our work is we implement code. So some people have this image that we stand in front of a white board with a lab coat and then code stuff and see or something like that. That’s not how it is. So what we use mainly is Python and R, so these are two fairly high level languages. The reason we use them is because they have really good library support for a lot of machine learning algorithms, so these are really our favorite tools.

So when Ian and I started out working on Cloud Foundry, the first thing we found is there’s no R buildpack, and the Python buildpack that was there is kind of … let’s say it doesn’t really have a lot of the libraries out of the box that we usually need. So what Ian did is he used the Anaconda Python distribution by Continuum Analytics and built his own buildpack out of it, and if we use that there’s a lot of stuff like scikit-learn for example which is a machine learning library, and we can use that out of the box so that was very handy.

I used a lot of R, especially in university. I’m a big R guy so what I did was I created the buildpack for R, which was kind of challenging but at some point I got it done. So these two things were really helpful and really essential before we could actually do anything on Cloud Foundry, right? So first things first we had the buildpacks which was good. So let’s take a look at our usual work. So Ian already mentioned briefly what we do is we work as kind of consultants for customers of the Pivotal Big Data Suite. And what we do there is we kind of try to get some meaningful, valuable information out of big data sets. So the way this happens in practice, so we work with a lot of enterprise customers so you see these siloed data and siloed systems at the customer, and then what we do is that we get a big data extract and put all of this in some kind of distributed big data platform. Which is nowadays usually HDFS, and then we work on top of it with Spark or something else. It could also be Greenplum, that’s an MPP relational database. Once we have it there we are happy data scientists, we can see all the data with great speed so we don’t need to go through long running extract processes because these already took place.

So we already pushed everything over there, and what we do over there is we develop the actual models. So we think about how can we, let’s say for a specific customer predicts his lifetime value for example. We use different statistical models, machine learning models that we train there. So we show a lot of data to that particular algorithm and then that algorithm learns how valuable a customer is. So everything happens over there. But the big problem is actually how do we push this model back here, because the business they actually need the prediction here and they’re legacy system landscape, right? So that’s actually kind of a big issue that we face in a lot of our customer engagements.

Very often after we created the really fancy model, we created a really great algorithm, but then we show a PowerPoint but then the model kind of dies in the PowerPoint is what we say, so not a lot happens. So this is kind of the issue that we have and we were looking at some ideas on how to solve that with Cloud Foundry which leads to roughly two thoughts on how you can actually do data science on Cloud Foundry. So this is just a very rough idea on how we think about this, there’s a lot of different variance to it, but essentially so what you can do … Let’s start here on the right side. What you can do is keep using your big data platform which is good because there’s a lot of libraries there, you can use Spark, and you can do the computation on the data in place, which is very good. You kind of use Cloud Foundry mainly as a visualization thing. So once you have some aggregated results, you’re able to show it to your customer in a web app that you deploy in Cloud Foundry, which is good.

The other approach is that you actually somehow try to leverage the compute power that’s available in Cloud Foundry and use the big data store just for storage. So you don’t do any computation in there. So these are the two different approaches. There’s also some variance to it. Let’s say you don’t want to store the data for some reason, then you can just leave that out and just do some online learning computations up there, so there’s different variance to it, but these are the two rough ideas how you could do it.

So what we did is we created this prototype of a prediction API we call it. So what we want to do with it is basically have a better way of actually interfacing with other software. So this is actually deployed at http://dsoncf.cfapps.io. When you go on there you just get the readme landing page basically, which tells you how you can send JSON there to do stuff with the API. And if you’re in the Pivotal organization on GitHub you can actually get the code here. So what does this do? So basically you have this rest API and point and you can send the request that says ‘hey, create me a model’, which then creates a model in the back end. That model then is able to ingest data, so you send the data as a JSON blob as well. And it’s kicking off some periodic re-training, so in machine learning there’s this notion of training. You show the model a lot of data and then the model gets smarter and smarter about the data.

So this framework is actually able to do some periodic re-training, saves everything in Redis for now, which you can bind really easily on Cloud Foundry and then you can also kind of send scoring requests to this API so you let it know about a data point, for example all the transactions of a customer, and then the model gives you a prediction back on how valuable that customer is, for example. So this is kind of the API idea that we have and on how we can actually leverage Cloud Foundry for Data Science.

We created this kind of interface, which means basically if you want to create a model in that kind of framework you have to implement this class interface, which means you need to have a trained function, a score function, and a get perimeters function. And they’re all done in Python. And by the way this is using Ian’s Python buildpack, I mentioned previously. So what are some data driven applications we did, what are some examples on our work. So one thing, which is really cool, which Ian created is this Transport for London demo. So what this does is it creates a live feed of all the disruptions on London streets, and then you can see the current disruptions that are happening. But what it also does is it gives you a prediction on how long these disruptions are going to last. And that is based on historical data, so we scraped this data feed, store it, show the life status, put some predictions in there, and the model also gets periodically re-trained on the historical data. You can access it right there.

Ian Huston:
I think it’s fair to say that it’s like the simplest possible way of using Cloud Foundry. It’s just a website.

Alex Kagoshima:
This is basically the right approach that you see here. Another thing that I created with my R buildpack is … we call it insurance demo, so it’s basically an insurance data set, and this app basically allows you to explore the data a little bit, and the goal here is to find valuable new customers. What you can do in this app is try to create some rules manually, but also it let’s you just train the model that picks out these customers for you, and then you can compare the performance of your manual rules and the model. And the model is usually a lot better.

Ian Huston:
And that’s an example of the second one where the computation is actually happening in the Cloud Foundry app itself. So it’s not happening on the big data platform.

Alex Kagoshima:
Yes, so it’s possible in this case because the data set is really small. It’s like a megabyte or something like that. Okay so these are two examples of data driven applications we did. With that I’m going to hand it over to Ian again.

Ian Huston:
I think these are two public examples. We’ve done quite a lot of customer work as well where we’ve used these ideas and we’ve gone a bit further in those. But what we really want to hear is about the rest of the community and what they’re doing. Already gone down to the GE booth and heard a little bit about Predix and I’m sure there’s a lot of other examples in the community where people are using Cloud Foundry to not only just display results, but maybe provide data API’s and understand some of the issues we’re talking about. So we’d be really happy to hear anything that anyone has to say about that, and we set up this website as a place where you can like just show examples of how to do these kind of things. You can send us something on our Twitter accounts. But also we’d be happy to hear right now if anyone’s doing any of this, or if you’ve any other questions as well.

About the Author

Biography

More Content by Abby Kearns
Previous
UC Berkeley’s AMPLab Drives Big Data Innovation
UC Berkeley’s AMPLab Drives Big Data Innovation

The massive influx of data, and role of technologies such as Apache Hadoop®, is well-established among ente...

Next
The Agility Frontier—Continuous Delivery and Pivotal Cloud Foundry
The Agility Frontier—Continuous Delivery and Pivotal Cloud Foundry

What’s Pivotal Cloud Foundry have to do with continuous delivery? Fresh off presenting at a recent Jenkins ...

Enter curious. Exit smarter.

Learn More