Using Data Science in Health and Life Sciences

July 22, 2015 Simon Elisha

sfeatured-podcastIn this episode, we speak with Sarah Aerni, Pivotal’s principal data scientist who leads our Healthcare and Life Sciences vertical. In the podcast, we explore with Sarah some of the work she’s been doing in healthcare and life science, and how that looks from a data science perspective.

PLAY EPISODE

SHOW NOTES

TRANSCRIPT

Announcer:
Welcome to the Pivotal Perspectives Podcast. The Podcast at the intersection of Agile, Cloud and Big Data. Stay tuned for regular updates, technical deep dives, architecture discussions and interviews. Now let’s join Pivotal’s Australian and New New Zealand’s CTO Simon Elisha for the Pivotal Perspective’s Podcast.

Simon Elisha:
Hello everyone, and welcome back to the Podcast. Great that you could make the time as always. Today I have a very special guest. I love to have a guest on the Podcast because it makes it much more interesting and interactive because we can learn different things. Today’s guest is Sarah Aerni who is Principal Data Scientist and leads our healthcare and life sciences vertical. She works in some pretty cool areas. Welcome to the Podcast, Sarah.

Sarah Aerni:
Thank you. Nice to be here.

Simon Elisha:
Thanks for joining us. I think you’re joining us from San Fran today, is that correct?

Sarah Aerni:
That’s right, yeah.

Simon Elisha:
Yeah, fantastic. It’s very early where I am, and a little later in the day where you are. I know that whenever we talk on the Podcast and we talk about data science we get a lot of feedback. A lot of that feedback is around what do people actually do? It seems like data science has because a bit of a mysterious space. People are like, “Well, you can use data science to do this, and you can use data science to do that.” What people really like to get to hear about is the reality of the situation, the on-the-ground experience. I think, Sarah, maybe let’s talk about some of the work you’ve been doing in healthcare and life science, and how that looks from a data science perspective.

Sarah Aerni:
Absolutely. Of course, data science is in all verticals, but I focus mostly in those, so I can share my experiences there primarily. There are, I think, multiple ways of approaching data science in that space. I think frequently people think of data science as being something around optimizing, maybe advertising dollars or, potentially, how to hold on to your customers. That is also true in healthcare and life sciences. You can think about a hospital itself being concerned with how to make sure that people are interested in coming to that particular hospital, or certainly a payer, like a healthcare payer which we have in the United States, that provide the health insurance, the payment dollars. They would then, of course, be interested in figuring out how to retain or hold onto their particular members in the health plan.

It actually extends far beyond that. Data science in healthcare can be the type of work that we’ve done around figuring out how to treat patients better. How to keep them from returning back to the hospital, meaning that we provide them with better care to try to lower readmissions rates, to try and determine how long a patient might stay in the hospital when they’re admitted. Using their patient record, understanding how many times they’ve come in previously and how healthy the patient is from a data-driven approach. Again, looking at their record to predict how long they’re going to be in the hospital this time around. That’s on the healthcare side. Of course, we have a lot around patient monitoring.

Other things like accounts payable at a hospital. Is there any fraud or waste that can be detected? Then the side on life science, which includes the pharmaceutical industry where we’ve done things like try to predict drug targets using, again, a lot of different data sources to figure out if there is a signal to figuring out how to bring a new drug to market, or what to bring it to market for. Optimizing manufacturing, so trying to predict at the end of an eight month manufacturing process for a vaccine, one project that we did there, and whether or not it’s going to be a high-quality product. Again, using the data that’s coming off of the machines during manufacturing. Can we optimize that pipeline? There are a lot of examples across the board there.

Simon Elisha:
There’s lots of dimensions as you say. One of the things that’s interesting about data science in general, this domain in particular as well, is that often when we come into a problem domain we’re trying to prove-out a theory or a hypothesis. We’re trying to think that we can find some sort of improvement. When you do things like, for example, you talked about the patient admissions stuff and understanding how long people stay, etc., were there insights that were found, or do you find that often you explore but don’t find any genuine cause or relationships? What’s the experience in this particular domain?

Sarah Aerni:
Yeah, that definitely is an interesting question. I would say we always find insights. That’s one of the biggest things that comes out of any project that we do with our customers on their data sets. The insights actually come even before building any sort of model. Sometimes it’s just having the experience of knowing how to poke around data sets. In the example of length of stay, actually, we found the types of insights you would expect, which is, for example, someone who’s in the hospital, and not only whether or not they leave, at what point they leave, but trying to predict where they leave to. If you leave a hospital you might leave for home. You might, unfortunately, not leave alive, or you might leave for a nursing facility.

Interesting that some of the things, the decision that patients make, which seemed obvious, like do not resuscitate, would determine whether or not you end up, for example, in a skilled nursing facility, as opposed to dying. Which, of course, which of those outcomes is better is up to the person to decide, but it’s an obvious thing, potentially. There were other things around a nurse’s shift change. When that actual shift change occurs, actually driving length of stay because of operational inefficiencies that are introduced by doing it during an hour when patients might be leaving the hospital. Those are insights that seem trivial, but they need to be surfaced, and often through data-driven approaches because how would they even know to look at that?

Simon Elisha:
Yeah. It’s real interesting. It’s almost taking a manufacturing operational view, to some extent, of the work that goes on in a healthcare facility and optimizing it as much as possible. When you consider how expensive healthcare is, how short on resources most facilities are, if you can improve outcomes by just adjusting, as you say, shift changes or destinations of customers that’s got to be a huge benefit to many of these providers, and to patients as well, for patient care.

Sarah Aerni:
Absolutely. Beyond that, of course, I don’t want to trivialize the fact that a lot of the patient histories themselves are informative. Being able to assess how healthy or sick a patient is, and not doing that in a very top-down, knowledge-driven way of saying, “I know this patient has these diagnosis, therefore is on the Charlson index in some point.” Instead you’re saying, “No, well, actually these indicators together, using more complex algorithms that allow you to recognize, potentially, interaction in terms of presence of one disease and another.” Not necessarily from a very physician’s point of view of being able to say, “I think this patient is very sick,” but instead of a data-driven view where it says, “The data support that this is actually a bad combination.”

Simon Elisha:
Yeah. In terms of that data, because I think we’re all familiar with the classic poor handwriting of the doctor on the chart sort of thing, what’s the reality in your experience in the healthcare sector in terms of the types of data sets that are available? Are you using mainly structured data that is well-codified, or are you using a lot of un-structured data or ancillary data as well? What’s the approach you take to data set collection before you do your analysis?

Sarah Aerni:
For us we like to have access to all the data sets that are available. In general, in the healthcare space these are largely structured or semi-structured data sets at the moment. We generally don’t get access too frequently to notes. Certainly when you think about what you’re describing with the handwritten challenges, a lot of healthcare providers don’t necessarily have that data available for us to work with, at least initially. You always have to bring in the perspective of at what point is it inefficient or efficient to bring in that data set, how much value is added, and then, of course, whether or not it’s actionable.

At times, yes, for patient history those notes might be very critical for capturing some undiagnosed disease that might not have made it into the patient record in a structured format. At the same time if it’s something where as it’s entered we want to drive an action immediately, then although the notes would be helpful for assessing a patient in the past, sometimes there are ways of approaching it or inferring it to allow you to drive action in a different way. It varies.

Simon Elisha:
It’s the classic “It depends” answer, I guess.

Sarah Aerni:
Yeah.

Simon Elisha:
It sounds to me like data science is reasonably well-established in this space. How long have people been using data science or variants there of in healthcare and life sciences?

Sarah Aerni:
Certainly data science, I think, is in healthcare and a lot of sciences often re-branded. We do meet a lot of people that have extensive experience. If you go into actuarial sciences you might argue that that is a very early data science field that is now evolving. That dates very far back.

Simon Elisha:
Yeah.

Sarah Aerni:
Certainly any decision support and those types of things have been around for decades, and also are the foundation of data science. I think in the life sciences space, the most rapid growth certainly around drug development and figuring out diseases and pathways, that space has always been around in terms of just being science. I think the unique pieces around all of these different areas is the availability of data, is what has changed everything.

With data the evolution of new techniques, where you don’t have to take that what you’re describing as a hypothesis-driven approach of, “I have this idea. Let me go and collect the data that can either support that this is true, or disprove it so I can go after something else.” Instead we have all of this data and access to all of this data that now allows us to explore many hypotheses or potentially just explore the data and see what’s there at all.

Simon Elisha:
Fantastic. You talk about some of the new opportunities, there being more data than ever before and more opportunities than ever before. What are you’re thoughts around some of the biggest hurdles? We’re still not maybe where we want to be. What do you run into from time to time that can cause frustration or things that are obstacles to getting to where you want to go?

Sarah Aerni:
I think there are a couple of hurdles that I see most obviously in this space. The first on is just the fact that in this space, access to data is a challenge frequently. That has to do with the fact that different components have come online at different times and systems have evolved. For example, in healthcare there are different systems that collect different parts of a patient’s record, and they tend to sit in silos even if they are part of a greater record. Trying to access those all, for example, if you want access to a raw image, that’ll live in one system that although it can communicate across to the main medical record, you’re not going to necessarily get access to the raw image itself.

Imagine we’re looking at a patient that has cancer, and of course we’re interested in trying to predict an outcome for this patient. That would be important to make a decision on how to treat the patient. Which type of therapy are we going to go after, what are the best outcomes for these patients? We have an ever-increasing amount of data for them. That could be their record, their history, but also could be the raw images themselves, so anything the pathologists could access about the state of the patient.

If you look at histology, for example, look at the raw images from a tumor and try and access. A pathologists would say, “Well, this is what the stage of the cancer might be,” and give a prognosis. Actually, there’s evidence if you look at the raw data and try and build models on the images themselves you might do a better job. Getting access to that data in conjunction to all of the other data sets is one of the first challenges, I would say.

Simon Elisha:
Do you have a challenge within that? Because it’s interesting, obviously, in healthcare there’s clearly a strong degree of privacy and data handling consideration. How does that play into the work that you do for these organizations?

Sarah Aerni:
That’s absolutely true that that is a challenge. However, a lot of the organizations, for example healthcare providers, would get around that, and certainly the pharmaceutical industry as well, they would get around those challenges by keeping all of this inside. Although our company talks about cloud enablement, there is the private and public cloud component. There are certainly ways of mitigating that risk around staying compliant, and then certainly around deciding which patients allow what type of data to be accessed. There is that component in the basic research side around just dealing with maybe cell lines rather than actual cancer patient data.

Even in that space, though, the fact that the image processing experts and maybe the chemical structure experts and the genomics field, the ever-evolving field of DNA and RNA, that space, we are talking about people with very deep expertise in separate areas. It’s interesting the way technology itself has generally been modeled to perform well with certain types of data. Classically, physicists have engineered very large super computers that are good at something which happened to work very well for molecular dynamics, but now that we’re dealing with large data volumes and we have moved down this, I guess, road of Hadoop and how do you distribute data and compute in a distributed fashion? The space of, I would say, bio-medicine, which evolved before large data sets existed is in this mindset, like a track mind of sticking to these old super computers, and having to make that transition over has been relatively slow.

I think another one of those challenges that I see where people in pharma and healthcare have to make the decision to make an actual monetary investment to have a seat at the table to now get to control and drive direction a little bit. Technologies evolve to, basically, make money by addressing the needs of communities that are going to use it. If pharma and the healthcare space in general gets to state their claim and say, “We’re willing to pay, and as a result we can now drive the technologies to cater to our needs.” That’s another challenge I see that they’re facing.

Simon Elisha:
That’s really interesting because, I guess, in the past these fields have always been associated with high-tech and investment in data, etc. It’s almost like the mainstream handling of big data and high-volume data, etc., has leapfrogged where some of those organizations are. Do you think that the introduction of a lot more instrumentation of medical machinery … We use the phrase “Internet of Things”, essentially, the fact that there is so much more technology in the facility associated with customers. You’ve now got a lot of day-to-day tracking with things like FitBits and Apple watches and those types of things as well. There’s more data points than ever before, certainly in healthcare. Do you think that’s going to change the view of these organizations and the way they process things?

Sarah Aerni:
Yeah. I can already see that there are just lots of companies that are partnering with these big tech companies, Google. Of course, Google within itself has organizations that are organizations. Google X is going after some of this space. What’s nice is that, just like you’re saying, these companies that have evolved to feed consumer’s need, maybe, to quantify themselves and to work to figure themselves out, well, those things are now enabling in healthcare and life sciences, absolutely.

There are still, I think, challenges with the more traditional types of research medium that we’re seeing. The fact that the genomics space in particular has changed a lot, and there’s been a lot of an evolution. I think initially at the baseline when you really think what has happened, even when you look at some of the tools that have been created, what’s interesting is they reinvented the wheel in a lot of ways, MPP databases, because that was something that was evolving at the time when that was already an existent … The human genome project is decades old, and definitely pre-dates MPP databases, but the meat, actually, was there already.

Hadoop could have greatly enabled any faster compute, but when the product out there that everyone is using are based on some sort of other system, then there has to be a major paradigm shift that takes place. I think, yes, absolutely with the quantified self and sensors and that will help a lot in the medical device space. I think the other areas are still undergoing some sort of transformation to understand how to agree as a community that this is the best approach, and so we should all switch over to it.

Simon Elisha:
I think it’s interesting in any industry vertical it’s always challenging to alter the status quo and to move from one accepted approach or technology to another. Particularly, what happens is if one person is a little bit of a pioneer or an experimenter or an early adopter and gets great results, then the others follow pretty quick. It’ll be interesting to see how that evolves over time.

Sarah Aerni:
Yes. Particularly in an industry where pioneering is probably not a top-of-mind for any pharma company. They go with large blockbuster drugs, for example, which is one sort of approach. There needs to be, maybe, a shift on other things, and we’ve seen this now. Some pharma companies are really making these massive shifts and taking more risks, and it’s really great. I think we’ll see immense progress.

Simon Elisha:
Fascinating. Now Sarah, one of the things we always do, particularly when we have data scientists guests on the show, is to ask, “How did you get here?” Because people often say, “What’s the road to becoming a data scientist?” Share a little bit about your journey with us, please.

Sarah Aerni:
Okay. My road to data science is all founded in the fact that I played tennis.

Simon Elisha:
An obvious starting point there.

Sarah Aerni:
Yeah. No, but sincerely what happened was I was always interested in biology, and as a result of accidentally tennis, I needed to take a computer science course in high school because typing just wasn’t offered. I found out I was interested in computer science as well. When I started my under grad, the same year they opened up specialization in what’s called bioinformatics which is using computers to study biological problems and systems. I just continued on that path, did some research. As I said earlier, it’s a re-branded version of data science. It’s using computers and machines to study data coming out of the biological or bio-medical domain. I think mine is probably a little bit less strange, but it is the tennis.

Simon Elisha:
It’s definitely the tennis is the key. It’s interesting, so far I have to say I’ve been fortunate to meet quite a few data scientists in various countries, and almost without exception they have an unusual or interesting story as to how they got to what they do. Again, maybe we need a data science engagement to figure out if there’s some sort of link there in unusual, unconventional backgrounds or cross-disciplinary backgrounds. Certainly, no one has the same story, which is, I think, an interesting factor in that whole domain.

Sarah Aerni:
Yeah, that is actually interesting. That’s natural in something that didn’t exist at the time when we all began the process toward this particular field.

Simon Elisha:
Fore sure. Fantastic. Sarah, it’s been really enlightening speaking with you. I’m sure the listeners have got a great deal of insight into the space, and more specifically around the health and life sciences space. Thank you so much for joining us today.

Sarah Aerni:
Yeah. Thanks very much for having me.

Simon Elisha:
Fantastic. Thanks everyone for listening as always. We do love to get your feedback. You can send us any suggestions you have, podcast@pivotal.io. Until then, keep on building.

Announcer:
Thanks for listening to the Pivotal Perspectives Podcast with Simon Elisha. We trust that you’ve enjoyed it and ask that you share it with other people who may also be interested. We’d love hear your feedback, so please send any comments or suggestions to podcast@pivotal.io. We look forward to having you join us next time on the Pivotal Perspectives Podcast.

About the Author

Simon Elisha is CTO & Senior Manager of Field Engineering for Australia & New Zealand at Pivotal. With over 24 years industry experience in everything from Mainframes to the latest Cloud architectures - Simon brings a refreshing and insightful view of the business value of IT. Passionate about technology, he is a pragmatist who looks for the best solution to the task at hand. He has held roles at EDS, PricewaterhouseCoopers, VERITAS Software, Hitachi Data Systems, Cisco Systems and Amazon Web Services.

More Content by Simon Elisha
Previous
Our Top 10 Quotes From OSCON 2015
Our Top 10 Quotes From OSCON 2015

Last week’s annual O'Reilly Open Source Convention in Portland was a hotbed of discussion on cloud-native a...

Next
Performance Benchmark: Pivotal HAWQ Beats Impala & Apache Hive—Part 1
Performance Benchmark: Pivotal HAWQ Beats Impala & Apache Hive—Part 1

In part one of this five blog series, Dan Baskette, Pivotal Director of Marketing for Data and Analytics, q...