Data Science Case Study: A Healthcare Company's Journey To Big Data

November 24, 2014 Hulya Emir-Farinas

Joint work performed by Hulya Emir-Farinas and Sarah Aerni with help from Noah Zimmerman, Emily Kawaler and Ailey Crow.

Adopting a new technology is never a trivial task. Introducing a brand new tool into a data scientist’s toolset is no different. The resistance to change is especially high in companies that employ tens or hundreds of statisticians. Understandably, analysts have learned to love their tool and live with any shortcomings. The effort required to learn a more efficient tool often seems too great even if such a transition would lead to long-term time savings. This is where Pivotal Data Labs (PDL) comes into the picture, using a team of highly skilled set of data scientists and engineers to prove results to our customers such as:

Shorter time to insight and to market
Better utilization of all captured data (both structured and unstructured)
Improved model quality and better decision-making
Minimized data movement and need to create multiple copies

In this blog, we will describe an example journey to technology adoption executed through a series of data science engagements solving real problems for our customer, a major healthcare provider. This customer has a large division of research, and as a trailblazer in preventive healthcare, employs many accomplished clinicians and biostatisticians who are limited by the analytics tools that they use. The journey they took shows how analytics can be done faster and better through a series of 5 projects (Figure 1). Each project answered different questions, proving the need and utility of new tools in advancing their data science practices, improving their business, and ultimately leading to the decision to adopt new technology.

PHASE 1: Prove Better Technology Speeds Up Discovery

Their journey started with a hackathon. They invited four vendors to a 24-hour event and provided medication order history and environmental sensor data. In 24 hours, we accomplished the following:

Showed that there is a correlation between measured environmental factor and prevalence of a chronic respiratory disease.
Predicted who is most likely to have a chronic respiratory disease related admission in the next three months.
Demonstrated that patients who do not pick up their medication from the pharmacy are more likely to have expensive hospitalizations.
Built a population management tool for physicians and a mobile app for patients where both apps were powered by predictive models built during the hackathon.

This hackathon served as a proof point that the platform was capable of rapidly ingesting, analyzing, and visualizing large-scale data that had never been analyzed before in a very short period of time. The customer was convinced that Pivotal’s big data technology platform enabled more rapid discovery and insights. However, they also wanted to see if it was possible to use the power to improve the quality of existing models.

PHASE 2: Prove Better Technology Can Improve Model Quality

The customer presented PDL with a model that researchers and statisticians had worked on extensively. Their model on predicting the length of stay of patients admitted to the hospital for Acute Myocardial Infarction (AMI) was state-of-the-art and was the most accurate model published in the academic literature. Our goal was to demonstrate that, by using new technology, they would be able to leverage more of their data, and using data-driven (rather than hypothesis-driven) approaches, improve the model quality.

In 3 weeks, we were able to engineer over 300 rich features, experiment with many different model forms, and build an ensemble model that doubled the accuracy of their baseline model. Some of the insights from this effort were very interesting to our customer:

We proved that length of stay (LOS) couldn’t be explained by just biology—operations, nurse schedules, and hospital’s experience in cardiology also played a big role explaining the variations in LOS.
Model fit for LOS for AMI is influenced largely by the most recent information. Available on a patient from the current hospitalization and recent laboratory test values, demonstrated by the figure below. By removing various groups of features and seeing the effect on model fit in the test set, we were able to assess the value of each group of features.

Figure 2 (Left) General categories of features used in the modeling exercise, over 300 total color coded by group given in chart. (Right) chart showing the model accuracy decrease on test data when feature group is excluded from the model against baseline (yellow) with all features included.

PHASE 3: Prove Better Technology Can Be Accessible To Non-technical Experts

With new technology, being able to take advantage of existing talent is critical for two reasons—overall adoption and to benefit from the full potential of an organization’s data. In this proof point, the PDL team collaborated with Pivotal Labs and created an application that allowed clinicians and data scientists to generate rich features on hundreds of millions of patient records stored in HDFS without writing any code in a short time.

The valuable knowledge physicians possess about patients can contribute greatly to modeling exercises, for example, in identifying valuable features of patient readmissions or adherence to smoking cessation programs. However, without the ability to write code, it is often difficult for them to translate their clinical knowledge and explore different hypothesis by processing and visualizing the data. This also applies to patient diagnoses. Coding systems that are used (ICD-9 in this case) are highly specific. By knowing a patient’s comorbidities at a less granular level and grouping codes, for example using CCS codes or the Charlson index, we find the analysis to be far more informative. Furthermore, depending on the particular application, a physician may only be interested in capturing newly diagnosed conditions (incidence) prior to a particular procedure and only within fixed windows of time.

To enable physicians to profile their patient population, we created a web-based application that was capable of, in seconds, processing hundreds of millions of patient records to generate profiles of the patient population (Figure 3). The application was flexible enough to allow the physicians to

choose various levels of granularity (CCS code levels)
filter by treatment location (e.g., hospital, skilled nursing facility)
using different time windows of interest

It generated a visualization of the breakdown of diagnoses of the requested patient population, even allowing interactive drill-downs.

Processing these large volumes of data in real-time and on-demand is no trivial task, and we used a compressed representation of the data to use in-database, bitwise operations, making the process extremely fast and efficient. This application was so successful that, together with the LOS project, it won the 2014 innovation fund for technology award for our sponsors within the company.

Figure 3: Screenshot of the web-based application where physicians can select various ways to group and aggregate patient histories for selected populations.

PHASE 4: Prove Data Science Can Improve Business Outside The Clinical Setting

In this project and proof point, our customer asked if we could help their accounts payable department with fraud, waste, and abuse (FWA) detection. This department was already doing a great job detecting FWA using deterministic rules established by their domain experts. However, they were interested in how data science might improve their approaches.

In just a few weeks, we managed to detect a substantial number of FWAs that were undetected by the existing rules as covered in our webinar Machine Learning for Forensic Accounting. Furthermore, the approaches reduced the number of false positives for review, reducing the workload of the domain experts. These tasks were accomplished by leveraging several different approaches:

Fuzzy string matching (by calculating Damerau-Levenshtein) to identify duplicate entries that may result from data-entry errors
Benford’s Law for identifying falsified invoices which follow a non-natural distribution of invoice amounts
Anomaly detection on purchasing profiles, e.g. to identify opportunities to reduce spending by comparing hospital generic drug purchasing behavior (see figure)

As a result, the Accounts Payable department hired their very first data scientist, which is one of many ways we measure success at PDL.

Figure 4: The heatmap shows the spend profile for the pharmacies within the healthcare provider for a single drug (defined as an active ingredient). Each column shows how a given pharmacy’s dollar spend for that particular drug is distributed across the various products available. A sample irregularity is shown where the pharmacy spends a larger fraction on a brand name drug than other pharmacies.

PHASE 5: Prove That The Technology Doesn’t Require Pivotal Data Scientists

After being convinced that PDL can use Pivotal technologies to build better models faster, the customer wanted us to train their data scientists to build better models. We designed a custom training session and asked their data scientists to bring the model they were working on to see if we could improve any of them. In 5 short days, their data scientists built a brand new sepsis mortality model (which outperformed the general mortality model) and improved their EDIP (Early Detection of Impending Physical Deterioration) model significantly. This was through our platform (Apache Hadoop® and HAWQ) that enabled the use of new modeling tools and extremely large-scale datasets, including bedside monitor feeds and orders.

Using Pivotal’s technology they were able to:

Perform rapid data exploration, munging and modeling of this data stored in HDFS with HAWQ’s SQL capabilities.
Have access to a variety of visualization and processing tools, including our big data machine learning library, MADlib.

It was a great experience to see their data scientists explore the whole dataset in its rawest form and build many interesting features in minutes. These would have taken them days using their old analytics tool.

Succeed At Your Own Big Data Journey

Acquiring a new technology never guarantees adoption, especially for running analytics. You may already have a shiny distributed computing platform but if your data scientists are still extracting a sample and taking it to an in memory solution to analyze it, you are missing the boat. Sometimes you need to teach your data scientists how to leverage this new technology and PDL is happy to help you with that challenge. The sample technology adoption journey here is only one of many examples of how PDL has helped our customers along this path. Look for future posts on how customers engage and get educated with new technologies to discover how they can revolutionize their business.

Learning More

Pivotal Big Data Suite Product Info and Data Science Blog Entries
Pivotal Labs and Pivotal Data Labs
For help on your next data science project, contact Pivotal Data Labs .

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

Tracker Secrets: Adding a Story to an epic

Four ways (that I know of, but there maybe more) to add a story to an epic: 1) Add the epic label to the st...

Increasing the Size of a VCSA Root Filesystem

In this blog post we describe the procedure to increase the size of the root filesystem of a VCSA (VMware v...

Data Science Case Study: A Healthcare Company's Journey To Big Data

PHASE 1: Prove Better Technology Speeds Up Discovery

PHASE 2: Prove Better Technology Can Improve Model Quality

PHASE 3: Prove Better Technology Can Be Accessible To Non-technical Experts

PHASE 4: Prove Data Science Can Improve Business Outside The Clinical Setting

PHASE 5: Prove That The Technology Doesn’t Require Pivotal Data Scientists

Succeed At Your Own Big Data Journey

About the Author

Previous

Next

Data Science Case Study: A Healthcare Company's Journey To Big Data

PHASE 1: Prove Better Technology Speeds Up Discovery

PHASE 2: Prove Better Technology Can Improve Model Quality

PHASE 3: Prove Better Technology Can Be Accessible To Non-technical Experts

PHASE 4: Prove Data Science Can Improve Business Outside The Clinical Setting

PHASE 5: Prove That The Technology Doesn’t Require Pivotal Data Scientists

Succeed At Your Own Big Data Journey

About the Author

Previous

Next

Related content in this Stream

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.