Using Data Science To Save And Improve Lives

August 28, 2015 Sarah Aerni

sfeatured-bioinformaticsJoint work performed by Sarah Aerni, Hulya Farinas, and Gautam Muralidhar.

Humanity will save more lives by analyzing big data and applying predictive models in healthcare and medicine.

For example, cholera outbreaks in Haiti would have been detected faster from Twitter feeds than traditional epidemiological methods.

Right now, we are experiencing a technology revolution in biomedical informatics, prompted by the growth of data and open source software. First, the adoption of basic electronic health record (EHR) systems has increased substantially since 2008 as shown in this report from the U.S. government—as many as 96.9% of non-Federal, acute care hospitals have adopted certified EHR while basic EHR has grown from 9.4% in 2008 to 75.5% in 2014. Patient records are also growing with advances in genomics, including proteomics, metabolomics, and transcriptomics. In fact, we are getting so good at collecting data that some genomics institutions are approaching the 100 petabyte mark—imagine when doctors have this information on every patient. Both consumer-wearable and in-hospital sensor data is growing in granularity, increasing in scope, and becoming more pervasive. Many imaging modalities, such as MRI, ultrasound, and echocardiography are using images in increasingly sophisticated ways and storing the high-resolution images. One microscopy image can be as large as a terabyte, and new, higher-resolution approaches are constantly being created. These trends are making data storage and computation a much more significant cost.

The bioinformatic revolution will use data-driven predictive models to drive actions to benefit a multitude of use cases or processes. For example, wearable heart monitors, patches, or other “Internet of humans” apps may capture data and use predictive models to prevent heart attacks or other problems in advance of them happening. An RFID tag might inform a surgeon that a sponge or an aortic cross clamp is about to be left behind in surgery. As well, pills and contacts will probably collect data—Novartis is working on robotic pill delivery as well as smart contact lenses.

The Architecture Requirements for Bioinformatics and Medical Sensors

To put data science algorithms to good use, we need the right architecture, accurate models, and a common data platform to produce alerts or affect outcomes in real-time. Most of this data is currently silo-ed, on dated technology, in both unstructured and structured forms.

In the past, bioinformatics problems have used computational horsepower from a massive number of CPUs, but these architectures have limits in the terabyte and petabyte scale. Traditional programming languages were also not designed to deal with trillion-row matrices or extremely large arrays. Lastly, teams would ship data from one location to another in order to process results, and the results would be sent back—creating a huge bottleneck with network and disk I/O.

The new architecture paradigm became massively parallel processing (MPP) databases that store and process data in distributed environments and minimize data movement. Pivotal’s Greenplum database became a market leader in this space by taking the open source PostgreSQL database and adding a sophisticated capability for storing data across nodes, letting the nodes perform parts of a query or count, and then aggregating results.

MPP has worked well with tabular formats of data, but data also comes in other forms. This is where Apache Hadoop® becomes a fit, storing massive amounts of unstructured data, such as images or text. Much like parallel processing, Hadoop MapReduce separates data across nodes to processes them and aggregates the results. MapReduce was originally intended for batch workloads and can have disk access latencies, especially when stringing jobs together.

The disk and batch nature of Hadoop opened the door for Apache Spark®, and it includes fault tolerance, distributed data storage, and in-memory computing. The Spark abstraction of a resilient distributed dataset (RDD), allows data to be cached in memory across machines, avoiding frequent disk reads, allowing reuse of data across multiple workloads. This results in very low-latency processing for real-time needs.

The relatively new adoption of these platforms has created a surge in the development of machine learning libraries, including MADlib, MLlib, and GraphX. We can use these tools as well as R or Python—single-threaded applications.

Making Use of Real-Time Data through Operationalized Predictive Models

These new architectures not only speed up traditional data studies, but also open us up to new types of data analytics that allow us to save lives and improve health in real-time. We can use this type of real-time architecture to reduce many costs and errors, prevent undesirable events, and to affect care through predictive assessment and proactive actions. For example, e can use ER wait time, readmission, length of stay, hospital census, gaps in care, and other data to be pro-active—active being the key word.

To explore the possibilities of what we can now do, the questions to begin thinking about are:

  • What data is available?
  • When is data available—at what point in time?
  • What data is actionable and when?
  • Can the current data and model make accurate enough predictions?
  • Can we make a prediction in a timely enough fashion to take action and affect outcomes?
  • Can we change our process to capture more data, improve it, or use data to optimize the process?

By taking these approaches and asking these types of questions, we can and have changed. Today, we can predict length of stay at hospitals and go far beyond combining patient history with vitals like blood pressure or oxygen saturation to prevent a problem. Instead, integrating all data, no matter its size or structure. We can ensure everyone performs hand hygiene practices sufficiently. We can ensure not only that pharmaceuticals are not incorrectly dispensed, but design precision treatments and improve the outcomes. We can use wristwatch-style sensors to alert a system of perspiration or body temperature drops, showing signs of low blood sugar in diabetics and prevent a catastrophic outcome.

There are still challenges to overcome with issues around technology accessibility and cost, data privacy and security, regulations, and patient cohort selection, but the innovations are coming fast.

With these types of solutions, humanity can save more lives and improve them as well.

Learning More:

About the Author


More Content by Sarah Aerni
This Month in Data Science: August 2015
This Month in Data Science: August 2015

In the month of August, as students prepare to return to school, big data’s increasing impact upon the coll...

Case Study: Refactoring A Monolith Into A Cloud-Native App (Part 1)
Case Study: Refactoring A Monolith Into A Cloud-Native App (Part 1)

Migrating legacy, monolith apps on to Cloud-Native architectures is a challenge. In this post, we delve int...

Enter curious. Exit smarter.

Register Now