Re-Architecting Genomics Pipelines to Handle the Rising Wave of Data

January 22, 2015 Sarah Aerni

featured-dna-dataThis blog is part of a series with joint work performed by Sarah Aerni, Mariann Micsinai, Noelle Sio , and Ailey Crow.

The world of biomedicine is rapidly changing, creating a deluge of data that is drowning current systems and approaches to analyzing or processing information, and genomics data is the key contributor.

Genomics Data Growth

In biomedicine, the challenges around developing new or targeted therapies and treatments involve the analysis of highly variable, large volume data types. For example, a growing number of treatments on the market now include genetic information for dosing of drugs or for cancer therapies targeting specific mutations. Additional data from medical image processing, breast cancer pathology and bedside monitoring in NICUs, when integrated with traditional medical data sources can lead to better prediction of patient outcomes.

Figure 1: On the left we show a sampling of the types of digital data available for patients including (l) structured data sources and (r) unstructured or semi-structured data. These data sources are useful in gaining a better picture of a patient’s health. On the right we show the promise of building models with increasing sources of data, for example, in building prognostic models that are increasingly accurate with inclusion of additional diverse sources.

Figure 1: On the left we show a sampling of the types of digital data available for patients including (l) structured data sources and (r) unstructured or semi-structured data. These data sources are useful in gaining a better picture of a patient’s health. On the right we show the promise of building models with increasing sources of data, for example, in building prognostic models that are increasingly accurate with inclusion of additional diverse sources.

One of the fastest growing sources of digital patient data is genomics data, which encompasses RNA, exome, whole-genome, ChIP or any other “seq” datasets. The need to process and analyze this data meaningfully is driving an enormous field of research requiring scientists to flex their muscle in engineering, statistics and mathematics.

Much has been accomplished over the last few decades in the field of genomics since the human genome project, arguably what put bioinformatics “on the map” and sparked its explosive growth as a field. Researchers from other disciplines, such as mathematics and computer science, have been drawn to this emerging field with its promise of contributing to fields of human health and disease, like cancer—a disease of the genome.

In part one of this series, we provide a background on the challenges in the space, the growth of genomics data sets, the current solutions, and the approaches to re-architecting significantly faster analytics that makes teams more productive and efficient.

Advancements in Genomics for Growing Datasets

As the datasets grew and the need for computation increased, many of the tools available in bioinformatics failed to adapt to the changes in the technology landscape. The unprecedented data growth resulted from increases in the number of sequenced biological samples, the rate at which they were coming in, and also the number of reads available per sequencing run, producing up to billions of reads (a terabyte of data) as referenced in this example and in the chart below.

Figure 2: Illustration of the exponential growth of genomics data and the simultaneous decline in the cost of sequencing.

Figure 2: Illustration of the exponential growth of genomics data and the simultaneous decline in the cost of sequencing.
Source Data:
http://systems.illumina.com/systems/hiseq_2500_1500/performance_specifications.html
http://www.genome.gov/sequencingcosts/
http://www.ncbi.nlm.nih.gov/genbank/statistics

Often companies would “throw hardware at the problem” rather than coming up with smarter approaches, increasing the number of machines, or improving the speed of the processors by using large compute clusters with scheduling frameworks like a Sun Grid Engine. These same compute clusters were also frequently used for running analyses on the resulting processed data, for example, to find potential disease-causing variants in populations.

While many researchers relied on increased speed and decreased cost of processors, others began to research alternative approaches. Some forays into the world of GPUs were successful, as well as attempts to make use of FPGAs (Field Programmable Gate Arrays). Many researchers began to use cloud environments where they could temporarily access cheap computation, without needing to purchase hardware or manage systems. This option was favored by many researchers who were not interested in building environments to handle occasional spikes in usage. Instead, they wanted to have the opportunity for bursting in the cloud in cases where local infrastructure is insufficient and assuming the local infrastructure and application is easily replicated in the cloud along with the distribution of data across a network. Cloud solutions are attractive in many instances; however, there still remains a need to think critically about any in-house compute components, and paradigms for running the jobs efficiently (especially off-premise).

These band-aid solutions simply divert the challenge from one of the many computational limitations to network limitation. High-performance compute clusters and cloud-bursting require data movement in order to process the data. In rare instances where data will not require re-processing, these types of solutions can make sense because only a small amount of data is ultimately brought back into an environment where it will be accessed and integrated. Still, even in the case of calling variants in mutations, mining the data for associations generally means transferring the datasets back to one of these environments each time new individuals, additional datasets for integrative analyses, or approaches are tested. Ultimately, any scenario for moving data across networks is far from ideal.

Most recently, Apache Hadoop® has grown into a leading form of cheap and well-managed storage and distributed computation for processing of raw sequencing data. These environments solve a number of problems including (1) reducing costs by using commodity hardware, (2) increasing the speed of computation by using many machines in parallel, and (3) reducing data movement by performing computation locally, where the data lives.

Opportunities for Re-architecting Genomics Processing Pipelines

Approaches for speeding up genomics processing pipelines and data mining extend beyond increasing the number of CPUs and reducing data movement. In processing pipelines, many innovations have developed around smart merging, splitting and indexing of flat files that contain these datasets. However, to data scientists at Pivotal, it became clear that many of these approaches were simply solving a problem by applying decades-old research and best practices in databases and ETL.

Our focus goes beyond taking current paradigms in computation and applying them to processing and analysis of genomics data. Recently, advancements in clever file formats to store data, open source projects that keep data in memory to avoid file I/O bottlenecks and clever compute paradigms demonstrate the readiness of the community to radically overhaul the way this data is stored, processed and analyzed.

Along with these innovations, Pivotal leverages the maturation of database technologies such as Pivotal Greenplum DB and Pivotal HD—both serve as massively parallel compute engines capable of efficiently stringing, managing and processing large-scale data. Many of the tasks accomplished in the processing pipeline for sequence data are easily adapted into the Pivotal environment. By storing mapped reads in the database using additional columns that allow for rapid grouping of data in appropriate ways (e.g., by location) enables simple tasks like identifying duplicates or counting reads in a particular region of the genome, like exons. The tasks that researchers poured energy into cleverly reengineering are actually just taking the same path that decades of database research have already solved. Therefore, we focus on how to leverage these mature technologies to solve genomics problems more efficiently.

Beyond the Processing Pipeline

These capabilities extend beyond the processing pipelines, to the integration and mining of additional biomedical data sources at scale. Last year, Pivotal co-presented with Isilon on how the use of Hadoop® and SQL on HDFS (through Pivotal HD’s HAWQ) to provide ideal environments for the processing and analysis of large-scale genomics data (access the webinar here). Leveraging in-database analytics allows for the joint analysis of clinical data alongside variant data for rapid in-database mining of statistically significant associations with diseases or gene expression.

In this blog series, we will be discussing alternative architectures that enable processing, analysis and visualization of genomics data. In particular, we will share “how-tos” that highlight how the Pivotal Data Science team ported common genomics pipeline and analytics methods to our environment. These posts will share the SQL-based approaches that accomplish identifying duplicates, counting reads for genomic features or functional regions, and mining the genome with clinical data for discovering the genetic basis of human disease.

image01

Learn More:

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

More Content by Sarah Aerni
Previous
5 Steps to Writing Better Documentation
5 Steps to Writing Better Documentation

Even if you’ve never had to document a feature you worked on, you’ve probably used open source software in ...

Next
All Things Pivotal Podcast Episode #12: Logging–How Hard Could it Be?
All Things Pivotal Podcast Episode #12: Logging–How Hard Could it Be?

When you are operating applications—be it one, or 1,000—logging is a key element of your operational model....

×

Subscribe to our Newsletter

Thank you!
Error - something went wrong!