This Month in Data Science: January 2015

January 30, 2015 Stacey Schneider

This Month in Data for January 2015As thought leaders, practitioners, and journalists looked forward to the year ahead in January, they identified a number of long-running trends that will become increasingly important in 2015: the increased demand for data scientists and relative lack of skilled practitioners, the continued growth of open source tools, and the value of automation in light of ever-growing data lakes. Here’s our roundup of the top data science news of the month, both from Pivotal and beyond.

So you want to build a data science team?
Internet companies looking to start a data science team often get overwhelmed with the challenges of hiring, building, and growing a team. In this post, Rodrigo Rivera details three key factors that companies should take into account: accountability, resources, and team composition.

Why data science matters and how technology makes it possible
Derrick Harris of GigaOm profiles data science rock star Hilary Mason, formerly of Bit.ly and Accel Partners, and currently founder and CEO of research company Fast Forward Labs. In the interview, she shares her insights on the long-term value and impact of data science, how to nurture an effective team, as well as specifics such as the state of the art in natural language processing.

Open Source Is Data Science’s Missing Ingredient
At ReadWrite, Matt Asay makes the argument that as companies attempt to reap value from their increasingly massive Big Data stores, they may miss the value of hiring and nurturing a culture of open source. In Asay’s opinion, the most effective data scientists are not only well-versed in open source data science tools, but also confident with working with, and asking questions of, open data sets.

A brief look at data science’s past and future
In this episode of the O’Reilly Data Show Podcast, DJ Patil reviews the history of data science, with which he has considerable first-hand experience, and offers his take on the current state of data science and Big Data in industry and academia.

What Should Data Scientists Know?
In this post at Forbes, Howard Baldwin makes the argument that, as demand for data scientists continues to grow in 2015, critical analysis skills and familiarity with the most common data science tools will become more important than formal education, despite the number of PhD’s currently in the field.

How Data Science can be used to solve issues for Teach For India
Harshith Mallya at SocialStory profiles a Project Accelerator Night hosted by DataKind Bangalore earlier in the month which explored how data science can be utilized by Teach for India, an initiative to get college graduates and professionals to commit two years to teach full-time at schools which lack resources and teachers.

The Data Center Journal 2015 Predictions: Data Science and Apache Hadoop and In-Memory, Oh My!
At the Data Center Journal, Mathias Golombek offers his predictions for the discipline in the New Year: increased sophistication of data science analytics, its ubiquity for an ever-growing number of use cases, improved efficiency as a result of Apache Hadoop® adoption, and the increased importance of in-memory analytics.

This Month in Pivotal Data Science

Re-Architecting Genomics Pipelines to Handle the Rising Wave of Data
Genomic data is quickly becoming a central part of the next generation of medicine. The future of treatments will be through precision medicine, as evidenced by the growing number of drugs targeting specific mutations, and this will require the processing of large volumes of data. Genomics data sets have grown considerably, and the solutions of the past are not going to solve up-and-coming problems with adequate performance and speed. In this post, the first of a series, Pivotal data scientists introduce a new perspective on how the re-architecture of genomics processing pipelines, using SQL on Hadoop and low-cost storage on HDFS, can propel this space by revolutionizing the processing and analysis of data.

Pivotal For Good with Crisis Text Line: A First Look
Pivotal For Good (P4G) has partnered with Crisis Text Line, whose trained specialists assist hundreds of at-risk teens every day, to use data science to understand and predict teens’ crisis needs. These insights will ultimately help foster data-driven improvements on Crisis Text Line’s current platform (e.g. their ‘switchboard’) and training/recruiting of their crisis specialists. This blog post is the first look into the challenge of measuring if texters feel their crisis situation was alleviated by an exchange with a specialist, and how they felt the conversation went.

Pivotal People—Gavin Sherry on Engineering PostgreSQL, Greenplum, HAWQ and More
In this post, we do a Q&A session with Pivotal’s vice president of engineering for data, Gavin Sherry. Gavin has played a major role in developing a number of well-known data-centric products, including significant contributions to PostgreSQL. Before big data was common, he joined Greenplum to help build the massively parallel processing engine for analyzing petabytes of data. At Pivotal, he helped lead the development of HAWQ, creating a new era of data platforms with SQL on Hadoop. In this interview, he shares his history as well as his views on where the market is heading and how his products are being used to help companies become more data driven.

Pivotal’s Top Predictions for Data Science and Big Data in 2015
Based on the thoughts of our top Pivotal data scientists and big data thought leaders, this post offers a view into our top predictions of what will happen in the data science and big data space in 2015. With an incredibly dynamic and fast-paced market, the year is sure to experience tons of innovation and market maturation that will yield exciting examples of just how significantly expertly built algorithms can benefit businesses.

All Things Pivotal Podcast Episode #10: Discussing Natural Language Processing and Churn Analysis with Mariann Micsinai and Niels Kasch
Organizations are increasingly concerned with is customer churn, the desire to reduce the number of customers they lose, and to deeply understand the reasons for customers leaving. Not just for its own sake—but to be able to take clear actions to reduce the rate of churn and to improve either the customer experience or the products that they offer. There are a raft of technologies and techniques that can be applied in this area—but where to start? This week we speak with Data Scientists Mariann Miscinai and Niels Kasch to get their insights into these domains.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

More Content by Stacey Schneider
Previous
When To Shave Yaks, Or Avoid It All Together
When To Shave Yaks, Or Avoid It All Together

In his first post with Pivotal, Cote explores some of the reasons why some enterprise development groups ar...

Next
Implementing Adaboost on MPP for Big Data Analytics
Implementing Adaboost on MPP for Big Data Analytics

The stockpile of ready-to-use tools for data scientists is growing daily, dramatically speeding up time to ...

Enter curious. Exit smarter.

Learn More