The societal impact of big data technologies and data science practices was demonstrated in February, with the White House appointing DJ Patil as its first chief data scientist, as well as significant news concerning the financial services sector, student debt, government benefits, and more. Within the industry, the announcement of the Open Data Platform initiative, which includes Pivotal, Hortonworks, and a number of other industry leaders received much attention. Here’s our roundup of the top data science news of the month, both from Pivotal and beyond.
Demonstrating the increasing prominence and need for data scientists across a wide swath of society, the White House appointed DJ Patil as the country’s first chief data scientist this month. In an open letter on the White House’s website, Patil vows to increase citizens’ understanding of big data analysis, establish nationwide data policies that will continue the United States’ prominence in data science in the years ahead, establish best practices within federal organizations, and recruit leaders in the field to maintain the country’s economic competitiveness and serve as a bridge between government, academia, and industry.
Phys.org profiles Kalyan Veeramachaneni of MIT’s Any Scale Learning For All group, which is working on one of the most common and time-consuming challenges for data scientists: preparing and cleaning raw data before introducing it to machine learning systems. This can be a piecemeal, case-by-case process, which increases the difficulty of developing automated solutions. Veeramachaneni and his group have developed a tool, beatDB, which aims to automate many parts of this translation process with tools that assist in identifying signal noise and errors, feature extraction, and the ability to learn from previous choices the data researchers have made.
Two Capital One fraud researchers have been charged with insider trading by the Securities and Exchange Commission. The data analysts allegedly used their access to raw credit card logs to analyze how big chains were performing, then used this information to short the companies’ stocks before official earnings reports making big gains. While not insider trading in the sense that most people understand it, this type of “misappropriation” insider trading may set new legal precedents and have an impact on how data analysis is performed. It could intensify the focus on who has access to this kind of raw data, and lead to restrictions for data scientists and others working to find insights in these datasets.
Rather than keeping callers on hold indefinitely, call centers are increasingly using big data and data science techniques to more effectively route callers to specific and relevant agents. This report from Software Advice examines these call routing techniques and makes the case that particular customer demographics have preferences regarding the tone and manner of interaction among customer service representatives. By customizing these callers’ experiences based on their demographic preferences, businesses can increase customer satisfaction.
Student debt is an ongoing problem, with graduates footing the bill for rapidly increasing university costs and high interest rates. One of the seemingly intractable problems is the difficulty of setting interest rates for a large population of students who have little-to-no credit history. Louis Beryl aims to leverage data science to find a more nuanced approach and reduce student debt with his startup Earnest. The company, which previously offered consumer loans, recently launched a refinancing program for student loans with interest rates as low as 1.92 percent. The tradeoff for graduates is that Earnest requires access to a number of their financial and professional data sources, including their bank accounts and even LinkedIn profiles, to attain what Beryl states is a much more nuanced, detailed, and accurate picture of the graduates’ economic history and earning potential.
The New York Times reports on the New York City Human Resources Administration’s efforts to use data mining and pattern recognition software to identify suspicious activity among benefits recipients. While it’s difficult to identify fraud directly from the data the organization collects, the administration identified anomalous behavior worthy of further research among a small percentage of recipients. The agency states that the improvement in results can be seen in the numbers, with 30,000 investigations yielding $46.5 million in fraud last year versus 48,000 investigations identifying $29 million in fraud in 2009. Data privacy advocates and lawyers state that such a data-centric approach to identifying fraud misses the nuances of particular anomalous cases, and that the agency’s use of commercial services such as LexisNexis deserves closer scrutiny.
A lighthearted weekly podcast featuring computational political scientist Dr. Chris Albon and startup CTO Jonathon Morgan, Partially Derivative runs the gamut of data science topics, mixing industry insider talk and news with topics such as board game and sports analytics. This month, Albon and Morgan have discussed numerous topics, from Microsoft’s acquisition of Revolution Analytics to demystifying data science jargon, Where’s Waldo? to 50 Shades of Grey.
This Month in Pivotal Data Science
Pivotal announced this month that fifteen industry leaders would come together with the intent to create a new industry initiative, identified as the Open Data Platform (“ODP”). The new initiative includes Platinum members GE, Hortonworks, IBM, Infosys, Pivotal, SAS, a large international telecommunications firm, and Gold members AltiScale, Capgemini, CenturyLink, EMC, Splunk, Verizon Enterprise Solutions, Teradata, and VMware. In this blog post, Pivotal’s President Scott Yara explains why this announcement signals rapid evolution and standardization in the big data arena, creating a rising tide that will float all boats and bring big benefits to big data solutions in the very near term. Later that week, Roman Shaposhnik added further color to the problem, illustrating in clear terms why the Apache Hadoop® market is fragmented and why the ODP is a much needed response.
In other major news, Pivotal announced groundbreaking product enhancements to Pivotal Big Data Suite, including plans to create the world’s first open sourced enterprise data portfolio, by open sourcing HAWQ, Pivotal GemFire and Greenplum Database. In addition, the new release provides greater cloud deployment options with support for bare metal commodity hardware, appliance-based delivery, virtualized instances, and now public, private, and hybrid cloud support. In addition, Pivotal announced a partnership with Hortonworks that will allow the use of Hortonworks Data Platform, and will also include advanced support from Hortonworks on Hadoop engineering issues.
The stockpile of ready-to-use tools for data scientists is growing daily, dramatically speeding up time to insight for certain use cases. Today, rather than building everything from scratch, a data scientist may find that the machine learning method that she intends to use is already implemented and available to reuse. However, if the problem at hand requires a machine learning method that is not yet available in the library, then the data scientist needs to quickly implement the method in a clever way. In this blog post, Pivotal’s Regunathan Radhakrishnan demonstrates how easy it is to implement Adaboost on Pivotal Greenplum Database.
In this post, two expert Pivotal Data Scientists explain new ways to help financial institutions address compliance. Since the world of regulations has changed, the existing compliance and governance technology and infrastructure platforms fail to meet new business requirements, often covering only archival or basic analytics. By implementing a data lake, advanced data science algorithms, and user interfaces that build analyst feedback into predictive models, compliance groups can reach a new level of operational excellence.
Upcoming Pivotal Data Science Events
Join data technology experts from Pivotal to get the latest perspective on how big data analytics and applications are transforming organizations across industries.This event provides an opportunity to learn about new developments in the rapidly-changing world of big data and understand best practices in creating Internet of Things (IoT) applications. Additionally, attendees will engage in hands-on data science and application development training using Pivotal’s market-leading Big Data Suite.
Register Now in Your City:
The mobile communications revolution is driving the world’s major technology breakthroughs. From wearable devices to connected cars and homes, mobile technology is at the heart of worldwide innovation. As an industry, we are connecting billions of people to the transformative power of the Internet and mobilizing every device we use in our daily lives. Over the course of four days, March 2-5 2015, Mobile World Capital Barcelona will host the world’s greatest mobile event: Mobile World Congress.
The 22nd annual SXSW Interactive Festival returns to Austin, Texas from March 13 through March 17. An incubator of cutting-edge technologies and digital creativity, the 2015 event features five days of compelling presentations and panels from the brightest minds in emerging technology including a few presentations from Pivotal technologists.
Big, fast and smart. These three words will define the future of big data. Structure Data brings together prominent big data analysts, technologists and companies who are implementing some very cool data strategies. Join Pivotal and other companies and practitioners in New York on March 18 and 19.
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author
BiographyMore Content by Paul M. Davis