During the month of April, the growing impact of Big Data and data-driven insight on our daily lives became increasingly apparent. While pundits debated the merits of this massive sea change in data collection and analysis, its value and results were borne out this month in intriguing and surprising ways.
Top Data Science News in April 2014
Among data geeks, UPS’s 2004 announcement that their delivery vehicles would avoid taking left turns to conserve fuel has long been a source of curiosity. As this Priceonomics post explains, the company’s idiosyncratic yet data-driven company policy has yielded significant efficiency gains, utilizing simple algorithms to map routes which maximize right turns. According to the company, since 2012 the policy has “saved around 10 million gallons of gas and reduced emissions by the equivalent of taking 5,300 cars off the road for a year.”
In this post for the Wall Street Journal’s CIO Journal, Irving Wladawsky-Berger provides executives with a high-level overview of the growing importance of data scientists within the enterprise, a field he describes as “one of the most exciting new professions and academic disciplines.”
Big Data may have entered the hype cycle’s dreaded trough of disillusionment if the recent media backlash is any indication, even though many critiques lack a sophisticated understanding of the tools and methodologies involved. In this post for O’Reilly Media, Mike Loukides pushes against the backlash. He acknowledges that data scientists must be ever-vigilant and skeptical when considering the limitations of particular methodologies and data sources, but emphasizes that the Big Data revolution is well underway, and is powering a great number of technologies we rely on daily. He also predicts that trend will only continue to grow in future years.
The semi-concurrent launch of Nate Silver’s 538, Vox, and a slew of data-driven “explainer” sites from big media outlets like the Washington Post and the New York Times has driven much debate this month about the value and potential limitations of data journalism. In this Politico essay, Felix Salmon argues why the boom in data journalism is actually a good thing for the news industry and media junkies alike.
GigaOM details how the Internet of Things has the potential to revitalize the chip industry, noting the amount of new opportunities and challenges that will arise as companies attempt to bring everyday physical objects into the connected world.
It may be the new hotness in boardrooms and shareable viral content, but data visualization is a centuries-old practice. In this fun post, Wired looks back at the past 900 years of tree diagrams, which came about during the Middle Ages, during which time there was an explosion of new knowledge needing to be categorized and communicated, drawing parallels with the Big Data explosion of today.
The New York Times explores how big data software companies are threatening the profitability of legacy hardware vendors such as Oracle, IBM, Teredata, and others. It relates the current industry shift to the way microprocessor-based computing drove computer mainframe prices into the ground.
A capricious group of Cornell researchers utilized data mining and deep statistical analysis to trawl the web and determine whether there are time travelers lurking in our midst. Unfortunately for the sci-fi minded among us, the researchers came up short in their research, but in the process illuminated the lighter side of data analysis.
|>> See and Share this post on Slideshare!
This Month in Pivotal Data Science
Pivotal’s New Big Data Suite Redefines the Economics of Big Data Including UNLIMITED Hadoop to Enterprises
This month, Pivotal changed the economics of Big Data forever, launching the Pivotal Big Data Suite. It is an annual subscription based software, support, and maintenance package that bundles Pivotal Greenplum Database, Pivotal GemFire, Pivotal SQLFire, Pivotal GemFire XD, and Pivotal HAWQ, into a flexible pool of big and fast data products.
Pivotal’s Roman Shaposhnik reviews ApacheCon 14, which took place last week in Denver. At Pivotal’s self-described “coming out party” to the Apache Software Foundation, we worked to make an impressions by starting off with a keynote, providing and attending various sessions and even hosting a cocktail party. In this review of the event, Shaposhnik also points community members to some of the newer technologies he believes are hot to watch and use right now.
In a video interview for the Big Data & Brews series, Pivotal’s Chief Scientist Milind Bhandarkar shares a beer with Datameer’s CEO Stefan Groschupf and provides an overview of the many features that differentiate Pivotal’s Hadoop distribution from the rest.
DSC Webinar Series: Data Science for the 99% Open Source Software for Machine Learning and Analytics
In this webinar, available to now view at Data Science Central, Pivotal’s Woo J. Jung, Sarah Aerni, and Srivatsan Ramanujam discuss some of the open source tools in their arsenal. They introduce and provide details on the variety of open source tools — such as MADlib, PL/R, PL/Python, PivotalR, PyMADlib and a host of others—they have utilized and extended for customer engagements.
The previous blog posts in this series introduced how Window Functions can be used for many types of ordered data analysis. This post further elaborates how these techniques can be expanded to handle time series resampling and interpolation.
Thursday, May 15, 2014
5:45 PM to 8:30 PM
San Francisco, CA
Twitter’s Dmitriy Ryaboy and Pivotal’s Milind Bhandarkar discuss Parquet, an open source project implementing columnar storage that supports deeply nested structures, efficient encoding and column compression schemes, and is designed to be compatible with a variety of higher-level type systems. In this talk, they will go over the Parquet design, use cases, and performance numbers.
Tuesday, May 20, 2014
5:45 PM to 8:30 PM
San Francisco, CA
Configuring and operating a Hadoop cluster is still not a trivial task and needs special considerations. In this talk, Pivotal’s Suhas Gogate will provide various tips to configure a Hadoop cluster and to analyze and tune the performance of Map/Reduce applications. He will also demo “Hadoop Vaidya”, a performance advisor for Hadoop M/R, which he submitted as a Hadoop contrib project.
June 3–5, 2014
San Jose, CA
The 7th Annual Hadoop Summit will feature many of the Apache Hadoop thought leaders who will showcase successful Hadoop use cases, share development and administration tips and tricks, and educate organizations about how best to leverage Apache Hadoop as a key component in their enterprise data architecture.
About the Author
BiographyMore Content by Paul M. Davis