One of the greatest challenges while working with big datasets concerns the need to move information out of storage for analysis. This process can increase the chance of error and often forces practitioners to work with partial or incomplete samplings of the data. One of the key features of the Pivotal HD platform and HAWQ is the ability to directly work with data within a Hadoop cluster, no movement necessary. To this end, the recent announcement of PivotalR 0.1 extends the platform’s capabilities, allowing users of the statistical programming language R to perform in-database analytics without leaving the command line.
PivotalR improves the scalability and performance of in-database analytics by letting users explore and manipulate information in the database using the R interface. PivotalR handles the necessary SQL translation, and computation is done within the database. The result is faster queries and modeling, without requiring the user to move data or work with only a portion of all the available information.
Practitioners familiar with R syntax will be able to perform predictive analytics and interact with MADlib analytics function calls using the language that they are already familiar with. On the roadmap for PivotalR includes support for R visualizations using intelligent sampling, Chorus integration, and support for all existing MADlib algorithms.
The PivotalR 0.1 package is available for download on Github, along with documentation, example code, and the quick start guide. You can also learn more about PivotalR from this video walkthrough by Hai Qian of Pivotal’s Predictive Analytics Team and Woo Jae Jung of the Data Science Team.
About the Author
BiographyMore Content by Paul M. Davis