JH
   

Data Warehousing with Greenplum: Open Source Massively Parallel Data Analytics

May 31, 2017

Relational databases haven’t gone away, but they evolved to integrate messy, disjointed unstructured data into a cleansed repository for analytics. With the execution of massively parallel processing (MPP), the latest generation of analytic data warehouses enable organizations to move beyond business intelligence to processing a variety of advanced analytic workloads. These MPP databases expose their power with the familiarity of SQL.

This report introduces the Greenplum Database, released in 2015 as an open source project by Pivotal. Lead author Marshall Presser of Pivotal Data Engineering takes you through the Greenplum approach to data analytics and data-driven decisions, beginning with Greenplum’s shared-nothing architecture. You’ll explore data organization and storage, data loading, running queries, as well as performing analytics in the database.

You’ll learn:

  • How each networked node in Greenplum’s architecture features an independent operating system, memory, and storage
  • Four deployment options to help you balance security, cost, and time to usability
  • Ways to organize data, including distribution, storage, partitioning, and loading
  • How to use Apache MADlib (incubating) analytical library for in-database analytics, and GPText to process and analyze free-form text
  • Tools for monitoring, managing, securing, and optimizing query responses available in the Pivotal Greenplum commercial database

About the Author

Marshall Presser is a Field Chief Technology Officer for Pivotal and is based in McLean VA. In addition to helping customers solve complex analytic problems with the Greenplum Database, he leads the Hadoop Virtual Field Team, working on issues of integrating Hadoop with relational databases.

Prior to coming to Pivotal (formerly Greenplum), he spent 12 years at Oracle, specializing in High Availability, Business Continuity, Clustering, Parallel Database Technology, Disaster Recovery and Large Scale Database Systems. Marshall has also worked for a number of hardware vendors implementing clusters and other parallel architectures. His background includes parallel computation, operating system and compiler development as well as private consulting for organizations in healthcare, financial services, and federal and state governments.

Marshall holds a B.A in Mathematics and an M.A. in Economics and Statistics from the University of Pennsylvania and a M.Sc. in Computing from Imperial College, London.

Previous
The Emergence and Future of the Data Engineer
The Emergence and Future of the Data Engineer

Recent developments in data management have led to the creation of the field called data engineering. This ...

Next
The Last Mile: Operationalizing Data Science
The Last Mile: Operationalizing Data Science