Tech How To: Using Pivotal Greenplum Hybrid Queries on Amazon S3 Data

April 27, 2016 Amey Banarse

 

Greenplum and Amazon S3Joint work performed by Amey Banarse and Qi Shao.

Last week, we announced that Pivotal Greenplum now does hybrid queries across data stored on-premises and on Amazon Web Services Simple Storage Service (AWS S3).

The release addresses a huge challenge in today’s hybrid cloud environments—the need to perform analytical queries across cloud and on-premises data warehouses. This post will give a high level overview of key scenarios and deeper technical explanation for how Pivotal Greenplum’s external table support works for AWS S3, including the steps to start using it.

As background, Pivotal Greenplum has supported external tables for some time, allowing a cluster to access data that is both inside and outside the cluster as if it were all inside the cluster. With this new release, Pivotal Greenplum uses external tables to query data that is natively stored in AWS S3 while accessing data that is also stored in your Pivotal Greenplum cluster, wherever it is located. This means that a single, analytical query can be segmented and distributed to either or both environments.

In scenarios where our customers choose to run Greenplum on AWS (cloud-only deployment model), the external table for S3 becomes a powerful feature that even AWS RedShift doesn’t currently have.

Common Scenarios

As companies try to figure out how to manage very large data sets, this capability offers alternative solutions for several problematic scenarios. Here are some more common examples.

First, some of our customers dump data from AWS S3 to disk and upload it into their data warehouse before they can run scripted batches or manage their ETL/ELT processes. With direct access to AWS S3 in the form of an external table within Pivotal Greenplum, data is moved to buffers at the time of a query, but not before a query is executed. Moreover, Data in AWS S3 can be analyzed in near real-time as soon as it lands there. Of course, query performance is going to be better on local disk than in S3, but some processes require analysis without the delay of a full data load. These scenarios benefit greatly from this capability.

Second, data warehouses struggle to scale and keep up with the rapid pace in which datasets grow these days. In such scenarios, AWS S3 and other cloud-based options offer additional, cost-effective, flexible storage alternative. However, it often results in two completely separate environments (or databases). An on-premise Greenplum instance will bypass the limits of separation (between cloud and on-premise) without any additional expansion requirements for an existing cluster (assuming the master is not overloaded).

As a third example, some of Pivotal’s customers in financial services and ad-tech industries use AWS S3 as their data lake. AWS tools like EMR (Elastic MapReduce)with Hive, Pig, Spark etc.are popular choices for data processing. With Greenplum, you gain a highly matured query optimizer, full ANSI SQL support, and a platform that can be deployed on AWS. So, it is a front-runner for data processing on AWS S3 datasets.

Lastly, data scientists use Greenplum’s in-database analytics with Apache MADLib, PL/Python, PL/R etc. for machine learning types of workloads. Now, they can perform the same tasks on S3 datasets as well. How cool is that?!!

Now that the the context is set, let’s look under the hood.

Concepts and Architecture

To understand the data flow between Pivotal Greenplum and AWS S3, we should quickly baseline how these components work. Pivotal Greenplum is a massively parallel processing (MPP) database cluster that conceptually uses multiple PostgreSQL instances for parallelism. There is also an open source version called Greenplum. AWS S3 provides secure, durable, highly-scalable cloud-based object storage via a simple, easy to use web service interface to store and retrieve massive amounts of data. The data is stored in buckets—these are logical units or fundamental containers of object storage.

In Pivotal Greenplum, the master holds the global system catalog and no user data. SQL queries enter the master, and the workloads are distributed to segments, where the user data is stored. The master also coordinates and presents the final results from each segment. External tables include a configuration with an input file format and location, and they are backed by the AWS S3 data. SQL commands, like SELECT, INSERT, JOIN, and other commands are then used against these external tables, which interface with the AWS S3 data.

Simple Exploratory Data Analytics Workflow

In the context of an architecture designed for exploratory analytics, as shown in the diagram below (Figure 1), data follows a certain, standard pattern. First, data lands in AWS S3 buckets from uploads or ETL, etc. Then, external tables are defined in Pivotal Greenplum to access this data. Exploratory queries can be run via SQL, or extensions like PL/Python, PL/R, PL/Java to gather metrics around the data. Once the data of interest is known, an internal Pivotal Greenplum table can be defined, and the data can loaded from the external table, in parallel. Once it is in Greenplum, heavy data science workloads and BI types of tools can run on the data.

image01

A Deeper Dive Into How It Works

Indexing the data follows a simple process. Within Pivotal Greenplum, there is a configuration for the S3 protocol with a URL specifying the AWS S3 endpoint, the S3 bucket name, and optional S3 file prefix. As shown in the diagram below, the Pivotal Greenplum segments will send a request to list all of the objects in AWS S3 that match bucket+prefix. These will be indexed based on the number of segments. After this process, each object will be assigned to a segment, and each segment will then have a list of objects to fetch from AWS S3.

Greenplum working with Amazon S3

Fetching data is depicted in the diagram below. First, an AWS S3 object is grouped into chunks, and a single segment will fetch its object in these chunks, one by one. For each object, multiple threads are started in the segment—one for each chunk. These fetch each chunk in parallel and put them into a buffer. Once the buffer is full, the data will be flushed into the database. The number of threads and chunks are configurable, and the size of the buffer is equivalent to # of threads * chunk size.

Greenplum queries on AWS S3

Get Your Hands Dirty!

In this section, we will cover how to setup S3 external tables for Greenplum cluster with version 4.3.8 or higher.

1. Prepare the configuration file for S3 – s3.conf
Replace <aws secret> and <aws access id> with your own AWS credentials/keypair.
Copy s3.conf to every segment host under same directory.

[gpadmin@127.0.0.1]$ cat <<EOF > /home/gpadmin/s3.conf
[default]
secret = "<aws secret>"
accessid = "<aws access id>;"
threadnum = 6
chunksize = 671088654
loglevel = DEBUG
EOF

2. Config S3 protocol for Greenplum
Before creating an external table with the s3 protocol you must configure Greenplum Database and define the necessary protocol.

-- Create a function to access the S3 protocol library
CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS
    '$libdir/gps3ext.so', 's3_import' LANGUAGE C STABLE;

-- Declare the S3 protocol and specify the function that is used
-- to read from an S3 bucket 
CREATE PROTOCOL s3 (
    readfunc  = read_from_s3
);

3. Define S3 external tables
The location for an s3 external table follows the format:

's3://s3_endpoint/bucket_name/[prefix/[prefix/]...]] [config=/config_file_location]'

All the files in the S3 location (S3_endpoint/bucket_name/folder_location) are used as the source for the external table and must have the same format. As well, each file must contain complete data rows. A data row cannot be split between files. Currently, only the TEXT and CSV formats are supported. The S3 file permissions must be correctly set to Open/Download and View for the S3 key pair that is accessing the files.

CREATE EXTERNAL TABLE trans_fact_external (
.... -- columns
) 
LOCATION('s3://s3-us-west-2.amazonaws.com/bucketname/trans_fact/trans_fact_1mil.csv config=/home/gpadmin/s3.conf') 
FORMAT 'TEXT'(DELIMITER='|');

4. Query S3 External Tables
You can either query the s3 external table directly or load the data into GPDB first then query the internal table for better performance.

-- query external s3 table
select count(*) from trans_fact_external;

-- load table into internal table
CREATE TABLE trans_fact as ( select * from trans_fact_external)
distributed randomly;

-- query internal table
select * from trans_fact limit 5;

Learning More:

 

About the Author

Biography

Previous
Part 1: Shifting From Android To Web Development
Part 1: Shifting From Android To Web Development

Learning to make front end web applications requires learning a few things in parallel—the language, the fr...

Next
Pivotal Perspectives—Unleash the Chaos Lemur!
Pivotal Perspectives—Unleash the Chaos Lemur!

In IT we have always worked hard to ensure that systems remain available, stable and "up". But the manner i...

×

Subscribe to our Newsletter

!
Thank you!
Error - something went wrong!