Another Open Source Project From Pivotal: Introducing Outsourcer

July 7, 2015 Jon Roberts

sfeatured-data-loadingOne of my early customers had the same challenges I did back in 2007—when I first started using Pivotal Greenplum Database as a customer. The challenge was loading over 1000 tables on a daily basis, from multiple Microsoft SQL Server databases to Greenplum Database, and without creating files or errors. This customer’s data was riddled with formatting problems, and they just didn’t have the place or time to unload all of their data to flat files on a daily basis.

Because I had the same challenge as this customer and knew others would too, I created a repeatable solution that others could use. It is called Project Outsourcer. The application initially leveraged Greenplum Database External Web Tables to EXECUTE a program and get data and later was enhanced to use gpfdist. At a high level, the program connects to SQL Server with a JDBC connection, queries the data, and formats it so the data could be read without a single error.

Comparing Traditional Data Loading to Outsourcer

image00

With the traditional process (see figure above), the preparation of data{?}, prior to loading, is the bottleneck. This is compounded when there are many tables to load and if there are many formatting errors to overcome.

image02

Outsourcer never lands files and cleans the data for formatting problems automatically. With it in place, the bottleneck becomes either Oracle/SQL Server or the network between the master server in Greenplum Database and Oracle/SQL Server.

How Outsourcer Evolved to a Widely Used Project

Over time, I have enhanced the program for customers. It automatically creates tables based on the source DDL, queues data to avoid flooding the system with 1000 concurrent INSERT statements, and performs append jobs where it would only get new data from the source. After these enhancements, I made the application open source, with the blessing of EMC, so it could be shared and used by more and have used more customers. Of course, we also hope for others to enhance it! You can view the source code here: http://github.com/pivotalguru/outsourcer

After it worked for MS SQL, I received several requests to support Oracle databases, which I added and many customers have used. For example, Silver Springs Networks used it to migrate off Oracle and onto Greenplum Database. EMC IT has also used it to migrate off of legacy Oracle and Microsoft SQL Server databases and onto Greenplum Database. Many of our partners use it with various customers, and many examples come to mind—an airline, several healthcare companies, and a hospital. We also use it widely for data science projects—Pivotal Engineers use it to quickly load data into Pivotal Greenplum Database or Pivotal HAWQ so that our data science practice can work with customers to gain new insights about their data.

As I added the support for HAWQ, I also added replication, scheduling, and a user interface. Outsourcer was the first tool to automate the table creation and data loading for HAWQ. The user interface leverages the NanoHttpd web server and simple HTML form posts. It is designed to be simple and focus on making it easier to load data.

2015: Making Outsourcer Work with Greenplum File Distribution

As we grew, I kept hearing requests for gpfdist, which is the Greenplum file distribution program. It is part of the secret sauce for incredible data loading speeds within Greenplum Database and HAWQ. This year, I finally enhanced the application to use External Tables with gpfdist rather than External Web Tables. This means the application can be installed on an edge node instead of only being able to be installed on the Master.

Now, this really is an incredibly scalable solution. Each job gets a dedicated gpfdist process and doesn’t use the Master server at all to load data. If you have a 10gb network to your source databases, the speed will really depend on how fast can you get data out of Oracle/SQL Server.

The diagram below shows how the segments connect to the gpfdist process, on an ETL server, to get data using Greenplum Database’s “scatter/gather streaming” (much like MapReduce). A database setting determines the number of segments that will connect.

image01

Though I started working with Greenplum about 8 years ago, I’m very excited about being at Pivotal. The company has embraced open source solutions for years and encourages employees to do the same. The Outsourcer project is an example where the openness of Pivotal has allowed an application to flourish very quickly, benefitting many customers who are trying to solve the challenge of loading data at high speeds.

Learning More:

About the Author

Jon Roberts

Jon Roberts is a Principal Engineer leading the development of deploying Pivotal Data Suite in the Amazon and Google Marketplaces. Prior to leading this work, he held Platform Engineering and Sales Engineering roles at Pivotal dating back to 2010. Prior to joining Pivotal, he was a Greenplum customer for three years. He holds a Bachelor of Science in Business Administration from the University of Louisville. Go Cards!

Previous
Cloud.gov’s Revolutionary Approach To Achieve Continuous Compliance
Cloud.gov’s Revolutionary Approach To Achieve Continuous Compliance

In this post, two members of 18F, the team behind Cloud.gov, explain how they apply cloud native developmen...

Next
UC Berkeley’s AMPLab Drives Big Data Innovation
UC Berkeley’s AMPLab Drives Big Data Innovation

The massive influx of data, and role of technologies such as Apache Hadoop®, is well-established among ente...