Another Open Source Project From Pivotal: Introducing Outsourcer

July 7, 2015 Jon Roberts

One of my early customers had the same challenges I did back in 2007—when I first started using Pivotal Greenplum Database as a customer. The challenge was loading over 1000 tables on a daily basis, from multiple Microsoft SQL Server databases to Greenplum Database, and without creating files or errors. This customer’s data was riddled with formatting problems, and they just didn’t have the place or time to unload all of their data to flat files on a daily basis.

Because I had the same challenge as this customer and knew others would too, I created a repeatable solution that others could use. It is called Project Outsourcer. The application initially leveraged Greenplum Database External Web Tables to EXECUTE a program and get data and later was enhanced to use gpfdist. At a high level, the program connects to SQL Server with a JDBC connection, queries the data, and formats it so the data could be read without a single error.

Comparing Traditional Data Loading to Outsourcer

With the traditional process (see figure above), the preparation of data{?}, prior to loading, is the bottleneck. This is compounded when there are many tables to load and if there are many formatting errors to overcome.

Outsourcer never lands files and cleans the data for formatting problems automatically. With it in place, the bottleneck becomes either Oracle/SQL Server or the network between the master server in Greenplum Database and Oracle/SQL Server.

How Outsourcer Evolved to a Widely Used Project

Over time, I have enhanced the program for customers. It automatically creates tables based on the source DDL, queues data to avoid flooding the system with 1000 concurrent INSERT statements, and performs append jobs where it would only get new data from the source. After these enhancements, I made the application open source, with the blessing of EMC, so it could be shared and used by more and have used more customers. Of course, we also hope for others to enhance it! You can view the source code here: http://github.com/pivotalguru/outsourcer

After it worked for MS SQL, I received several requests to support Oracle databases, which I added and many customers have used. For example, Silver Springs Networks used it to migrate off Oracle and onto Greenplum Database. EMC IT has also used it to migrate off of legacy Oracle and Microsoft SQL Server databases and onto Greenplum Database. Many of our partners use it with various customers, and many examples come to mind—an airline, several healthcare companies, and a hospital. We also use it widely for data science projects—Pivotal Engineers use it to quickly load data into Pivotal Greenplum Database or Pivotal HAWQ so that our data science practice can work with customers to gain new insights about their data.

As I added the support for HAWQ, I also added replication, scheduling, and a user interface. Outsourcer was the first tool to automate the table creation and data loading for HAWQ. The user interface leverages the NanoHttpd web server and simple HTML form posts. It is designed to be simple and focus on making it easier to load data.

2015: Making Outsourcer Work with Greenplum File Distribution

As we grew, I kept hearing requests for gpfdist, which is the Greenplum file distribution program. It is part of the secret sauce for incredible data loading speeds within Greenplum Database and HAWQ. This year, I finally enhanced the application to use External Tables with gpfdist rather than External Web Tables. This means the application can be installed on an edge node instead of only being able to be installed on the Master.

Now, this really is an incredibly scalable solution. Each job gets a dedicated gpfdist process and doesn’t use the Master server at all to load data. If you have a 10gb network to your source databases, the speed will really depend on how fast can you get data out of Oracle/SQL Server.

The diagram below shows how the segments connect to the gpfdist process, on an ETL server, to get data using Greenplum Database’s “scatter/gather streaming” (much like MapReduce). A database setting determines the number of segments that will connect.

Though I started working with Greenplum about 8 years ago, I’m very excited about being at Pivotal. The company has embraced open source solutions for years and encourages employees to do the same. The Outsourcer project is an example where the openness of Pivotal has allowed an application to flourish very quickly, benefitting many customers who are trying to solve the challenge of loading data at high speeds.

Learning More:

Download Outsourcer: pivotalguru.com
Outsourcer Source Code: github.com/pivotalguru/outsourcer
Pivotal Greenplum: Product Info/Docs/Downloads and Big Data Blogs

About the Author

Jon Roberts is a Principal Engineer leading the development of deploying Pivotal Data Suite in the Amazon and Google Marketplaces. Prior to leading this work, he held Platform Engineering and Sales Engineering roles at Pivotal dating back to 2010. Prior to joining Pivotal, he was a Greenplum customer for three years. He holds a Bachelor of Science in Business Administration from the University of Louisville. Go Cards!

Cloud.gov’s Revolutionary Approach To Achieve Continuous Compliance

In this post, two members of 18F, the team behind Cloud.gov, explain how they apply cloud native developmen...

UC Berkeley’s AMPLab Drives Big Data Innovation

The massive influx of data, and role of technologies such as Apache Hadoop®, is well-established among ente...

Another Open Source Project From Pivotal: Introducing Outsourcer

Comparing Traditional Data Loading to Outsourcer

How Outsourcer Evolved to a Widely Used Project

2015: Making Outsourcer Work with Greenplum File Distribution

Learning More:

About the Author

Previous

Next

Another Open Source Project From Pivotal: Introducing Outsourcer

Comparing Traditional Data Loading to Outsourcer

How Outsourcer Evolved to a Widely Used Project

2015: Making Outsourcer Work with Greenplum File Distribution

Learning More:

About the Author

Previous

Next

Related content in this Stream

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.

Explore Spring's exceptional NPS score of 75, surpassing industry benchmarks by 18%. Discover why it matters.