As Apache HAWQ (incubating) approaches its first release, Pivotal is providing new single- and multi-node sandbox environments so that architects, developers, and big data administrators can quickly spin up clusters for testing the software as well as 3rd party integrations.
In this post, you will learn how we created a single-node sandbox, how we used Docker to enable a multi-node testing experience and which one to choose based on your requirements.
About The Pivotal HDB/Apache HAWQ Sandbox
DBAs, SQL developers, and data scientists typically want to test commercial software and have it configured out-of-the-box with all the appropriate management tools. These users also need the ability to easily install 3rd party software, such as a BI Tool, alongside the software they are testing. And, their set-up test environment is often segmented to support multiple purposes. For these users, we have created a VMware virtual machine with Pivotal HDB—Pivotal’s commercial version of Apache HAWQ—and it is now available on Pivotal Network. We have also streamlined the install to reduce the size of the resulting VM, making it easier to download. To address future sandbox updates, we collaborated with Hortonworks to design an automated build process, and it allows our team to easily push out new revisions, add functionality, and fix any issues that are reported.
The resulting Pivotal HDB Sandbox virtual machine is available for VMware today, but it will also be made available as an Amazon AMI in the near future. This sandbox includes all the software needed to get started and allows you to test SQL on Hadoop with in-database machine learning.
Problems With Single Node Testing & Parallelism
This approach is pretty standard for education and testing, but it doesn’t really address the distributed environments found in production. To address this need, one of our uber-talented engineers Jemish Patel has taken this sandbox a step forward. Jemish took on the challenge to create a multi-node testing environment to better demonstrate Apache HAWQ’s powerful massively parallel-processing (MPP) capabilities. This method of testing is targeted toward developers and power users that gravitate toward building their own test systems.
While test environments have become the norm for database software, they often fail at delivering a true-to-life experience when working with distributed software. Typically, the management of the system is very different, as is the network communication profile. This means single-node test environments don’t capture the management finesse—or failure—of the software.
How can we architect a test environment that resembles real-world use? Linux containers allow us to build small self-contained clusters that can be deployed within a single OS image. This gives the look-and-feel of a distributed cluster without the hardware requirements. We also don’t have to address the changes that are sometimes required when installing multiple software products on a single node. In Docker, these changes can be handled at runtime, through techniques such as port mapping.
Containerizing Apache HAWQ With Docker
As we set out to provide an easy testing mechanism for Apache HAWQ, we wanted to provide both a testing vehicle for the platform itself and for the building of Apache HAWQ software from its source code. This additional step was required because the Apache project does not provide binaries. We chose Docker to provide the platform for this new environment.
In these types of environments, Docker is a powerful deployment mechanism for a few reasons. First, because you are really only running the additional software itself and not recreating the entire OS memory image within a virtual machine, the memory footprint is much smaller. This also means the images can start almost instantly. Second, this smaller memory footprint allows us to run multiple “virtual” nodes on the same host and create a true clustered deployment. Lastly, since we are not distributing the entire OS as part of the build, the software download is much smaller and takes far less time to complete.
Regarding the image design, our original goal was Apache HAWQ testing support, but we also wanted to align with containerization design ideals. So, much of the surrounding core Hadoop stack software is purposefully absent. Our core configuration contains HDFS and YARN, and we added Apache HAWQ to that core. Any of the other software that could be used in conjunction with Apache HAWQ, such as Apache Hive, can be provided via additional containers that communicate over the internal network. In the end, this Docker configuration consists of 4 containers that will run a Hadoop Namenode/HAWQ Master, and 3 Hadoop Datanode/HAWQ segment servers.
Getting Started With Your Own HAWQ Proof Of Concept
The source code and instructions to build this Apache HAWQ containerized cluster are available on GitHub at https://github.com/jpatel-pivotal/hdb-docker. The build process downloads a pre-built docker image from DockerHub and sets up all the needed docker containers. The docker image was also built from a clone of the Apache HAWQ repository with a complete Maven build, and it includes HAWQ’s PXF, the external data query framework.
The build also includes a few other notable packages:
- GP-ORCA, an open-source query optimizer that is common across both Pivotal Greenplum Database, Apache HAWQ, and Pivotal HDB.
- Apache MADlib (incubating), which is an open-source library for scalable in-database machine learning algorithms
- PL/R, so you can write database functions in the R programming language and use R packages that contain R functions and data sets.
- PL/Python, to allow using Python to write database functions.
Our goal is to provide end-users with a seamless experience for testing Pivotal HDB/Apache HAWQ. Once you download the Pivotal HDB Sandbox (or clone the repository discussed above) and launch your own SQL database on Hadoop cluster with Docker, please give us some feedback via the comments section below.
All of your thoughts will help evolve this model, providing more “cluster-in-a-box” testing solutions and demonstrations for the other software products, including those within the Pivotal Big Data Suite.
About the Author
Dan is Director of Technical Marketing for Data and Analytics at Pivotal with over 20 years experience in various pre-sales and engineering roles with Sun Microsystems, EMC Corporation, and Pivotal Software. In addition to his technical marketing duties, Dan is frequently called upon to roll-up his sleeves for various "Will this work?" type projects. Dan is an avid collector of Marvel Comics gear and you can usually find him wearing his Marvel Vans. In his spare time, Dan enjoys playing tennis and hiking in the Smoky Mountains.More Content by Dan Baskette