Case Study: Analyzing Retail/e-Commerce 318x Faster on HAWQ +MADlib

October 8, 2013 Adam Bloom

In e-commerce or traditional store retail, information about purchase behavior is key to managing the profit and loss of the business. For one of our customers, they saw how HAWQ could improve the speed of analysis by over 318x and reduce analytic queries from 24 days to 6 minutes.

This type of performance gain allows business intelligence teams to impact operations by an order of magnitude.

Analysis, Statistics, and Business Intelligence in Retail

In retail, there are an endless number of questions about which items are purchased together, what promotions are working, where margins can improve, how much product should be kept in stock, and why customer loyalty improves. The answers to these questions help senior executives establish priorities and help store, region, or product line managers make advertising and promotions more profitable, increase basket size, improve profit margin, increase inventory turns, and impact similar metrics. There are large teams of people, a complex array of systems, and big investments made in analytical systems to help large retailers get a competitive edge and improve operations. Even in mid-sized e-commerce companies with smaller teams and IT departments, there is a priority placed on doing more and better analysis to increase revenues.

One of our large retail customers wanted to increase the speed, cost, and scale at which they could analyze this type of data—for example, running association rules via MADlib with greater throughput on big data. By having more capacity in analytical processes like market-basket analysis, lines of business could find a greater number of useful relationships like “people who purchase charcoal and sockeye salmon also buy several other high margin items like white wine, French gruyere, and artisan crackers.” The insight from this type of analysis leads to better advertising, merchandising, promotions, and margins, allowing the business to run more profitably.

Where HAWQ Fits in Retail or e-Commerce Environments

Our customer was using Hadoop as a central landing zone for 10 years information, and it included both unstructured and structured data. In addition, they had a robust enterprise data warehouse (EDW) with 2 years of data. They used a large number of data marts to slice and dice analytics reports, explore scenarios, run market basket analysis, and make better business decisions. The marts included data that was sourced from both the Hadoop File System (HDFS) and the data warehouse. In addition to the data marts, there were several data services and applications using the data from both Hadoop and the EDW.

Their initial plan was to keep separate data warehouse and Hadoop environments that pushed data back and forth in real time via a high-speed interconnect. When they began looking at Pivotal HD and HAWQ to improve their data mining and business intelligence operations, they had several goals: 1) they wanted to see better performance than Hive, 2) they wanted to run SQL for business intelligence and see high concurrency, and 3) they wanted to have a high-performance approach to using MADlib’s data mining association rules for statistical analysis and machine learning.

How HAWQ Improved Performance and Business Results

In comparing Hive versus HAWQ, their initial results showed key improvements—a test on one day’s worth of data saw Hive take 1593 seconds (or 26.5 minutes) versus 5 seconds in HAWQ—this improvement was 318x faster. Another test showed 24 days (or 34,560 minutes) of queries get boiled down into six minutes—a speed improvement of 5760x. These type of improvements meant more data could be used for analysis, more analysis could be done, and a greater impact could be had on the metrics that matter to executives.

They also learned that HAWQ provides out of the box support for data-parallel implementations of MADlib’s mathematical, statistical, and machine-learning methods. If you aren’t familiar with MADlib, it is an open-source, community library of in-database, statistical analytic methods and SQL-based algorithms designed with parallelism in mind. While the project has grown into version 1.0, MADlib v.0.3 was initiated in 2010 by Greenplum engineers and UC Berkeley researchers and originally included the following methods:

Supervised Learning: linear regression, logistic regression, naïve Bayes classification, decision trees, support vector machines
Unsupervised Learning: k-Means Clustering
Descriptive Statistics: Count-min sketch, Flajolet-Martin Sketch, data profiling, and quantiles
Support Modules: sparse vectors, array operations, and conjugate gradient optimization

Beyond their original goals, their look into HAWQ also made them realize several disadvantages within their initial architecture plan. One, the separate Hadoop and data warehouse systems approach meant duplicate data, duplicate storage costs, and additional network overhead. Moving data took additional cycle time as well—particularly with such large amounts of data. With HAWQ, they saw a way to simplify their architecture by having a single system. They could do this because HAWQ allowed them to run SQL on Hadoop’s File System. Simplifying the architecture meant lower capital expenditures. It also meant more productive operations because two separate systems require more resources and different skill sets to administer and operate than one.

In the HAWQ architecture, there was only one system to invest in, learn, and deploy. They didn’t have to keep developing with Hive, HBase, Pig, or MapReduce—an area with several major development investments in the future. Instead, the entire team of DBAs, users, admins, and analysts could begin using existing SQL resources and existing SQL development experience. This also meant that all the money spent and code written on past SQL data mining projects could largely be recouped. Similarly, they could still utilize recent MapReduce investments and run MapReduce jobs where it made the most sense.

To learn more about Pivotal HD and HAWQ:

Learn the basics of Hadoop in 5 Pictures
See what we learned at Hadoop Summit
Read about McKinsey research—how big data analytics might be the #1 key to US economic growth and how we are going to face a shortage of data scientists
Download Pivotal HD or Pivotal HD Single Node, including HAWQ.
Read the Pivotal HD Documentation
Sign up to use Analytics Workbench for free, our 1000-node Hadoop cluster.

About the Author

Biography

Announcing Ginkgo and Gomega: BDD-Style Testing for Golang

I’m happy to announce that Ginkgo, a BDD-style testing framework for Go, and its preferred matcher library ...

LicenseAudit improvements

Over the last week, Abhinav and I have been working on some enhancements and cleanup of LicenseAudit. For t...

Case Study: Analyzing Retail/e-Commerce 318x Faster on HAWQ +MADlib

Analysis, Statistics, and Business Intelligence in Retail

Where HAWQ Fits in Retail or e-Commerce Environments

How HAWQ Improved Performance and Business Results

About the Author

Previous

Next

Case Study: Analyzing Retail/e-Commerce 318x Faster on HAWQ +MADlib

Analysis, Statistics, and Business Intelligence in Retail

Where HAWQ Fits in Retail or e-Commerce Environments

How HAWQ Improved Performance and Business Results

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.