Fail Fast And Ask More Questions Of Your Data With HDB 2.0

May 18, 2016 Dormain Drewitz


sfeatured-36041-HDBNYNYJoint work provided by Dormain Drewitz and Jeff Kelly.

Pivotal HDB 2.0, the Hadoop Native Database powered by Apache HAWQ (incubating), became generally available last week. This release marks a major milestone in the technology’s evolution from it’s massively parallel processing (MPP) roots towards a new category of cloud-scale analytical database, deeply integrated with the Apache Hadoop ecosystem. So, the technology is cool, but why does this really matter? In this post we’ll look at this release through the lens of digital transformation requirements.

Building software that matters with information in context

Data is everywhere, but information can be scarce. What’s worse, information buried in a business intelligence tool is under-serving its potential users.

So, how do we turn the vast amounts of data being produced by machine sensors, web clicks, and transactions into something that users can actually use? The secret is to present that information in context for the user. Some great examples include:

  • Delivery time estimates. Think your Uber is arriving in 5 minutes, your Amazon package is guaranteed to be delivered by Friday or your estimated arrival for your flight tomorrow is 7:20 pm.
  • Next best offer suggestions. In Netflix terms, this is what they recommend for you watch next, for Amazon, it’s what other shoppers who bought that same Fitbit bought next.
  • Life-time value ranked at-risk customer lists. A bit more internal facing, these are reports that can prioritize sales plans to retain your most valued customers.

Not an exhaustive list, these use cases illustrate how data delivered to users in an application helps them do their job or go about their day with a better experience from their brands they’ve invested in.

In order to do that, several things need to happen, including analyzing large amounts of data. As an industry, we’ve known this for a decade—so why is it still so elusive? Part of it involves getting past the “f” word: Failure.

Yes, there are “dumb” questions, and that’s okay

When asking questions (i.e., running a query) of your data is time consuming and complex, it’s natural to ask fewer questions. We do this by leaning on the domain knowledge of our business analysts and crafting the fewest numbers of queries required to prove out a hypothesis. In other words, we try to avoid the “failure” of asking questions that don’t prove anything.

But that only proves what you already know (or where you have a hunch). It doesn’t expose the types of deep statistical relationships that create new opportunities. Identifying these unforeseen but potentially valuable relationships requires a different approach—namely exploratory analytics.

Rather than starting with a set of predefined questions and assumed data model, exploratory analytics techniques are highly iterative and designed to surface data in ways that interesting relationships becomes apparent. With exploratory analytics, data scientists sometimes hit dead ends and take wrong turns, but these “failures” often inform the next query and the next query and the next—until hidden insights reveal themselves.

Basic reporting and minimal queries would never expose non-obvious relationships, like, for example, how Netflix subscribers love Kevin Spacey, political intrigue, and murder. Without that insight guiding Netflix to create a show that combines those topics, the world wouldn’t have the pleasure of binge watching House of Cards on the weekends.

What if your enterprise could add exploratory analytics to its data science arsenal? What if you could ask all kinds of questions of all your data, unencumbered by the fear of asking questions that may go nowhere? What game-changing insights might you uncover if you could run complex queries in minutes rather than hours against petabytes of data?

Reducing the “cost” of asking questions of all your data to support exploratory analytics and data science is the fundamental aim of Pivotal HDB, which is powered by the open source Apache HAWQ (incubating) and is part of Pivotal Big Data Suite (BDS). Pivotal HDB combines the best of a mature, high-performance MPP analytical database with the scale and affordability of Hadoop. The result is the industry’s leading Hadoop native SQL database for “speed of thought” exploratory analytics at Big Data scale.

Apache HAWQ: High performance SQL database at Hadoop scale

Since it’s debut three years ago, enterprises in the financial services, manufacturing, telecommunications, and media industries, among others, have turned to Pivotal HDB and Apache HAWQ to invigorate their Hadoop analytics capabilities. One such enterprise is CoreLogic, a California-based company that provides data and analytics services to its clients in the financial services and mortgage industries. Mortgage lenders, for example, rely on insights from CoreLogic to score risk associated with loan applicants and their likelihood to stay current on payments. The ability to continuously interrogate huge volumes of disparate data to identify predictive correlations and insights isn’t just a luxury for CoreLogic—it’s fundamental to its competitive differentiation.

CoreLogic is in the process of replacing its legacy data management and analytics tools with Pivotal BDS, including Pivotal HDB, to enable its data scientists and analysts to ask more questions of its data, more often in order to accelerate the pace of insight discovery.

“The Pivotal Big Data Suite is a strategic platform on which CoreLogic will ingest, curate, innovate, and deliver data assets that unlock value across the different industries we serve,” said Matt Kjernes, Vice President, Software Architecture at CoreLogic. “HAWQ provides the batch/ad-hoc functionality and scale that allows us to reap the benefits of the Hadoop ecosystem in a fashion that is familiar to users.”

Elastic scalability for high performance analytics in Hadoop

Companies like CoreLogic will benefit from the latest iteration of Pivotal HDB, Pivotal HDB 2.0, which has a number of improvements, including a new architecture that supports elastic scalability. HDB can now dynamically allocate resources per query to scale up and down the number of processes, allowing much more flexibility and efficiency while retaining HDB’s high performance qualities. Here are three areas of focus in this release, all supporting the ultimate goal of further lowering the cost of asking questions of all your data.

  • A flexible architecture for high performance queries. Iterative analytics requires queries that return results at the “speed of thought,” but as queries become more complex, performance can suffer. Pivotal HDB brings the performance benefits of a mature, massively-parallel processing database—including dynamic pipelining and cost-based query optimization—to Hadoop. With HDB 2.0, it now includes more granular resource management capabilities, and elastic query execution to make the most efficient use of cluster resources and mitigate performance bottlenecks. Essentially, HDB 2.0 users are getting the best of MPP and Hadoop with this architecture.
  • HCatalog integration and SQL compliance to expand access to the most users and the most tools. It doesn’t matter how fast queries run if you don’t know how to write a query in the first place. Pivotal HDB 2.0 builds on its heritage of robust SQL compliance, opening up Hadoop-based analytics to the widest possible user base. Pivotal HDB 2.0 also includes HCatalog integration, which makes it easier for users to query data stored in Hive without having to define the schema in advance. This eliminates a complex and time-consuming process that discourages users from including Apache Hive™ data in their queries.
  • Data science at scale—now including path functions. Let’s say you want to analyze millions of sensor logs from cars or other machines to identify common patterns in part failure. The economics of Hadoop means it is now affordable to store pretty much all the data, but if it’s complicated and time consuming to run machine learning and other advanced analytics queries against your data in Hadoop, this kind of analysis could require moving data out of Hadoop and working with small sample sizes. HDB has always supported for in-database machine learning library Apache MADlib (incubating) and the latest release of MADlib includes path functions. Path functions, which are used for pattern matching over large volumes of data, are useful in a number of analytics scenarios. In retail, for example, they are useful in analyzing website and clickstream data to identify paths to purchase. Path functions are also adept at predictive analytics, such as analyzing sensor data from industrial equipment or other machines to identify common patterns in part failure.

Remember, your data is only as valuable as your ability to ask questions of it—lots of questions. HDB 2.0 is the most advanced Hadoop native SQL database on the market, giving both sophisticated data scientists and less technically savvy business users a platform for iterative, interactive analytics at Big Data scale. It allows users to fail fast and ask more questions of their data leading to faster time to insight. In the end, that’s what Big Data is all about.

There are a lot more exciting enhancements in HDB 2.0. To learn more, check out this white paper by Pivotal’s Lei Chang. Lei goes into greater depth on some of the specific feature enhancements, including improved query execution flow, YARN integration and resource management.


About the Author

Dormain Drewitz

Dormain leads Product Marketing for Pivotal Platform Ecosystem, including GemFire, Pivotal's PCF Services offerings, and ISV offerings for PCF. Previously, she was Director of Product Marketing for Mobile and Pivotal Data Suite. Prior to Pivotal, she was Director of Platform Marketing at Riverbed Technology. Prior to Riverbed, she spent over 5 years as a technology investment analyst, closely following enterprise infrastructure software companies and industry trends. Dormain holds a B. A. in History from the University of California at Los Angeles.

Follow on Twitter More Content by Dormain Drewitz
Living On The Edge: Why Route Services Matter
Living On The Edge: Why Route Services Matter

Route Services in Pivotal Cloud Foundry (PCF) are a new type of marketplace service that developers insert ...

When It Comes To Big Data, Cloud And Agility Go Hand-in-Hand
When It Comes To Big Data, Cloud And Agility Go Hand-in-Hand

In this post, Pivotal data strategist Jeff Kelly covers new research on the topic of cloud analytics and ex...


Subscribe to our Newsletter

Thank you!
Error - something went wrong!