Why MPP-based Analytical Databases Are Still Key For Enterprises

October 16, 2014 Sai Devulapalli

Over the past five years, we have seen the Apache Hadoop® ecosystem grow at an escalating pace. This week’s Strata and Hadoop World conference in New York is a testament to the level of interest this evolution has created among enterprises looking to expand their data analytic capabilities. Enterprises now have an arsenal of data analytics tools at their disposal: Data Lakes, SQL engines, parallel machine learning tools, real-time complex-event processing and online learning tools, key-value and object stores, visualization and analytics development tools, and more.

So this is the right time to take a step back and think about what business problems we are trying to solve and how the various solutions in the market align with business objectives: the business problems and use-cases; cost and performance goals; as well as policy, maturity and regulatory needs. When the question gets asked this way, we come to the realization that there are always trade-offs between these business objectives, and that there is no one-size-fits-all solution.

The figure below shows where the three commonly used analytics solutions in the industry fit in. Enterprise Data Warehouses (EDWs) have been around since the 80s and to-date continue to be used to store historical enterprise information. Enterprises have invested in frameworks of tools and business processes around the EDW. That said, there is a clear trend in the industry to off-load analytics processing out of EDWs into Massively Parallel Processing-based (MPP) Analytics Databases and Hadoop-based Analytics Stacks. The business drivers for this trend are fairly well understood:

the economics in storing and analyzing petabytes of data from a variety of data sources
an ecosystem of analytics development tools not tied to specific vendors
use cases requiring deep analytics such as machine learning algorithms applied to large volumes of data

Then the question comes down to what use cases are best suited for implementation on Hadoop®-based Analytics Stacks and what use cases are a better fit for MPP-based Analytics Databases. This is not to suggest that these two stacks are mutually exclusive. In fact, the opposite is true in that quite a few use cases require these stacks to be integrated and working together. Nonetheless, it is a useful exercise to identify the criteria that drive the use of each of these stacks.

Structured and unstructured data analytics: The Map-Reduce paradigm native to Apache Hadoop® has proven to be an effective tool to pre-process unstructured and semi-structured data sources such as images, text, raw logs, XML/JSON objects etc. On the other hand, rapid implementation of majority of the data discovery and data science use-cases requires strong support for SQL with embedded machine learning capabilities normally available in an MPP-based Analytic Database. The significant shift toward SQL based analytics is also being driven by the dearth of developers with Map-Reduce skills.

Performance and cost drivers: MPP-based Analytic Databases can be built using shared nothing architecture and are not constrained by the limitations of the HDFS file system. In addition, automatic parallelization of ingest load and redistribution of processing load in an MPP-based Analytic Database ensures better latency for ad-hoc queries and better throughput for batch-mode queries. MPP-based Analytic Databases usually run on bare-metal as opposed to virtualized environments due to their performance intensive workloads.

Current Support for Enterprise Grade Features: MPP-Based Analytic Databases have been designed with security, authentication, disaster recovery, high availability and backup/restore in mind. On the other hand, Hadoop®-based analytic stacks have been originally designed for distributed operation with high availability. Additional enterprise grade features are actively being added to both Apache and vendor-specific distributions of the Apache Hadoop® stack, so this gap in support for enterprise grade features is likely to significantly narrow into the future.

Greenplum MPP-Based Analytic Database

Pivotal offers Greenplum Database, the industry leading MPP-based Analytic Database that performs data exploration and deep analytics at petabyte scale with blazing performance and support for critical IT and business requirements in security, policy and business continuity. Greenplum Database underscores Pivotal’s commitment to providing the strongest enterprise grade SQL-based analytics offering in the market.

Architectural tenets: Greenplum Database is built using a shared-nothing architecture with collocated storage and compute. It supports parallel loading from diverse structured data sources and Apache Hadoop® data lakes and massively parallel high performance ad-hoc queries. This enables Greenplum Database to be deployed in a diverse set of data pipeline processing architectures. Hardware capacity can be expanded on an incremental basis with automatic or controlled load redistribution minimizing lifecycle management costs.

Flexibility and Adaptability: Polymorphic storage enables columnar and row-based storage simultaneously and is used for scanning large volumes of data and small lookups respectively. This enables the solution to scale up to handle large data sets with thousands of columns as well as scale down, with respect to cost and latency to handle smaller data sets. Appliance-based and Software-only deployment options, column level compression, flexible indexing and partitioning provide full control to enterprises to trade off performance with cost.

Advanced Analytics: In addition to OLAP queries such as cube and grouping set operations, Greenplum Database has the richest support in the industry for massively parallel machine learning capabilities invoked from SQL, Python, R, etc.

Enterprise Grade Features: Besides cost, performance and deep analytics capabilities, enterprises need an analytics platform that confirms to their security and regulatory policies and business continuity SLAs. To this end, Greenplum Database supports row and column level encryption for data at-rest and in-motion and a rich set of authentication and role-based access control mechanisms. Business continuity can be ensured using comprehensive High Availability with block-level replication capabilities and full and incremental automated backup/restore with remote Disaster Recovery

These product capabilities along with excellent customer success initiatives have earned Pivotal the leadership role in SQL-based enterprise analytics and machine learning. Recently, Gartner published the report, “Gartner Critical Capabilities for Data Warehouse Database Management Systems” that shares survey results of customers from their experiences with data warehouse DBMS products. The report scored Pivotal in the top 2 out of 16 vendors in two use cases: “Traditional Data Warehouse” and “Logical Data Warehouse”. In a third use case, “Context Independent Data Warehouse”, Pivotal scored in the top 3 relative to the 15 other vendors.

Leveraging Greenplum MPP-Based Analytic Database for Apache Hadoop®

Our leadership in SQL-based enterprise analytics and machine learning has led us to challenge the conventional thinking in the industry around the gap between MPP-based Analytic Databases and Hadoop®-based Analytics Stacks.

While most analytics vendors are investing to improve the SQL-on-Hadoop implementation, Pivotal has leveraged the decade worth of product development effort that went into the Greenplum Database, reused this code-base to build an SQL engine on Hadoop® and enhanced it with the industries’ only cost-based query optimization framework tailored for HDFS. This SQL-on-Hadoop product is called HAWQ (Hadoop® With Query). HAWQ enables enterprises to benefit from the hardened MPP-based analytic features and its query performance while leveraging the Apache Hadoop® stack.

Pivotal offers one license for both HAWQ and Greenplum Database, under Big Data Suite at the price point normally found in SQL-over-Hadoop systems and charges software licenses only for compute resources and not for the volume of data stored. This enables enterprises to switch between HAWQ and Greenplum Database without re-budgeting exercises and spending approvals for licenses as the volume of data grows and enterprise analytic needs change. Furthermore, this combined stack, shown in above figure can run on commodity hardware or the DCA appliance from EMC.

The combined stack significantly lowers the business risk for enterprises by providing a choice of interoperable analytic solutions and the ability to switch between them with minimal reconfiguration, all under one license. Please find more information and technical details for Greenplum Database, HAWQ and Big Data Suite, visit us at Strata and Hadoop® World, subscribe to our YouTube channel and reach out to your local Pivotal sales representative to discuss your specific business analytic needs.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

Big Data Suite Gets Big Props From Customers, Partners

Just over 6 months after launching Pivotal Big Data Suite (BDS), Pivotal customers and partners are rallyin...

All Things Pivotal Episode #2 – Why Customers Want To Use Pivotal CF

In this week's episode, Simon shares insights on why organisations want to use Platform-as-a-Service, and a...

Why MPP-based Analytical Databases Are Still Key For Enterprises

Greenplum MPP-Based Analytic Database

Leveraging Greenplum MPP-Based Analytic Database for Apache Hadoop®

About the Author

Previous

Next

Why MPP-based Analytical Databases Are Still Key For Enterprises

Greenplum MPP-Based Analytic Database

Leveraging Greenplum MPP-Based Analytic Database for Apache Hadoop®

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.