Greenplum Database Adds The Pivotal Query Optimizer

May 5, 2015 Dan Baskette

Today, we are releasing Pivotal Greenplum Database 4.3.5 with the new Pivotal Query Optimizer, the world’s first cost-based optimizer for Big Data. This new cost-based query optimizer represents a radical improvement over the legacy optimizer that has served the Greenplum Database well over the past 10 years.

The Pivotal Query Optimizer, available exclusively for Pivotal Greenplum Database, brings a state of the art query optimization framework to Greenplum Database that is distinguished from other optimizers in several ways:

Modularity. Pivotal Query Optimizer is not confined inside a single RDBMS. It is currently leveraged in both Greenplum Database and Pivotal HAWQ, but it can also be run as a standalone component to allow greater flexibility in adopting new backend systems and using the optimizer as a service. This also enables elaborate testing of the optimizer without going through the other components of the database stack.
Extensibility. The Pivotal Query Optimizer has been designed as a collection of independent components that can be replaced, configured, or extended separately. This significantly reduces the development costs of adding new features, and also allows rapid adoption of emerging technologies. Within the Query Optimizer, the representation of the elements of a query has been separated from how the query is optimized. This lets the optimizer treat all elements equally and avoids the issues with the imposed order of optimizations steps of multi-phase optimizers.
Performance. The Pivotal Query Optimizer leverages a multi-core scheduler that can distribute individual optimization tasks across multiple cores to speed up the optimization process. This allows the Query Optimizer to apply all possible optimizations as the same time, which results in many more plan alternatives and a wider range of queries that can be optimized. For instance, when the Pivotal Query Optimizer was used with TPC-H Query 21 it generated 1.2 Billion possible plans in 250 ms. This is especially important in Big Data Analytics where performance challenges are magnified by the volume of data that needs to be processed. A suboptimal optimization choice could very well lead to a query that just runs forever.

Pivotal Query Optimizer enables multiple new features for query optimization. There are three features that are the most impactful to the type of queries that are common to big data analytics:

Dynamic Partition Elimination. In the world of big data, partition elimination is an extremely well-known and important feature in reducing the amount of data required to be processed by the query engine. With Greenplum Database and the Pivotal Query Optimizer, if there is a cost-effective way to eliminate partitions that are not relevant to the result, they will be eliminated. Let’s look at an example:
```
SELECT year FROM catalog_sales JOIN date_dim ON (date_id=date_dim.id) WHERE date_dim.month = 12 GROUP BY year;
```
In this query, the list of relevant partitions is based on the outcome of the join to date_dim table, which is not known until execution time. This is known as Dynamic Partition Elimination and is, basically, the ability to only scan relevant partitions when the relevant partitions can only be determined when you begin to execute the query. To address these types of queries, Pivotal Query Optimizer has introduced three new query operators that work together in a producer/consumer model to perform scans over partitioned tables: PartitionSelector, DynamicScan, and Sequence.
- PartitionSelector computes all the child partition OIDs that satisfy the partition selection conditions given to it.
- DynamicScan is responsible for passing tuples from the partitions identified by the PartitionSelector.
- Sequence is an operator that executes its child operators and then returns the result of the last one.
The placement of these PartitionSelectors in the query plan allow the Pivotal Query Optimizer to support more complex patterns, such as partition selection based on equality and range predicates, as well as dynamic partition elimination. If we look a little deeper at the SQL example above we can dive into the details about how all these components work together to obtain the desired result.
```
QUERY PLAN
------------------------------------------------------------------------------
Gather Motion 2:1 (slice2; segments: 2) (cost=... rows=10 width=32
 ->  Hash Join  (cost=... rows=5 width=32)
 	   Hash Cond: catalog_sales.date_id = data_dim.id
 	   ->  Dynamic Table Scan on catalog_sales (dynamic scan id: 1)  (cost... rows=5000 width=16)
 	   ->  Hash  (cost=100.00..100.00 rows=50 width=4)
 	       ->  Partition Selector for catalog_sales (dynamic scan id: 1)  (cost=... rows=50 width=4)
 	        	 Filter: catalog_sales.date_id = date_dim.id
 	        	 ->  Broadcast Motion 2:2  (slice1; segments: 2)  (cost=... rows=1 width=16)
 	        	 	  ->  Sequence  (cost=... rows=1 width=16)
 	        	 	  	  ->  Partition Selector for date_dim (dynamic scan id:  2) (cost=....)
 	        	 	  	       Partitions selected:  12 (out of 12)
 	        	 	  	  ->  Dynamic Table Scan on date_dim (dynamic scan id: 2)  (cost=...)
 	        	 	  	           Filter: month = 12  
```
The query plan is read from bottom to top and at the base of the plan, the first instances of PartitionSelector, DynamicScan, and Sequence are leveraged to setup the partition scans of the data_dimtable. The second instance of PartitionSelector is where the magic happens. When the inner (right) side of the join is executed, tuples from date_dim will be streamed into the PartitionSelector, where the partition selection function will be applied to choose those partitions determined by the value of the join clause (date_id=date_dim.id). In this case, no Sequence operator is necessary, as the Hash Join operator enforces the left- to-right order of execution of children.

SubQuery Unnesting. This is probably the most significant enhancement in the Pivotal Query Optimizer, because of the heavy use of subqueries by the major BI/Reporting tools in the industry. A subquery is a query that is nested inside an outer query block, such as:

SELECT * FROM part
 WHERE price > (SELECT avg(price) FROM part)
 GROUP BY year;

From a pure benchmarking standpoint, to further illustrate the importance of subqueries to production analytic workloads, 20% of the 111 TPC-DS queries contain subqueries. Our testing across multiple platforms also shows that inefficient evaluation of these subqueries can frequently cause queries that never terminate. The Pivotal Query Optimizer provides three main optimizations to handle subqueries:

Removing Unnecessary Nesting. In cases where subquery blocks are superfluous or redundant the Pivotal Query Optimizer can simplify the query by removing those blocks. In this query example, the SQL SELECT statements provide redundant selection.
```
SELECT * FROM (SELECT * FROM catalog_sales)AS foo, (SELECT * FROM date_dim) AS bar WHERE foo.date_id=bar.id;
```
The optimizer simplifies the query by removing unnecessary unnesting. In the previous example, the optimizer can simplify the query to:
```
SELECT * FROM catalog_sales, date_dim WHERE date_id=id;
```

Subquery Decorrelation. A Correlated Subquery (CSQ) is a subquery that uses values from the outer query. Correlation means that we need to repeat the execution of subquery block for each value given by the outer query block. This can lead to poor performance if optimizer is incapable of de-correlating the subquery by pulling up subquery predicates.

SELECT * FROM retail_demo.order_lineitems o1 WHERE item_price > (SELECT avg(item_price) FROM retail_demo.order_lineitems_hawq o2 WHERE o2.product_category_id = o1.product_category_id)

QUERY PLAN
------------------------------------------------------------------------------	
Gather Motion 1:1  (slice3; segments: 1)  (cost=0.00..1808093.66 rows=457458 width=510)
 ->  Hash Join  (cost=0.00..1580257.51 rows=457458 width=510)
      Hash Cond: retail_demo.order_lineitems.product_category_id = retail_demo.order_lineitems.product_category_id
      Join Filter: retail_demo.order_lineitems.item_price > "inner".avg
      ->  Table Scan on order_lineitems  (cost=0.00..510078.69 rows=1024158 width=510)
      ->  Hash  (cost=50016.08..50016.08 rows=45 width=10)
            ->  Broadcast Motion 1:1  (slice2; segments: 1)  (cost=0.00..50016.08 rows=45 width=10)
                  ->  Result  (cost=0.00..50014.64 rows=45 width=10)
                        ->  HashAggregate  (cost=0.00..50014.64 rows=45 width=10)
                              Group By: retail_demo.order_lineitems.product_category_id
                              ->  Redistribute Motion 1:1  (slice1; segments: 1)  (cost=0.00..5011.02 rows=45 width=10)
                                    Hash Key: retail_demo.order_lineitems.product_category_id
                                    ->  Result  (cost=0.00..50009.59 rows=45 width=10)
                                          ->  HashAggregate  (cost=0.00..50009.59 rows=45 width=10)
                                                Group By: retail_demo.order_lineitems.product_category_id
                                                ->  Table Scan on order_lineitems  (cost=0.00..10001.54 rows=1024158 width=10)

Conversion of Subqueries into Joins. This is a common method of handling subqueries, and in many cases platforms depend on the user to make this optimization. Manual optimizations introduces the risk of error, and often many BI tools generate these types of queries automatically and it’s difficult to make interpretation and rewriting these queries part of the standard workflow.

SELECT * FROM retail_demo.order_lineitems ol WHERE Customer_ID NOT IN (SELECT Customer_ID FROM retail_demo.Customers_Dim);

For example, in this query the Pivotal Query Optimizer will dynamically rewrite this query to leverage a Anti Semi-Join (Not In), as well as the DynamicScan capabilities discussed earlier.

QUERY PLAN
------------------------------------------------------------------------------	
Hash Left Anti Semi Join (Not-In)(cost=0.00..42973.78 rows=1 width=264)
  Hash Cond: order_lineitems.customer_id::bigint = customers_dim.customer_id
  -> Gather Motion 2:1 (slice1; segments: 2)(cost=0.00..2.52 rows=1 width=264)
        -> Sequence (cost=0.00..1.39 rows=1 width=264)
             -> Result (cost=0.00..1.39 rows=1 width=264)
                  -> Function Scan on gp_partition_expansion (cost=10.00..100.00 rows=50 width=4)
                  -> Dynamic Table Scan on order_lineitems (partIndex: 0)(cost=0.00..0.13 rows=1 width=264)
  -> Hash (cost=3907.25..3907.25 rows=250000 width=8)
      -> Gather Motion 2:1 (slice2; segments: 2)(cost=0.00..3907.25 rows=250000 width=8)
            -> Table Scan on customers_dim (cost=0.00..19    53.12 rows=250000 width=8)

Common Table Expressions (CTE). CTEs are temporary tables that are used for just one query, and are typically heavily utilized in analytical workloads. For instance, in the TPC-DS 46 of the 111 queries leverage CTEs. Pivotal Query Optimizer introduces a new producer-consumer model for WITH clause, much like the model introduced for Dynamic Partition Elimination. The model allows evaluating a complex expression once, and consuming its output by multiple operators. This allows Greenplum Database to deal with much more complex CTEs because it is not forced to fully expand them, but instead deals with them dynamically.
```
WITH v AS (SELECT a, b FROM T where c < 10) SELECT * FROM v AS v1, v AS v2 WHERE v1.a = v2.b;
```
```
QUERY PLAN
------------------------------------------------------------------------------	
Sequence  (cost=0.00..5.32 rows=1 width=32)
  -> Shared Scan (share slice:id 0:0)  (cost=0.00..3.04 rows=1 width=1)
       ->  Materialize  (cost=0.00..3.04 rows=1 width=1)
             -> Gather Motion 2:1  (slice1; segments: 2)  (cost=0.00..2.04 rows=1 width=16)
                  ->  Table Scan on t  (cost0.00..1.03 rows=1 width=16)
                        Filter: c < 10   -> Hash Join  (cost=0.00..1.22 rows=1 width=32)
       Hash Cond: "outer".b = "inner".a
       ->  Shared Scan (share slice:id 0:0)  (cost=0.00..0.02 rows=1 width=16)
       ->  Hash  (cost=0.02..0.02 rows=1 width=16)
             -> Shared Scan (share slice:id 0:0)  (cost=0.00..0.02 rows=1 width=16)   
```
This is a simple example, but in this query the consumer/producer concept is easily demonstrated. The query plan shows a new SharedScan operator. This operator represents the Common Table Expression and is leveraged wherever tuples from CTE are required elsewhere in the Query. This model allows Greenplum Database to provide superior performance for standard CTEs, but also Nested CTEs. It also allows Greenplum Database to more easily push predicates into the CTEs. For example:
```
WITH v AS (SELECT a, sum(b) as s FROM T GROUP BY a) SELECT * FROM v as v1, v as v2, v as v3 WHERE v1.a < v2.a AND v1.s < v3.s AND v1.a = 10 AND v2.a = 20 AND v3.a = 30;
```
In this example, pushing the predicates into the Producer/Consumer branches of the query can drastically reduce the amount of data that producer will be supplying to any consumer and will positively affect query timings.

ful
In summary, the interplay of the previous features is enabled by Pivotal Query Optimizers architecture, new operators, and the abstraction of components. The new modular design also allows each of these features to be implemented and tested with very little effect on the behavior of the other features in the optimizer. The addition of the Pivotal Query Optimizer provides significant benefits for Greenplum Database users:

Significant performance enhancements on a wide-variety of complex analytical queries.
Improved Manageability by reducing user-level management of database configuration parameters (GUCs).
The modularity of the Pivotal Query Optimizer allows plan export and test-execution by support personnel leading to faster reproduction leading to an enhanced experience in customer support.

Learn More:

Check out Pivotal Greenplum Database product page on our website
Learn more about the Pivotal Big Data Suite
Read more blogs on Pivotal Greenplum Database

About the Author

Dan is Director of Technical Marketing at Pivotal with over 20 years of experience in various pre-sales and engineering roles with Sun Microsystems, EMC Corporation, and Pivotal Software. In addition to his technical marketing duties, Dan is frequently called upon to roll-up his sleeves for various "Will this work?" type projects. Dan is an avid collector of Marvel Comics gear and you can usually find him wearing a Marvel shirt. In his spare time, Dan enjoys playing tennis and hiking in the Smoky Mountains.
Follow on Twitter More Content by Dan Baskette

Pivotal Perspectives—Thinking About Mobile and How it Changes What You Do

The move to Mobile is pervasive in all businesses. Smartphones are pervasive in the community, and new devi...

Pivotal Cloud Foundry’s Roadmap For 2016

This post and video recaps Pivotal's Onsi Fakhouri’s, vice president of cloud research & development, overv...

Greenplum Database Adds The Pivotal Query Optimizer

Learn More:

About the Author

Previous

Next

Greenplum Database Adds The Pivotal Query Optimizer

Learn More:

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.