Partitioning very large Pivotal Greenplum (GPDB) or Greenplum Database tables should improve query performance—and everyone likes better performance. Typically, you conveniently partition your data on dates, numeric ranges or lists of values. But, what should you do about future dates, expanding numeric ranges or unexpected values? Specifying a default partition on your table looks tempting—it acts as a simple catch-all container making your job easy. Surely default partitions are the solution.
Well, as the adage goes, “if it’s too good to be true, it probably is.”
Default Partitions And Pivotal Query Optimizer
Default partitions force GPDB’s Pivotal Query Optimizer (PQO) to operate counter to its mission—produce plans that scan the least amount of data to accelerate query performance. Instead, the presence of a default partition on a table guarantees that the default partition, along with qualifying partitions, will be scanned to ensure a correct result set regardless of whether the default partition contains data required to satisfy the query. Worse, experience shows that default partitions typically contain significantly more data than expected. This triggers extra I/Os unnecessarily, inhibiting your queries from reaching their maximum performance potential when the default partition has no data to satisfy the query. Unless the partition is empty, this will hinder performance.
You might wonder why default partitions require a scan every time. The answer lies in what happens behind the scenes when partitions are created. Each non-default partition has a check constraint generated—indicating the starting and stopping values for each partition. PQO uses check constraints to determine which partitions contain data that satisfy the query. However, default partitions do not have check constraints; consequently, PQO does not know in advance if data on the default partition will fulfill a query. As such, PQO stipulates a scan of the default partition as part of the overall query plan.
Alternatives To Default Partitions
Okay, now that you know why default partitions may hinder query performance, what do you do as an alternative? Several solutions exist. Perhaps the most straightforward approach, albeit manual in nature, starts with creating the last partition with dates or numerical ranges greater than current data values. In time, you can split the ending partition into two using the ALTER TABLE … SPLIT PARTITION FOR command. This approach allows you to continually add partitions with newly specified ranges after the data has arrived.
CREATE TABLE bank_data (id int, age int, date date, euribor float8) DISTRIBUTED BY (id) PARTITION BY RANGE (date) ( PARTITION yr2000 START (date '2000-01-01') INCLUSIVE, PARTITION yr2001 START (date '2001-01-01') INCLUSIVE, ... PARTITION yr2016 START (date '2016-01-01') INCLUSIVE, PARTITION yrs20172030 START (date '2017-01-01') INCLUSIVE, END (date '2031-01-01') EXCLUSIVE; ALTER TABLE bank_data SPLIT PARTITION FOR ('yrs20172030') AT ('2018-01-01') INTO (PARTITION yr2017, PARTITION yrs20182030);
Another option involves defining multiple partitions with ranges greater than the current data values in anticipation of future data. If partitioned tables already exist, new partitions can be added using the ALTER TABLE … ADD PARTITION command so long as the partitioned tables do not contain default partitions. Since no data will meet the criteria of the future range partition’s check constraint, PQO will not select the partitions; thus, query performance does not degrade. The number of future range partitions, however, should be kept in moderation for administrative management purposes.
CREATE TABLE bank_data (id int, age int, date date, euribor float8) DISTRIBUTED BY (id) PARTITION BY RANGE (year) ( START (1980) END (2050) EVERY (1) );
CREATE TABLE bank_data (id int, age int, date date, euribor float8) DISTRIBUTED BY (id) PARTITION BY RANGE (date) ( PARTITION yr2000 START (date '2000-01-01') INCLUSIVE, PARTITION yr2001 START (date '2001-01-01') INCLUSIVE, ... PARTITION yr2016 START (date '2016-01-01') INCLUSIVE, END (date '2017-01-01') EXCLUSIVE; ALTER TABLE bank_data ADD PARTITION START (date '2017-01-01') INCLUSIVE END (date '2018-01-01') EXCLUSIVE;
When utilizing list partitioning, alternatives to default partitions require a little more creativity than simply splitting or adding partitions. When possible, choose range partitioning over list partitioning. If the use of list partitioning is unavoidable, consider loading data into partitioned tables using the SEGMENT REJECT LIMIT clause. After loading, use the built-in SQL function gp_read_error_log() to review records failing partition check constraints. Information from the error log may highlight additional list values to partition on. This approach does, however, require reloading the data. Also, additional partitions need to be added when new values arrive in the list (e.g. a newly created sales region).
So, when should default partitions be used? Ideally never. But, when manual intervention is not an option or when it is impossible to predict values—typically associated with list partitions—proceed with using default partitions. As an example (and as an alternative to the SEGMENT REJECT LIMIT option noted above) you can start with a default partition when loading your data to prevent the load from failing on rejected data. Then after loading has completed, the default partition can be split into two based on the values found in the default partition. (Just as a reminder, you can query the default partition—a child table—just like you query any regular table. This allows you to identify the values found only in the default partition.) Splitting the default partition yields a new check constraint on the new partition and simultaneously keeps the default partition. Once all the values have been partitioned out of the default partition (meaning the use of SPLIT PARTITION is an iterative process), the default partition can simply be dropped.
gpadmin=# CREATE TABLE test (id int, city varchar(20), state char(2)) gpadmin-# PARTITION BY LIST (state) gpadmin-# (PARTITION MA VALUES ('MA'), gpadmin(# PARTITION CA VALUES ('CA'), gpadmin(# DEFAULT PARTITION other_states); gpadmin=# ALTER TABLE test SPLIT DEFAULT PARTITION AT ('OR') gpadmin-# INTO (PARTITION OREGON, DEFAULT PARTITION); gpadmin=# ALTER TABLE test DROP DEFAULT PARTITION; NOTICE: dropped partition "other_states" for relation "test" ALTER TABLE
In the all-new Greenplum Developer course, you can learn more about partitioning and other GPDB concepts—ranging from data loading strategies to data modeling and more. Best of all, Pivotal Education offers the Greenplum Developer course in a wide array of training modalities, including our all-new eLearning option that provides the same in-depth and technical content found in a traditional classroom and is optimized for self-study.
About the AuthorMore Content by Charles Killam