The latest release of Pivotal Greenplum Database, version 4.3.3, adds a number of notable updates, including Delta Compression. This exciting update adds an additional way to compress data in a column to save space. Internal tests and customer data have demonstrated well over 100x compression on 10G worth of TIME values from a dataset.
Pivotal Greenplum DB is a Massive Parallel Processing (MPP) database, which means it spreads data over multiple nodes to harness the compute and IO power of a cluster to process petabyte scale data sets. It also embraces polymorphic storage—the ability to store data in multiple formats within one logical table. A table can have partitions that are row-oriented side by side with partitions that are column oriented. In addition to this, various compression algorithms can be applied at the table and column level.
For example, the latest three months of data in the table can be row-oriented, the next three months columnar and uncompressed, and the following three or more months columnar with compression. As far as the end user is concerned, all data are queried the same, not requiring any changes for data in different parts of the lifecycle.
Delta Compression adds a new approach to compression. In addition to standard lzo and zlib column compression, Pivotal Greenplum DB has been able to perform Run Length Encoding (RLE) compression for awhile now. To understand this, imagine if you had a table with dinner orders. One of those columns defines the order, and when you change from row-based to columnar, the data stored for the column will look like this: Fish, Fish, Fish, Fish, Fish, Fish. RLE compression stores that same type of data as Fish(6).
For data sets with a large number of repeating values, this can save large amount of space. Delta Compression adds data types such as integers and time, which are expressed as their offset. For example, the dates 2014-01-02, 2014-01-03, 2014-01-04 would be 2014-01-02, +1 , +1 with Delta Compression.
If we take this and combine it with RLE on the following data set:
2014-01-02, 2014-01-02, 2014-01-02, 2014-01-03, 2014-01-03, 2014-01-03, 2014-01-04, 2014-01-04, 2014-01-04
We end up with:
2014-01-02 (3), +1 (3), +1 (3)
After applying both Delta Compression and RLE, we compress the entire block with zlib.
From customer data and testing, we are seeing well over 100x compression on 10G worth of TIME values from a customers’ dataset. Even more impressive is performance of up to 5000x compression on a similar 10G sequence column. In addition, Pivotal Greenplum Database 4.3.3 adds the following features:
- Netbackup integration
- PL/R update to 3.1
- Fuzzy String Match module
- Product Overview, Features, and Technology
- Product Documentation
- Other blog articles on Pivotal Greenplum Database
About the Author
BiographyMore Content by Scott Kahler