Data Compression
Columnar storage inherently makes compression (and decompression) more efficient than does row-oriented storage.
For row-oriented data, choosing a compression method that works well for the variety of data types in a row can be challenging, because compression for text and numeric data work best with different algorithms.
Column storage allows the algorithm to be chosen according to the data type and the data domain and range, even where the domain and range are not declared explicitly. For example, an alphabetic column GENDER defined as CHAR(1) will have only two actual values (M and F), and rather than storing an eight-bit byte, the value can be compressed to a single bit, and then the bit string can be further compressed.
Vector uses different types of algorithms from those found in most other products. Because Vector processes data so efficiently, compression—and in particular decompression—are designed to use little CPU and to reduce disk I/O. While on-disk compression ratios may be slightly lower than other products, overall performance is improved.
Compression in Vector is automatic, requiring no user intervention. Vector chooses a compression method for each column, per data block, according to its data type and distribution.
Data Type Storage Format and Compression Type
Data types are stored internally in a specific format. The type of compression used depends on the data type and distribution.
Vector can use any of the following compression methods:
• RLE (Run Length Encoding) – This method is efficient if many duplicate adjacent tuple values are present (such as in ordered columns with few unique values).
• PFOR (Patched Frame Of Reference) – This method encodes values as a small difference from a page-wide base value. The term "Patched" indicates that FOR is enhanced with a highly efficient way to handle values that fall outside the common frame of reference. PFOR is effective on any data distribution with some value distribution locality.
• PFOR-DELTA (delta encoding on top of PFOR) – In this method, the integers are made smaller by considering the differences between subsequent values. PFOR-DELTA is highly effective on ordered data.
• PDICT dictionary encoding (offsets into a dictionary of unique values) – This method is efficient if the value distribution is dominated by a limited amount of frequent values.
• LZ4 – This algorithm detects and encodes common fragments of different string values. It is particularly efficient for medium and long strings.
Most INTEGER, DECIMAL, and DATE and TIME types internally are compressed using any of the first four compression methods.
FLOAT and FLOAT4 types are stored without compression in Vector tables.
Character types (CHAR, VARCHAR, NCHAR, NVARCHAR) of lengths larger than one are stored internally as variable width strings. This data can be automatically compressed using either a per-block dictionary or LZ4 algorithm.
NULL values are stored internally as a single byte column and are compressed using the RLE method. The null indicator, if needed, is represented internally as a separate column. Loading and processing of nullable columns can be slower than non-nullable columns.
Last modified date: 11/09/2022