Data Type Storage Format and Compression Type
Data types are stored internally in a specific format. The type of compression used depends on the data type and distribution.
Vector can use any of the following compression methods:
RLE (Run Length Encoding) – This method is efficient if many duplicate adjacent tuple values are present (such as in ordered columns with few unique values).
PFOR (Patched Frame Of Reference) – This method encodes values as a small difference from a page-wide base value. The term "Patched" indicates that FOR is enhanced with a highly efficient way to handle values that fall outside the common frame of reference. PFOR is effective on any data distribution with some value distribution locality.
PFOR-DELTA (delta encoding on top of PFOR) – This method makes integers smaller by considering the differences between subsequent values. PFOR-DELTA is highly effective on ordered data.
PDICT dictionary encoding (offsets into a dictionary of unique values) – This method is efficient if the value distribution is dominated by a limited amount of frequent values.
LZ4 – This algorithm detects and encodes common fragments of different string values. It is particularly efficient for medium and long strings.
Most INTEGER, DECIMAL, and DATE and TIME types internally are compressed using any of the first four compression methods.
FLOAT and FLOAT4 types are stored without compression in Vector tables.
Character types (CHAR, VARCHAR, NCHAR, NVARCHAR) of lengths larger than one are stored internally as variable width strings. This data can be automatically compressed using either a per-block dictionary or LZ4 algorithm.
NULL values are stored internally as a single byte column and are compressed using the RLE method. The null indicator, if needed, is represented internally as a separate column. Loading and processing of nullable columns can be slower than non-nullable columns.