Supported Compression Formats
DataFlow is shipped with a number of compression formats available by default. You can use any of the formats to read or write files either by explicitly declaring their use or automatically detecting them based on the file suffix. Use the
CompressionFormats factory class to obtain a specific
CompressionFormat implementation by applying the appropriate format identifier.
Obtaining a CompressionFormat
CompressionFormat format = CompressionFormats.lookupFormat("gzip");
When reading the files, DataFlow segments the data into “splits.” Each file split is independently read, parsed, and processed. Splitting files in this manner allows I/O to be distributed and to enhance the performance to distributed applications. A few compression formats support splitting the compressed data into segments for distributed reading. Also, a few formats require the complete file to be processed for the decompression operation to succeed.
The following sections provide information about the compression format and support for distributed read operations (split).
gzip
gzip is a noted public domain compression format. The JDK provides direct support for gzip.
bzip2
bzip2 is a noted public domain compression format. Typically, it allows a larger degree of compression than other formats. However, it is also slow at compression and decompression when compared to other formats.
snappy
Snappy is a public domain compression format designed to achieve a modest level of compression at high speeds.