Was this helpful?
Supported Compression Formats
DataFlow is shipped with a number of compression formats available by default. You can use any of the formats to read or write files either by explicitly declaring their use or automatically detecting them based on the file suffix. Use the CompressionFormats factory class to obtain a specific CompressionFormat implementation by applying the appropriate format identifier.
Obtaining a CompressionFormat
CompressionFormat format = CompressionFormats.lookupFormat("gzip");
When reading the files, DataFlow segments the data into “splits.” Each file split is independently read, parsed, and processed. Splitting files in this manner allows I/O to be distributed and to enhance the performance to distributed applications. A few compression formats support splitting the compressed data into segments for distributed reading. Also, a few formats require the complete file to be processed for the decompression operation to succeed.
The following sections provide information about the compression format and support for distributed read operations (split).
gzip
gzip is a noted public domain compression format. The JDK provides direct support for gzip.
Implementation Class
Format Identifier
gzip
Recognized suffixes
.gz, .z
Split Support
No
bzip2
bzip2 is a noted public domain compression format. Typically, it allows a larger degree of compression than other formats. However, it is also slow at compression and decompression when compared to other formats.
Implementation Class
Format Identifier
bzip2
Recognized suffixes
.bz2, .bz
Split Support
Yes
snappy
Snappy is a public domain compression format designed to achieve a modest level of compression at high speeds.
Implementation Class
Format Identifier
snappy
Recognized suffixes
.snz
Split Support
No
Last modified date: 12/09/2024