Reading in Parallel

The file reader operators in DataFlow are able to perform parallel reads of the source files. When reading in parallel, each partition of the graph will process part of the file. A file split (or more commonly split) is a contiguous segment of a data file, spanning a range of bytes. DataFlow performs parallel reads by breaking files into a number of splits and assigning them to different partitions. Each partition parses the data contained within the split, resulting in the records of the file being spread across all partitions.

This concept of splits is the same as found in Hadoop. As in Map/Reduce, locality information can be used to improve performance, assigning splits to partitions with local access. When running DataFlow against HDFS, the engine will take the locality of splits into consideration when assigning them to partitions.

This parallel read ability may sometimes be disabled. This could happen for a number of reasons, including at the request of the user. In this case, instead of assigning splits to partitions, DataFlow will assign files to partitions. Thus, when reading a number of files, at least file-level parallelism can still be achieved.

Most compression formats cannot be arbitrarily divided into pieces, but instead must be read from the beginning of the file (or some format-specific blocking boundary). If a source file is compressed, DataFlow will not parallelize the file; the file will be treated as a single split. Note, however, if multiple files are being read, the individual files are read concurrently.