Writing in Parallel

Concepts to Know : Input/Output : Parallel Inout/Output : Writing in Parallel

Share this page

Writing in Parallel

Writing a file is an inherently serial operation. To avoid the scalability issue introduced by this, parallel write operations in DataFlow expect a path to a directory, not a file, and produce a file for each partition of the data. These files are called fragments. Collectively they represent the entire output, much as the partitions collectively represent the entirety of the data being processed.

Writers are parallel by default and therefore will produce fragments. This fits well with the behavior of read operators in DataFlow with regard to being provided with a directory as the source—specifically, reading all files in the directory. If a single output file is desired, this must be configured explicitly on the writer.