Graphs
Whether a graph is intended to be executed on a single multi-core machine or on a distributed cluster, it is composed in exactly the same way. The target environment—a single multi-core machine or a distributed cluster—is not specified until execution time. As indicated before, the details of how to execute in a given environment are left to the engine.
When a logical graph is executed, the DataFlow engine analyzes it and determines an execution plan based on the target execution environment. Each logical operator in DataFlow may potentially be executed in parallel, assigning subsets of the input to individual workers. Dividing the data into subsets is known as partitioning; the subset is known as a partition.
Logical operators provide input to this process by providing metadata—information about their input requirements and output guarantees. The processing of a set of partitions collectively represents the specified operation on the entire data set. The DataFlow framework manages partitions automatically, moving data between workers as required as it flows through the logical graph.
The analysis results in a sequence of phases, each itself represented as a dataflow graph. These are the physical graphs that will perform the actual work involved in producing the result described by the logical graph. A physical graph is the processing associated with a partition; there will be a copy of the phase’s graph executed for each partition of the data. An example of this conversion from logical to physical graphs is illustrated below.
As shown in the illustration:
• The original simple pipeline has been divided into multiple physical graphs.
• The logical graph had no explicit data parallelism, but the engine produced an execution plan that leverages it. In this case, we see four partitions of the data.
• The framework automatically injected movement of data between partitions, writing the data to disk in between phases as needed.
Despite the discussion here about physical graphs, users usually do not need to be aware of physical graphs; physical graphs only become important when debugging. Because logical graphs are relatively more important in everyday usage, it is common to simply use the term graph to refer to a logical graph.
For more information about building and executing graphs, see
Composing an Application,
Compilation Output, and
Executing an Application.