Difference Between Local and Distributed Execution
DataFlow applications are run locally by default. In local mode, the applications along with their I/O are parallelized. To execute an application in a distributed environment, it requires:
• A Hadoop cluster.
Note: Currently, the distributed DataFlow implementations work only in a Hadoop cluster.
• Integrating with the HDFS.
The optimal method to execute a distributed DataFlow job is to use the data in HDFS as input to the application. DataFlow distributes the I/O on the cluster by applying the location optimization which limits the network I/O. When an application is compiled and executed, the distributed parts of the physical plan are sent to work nodes for execution. A few parts of the graph may run locally.
The engine configuration has a cluster setting that can be used to specify the cluster for execution.
The following code sample provides the setting of the cluster programmatically.
Setting the cluster execution configuration
ClusterSpecifier clusterSpec = ClusterSpecifier.create().host("galaxy.englab. local").port(60032);
EngineConfig config = EngineConfig.engine().cluster(clusterSpec); graph.compile(config).run();