Difference Between Local and Distributed Execution

Building DataFlow Applications : Building DataFlow Applications in Java : Executing an Application : Difference Between Local and Distributed Execution

Share this page

DataFlow applications are run locally by default. In local mode, the applications along with their I/O are parallelized. To execute an application in a distributed environment, it requires:

• A Hadoop cluster.

Note: Currently, the distributed DataFlow implementations work only in a Hadoop cluster.

• Installing and configuring distributed DataFlow on the Hadoop cluster. For more information, see Installing and Configuring DataFlow on a YARN-enabled Hadoop Cluster.

• Integrating with the HDFS.

The optimal method to execute a distributed DataFlow job is to use the data in HDFS as input to the application. DataFlow distributes the I/O on the cluster by applying the location optimization which limits the network I/O. When an application is compiled and executed, the distributed parts of the physical plan are sent to work nodes for execution. A few parts of the graph may run locally.

The engine configuration has a cluster setting that can be used to specify the cluster for execution.

The following code sample provides the setting of the cluster programmatically.

Setting the cluster execution configuration

ClusterSpecifier clusterSpec = ClusterSpecifier.create().host("galaxy.englab. local").port(60032);

EngineConfig config = EngineConfig.engine().cluster(clusterSpec); graph.compile(config).run();