Using DataFlow Cluster Manager
Executing DataFlow Jobs Using YARN
The following is the basic flow for executing a DataFlow job on YARN:
1. A DataFlow client instantiates a job within a YARN cluster by contacting the YARN Resource Manager to execute an Application Master.
2. A container for the Application Master is started by YARN on a worker node within the cluster.
3. The DataFlow Application Master then negotiates with the Application Master for resources. After resources are obtained for the job, the Application Master launches containers through node managers to execute the DataFlow job.
4. Partition workers (containers) execute their part of the DataFlow job on worker nodes within the cluster.
5. After all phases of a job complete, the DataFlow Application Master cleans up any launched worker containers and returns the resources to YARN. The Application Master reports final status to the Resource Manager and to the job client.
6. During execution of the job, the DataFlow Cluster Manager obtains detailed job status and metrics from each Application Master. You can view these detailed job metrics using the DataFlow Cluster Manager web application. Details about how to view the detailed metrics are provided in
Monitoring DataFlow Jobs.
You can execute a DataFlow job with the Java API, RushScript, and from KNIME.
Last modified date: 01/06/2023