Configuring DataFlow Jobs
You should configure the following settings for individual jobs:
parallelism
Specifies the number of parallel streams required (one stream = one container). The default value zero (0) implies that the system will receive the maximum number of parallel streams.
minParallelism
Specifies the minimum amount of parallelism allowed. This is an important setting, as a busy cluster might have a varying number of resources available over time.
When a DataFlow job is run, the Application Master container must be started. The master resources properties determine the size of the container to request. YARN allocates and starts the Application Master container on a node with free resources.
The Application Master determines the level of parallelism requested along with the worker container resources required and decides the number of containers to request. Once again, YARN will search for available resources throughout the cluster and respond with appropriate container allocations. YARN may allocate fewer containers than requested, depending on the currently available resources.
YARN uses the specified memory and virtual core settings to allocate and track resource usage per node. For example, a 5-node cluster with 12 GB of memory allocated per node for YARN has a total memory capacity of 60 GB. For a default 2 GB container size, YARN can allocate 30 containers of 2 GB each. The number of cores should also be considered.
• If executing jobs with the Java API or RushScript, these settings are specified within the code.
• If executing jobs from KNIME, these settings are available in File > Preferences > Actian under the specific cluster profile.