Before You Begin

Share this page

Before You Begin

Before you set up a cluster, review the architecture of a DataFlow cluster in Execution Modes.

The first step in cluster installation is to identify the machines in the cluster. At least one machine is required for each node manager. When planning, keep in mind the following:

• Cluster Manager does not consume a significant amount of resources when running. It can be on the same host as a node manager. Cluster Manager represents a single point of failure for the cluster.

• Every machine with a node manager is responsible for running a portion of a distributed graph. We recommend that you reserve these systems for the purpose of executing graphs. If the systems are busy, then the performance of one or more nodes may be affected.

Every machine that hosts either Cluster Manager or a node manager must have DataFlow installed as described in Installing DataFlow for Use with Java. The DataFlow installation contains all binaries and libraries required for both cluster clients and servers.

Both node managers and cluster managers should be run by the same user, which is referred to here as the dataflow user. Passwordless ssh should be configured for all the nodes in the cluster in such a way that the dataflow user can ssh from the Cluster Manager machine to any of the nodes in the cluster without requiring login. Currently, passwordless SSH is required to start a node using the admin GUI.

Warning! Graphs executing on the cluster will have the operating system rights available to the account running the node manager daemon. Therefore, we strongly recommend that you have the cluster daemons run by an account with very few privileges. This reduces the possibility of unauthorized access when running workflows from remote clients.

Do not run cluster daemons as root or any other account having superuser privileges!