Before You Begin
The following are requirements for installing and configuring DataFlow Cluster Manager within a Hadoop cluster with YARN.
• Set up the cluster.
• Ensure that your Hadoop distribution supports YARN, is currently running, and you can browse the HDFS file system.
• DataFlow Cluster Manager is installed on the head (master) node of the Hadoop cluster. This is generally where the Resource Manager is located.
The Hadoop cluster configuration files are required to allow the Cluster Manager access to an up-to-date cluster configuration. For Hadoop cluster installations with HA capabilities on the head nodes, this provides fail-safe operations of the DataFlow Cluster Manager.
• By default, Hadoop is installed at /etc/hadoop/. Locate the configuration files in the installed location. If it is not available, then set an environment variable named hadoop.conf.dir that points to the configuration directory for Hadoop on the install machine.
For KNIME, you can add -Dhadoop.conf.dir=<path to set configuration files> in the Knime.ini file and restart KNIME. This allows Cluster Manager to find the cluster configuration.
• The Cluster Manager should be run as a non-root user. We recommend that you create a user actian and use this user for executing the Cluster Manager. To run the Cluster Manager, use the same user. Using a different user to launch Cluster Manager can cause start-up failures.
In addition, it is required to manually set up the file space within HDFS for execution of DataFlow applications on the cluster. By convention, Hadoop supports adding application-specific files under the /apps directory in HDFS. For DataFlow, this directory is /apps/actian/dataflow. The directory should be manually created and the owner should be changed to the actian user. Create the following three directories within the /apps/actian/dataflow directory:
cache
Stores the run-time information needed by DataFlow jobs.
archive
Contains DataFlow libraries and their dependencies in archive format. The archives are used in the YARN local resource cache for distributing needed libraries throughout the cluster.
extensions
Contains the user created extensions to DataFlow. The user can populate this directory with archives such as jar files that contain user-developed DataFlow extensions.
The database providers supporting JDBC provide a JDBC .jar file. We recommend you include this .jar file in the extensions directory.
The following example provides the commands to create the directories in HDFS and set the permissions. Run these commands as the HDFS user on one of the Hadoop cluster nodes.
Creating the cache location directory
$ hadoop fs -mkdir -p /apps/actian/dataflow
$ hadoop fs -mkdir -p /apps/actian/dataflow/cache
$ hadoop fs -mkdir -p /apps/actian/dataflow/archive
$ hadoop fs -mkdir -p /apps/actian/dataflow/extensions
$ hadoop fs -chown -R actian:actian /apps/actian/dataflow
$ hadoop fs -chmod 777 /apps/actian/dataflow/cache
Clusters on Rackspace or AWS
Before executing jobs on a cluster in Rackspace or Amazon Web Services from the local KNIME instance:
1. Update the host file to link the hostnames to their public IP addresses. Add the server IP address to the client’s host file. This allows the DataFlow client to connect using the public IP address and not the private IP address.
2. Update the following properties in the hdfs-site.xml and yarn.xml files for the Hadoop server configuration:
Properties in hdfs-site.xml
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
Properties in yarn-site.xml
<property>
<name>yarn.resourcemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
3. Set a global environment property HADOOP_HOME on the client system. Copy the hdfs-site.xml file from the server to the client and store it in the %HADOOP_HOME%/etc/hadoop directory.
4. Ensure that the ports used by the DataFlow cluster have been whitelisted if using a firewall or iptables in addition to any ports required by Hadoop, usually 1099 and 1100 by default.
Clusters on AWS or Rackspace with Vector on Hadoop
When using AWS or Rackspace, additional setup is required when using Vector on Hadoop. The additional setup includes:
1. When updating the client’s host file on the cluster's network, if private host names are used internally, then they should be linked with the public IP addresses instead of using the public host names.
2. When configuring a new DataFlow workflow that reads or writes to a Vector database, it is required to use the internal host name of the Vector master node.
Note: Do not use the IP address or external host name for the Vector connections.
3. If the DataFlow cluster is not installed on the same Hadoop cluster, then depending on the security set up currently, it may not use the Vector on Hadoop instance with DataFlow cluster.
Last modified date: 01/06/2023