Installing in a Hadoop Cluster
When using DataFlow for distributed execution with HDFS, a node manager should be installed on every machine which is a Hadoop DataNode. In this way, DataFlow can take advantage of data locality during execution. We recommend that you set up Cluster Manager on the machine hosting the HDFS NameNode, although this is not a strict requirement.
When jobs run in the cluster, the location of the Hadoop installation is required to obtain the configuration. This location is set using the property node.hadoop.home on the Machine Classes page.
WARNING! An issue with the transparent huge pages feature within Linux can cause a significant performance degradation when running Hadoop within a Linux cluster. For more information about the issue, see
transparent huge page compaction enabled issue.
Because the DataFlow daemons execute as a user different from the one submitting the DataFlow graph, they usually have the access rights in HDFS associated with the dataflow user. However, it is preferred for the executing graphs to have the rights of the submitting user. To do this, the daemons must impersonate that user, acting as a proxy on its behalf. Hadoop supports this, but you must configure the users allowed to act as proxies. Add the following properties to core-site.xml on all machines in the Hadoop cluster:
<property>
<name>hadoop.proxyuser.datarush.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.datarush.groups</name>
<value>*</value>
</property>
After the configuration, restart the Hadoop daemons for the changes to take effect.
Note: Older versions of Hadoop do not support the wildcard syntax for hosts and groups used above. In these cases, an explicit list must be provided for each property. The host list should include the fully qualified domain name of each machine running a DataFlow node manager. The group list should include the name of all groups containing users executing the DataFlow graphs in the cluster.
Setting Up Kerberos Authentication
Hadoop can be configured to require Kerberos authentication. For this, it is required for DataFlow processes to authenticate as Kerberos service principals. A principal must be created for each machine where DataFlow is installed. The principal name should be in dataflow/host@REALM format, where host is the fully qualified domain name of the machine and REALM is name of the Kerberos realm.
The DataFlow daemons also require a keytab file to authenticate with Kerberos. Each machine requires a keytab file containing the credentials for the service principal associated with that machine. These files should only be readable by the dataflow user since any user with access to these files can authenticate as the principal.
The procedure for creating principals and keytabs is the same as that used for configuring secure Hadoop. To configure secure Hadoop, see
Hadoop in Secure Mode. After principals and keytabs are created, you can configure DataFlow to authenticate with Kerberos. For more information, see
Configuring Kerberos Authentication.
Last modified date: 12/09/2024