Configuring Hadoop for DataFlow

By default, the Hadoop configuration files are found in the /etc/hadoop/conf directory. If this directory is not available, set the hadoop.conf.dir environment variable to the configuration directory for Hadoop on the installation machine.

The /etc/hbase/conf directory contains hbase-site.xml. Copy hbase-site.xml to /etc/hadoop/conf.

The following provide the important YARN properties that require configuration:

Specifies the amount of memory per worker node allocated for YARN application usage.

Specifies the minimum size container that YARN will allocate; a container request for less memory will be bumped up to this minimum.

Specifies the number of “virtual” cores per worker node; not always set in the YARN configuration. Defaults to the number of cores discovered on worker nodes.

Specifies the classpath that should be used by the containers (processes) that are launched by YARN. This classpath references all of the required Hadoop jar files with their locations on the cluster nodes.

During installation, yarn.application.classpath is not set by all Hadoop distributions. If this YARN property is not set, then do the following:

1. Log in to a cluster node and run the command "yarn classpath". Obtain the output as it includes the required classpath for all YARN-based applications.

2. Set the yarn.application.classpath property to the classpath entries obtained from the output of the "yarn classpath" command.

/usr/hdp/x.x.x.x-xxxx/hadoop-mapreduce/*
/usr/hdp/x.x.x.x-xxxx/hadoop-mapreduce/lib/*

/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*

/opt/mapr/hadoop/hadoop-x.x.x/share/hadoop/mapreduce/lib/*
/opt/mapr/hadoop/hadoop-x.x.x/share/hadoop/hadoop-mapreduce/lib/*

Note: For MapR, the setting must be updated in the yarn-site.xml file for the master and all the worker nodes.

For Hadoop version 2.7, set the yarn.application.classpath properties on the client. Enter a value for yarn.application.classpath properties in <HADOOP_HOME>/etc/hadoop/yarn-site.xml file.

3. Restart all the YARN services after updating the YARN property. For instructions, refer to the instructions specific to your Hadoop distribution.

Note: For a YARN-enabled cluster, Kerberos can be used without any additional setup. For more information about setting up Hadoop in secure mode, see the Apache Hadoop documentation.

For a cluster without YARN, Kerberos authentication must be set up. For information about configuring Kerberos, see Configuring Kerberos Authentication in the DataFlow Troubleshooting and Reference Guide.