DF 7.0.3 | Configuring Hadoop for DataFlow

Installing and Configuring DataFlow > Installing and Configuring DataFlow > Installing DataFlow as a Plugin > Installing and Configuring DataFlow on a YARN-enabled Hadoop Cluster > Configuring Hadoop for DataFlow

Was this helpful?

Configuring Hadoop for DataFlow

By default, the Hadoop configuration files are found in the /etc/hadoop/conf directory. If this directory is not available, set the hadoop.conf.dir environment variable to the configuration directory for Hadoop on the installation machine.

The /etc/hadoop/conf directory contains by default the following files:

• core-site.xml

• hdfs-site.xml

• yarn-site.xml

The /etc/hbase/conf directory contains hbase-site.xml. Copy hbase-site.xml to /etc/hadoop/conf.

The following provide the important YARN properties that require configuration:

yarn.nodemanager.resource.memory-mb

Specifies the amount of memory per worker node allocated for YARN application usage.

yarn.scheduler.minimum-allocation-mb

Specifies the minimum size container that YARN will allocate; a container request for less memory will be bumped up to this minimum.

yarn.nodemanager.resource.cpu-vcores

Specifies the number of “virtual” cores per worker node; not always set in the YARN configuration. Defaults to the number of cores discovered on worker nodes.

yarn.application.classpath

Specifies the classpath that should be used by the containers (processes) that are launched by YARN. This classpath references all of the required Hadoop jar files with their locations on the cluster nodes.

To set yarn.application.classpath

During installation, yarn.application.classpath is not set by all Hadoop distributions. If this YARN property is not set, then do the following:

1. Log in to a cluster node and run the command "yarn classpath". Obtain the output as it includes the required classpath for all YARN-based applications.

2. Set the yarn.application.classpath property to the classpath entries obtained from the output of the "yarn classpath" command.

For Optimized Row Columnar (ORC) files, the Hadoop distribution path is:

HDP

/usr/hdp/x.x.x.x-xxxx/hadoop-mapreduce/*
/usr/hdp/x.x.x.x-xxxx/hadoop-mapreduce/lib/*

Cloudera

/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*

MapR

/opt/mapr/hadoop/hadoop-x.x.x/share/hadoop/mapreduce/lib/*
/opt/mapr/hadoop/hadoop-x.x.x/share/hadoop/hadoop-mapreduce/lib/*

Note: For MapR, the setting must be updated in the yarn-site.xml file for the master and all the worker nodes.

For Hadoop version 2.7, set the yarn.application.classpath properties on the client. Enter a value for yarn.application.classpath properties in <HADOOP_HOME>/etc/hadoop/yarn-site.xml file.

3. Restart all the YARN services after updating the YARN property. For instructions, refer to the instructions specific to your Hadoop distribution.

Note: For a YARN-enabled cluster, Kerberos can be used without any additional setup. For more information about setting up Hadoop in secure mode, see the Apache Hadoop documentation.

For a cluster without YARN, Kerberos authentication must be set up. For information about configuring Kerberos, see Configuring Kerberos Authentication in the DataFlow Troubleshooting and Reference Guide.

Last modified date: 01/06/2023