Configuring DataFlow

After you install DataFlow, we recommend that you configure it to run in your particular environment by entering settings for environment variables, Hadoop, and Java.

After you install DataFlow, we recommend that you configure environment variables for users. You can do this on a per-user basis or by applying a global profile to your system.

Two environment variables are convenient for working with DataFlow, although updating them is not required. They are:

Points to the installation directory for DataFlow. If not set, it will automatically be determined by command line utilities. In documentation, this variable denotes the installation directory.

Should be updated to include DR_HOME/bin, as this will make using DataFlow command line utilities easier.

On UNIX, environment variables—with the exception of those listed above—can also be set in DR_HOME/conf/dr_env.sh, which is automatically included by DataFlow command line utilities, ensuring these are set for all users regardless of system configuration.

To integrate with Hadoop, DataFlow must be able to locate the necessary client configuration and jar files to add to the classpath. When using the command line utilities, they will determine this information from the environment. If these files cannot be found, Hadoop-related features in DataFlow will not be available.

Hadoop files are located as follows:

• If DR_HADOOP_CLASSPATH is defined, it is added to the class path and no further steps are needed.

• Otherwise, DataFlow attempts to discover this information by running utility commands from the local installation of Hadoop.

– First, the location of the Hadoop installation is determined by checking HADOOP_HOME. If not set, the default /usr/lib/hadoop is used.

– After the installation location is determined, $HADOOP_HOME/bin/hadoop is invoked to obtain the class path.

– If this command fails, PATH is checked for Hadoop and, if found, is used to obtain the class path.

Note: On Windows systems, DR_HADOOP_CLASSPATH is the only way to set the location of Hadoop configuration and .jar files.

DataFlow comes with a command line utility for launching DataFlow applications, Using dr. This is provided as both a .bat file and a .sh shell script. Both files begin with settings for JVM location and memory usage.

The JVM used will be determined from the value of the JAVA_HOME environment variable. If you already have set this variable in your system, leave this setting as is—the JVM located in JAVA_HOME/bin will be used. Otherwise, the JVM will be found using the path.

If a specific JVM is desired, explicitly define JAVA_HOME at the top of the appropriate script, entering an absolute path to a Java installation. For example, on a Windows system, the path might resemble the following:

JVM options are set explicitly in the script using a variable named JAVA_ARGS. JVM options are local to the session in which DataFlow runs.

The dr command uses the ‑server flag to let the JVM automatically manage heap allocation with a server JVM (64-bit). To add additional arguments, edit the line for JAVA_ARGS. For example, the options setting starting memory size (-Xms) or the maximum memory size (-Xmx) can be added. For application development, we recommend using at least 50 percent of physical memory. In testing on JVMs, we have found performance to be better when -Xms and -Xmx have the same value.

The -server flag must appear first in JAVA_ARGS.

The DataFlow command line utilities use log4j for logging messages. To configure this logging, modify the file $DR_HOME/conf/log4j.properties.

Default logging produces informational messages to both the console and a log file located at ~/Dataflow.log.

For more information about log4j capabilities, see the log4j documentation at the Apache website.

On the machine where you performed the installation, open a command prompt and run either dr ‑‑version or dr -v.

The results should look like this:

If the version prints successfully, then DataFlow is correctly installed.

DataFlow is shipped with built-in support for a number of third-party products. If you will not use some of these, you can disable them. Enabled modules are configured in the file $DR_HOME/conf/module.