HDFS Connectivity

User Guide : Map Connectors : HDFS Connectivity

Share this page

HDFS Connectivity

This section contains information on reading from and writing to an Hadoop Distributed File System (HDFS). Once the following steps are taken, DataConnect can read and write to HDFS just like any other file system.

1. Configure your HDFS instance to be mountable. This is done by enabling the NFS gateway. Follows the basic steps in https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

2. Mount the HDFS instance onto the Linux system where you plan to run DataConnect. Logged in as root, execute the mount command with the follwing parameters:

mount -t nfs -o vers=3,proto=tcp,nolock <server>:/ <mount point>

3. For example:

mount -t nfs -o vers=3,proto=tcp,nolock 192.168.1.50:/ /mnt/hdfs1

Where 192.168.1.50 is the IP address of the Name Node (HDFS instance), and /mnt/hdfs1 is the location on the client machine where you’d like to mount the file system.

4. Install, configure, and launch DataConnect Studio IDE on the Linux system where HDFS was mounted. You will now be able to create datasets that can connect to the HDFS mounted file locations.

Note:

• Both the HDFS system and the system where DataConnect is installed should have a user group with the same name containing a shared list of users. The users should have full permissions.

• If you have multiple DataConnect worker machines, each worker will need to have the HDFS instance mounted.

• Consider using macros for mountable file locations.

• Due to limitations in either HDFS or the NFS gateway for overwriting existing files, the target connector's output mode should be set to Append. If you need to delete the contents of the file before each transformation, you can use the FileDelete script function.