Supported Files, Formats, and Text Types : Supported Files and Formats : Supported File Systems
 
Share this page                  
Supported File Systems
The file access abstraction layer in DataFlow has a number of predefined file systems that can be used. Some are enabled by default; others may require including additional jars in the classpath.
The supported file systems are:
Local files
Process streams
Hadoop Distributed File System (HDFS)
Amazon S3
File Transfer Protocol (FTP), FTPS, or SFTP
Local Files
DataFlow provides support for accessing the local file system. Unlike other file systems, paths for local files do not require a schema prefix; paths to local files are optionally prefixed by the file scheme.
Warning!  When executing on the cluster, local paths are interpreted relative to the machine on which the graph is executing. This may or may not refer to the same file as on the machine from which the graph was invoked. Use caution when using local paths with a DataFlow cluster. Either ensure the path is visible to all machines (by copying the file or by using a distributed file system such as NFS) or configure the operator to execute locally.
Process Streams
The JDK provides access to three special streams: standard input, standard output, and standard error. These streams are environmental; they are established as part of the execution of the JVM process. Typically, these streams are associated with the console, but may be redirected elsewhere.
DataFlow provides support for addressing these streams through special schemes. As these streams are shared across the JVM, there should only be one accessor for a given stream. If this guideline is violated, the results will be undefined.
The special schemes are:
stdin
Identifies the standard input stream
stdout
Identifies the standard output stream
stderr
Identifies the standard error stream.
Hadoop Distributed File System (HDFS)
HDFS (Hadoop Distributed File System) is the distributed file system that is provided as part of Apache Hadoop. DataFlow provides support for accessing data stored on HDFS, but this support is not enabled by default.
To enable HDFS access within DataFlow, the following must be true:
The contents of $DR_HOME/lib/hadoop must be included in the classpath.
The appropriate client jars for your Hadoop installation must be included in the classpath. Additionally, we strongly suggest that the configuration files for your Hadoop installation be included in the classpath, although this will not prevent the enabling of HDFS support.
Provided that the environment has been correctly configured as documented in Installing DataFlow for Use with Java, these conditions will be met when using Command Line Usage: dr.
Paths referring to HDFS are prefixed using the hdfs scheme identifier. The format of an HDFS path is illustrated below:
hdfs://<namenode>:<port>/<path to file>
Note:  HDFS paths in DataFlow use the same format as they do in Hadoop; any string recognized by Hadoop as a valid HDFS path is also a valid path string in DataFlow.
When executing in a DataFlow cluster, the locality information available in HDFS will be used as input to scheduling, preferring nodes that can access file splits locally over those which would access them remotely. Because of this, installing a DataFlow cluster on the same machines as the HDFS data nodes improves DataFlow’s ability to ensure data access is local, potentially increasing overall performance of executed graphs.
HDFS access uses credentials inherited from the environment of the client. The permissions of the user launching the DataFlow graph will be used for accesses done on both the client machine and the cluster machines.
Amazon S3
S3 is Amazon’s web-based cloud storage service. DataFlow provides support for accessing data stored in S3, although this support is not enabled by default. To enable support for S3, the following must be true:
The contents of $DR_HOME/lib/aws must be included in the classpath.
Executing a graph using Command Line Usage: dr will meet these conditions if the AWS module has been enabled in $DR_HOME/conf/modules.
Paths referring to S3 are prefixed using the s3 scheme identifier. An s3 path has two scheme-specific components, the bucket name and the object name. Both are optional, although if an object name is present, a bucket name must also be present. The format of an s3 path is as illustrated below:
s3://<bucket name>/<object name>
Access to S3 requires credentials; the FileClient used to access S3 paths must have been provided appropriate credentials in order to succeed. DataFlow supports use of AWS access keys for accessing S3. The required fields in the Credentials object associated with the file system are:
Field Name
Description
access key
The Access Key ID identifying the account
secret key
The Secret Access Key used as the shared secret for authentication
Note:  Files stored on S3 cannot be read using splits. Therefore reads on S3 can only exhibit parallelism at a per-file granularity. Single files cannot be read in a parallel fashion.
FTP, FTPS, or SFTP
File Transfer Protocol is a standard network protocol used to transfer files between hosts. DataFlow provides support for accessing data through an FTP connection to a host. FTPS and SFTP are additionally supported when security is an issue.
Paths referring to an FTP location are prefixed with either ftp, ftps, or sftp, depending on which protocol should be used. The scheme identifier determines which credentials are used when connecting to the FTP server. The format of a path is as illustrated below:
<ftp,ftps,sftp>://<host>:<port>/<path to file>
Access to FTP servers requires credentials; the FileClient used to access FTP paths must have been provided appropriate credentials in order to succeed. Some credentials apply only when using the appropriate scheme identifier. The fields in the Credentials object associated with the file system are:
Field Name
Applicable Scheme
Description
username
ftp, ftps, sftp
The username to log in to the server
password
ftp, ftps, sftp
The password associated with the username
key
ftps, sftp
The local path to the cryptographic key or certificate
keystore password
ftps, sftp
The password used to unlock or check the integrity of the key
keystore type
ftps
The type of keystore
keymanager algorithm
ftps
The standard name of the requested key management algorithm
trustmanager algorithm
ftps
The standard name of the requested trust management algorithm
If a field is not set the system-specific default will be use, d if possible. The only required fields are username and key if security is enabled.
For more information about keystores and algorithms, see http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html#AppA and http://docs.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html.
Note:  Files stored on an FTP server cannot be read using splits. Therefore, reads on an FTP server can only exhibit parallelism at a per-file granularity. Single files cannot be read in a parallel fashion.