Was this helpful?
Supported File Systems
The file access abstraction layer in DataFlow has a number of predefined file systems that can be used. Some are enabled by default; others may require including additional jars in the classpath.
The supported file systems are:
Local files
Process streams
Hadoop Distributed File System (HDFS)
Amazon S3
File Transfer Protocol (FTP, FTPS, or SFTP)
Azure Blob Storage
Google Cloud Platform
Local Files
DataFlow provides support for accessing the local file system. Unlike other file systems, paths for local files do not require a schema prefix; paths to local files are optionally prefixed by the file scheme.
WARNING!  When executing on the cluster, local paths are interpreted relative to the machine on which the graph is executing. This may or may not refer to the same file as on the machine from which the graph was invoked. Use caution when using local paths with a DataFlow cluster. Either ensure the path is visible to all machines (by copying the file or by using a distributed file system such as NFS) or configure the operator to execute locally.
Process Streams
The JDK provides access to three special streams: standard input, standard output, and standard error. These streams are environmental; they are established as part of the execution of the JVM process. Typically, these streams are associated with the console, but may be redirected elsewhere.
DataFlow provides support for addressing these streams through special schemes. As these streams are shared across the JVM, there should only be one accessor for a given stream. If this guideline is violated, the results will be undefined.
The special schemes are:
stdin
Identifies the standard input stream
stdout
Identifies the standard output stream
stderr
Identifies the standard error stream.
Hadoop Distributed File System (HDFS)
HDFS (Hadoop Distributed File System) is the distributed file system that is provided as part of Apache Hadoop. DataFlow provides support for accessing data stored on HDFS, but this support is not enabled by default.
To enable HDFS access within DataFlow, the following must be true:
The contents of $DR_HOME/lib/hadoop must be included in the classpath.
The appropriate client jars for your Hadoop installation must be included in the classpath. Additionally, we strongly suggest that the configuration files for your Hadoop installation be included in the classpath, although this will not prevent the enabling of HDFS support.
Provided that the environment has been correctly configured as documented in Installing DataFlow for Use with Java, these conditions will be met when using Command Line Usage: dr.
Paths referring to HDFS are prefixed using the hdfs scheme identifier. The format of an HDFS path is illustrated below:
hdfs://<namenode>:<port>/<path to file>
Note:  HDFS paths in DataFlow use the same format as they do in Hadoop; any string recognized by Hadoop as a valid HDFS path is also a valid path string in DataFlow.
When executing in a DataFlow cluster, the locality information available in HDFS will be used as input to scheduling, preferring nodes that can access file splits locally over those which would access them remotely. Because of this, installing a DataFlow cluster on the same machines as the HDFS data nodes improves DataFlow’s ability to ensure data access is local, potentially increasing overall performance of executed graphs.
HDFS access uses credentials inherited from the environment of the client. The permissions of the user launching the DataFlow graph will be used for accesses done on both the client machine and the cluster machines.
Amazon S3
S3 is Amazon’s web-based cloud storage service. DataFlow provides support for accessing data stored in S3, although this support is not enabled by default. To enable support for S3, the contents of $DR_HOME/lib/aws must be included in the classpath.
Executing a graph using Command Line Usage: dr will meet these conditions if the AWS module has been enabled in $DR_HOME/conf/modules.
Paths referring to S3 locations are prefixed using any valid scheme identifiers, including s3, s3a, and s3n. An s3 path has two scheme-specific components, the bucket name and the object name. Both are optional, although if an object name is present, a bucket name must also be present. The format of an s3 path is:
<s3 scheme identifier>://<bucket name>/<object name>
Note:  When loading Amazon S3, you must be aware of the multi-part upload limits since DataFlow will attempt to parallelize the workflow based on the executing capabilities of the client or cluster. These issues are frequently signified by messages in the logs regarding 'Timeout waiting for connection from pool'. If these limits are reached when executing a workflow, then lowering the output sink's parallelism and increasing the write batch sizes can eliminate the related issues. For more information, see the Amazon S3 documentation.
Access to S3 requires credentials; the FileClient used to access S3 paths must have been provided appropriate credentials in order to succeed. DataFlow supports use of AWS access keys for accessing S3 including the use of official AWS credentials provided by global or system environment variables. The required fields in the Credentials object associated with the file system are:
Field Name
Description
access key
The Access Key ID identifying the account
endpoint
Optional custom URI to connect to s3 compatible clusters
region
Where the bucket is placed
secret key
The Secret Access Key used as the shared secret for authentication
Note:  Files stored on S3 cannot be read using splits. Therefore, reads on S3 can only exhibit parallelism at a per-file granularity. Single files cannot be read in a parallel fashion.
FTP, FTPS, or SFTP
File Transfer Protocol is a standard network protocol used to transfer files between hosts. DataFlow provides support for accessing data through an FTP connection to a host. FTPS and SFTP are additionally supported when security is an issue.
Paths referring to an FTP location are prefixed with either ftp, ftps, or sftp, depending on which protocol should be used. The scheme identifier determines which credentials are used when connecting to the FTP server. The format of a path is:
<ftp,ftps,sftp>://<host>:<port>/<path to file>
Access to FTP servers requires credentials; the FileClient used to access FTP paths must have been provided appropriate credentials to succeed. Some credentials apply only when using the appropriate scheme identifier. The fields in the Credentials object associated with the file system are:
Field Name
Applicable Scheme
Description
username
ftp, ftps, sftp
The username to log in to the server
password
ftp, ftps, sftp
The password associated with the username
key
ftps, sftp
The local path to the cryptographic key or certificate
keystore password
ftps, sftp
The password used to unlock or check the integrity of the key
keystore type
ftps
The type of keystore
keymanager algorithm
ftps
The standard name of the requested key management algorithm
trustmanager algorithm
ftps
The standard name of the requested trust management algorithm
If a field is not set, then the system-specific default is used if possible. The required fields are username and key (if security is enabled).
Note:  Files stored on an FTP server cannot be read using splits. Therefore, reads on an FTP server can only exhibit parallelism at a per-file granularity. Single files cannot be read in a parallel fashion.
Azure Blob Storage
Azure Blob storage is Microsoft's object storage solution for the cloud. DataFlow provides support for accessing data stored in Azure through the Azure Blob File System driver.
The paths that refer to an Azure Blob Storage location are prefixed with abfs. The format of the path is:
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
Access to Azure Blob Storage requires credentials; the FileClient used to access the ABFS paths must have been provided appropriate credentials to succeed. Some credentials apply only when using the appropriate scheme identifier. The fields in the Credentials object associated with the file system are:
 
Field Name
Description
AzureAuthType
Authentication method to use. Can be SharedKey, SAS, OAuth, or Public
DefaultEndpointProtocol
Either http or https
AZURE_STORAGE_ACCOUNT
Name of the Azure account
AZURE_STORAGE_KEY
Key associated with the Azure account
AZURE_STORAGE_SAS_TOKEN
Shared access signature key
AZURE_CLIENT_ID
Client ID to use when performing service principal authentication with Azure
AZURE_CLIENT_SECRET
Client secret to use when performing service principal authentication with Azure
AZURE_TENANT_ID
Tenant ID for the Azure resources
If a field is not set, the system-specific default is used if possible or an attempt is made to acquire the value through environment variables in some cases. The only field that is always required is AccountName, although depending on the authorization method chosen other fields may need to be provided.
Google Cloud Platform (GCP)
GCP is a cloud storage solution service. DataFlow provides support for accessing data stored in a Google Cloud Storage through the GCP File System driver.
The paths that refer to a Google Storage location are prefixed with gs. The format of the path is:
gs://<BUCKET_NAME>/<OBJECT_NAME>
Access to Google Cloud Storage requires credentials; the FileClient used to access the GS paths must have been provided appropriate credentials to succeed. The fields in the Credentials object associated with the file system are:
 
Field Name
Description
GOOGLE_APPLICATION_CREDENTIALS
Location in the local file system of the Google service account key configuration file. This is usually a JSON file.
client_id
Client ID of the service account.
client_email
Client email address of the service account.
private_key_id
Private key identifier for the service account.
private_key
RSA private key object for the service account.
token_uri
URI of the end point that provides tokens.
project_id
Project used for billing.
quota_project_id
Project used for quota and billing purposes.
Note:  If the GOOGLE_APPLICATION_CREDENTIALS field is not set, then the system-specific default is used if possible or an attempt is made to acquire the value through environment variables if available. If the GOOGLE_APPLICATION_CREDENTIALS are not found, then the rest of the credential fields will be checked and used to access the service.
Last modified date: 06/14/2024