Was this helpful?
Engine Settings and Types
 
Name
Type
Default Value
Description
cluster
String or ClusterSpecifier
undefined
Specify the cluster to use for the execution environment. A cluster specification is composed of a host name and IP port number. Execution is local (not clustered) by default.
When specifying as a string, use the URL format: dr://host:port where host is the host name or IP address of the server running the cluster manager and port is the specified port number of the cluster manager.
ClusterSpecifers may also take options to specify job settings. (See example below.)
socketProvider
SocketProvider
DirectSocketProvider
Can be used to configure a socks proxy. (See example below.)
dumpfilePath
String
undefined
Path of statistics dump after execution of the graph. Dumping of statistics is disabled by default.
maxRetries
int
0
The maximum number of retries for network communication attempts. Default: zero (implies no retries).
monitored
boolean
false
Enable or disable monitoring (the collection of runtime statistics). Default: monitoring disabled.
parallelism
int
0
Specifies the degree of parallelism. This setting determines the number of parallel streams that will be created in the physical execution plan.
It is zero by default, indicating the engine is free to choose an appropriate value. When running locally, this is usually the value returned by java.lang.Runtime.availableProcessors(). When running in the cluster, as many partitions as the cluster can support will be used.
Setting the parallelism automatically sets the minimumParallelism to the same value.
minimumParallelism
int
0
Specifies the minimum acceptable degree of parallelism.
The default value of zero implies no minimum when running locally. Running in a cluster, zero implies the cluster default (as specified by the job.minimum.parallelism setting) should be used; see Cluster Settings for details.
storageManagementPath
Path
undefined
Path to use for storage of temporary and intermediate files during graph execution. When running in local mode, the java.io.tmpdir property setting is used by default. When running within a cluster, the configured cluster manager will be used by default.
A storage system called “drfs” (DataFlow file system) is supported within a cluster configuration. This system uses local temporary space on each machine within the cluster.
HDFS can be used in a Hadoop cluster for full fault tolerance.
classpathSpecifier
java.class.path
Specifies the classpath to use for clustered jobs and controls the cache behavior of the entries in the specified classpath. The entries in the list refer to directories or files on the client machine. Those entries are automatically synchronized to the required nodes in the cluster upon job submission.
If unspecified, the value is determined by parsing the java.class.path system property.
By default, .jar files are cached in a shared location (the node-scoped cache), so that they are reusable by other jobs that run on the same node. Directories, on the other hand, are stored in a private cache and are deleted from the cluster machine upon job termination.
For more information, see The Class Cache section in Setting Up Clusters.
moduleConfiguration
datarush-hadoop-apache3
 
Specifies the module to be enabled. This is used to select the version of Hadoop to use within dataflow. The values allowed to include are:
datarush-hadoop-apache3
extensionPaths
String
 
A comma-separated list of paths referring to directories in shared storage. The paths are intended to contain extensions to the DataFlow environment on a cluster. Files found in the extension paths will be copied to the current directory of the containers created to run a DataFlow job on nodes within a cluster. Files that are archives (see below) are added to the class path.
Here are file extensions that indicate a file is an archive:
.tar.gz
.tar
.zip
.jar
Jar files are copied as-is into the local directory. The other archive file types are extracted into the local directory using a base directory name the same as the archive file. All archives are added to the class path of the container. Non-archive files are copied to the local directory of the container but are not added to the classpath.
Each of the paths must be contained in a shared, distributed storage system such as HDFS.
Extension paths are only supported when executing DataFlow jobs using YARN.
schedulerQueue
String
 
The name of the scheduler queue to use when scheduling jobs. The scheduler queue name is only valid when using a cluster for job execution. Currently scheduler queue names are only supported when using YARN for job execution.
The following code example shows how to create a base engine configuration and set specific engine settings.
Creating an engine configuration
EngineConfig config = EngineConfig.engine().monitored(true).parallelism(4);
Configuring a socks proxy
EngineConfig config=...;
config = config.socketProvider(new SocksSocketProvider("localhost",1080));
Last modified date: 06/14/2024