Engine Settings and Types

Name	Type	Default Value	Description
cluster	String or ClusterSpecifier	undefined	Specify the cluster to use for the execution environment. A cluster specification is composed of a host name and IP port number. Execution is local (not clustered) by default. When specifying as a string, use the URL format: dr://host:port where host is the host name or IP address of the server running the cluster manager and port is the specified port number of the cluster manager. ClusterSpecifers may also take options to specify job settings. (See example below.)
socketProvider	SocketProvider	DirectSocketProvider	Can be used to configure a socks proxy. (See example below.)
dumpfilePath	String	undefined	Path of statistics dump after execution of the graph. Dumping of statistics is disabled by default.
maxRetries	int	0	The maximum number of retries for network communication attempts. Default: zero (implies no retries).
monitored	boolean	false	Enable or disable monitoring (the collection of runtime statistics). Default: monitoring disabled.
parallelism	int	0	Specifies the degree of parallelism. This setting determines the number of parallel streams that will be created in the physical execution plan. It is zero by default, indicating the engine is free to choose an appropriate value. When running locally, this is usually the value returned by java.lang.Runtime.availableProcessors(). When running in the cluster, as many partitions as the cluster can support will be used. Setting the parallelism automatically sets the minimumParallelism to the same value.
minimumParallelism	int	0	Specifies the minimum acceptable degree of parallelism. The default value of zero implies no minimum when running locally. Running in a cluster, zero implies the cluster default (as specified by the job.minimum.parallelism setting) should be used; see Cluster Settings for details.
storageManagementPath	Path	undefined	Path to use for storage of temporary and intermediate files during graph execution. When running in local mode, the java.io.tmpdir property setting is used by default. When running within a cluster, the configured cluster manager will be used by default. A storage system called “drfs” (DataFlow file system) is supported within a cluster configuration. This system uses local temporary space on each machine within the cluster. HDFS can be used in a Hadoop cluster for full fault tolerance.
classpathSpecifier	ClasspathSpecifier	java.class.path	Specifies the classpath to use for clustered jobs and controls the cache behavior of the entries in the specified classpath. The entries in the list refer to directories or files on the client machine. Those entries are automatically synchronized to the required nodes in the cluster upon job submission. If unspecified, the value is determined by parsing the java.class.path system property. By default, .jar files are cached in a shared location (the node-scoped cache), so that they are reusable by other jobs that run on the same node. Directories, on the other hand, are stored in a private cache and are deleted from the cluster machine upon job termination. For more information, see The Class Cache section in Setting Up Clusters.
moduleConfiguration	String or ModuleConfiguration	datarush-hadoop-apache2 datarush-hadoop-apache3 datarush-hadoop-mapr6	Specifies the module to be enabled. This is used to select the version of Hadoop to use within dataflow. The values allowed to include are: • datarush-hadoop-apache2 • datarush-hadoop-apache3 • datarush-hadoop-cdh4.2 • datarush-hadoop-mapr4 • datarush-hadoop-mapr5 • datarush-hadoop-mapr6
extensionPaths	String		A comma-separated list of paths referring to directories in shared storage. The paths are intended to contain extensions to the DataFlow environment on a cluster. Files found in the extension paths will be copied to the current directory of the containers created to run a DataFlow job on nodes within a cluster. Files that are archives (see below) are added to the class path. Here are file extensions that indicate a file is an archive: • .tar.gz • .tar • .zip • .jar Jar files are copied as-is into the local directory. The other archive file types are extracted into the local directory using a base directory name the same as the archive file. All archives are added to the class path of the container. Non-archive files are copied to the local directory of the container but are not added to the classpath. Each of the paths must be contained in a shared, distributed storage system such as HDFS. Extension paths are only supported when executing DataFlow jobs using YARN.
schedulerQueue	String		The name of the scheduler queue to use when scheduling jobs. The scheduler queue name is only valid when using a cluster for job execution. Currently scheduler queue names are only supported when using YARN for job execution.