Name | Type | Default Value | Description |
---|---|---|---|
cluster | String or ClusterSpecifier | undefined | Specify the cluster to use for the execution environment. A cluster specification is composed of a host name and IP port number. Execution is local (not clustered) by default. When specifying as a string, use the URL format: dr://host:port where host is the host name or IP address of the server running the cluster manager and port is the specified port number of the cluster manager. ClusterSpecifers may also take options to specify job settings. (See example below.) |
socketProvider | SocketProvider | DirectSocketProvider | Can be used to configure a socks proxy. (See example below.) |
dumpfilePath | String | undefined | Path of statistics dump after execution of the graph. Dumping of statistics is disabled by default. |
maxRetries | int | 0 | The maximum number of retries for network communication attempts. Default: zero (implies no retries). |
monitored | boolean | false | Enable or disable monitoring (the collection of runtime statistics). Default: monitoring disabled. |
parallelism | int | 0 | Specifies the degree of parallelism. This setting determines the number of parallel streams that will be created in the physical execution plan. It is zero by default, indicating the engine is free to choose an appropriate value. When running locally, this is usually the value returned by java.lang.Runtime.availableProcessors(). When running in the cluster, as many partitions as the cluster can support will be used. Setting the parallelism automatically sets the minimumParallelism to the same value. |
minimumParallelism | int | 0 | Specifies the minimum acceptable degree of parallelism. The default value of zero implies no minimum when running locally. Running in a cluster, zero implies the cluster default (as specified by the job.minimum.parallelism setting) should be used; see Cluster Settings for details. |
storageManagementPath | Path | undefined | Path to use for storage of temporary and intermediate files during graph execution. When running in local mode, the java.io.tmpdir property setting is used by default. When running within a cluster, the configured cluster manager will be used by default. A storage system called “drfs” (DataFlow file system) is supported within a cluster configuration. This system uses local temporary space on each machine within the cluster. HDFS can be used in a Hadoop cluster for full fault tolerance. |
classpathSpecifier | java.class.path | Specifies the classpath to use for clustered jobs and controls the cache behavior of the entries in the specified classpath. The entries in the list refer to directories or files on the client machine. Those entries are automatically synchronized to the required nodes in the cluster upon job submission. If unspecified, the value is determined by parsing the java.class.path system property. By default, .jar files are cached in a shared location (the node-scoped cache), so that they are reusable by other jobs that run on the same node. Directories, on the other hand, are stored in a private cache and are deleted from the cluster machine upon job termination. For more information, see The Class Cache section in Setting Up Clusters. | |
moduleConfiguration | String or ModuleConfiguration | datarush-hadoop-apache3 | Specifies the module to be enabled. This is used to select the version of Hadoop to use within dataflow. The values allowed to include are: • datarush-hadoop-apache3 |
extensionPaths | String | A comma-separated list of paths referring to directories in shared storage. The paths are intended to contain extensions to the DataFlow environment on a cluster. Files found in the extension paths will be copied to the current directory of the containers created to run a DataFlow job on nodes within a cluster. Files that are archives (see below) are added to the class path. Here are file extensions that indicate a file is an archive: • .tar.gz • .tar • .zip • .jar Jar files are copied as-is into the local directory. The other archive file types are extracted into the local directory using a base directory name the same as the archive file. All archives are added to the class path of the container. Non-archive files are copied to the local directory of the container but are not added to the classpath. Each of the paths must be contained in a shared, distributed storage system such as HDFS. Extension paths are only supported when executing DataFlow jobs using YARN. | |
schedulerQueue | String | The name of the scheduler queue to use when scheduling jobs. The scheduler queue name is only valid when using a cluster for job execution. Currently scheduler queue names are only supported when using YARN for job execution. |