Setting Up Clusters : Configuring the Cluster : Defining Temporary Storage
 
Share this page                  
Defining Temporary Storage
DataFlow uses the local file system for temporary data storage during execution. By default the system temporary directory is used. However, this may not provide enough space or performance for large datasets. We strongly recommend that this setting be modified appropriately for the system.
The node.executor.scratch.directory property specifies the directory for storing temporary files. Any temporary files created during execution are written to a subdirectory of the specified path. Multiple directories can be specified by using a comma to separate each individual path. DataFlow will then use the directories in a round-robin fashion as storage is requested.
The node.executor.scratch.directory property can be specified on the Machine Classes page. If you have a few nodes in your cluster with more disks than others, you should create different machine classes for those nodes.
Note:  Using multiple directories for temporary storage increases the available space and also improves the potential I/O throughput as disk access can be spread across multiple disks. When co-locating a DataFlow node with HDFS, consider using directories on the same drives that are providing storage for the Hadoop DataNode.

The directories for temporary storage must have the same permissions as those used for the system temporary directory. This allows writing for all users with the "sticky" bit turned on. On UNIX, this can be accomplished by using mode 1777 with the chmod command.