A. Command Reference : vwload Command--Load Data into a Table : vwload in Parallel Mode
Share this page                  
vwload in Parallel Mode
The vwload ‑‑cluster option speeds up loading by parallelizing some of the processing steps.
Note:  The ‑‑cluster option should always be used in VectorH.
Requirements for Parallel vwload
If parallel mode is enabled, all processing steps including reading input files from disk, UTF8 verification or character set conversion, parsing, and data type conversion are done by the server (that is, by all nodes in the cluster used by VectorH). Therefore, files must be accessible by the server, not the vwload client (as in normal mode).
Vwload cannot split one big input file to load it in parallel. Multiple input files must be specified to benefit from parallelization. Parallelization works best when all input files are approximately the same size.
Because the input files are read by the servers on all nodes in the cluster, the paths should be accessible from all cluster nodes. They can be either paths into a shared filesystem (HDFS, or network paths mounted on all nodes), or data replicated under the same path on all nodes. There is no control over which files will be read by a server on which node.
Note:  The directory for vectorwise.log must also be accessible by the servers on all nodes.
To use the vwload command in parallel mode, /proc/sys/vm/overcommit_memory (see  Virtual Address Space Allocation (Linux)) must be set to 1 or [memory] max_overalloc (see  max_overalloc) must be 0.
Limitations of Parallel vwload
In parallel mode, compression and writing data to disk remain as in regular vwload. Reading and parsing input gets parallelized (and distributed over cluster nodes). Also, parallel vwload imposes less communication overhead, because all the processing is performed inside a single server process. Therefore, vwload in parallel mode may be faster even when loading a single file.
In parallel mode vwload does not output all parsing and conversion errors. It outputs only the first error encountered (if any). Therefore, we strongly recommend using the ‑‑log option to be able to see all rows that were rejected during load and corresponding errors. Also, in parallel mode, vwload reports only the number of loaded tuples. It does not report the number of errors (rejected rows) and the total number of processed rows.
The following options cannot be used in parallel mode:
The following options behave differently in parallel mode:
‑‑errcount n
In regular mode, first n errors are ignored. In parallel mode the first n errors in each input file are ignored. In particular, with m input files, the maximum number of ignored errors is n*m.
‑‑log path
In regular mode, the path specified is a file. The file is created by vwload. The file will contain rejected rows and corresponding errors.
In parallel mode, the path specified is a directory. The directory is created if it does not exist. If errors are encountered during load, vwload creates two files for each input file. For example, if errors occur while loading file "input1", vwload creates:
path/input1_reject with rows that were not loaded (were rejected)
path/input1_errors with errors that caused those rows not to be loaded
Note:  VectorH does not support multiple instances of vwload loading data into a single table.