vwload in Parallel Mode
The vwload ‑‑cluster option speeds up loading by parallelizing some of the processing steps.
Requirements for Parallel vwload
If parallel mode is enabled, all processing steps including reading input files from disk, UTF8 verification or character set conversion, parsing, and data type conversion are done by the server. Therefore, files must be accessible by the server, not the vwload client (as in normal mode).
Vwload cannot split one big input file to load it in parallel. Multiple input files must be specified to benefit from parallelization. Parallelization works best when input files have (almost) the same size.
Linux: To use the vwload command in parallel mode, /proc/sys/vm/overcommit_memory must be set to 1 or [memory] max_overalloc must be 0.
Limitations of Parallel vwload
In parallel mode, compression and writing data to disk remain as in regular vwload. Reading and parsing input gets parallelized. Also, parallel vwload imposes less communication overhead, because all the processing is performed inside a single server process. Therefore, vwload in parallel mode may be faster even when loading a single file.
In parallel mode vwload does not output all parsing and conversion errors. It outputs only the first error encountered (if any). Therefore, we strongly recommend using the ‑‑log option to be able to see all rows that were rejected during load and corresponding errors. Also, in parallel mode, vwload reports only the number of loaded tuples. It does not report the number of errors (rejected rows) and the total number of processed rows.
The following options cannot be used in parallel mode:
• ‑‑skip
• ‑‑frequency
• ‑‑verbose
The following options behave differently in parallel mode:
‑‑errcount n
In regular mode, first n errors are ignored. In parallel mode the first n errors in each input file are ignored. In particular, with m input files, the maximum number of ignored errors is n*m.
‑‑log path
In regular mode, the path specified is a file. The file is created by vwload. The file will contain rejected rows and corresponding errors.
In parallel mode, the path specified is a directory. The directory is created if it does not exist. If errors are encountered during load, vwload creates two files for each input file. For example, if errors occur while loading file "input1", vwload creates:
• path/input1_reject with rows that were not loaded (were rejected)
• path/input1_errors with errors that caused those rows not to be loaded
Last modified date: 08/14/2024