Distributed Data Loading with vwload

12. Loading Data : Distributed Data Loading with vwload

Share this page

Data loading with vwload can be distributed over the cluster if there are multiple input files. The maximum parallelism that can be achieved is limited to the number of input files and the number of total execution cores in the cluster.

To use this method of data loading, all input files should be on the HDFS file system. Use standard utilities to copy the input files to the file system (for example, hdfs dfs -put) or generate the input files with an application that writes directly to the file system.

To enable distributed data loading, add the ‑c option to the vwload command line and provide the full HDFS path to all input files.

The following command loads the data in the lineitem.txt file into the lineitem table of the dbt3 database. In the lineitem table, fields are delimited by | and records are delimited by a new line (\n)

To load the data into the lineitem table using vwload -c

Enter a command like the following at the operating system prompt:

vwload -c --fdelim "|" --rdelim "\n" --table lineitem dbt3 hdfs://namenode:8020/path/to/data/lineitem_1.txt hdfs://namenode:8020/path/to/data/lineitem_2.txt . . .