Load Data with Spark-Vector Loader
The Loader is a command-line client utility that allows you to load files such as CSV, JSON, Parquet, or ORC through Spark into Vector using the Spark Vector Connector. The Loader is contained within the Spark Vector Connector service container image but can also be accessed as a standalone tool.
For a functional setup, you need a working Spark installation and the Loader jar file. Please contact Actian support for a copy of the Loader jar and the installation requirements for Spark.
The following section describes how to use the Loader included in the service container.
Using the Loader included in the service container
To load files from a local storage. you must mount a folder from the local file system into the container, as described in Section [Setting up service containers]. The content of the mounted folder appears under /opt/user_mount/ in the container file system. The Loader jar is located under /opt/spark/loader/spark_vector_loader.jar in the container.
To simplify things, we assume that the running container is named spark, the Vector host is localhost, the installation id is VW, the database is named testdb, and user and password are actian.
To view all the command line options for loading file contents in different formats into the product and understand their meanings, can use the following command:
docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar --help
Examples of Loading Data with Spark-Vector Loader
The following sections show sample data files and the commands used to load their contents into Vector using the Spark-Vector Loader.
CSV File with Header
employee_header_test.csv
"name"|"salary"
"abc"|100
"xyz"|150
Command
docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load csv -sf /opt/user_mount/employee_header_test.csv -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_csv_with_header -sc "|" -sh true
Note: This command will create a table if –ct true is used
CSV File without Header
employee_test.csv
"abc"|100
"xyz"|150
Command
docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load csv -sf /opt/user_mount/employee_test.csv -h "name string,salary int" -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_csv_without_header -sc "|"
This command will create a table if –ct true is used.
JSON File
employee_test.csv
{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4000}
{"name":"Justin", "salary":5900}
{"name":"Berta", "salary":4800}
{"name":"Raju", "salary":3900}
{"name":"Chandy", "salary":5500}
{"name":"Joey", "salary":3500}
{"name":"Mon", "salary":4000}
{"name":"Rachel", "salary":4000}
Command
docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load json -sf /opt/user_mount/employee_test.json -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_json'help
This command creates a table if –ct true is used and -ac true is used to allow unquoted JSON fields.
Parquet File
Userdata.parquet
Column details:
column# column_name hive_datatype
===============================================
1 registration_dttm timestamp
2 id int
3 first_name string
4 last_name string
5 email string
6 gender string
7 ip_address string
8 cc string
9 country string
10 birthdate string
11 salary double
12 title string
13 comments string
Command
docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load json -sf /opt/user_mount/userdata.parquet -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_parquet -cols "registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title"'
This command creates a table if –ct true is set. The option -cols is used to specify the columns to load into a Vector table. The command mentions all columns except comments so it will create a table svc_parquet with all columns except the comments column.
For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet.
ORC File
UserData.orc
ORC file does not have a header.
Column details:
column# column_name hive_datatype
==============================================
1 registration_dttm timestamp
2 id int
3 first_name string
4 last_name string
5 email string
6 gender string
7 ip_address string
8 cc string
9 country string
10 birthdate string
11 salary double
12 title string
13 comments string
Command
docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load orc -sf /opt/user_mount/userdata.orc -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11"'
Because this file does not have a header, the table is created with column name _col0, _col1 and so on. We have specified _col0 to _col11, so the command will create table svc_orc with all columns except the comments column.
Another way to load this file is to create the svc_orc table using CREATE TABLE command, and then execute the spark-submit command without the –ct flag:
CREATE TABLE svc_orc(
registration_dttm TIMESTAMP,
id INTEGER,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(50),
gender VARCHAR(50),
ip_address VARCHAR(50),
cc VARCHAR(50),
country VARCHAR(50),
birthdate VARCHAR(50),
salary FLOAT8,
title VARCHAR(50)
) WITH STRUCTURE=X100;
docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load orc -sf /opt/user_mount/userdata.orc -vh localhost -vi VW -vd testdb -vu actian -vp actian -tt svc_orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11"'
For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/orc.