Vector 7.0 | Load Data with Spark-Vector Loader

User Guide > User Guide > Loading Data > Load Data with Spark SQL through the Spark-Vector Connector > Load Data with Spark-Vector Loader

Was this helpful?

Load Data with Spark-Vector Loader

The Loader is a command-line client utility that allows you to load files such as CSV, JSON, Parquet, or ORC through Spark into Vector using the Spark Vector Connector. The Loader is contained within the Spark Vector Connector service container image but can also be accessed as a standalone tool.

For a functional setup, you need a working Spark installation and the Loader jar file. Please contact Actian support for a copy of the Loader jar and the installation requirements for Spark.

The following section describes how to use the Loader included in the service container.

Using the Loader included in the service container

To load files from a local storage. you must mount a folder from the local file system into the container, as described in Section [Setting up service containers]. The content of the mounted folder appears under /opt/user_mount/ in the container file system. The Loader jar is located under /opt/spark/loader/spark_vector_loader.jar in the container.

To simplify things, we assume that the running container is named spark, the Vector host is localhost, the installation id is VW, the database is named testdb, and user and password are actian.

To view all the command line options for loading file contents in different formats into the product and understand their meanings, can use the following command:

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar --help

Examples of Loading Data with Spark-Vector Loader

The following sections show sample data files and the commands used to load their contents into Vector using the Spark-Vector Loader.

CSV File with Header

employee_header_test.csv

"name"|"salary"

"abc"|100

"xyz"|150

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load csv -sf /opt/user_mount/employee_header_test.csv -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_csv_with_header -sc "|" -sh true

Note: This command will create a table if –ct true is used

CSV File without Header

employee_test.csv

"abc"|100

"xyz"|150

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load csv -sf /opt/user_mount/employee_test.csv -h "name string,salary int" -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_csv_without_header -sc "|"

This command will create a table if –ct true is used.

JSON File

employee_test.csv

{"name":"Michael", "salary":3000}

{"name":"Andy", "salary":4000}

{"name":"Justin", "salary":5900}

{"name":"Berta", "salary":4800}

{"name":"Raju", "salary":3900}

{"name":"Chandy", "salary":5500}

{"name":"Joey", "salary":3500}

{"name":"Mon", "salary":4000}

{"name":"Rachel", "salary":4000}

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load json -sf /opt/user_mount/employee_test.json -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_json'help

This command creates a table if –ct true is used and -ac true is used to allow unquoted JSON fields.

Parquet File

Userdata.parquet

Column details:

column# column_name hive_datatype

===============================================

1 registration_dttm timestamp

2 id int

3 first_name string

4 last_name string

5 email string

6 gender string

7 ip_address string

8 cc string

9 country string

10 birthdate string

11 salary double

12 title string

13 comments string

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load json -sf /opt/user_mount/userdata.parquet -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_parquet -cols "registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title"'

This command creates a table if –ct true is set. The option -cols is used to specify the columns to load into a Vector table. The command mentions all columns except comments so it will create a table svc_parquet with all columns except the comments column.

For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet.

ORC File

UserData.orc

ORC file does not have a header.

Column details:

column# column_name hive_datatype

==============================================

1 registration_dttm timestamp

2 id int

3 first_name string

4 last_name string

5 email string

6 gender string

7 ip_address string

8 cc string

9 country string

10 birthdate string

11 salary double

12 title string

13 comments string

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load orc -sf /opt/user_mount/userdata.orc -vh localhost -vi VW -vd testdb -vu actian -vp actian -ct true -tt svc_orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11"'

Because this file does not have a header, the table is created with column name _col0, _col1 and so on. We have specified _col0 to _col11, so the command will create table svc_orc with all columns except the comments column.

Another way to load this file is to create the svc_orc table using CREATE TABLE command, and then execute the spark-submit command without the –ct flag:

CREATE TABLE svc_orc(

registration_dttm TIMESTAMP,

id INTEGER,

first_name VARCHAR(50),

last_name VARCHAR(50),

email VARCHAR(50),

gender VARCHAR(50),

ip_address VARCHAR(50),

cc VARCHAR(50),

country VARCHAR(50),

birthdate VARCHAR(50),

salary FLOAT8,

title VARCHAR(50)

) WITH STRUCTURE=X100;

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load orc -sf /opt/user_mount/userdata.orc -vh localhost -vi VW -vd testdb -vu actian -vp actian -tt svc_orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11"'

For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/orc.

Last modified date: 01/10/2025