Vector 6.2 | Load Data with Spark-Vector Loader

User Guide > User Guide > Loading Data > Load Data with Spark SQL through the Spark-Vector Connector > Load Data with Spark-Vector Loader

Was this helpful?

Load Data with Spark-Vector Loader

The Spark-Vector Connector (spark_vector_providerX.jar) contains a Spark-Vector Loader component. The Loader is also available standalone.

The Loader is a command line client utility that lets you load CSV, JSON, Parquet, and ORC files through Spark into Vector, using the Spark-Vector Connector. For more information, see https://github.com/ActianCorp/spark-vector.

To view all the command line options to load file contents in various formats into Vector, you can download the spark_vector_loader_assembly from https://esd.actian.com/product/drivers/Spark, and then use the following command:

spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar --help

Examples of Loading Data with Spark-Vector Loader

The following sections show sample data files and the commands used to load their contents into Vector using the Spark-Vector Loader.

CSV File with Header

employee_header_test.csv

"name"|"salary"

"abc"|100

"xyz"|150

Command

spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load csv -sf /home/hcl01/Spark/csv/employee_header_test.csv -h "name string,salary int" -sh true -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_csv_with_header -sc "|"

CSV File without Header

employee_test.csv

"abc"|100

"xyz"|150

Command

spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load csv -sf /home/hcl01/Spark/csv/employee_test.csv -h "name string,salary int" -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_csv_without_header -sc "|"

This command will create a table if –ct true is used.

JSON File

employee_test.csv

{"name":"Michael", "salary":3000}

{"name":"Andy", "salary":4000}

{"name":"Justin", "salary":5900}

{"name":"Berta", "salary":4800}

{"name":"Raju", "salary":3900}

{"name":"Chandy", "salary":5500}

{"name":"Joey", "salary":3500}

{"name":"Mon", "salary":4000}

{"name":"Rachel", "salary":4000}

Command

spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load json -sf /home/hcl01/Spark/json/employee_test.json -h "name string,salary int" -ac true -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_json

This command creates a table if –ct true is used and -ac true is used to allow unquoted JSON fields.

Parquet File

Userdata.parquet

Column details:

column# column_name hive_datatype

===============================================

1 registration_dttm timestamp

2 id int

3 first_name string

4 last_name string

5 email string

6 gender string

7 ip_address string

8 cc string

9 country string

10 birthdate string

11 salary double

12 title string

13 comments string

Command

spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load parquet -sf /home/hcl01/Spark/parquet/userdata.parquet -cols "registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title" -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_parquet

This command creates a table if –ct true and -cols are used to specify the columns to load into a Vector table. The command mentions all columns except comments so it will create a table svc_parquet with all columns except the comments column.

For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet.

ORC File

UserData.orc

ORC file does not have header.

Column details:

column# column_name hive_datatype

==============================================

1 registration_dttm timestamp

2 id int

3 first_name string

4 last_name string

5 email string

6 gender string

7 ip_address string

8 cc string

9 country string

10 birthdate string

11 salary double

12 title string

13 comments string

Command

spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load orc -sf /home/hcl01/Spark/orc/data1.orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11" -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_orc

This command creates a table if –ct true is used and -cols is used to specify the columns to load in Vector table. Because this file does not have a header, the table is created with column name _col0, _col1 and so on. We have specified _col0 to _col11, so the command will create table svc_orc with all columns except the comments column.

Another way to load this file is to create the svc_orc table using CREATE TABLE command, and then execute the spark-submit command without the –ct flag:

CREATE TABLE svc_orc(

registration_dttm TIMESTAMP,

id INTEGER,

first_name VARCHAR(50),

last_name VARCHAR(50),

email VARCHAR(50),

gender VARCHAR(50),

ip_address VARCHAR(50),

cc VARCHAR(50),

country VARCHAR(50),

birthdate VARCHAR(50),

salary FLOAT8,

title VARCHAR(50)

) WITH STRUCTURE=X100;

For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/orc.

Last modified date: 04/23/2025