Was this helpful?
Load Data with Spark-Vector Loader
The Spark-Vector Connector (spark_vector_providerX.jar) contains a Spark-Vector Loader component. The Loader is also available standalone.
The Loader is a command line client utility that lets you load CSV, JSON, Parquet, and ORC files through Spark into Vector, using the Spark-Vector Connector. For more information, see https://github.com/ActianCorp/spark-vector.
To view all the command line options to load file contents in various formats into Vector, you can download the spark_vector_loader_assembly from https://esd.actian.com/product/drivers/Spark, and then use the following command:
spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar --help
Examples of Loading Data with Spark-Vector Loader
The following sections show sample data files and the commands used to load their contents into Vector using the Spark-Vector Loader.
CSV File with Header
employee_header_test.csv
"name"|"salary"
"abc"|100
"xyz"|150
Command
spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load csv -sf /home/hcl01/Spark/csv/employee_header_test.csv -h "name string,salary int" -sh true -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_csv_with_header -sc "|"
CSV File without Header
employee_test.csv
"abc"|100
"xyz"|150
Command
spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load csv -sf /home/hcl01/Spark/csv/employee_test.csv -h "name string,salary int" -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_csv_without_header -sc "|"
This command will create a table if –ct true is used.
JSON File
employee_test.csv
{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4000}
{"name":"Justin", "salary":5900}
{"name":"Berta", "salary":4800}
{"name":"Raju", "salary":3900}
{"name":"Chandy", "salary":5500}
{"name":"Joey", "salary":3500}
{"name":"Mon", "salary":4000}
{"name":"Rachel", "salary":4000}
Command
spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load json -sf /home/hcl01/Spark/json/employee_test.json -h "name string,salary int" -ac true -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_json
This command creates a table if –ct true is used and -ac true is used to allow unquoted JSON fields.
Parquet File
Userdata.parquet
Column details:
column# column_name hive_datatype
===============================================
1         registration_dttm        timestamp
2         id                      int
3         first_name              string
4         last_name               string
5         email                   string
6         gender                  string
7         ip_address              string
8         cc                      string
9         country                 string
10        birthdate               string
11        salary                  double
12        title                   string
13        comments                string
Command
spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load parquet -sf /home/hcl01/Spark/parquet/userdata.parquet -cols "registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title" -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_parquet
This command creates a table if –ct true and -cols are used to specify the columns to load into a Vector table. The command mentions all columns except comments so it will create a table svc_parquet with all columns except the comments column.
For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet.
ORC File
UserData.orc
ORC file does not have header.
Column details:
column#   column_name           hive_datatype
==============================================
1         registration_dttm     timestamp
2         id                    int
3         first_name            string
4         last_name             string
5         email                 string
6         gender                string
7         ip_address            string
8         cc                    string
9         country               string
10        birthdate             string
11        salary                double
12        title                 string
13        comments              string
Command
spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load orc -sf /home/hcl01/Spark/orc/data1.orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11" -vh usau-hcl-lnx01 -vi V2 -vd sparktest -ct true -tt svc_orc
This command creates a table if –ct true is used and -cols is used to specify the columns to load in Vector table. Because this file does not have a header, the table is created with column name _col0, _col1 and so on. We have specified _col0 to _col11, so the command will create table svc_orc with all columns except the comments column.
Another way to load this file is to create the svc_orc table using CREATE TABLE command, and then execute the spark-submit command without the –ct flag:
CREATE TABLE svc_orc(
    registration_dttm    TIMESTAMP,
    id                   INTEGER,
    first_name           VARCHAR(50),
    last_name            VARCHAR(50),
    email                VARCHAR(50),
    gender               VARCHAR(50),
    ip_address           VARCHAR(50),
    cc                   VARCHAR(50),
    country              VARCHAR(50),
    birthdate            VARCHAR(50),
    salary               FLOAT8,
    title                VARCHAR(50)
) WITH STRUCTURE=X100;
spark-submit --class com.actian.spark_vector.loader.Main spark_vector_loader-assembly-2.1.jar load orc -sf /home/hcl01/Spark/orc/data1.orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11" -vh usau-hcl-lnx01 -vi V2 -vd sparktest -tt svc_orc
For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/orc.
Last modified date: 12/06/2024