Actian Ingres 12.1 | Load Data with Spark Loader

Database Administrator Guide > Loading Data into X100 Tables > Load Data with Spark Loader

Was this helpful?

Load Data with Spark Loader

The Spark Loader is a command-line client utility that allows you to load files such as CSV, JSON, Parquet, or ORC through Spark into Actian Ingres using the Spark Connector. The Spark Loader is contained within the Spark Connector service container image but can also be accessed as a standalone tool.

For a functional setup, you need a working Spark installation and the Loader JAR file. Contact Actian support for a copy of the Loader JAR file and the installation requirements for Spark.

Using the Loader Included in the Service Container

To load files from a local storage, you must mount a folder from the local file system into the container. For more information, see Containerized Services in the System Administrator Guide. The content of the mounted folder appears under /opt/user_mount/ in the container file system. The Loader JAR is located under /opt/spark/loader/spark_vector_loader.jar in the container.

For example assume that the running container is named spark, the Actian Ingres host is localhost, the installation id is II, the database is named testdb, and user and password are actian.

To view all the command line options for loading file contents in different formats into the product, run the following command:

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar --help

Examples of Loading Data with Spark Loader

The following are the sample data files and the commands used to load their contents into Actian Ingres using the Spark Loader.

CSV File with Header

employee_header_test.csv

"name"|"salary"

"abc"|100

"xyz"|150

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load csv -sf /opt/user_mount/employee_header_test.csv -vh localhost -vi II -vd testdb -vu actian -vp actian -ct true -tt svc_csv_with_header -sc "|" -sh true

Note: This command will create a table if –ct true is used

CSV File without Header

employee_test.csv

"abc"|100

"xyz"|150

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load csv -sf /opt/user_mount/employee_test.csv -h "name string,salary int" -vh localhost -vi II -vd testdb -vu actian -vp actian -ct true -tt svc_csv_without_header -sc "|"

Note: This command will create a table if –ct true is used.

JSON File

employee_test.json

{"name":"Michael", "salary":3000}

{"name":"Andy", "salary":4000}

{"name":"Justin", "salary":5900}

{"name":"Berta", "salary":4800}

{"name":"Raju", "salary":3900}

{"name":"Chandy", "salary":5500}

{"name":"Joey", "salary":3500}

{"name":"Mon", "salary":4000}

{"name":"Rachel", "salary":4000}

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load json -sf /opt/user_mount/employee_test.json -vh localhost -vi II -vd testdb -vu actian -vp actian -ct true -tt svc_json'help

Parquet File

Userdata.parquet

column# column_name hive_datatype

===============================================

1 registration_dttm timestamp

2 id int

3 first_name string

4 last_name string

5 email string

6 gender string

7 ip_address string

8 cc string

9 country string

10 birthdate string

11 salary double

12 title string

13 comments string

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load json -sf /opt/user_mount/userdata.parquet -vh localhost -vi II -vd testdb -vu actian -vp actian -ct true -tt svc_parquet -cols "registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title"'

This command creates a table if –ct is set to true. The option -cols is used to specify the columns to load into an Actian Ingres table. The command mentions all columns except comments, so it will create a table svc_parquet with all columns except the comments column.

For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet.

ORC File

UserData.orc

ORC file does not have a header.

column# column_name hive_datatype

==============================================

1 registration_dttm timestamp

2 id int

3 first_name string

4 last_name string

5 email string

6 gender string

7 ip_address string

8 cc string

9 country string

10 birthdate string

11 salary double

12 title string

13 comments string

Command

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load orc -sf /opt/user_mount/userdata.orc -vh localhost -vi II -vd testdb -vu actian -vp actian -ct true -tt svc_orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11"'

Note: Since this file does not have a header, the table is created with column name _col0, _col1 and so on. _col0 to _col11, is specified so the command will create table svc_orc with all columns except the comments column.

Another way to load this file is to create the svc_orc table using the CREATE TABLE command, and then execute the spark-submit command without the –ct flag:

CREATE TABLE svc_orc(

registration_dttm TIMESTAMP,

id INTEGER,

first_name VARCHAR(50),

last_name VARCHAR(50),

email VARCHAR(50),

gender VARCHAR(50),

ip_address VARCHAR(50),

cc VARCHAR(50),

country VARCHAR(50),

birthdate VARCHAR(50),

salary FLOAT8,

title VARCHAR(50)

) WITH STRUCTURE=X100;

docker exec spark /bin/bash -c '/opt/spark/bin/spark-submit --class com.actian.spark_vector.loader.Main /opt/spark/loader/spark_vector_loader.jar load orc -sf /opt/user_mount/userdata.orc -vh localhost -vi II -vd testdb -vu actian -vp actian -tt svc_orc -cols "_col0,_col1,_col2,_col3,_col4,_col5,_col6,_col7,_col8,_col9,_col10,_col11"'

The column names are automatically matched. However, the ordering of the columns needs to be matched, and the columns can be skipped only at the end, not in between.

For sample data reference, see https://github.com/Teradata/kylo/tree/master/samples/sample-data/orc.

Last modified date: 02/19/2026