Vector 6.2 | Initial Data Load

Deployment Guide > Deployment Guide > Data Loading > Initial Data Load

Was this helpful?

Initial Data Load

To initially load your data efficiently into Vector you should always use a bulk approach. This section discusses various methods for the initial load.

vwload Utility

The vwload utility reads data from files, directly appending data blocks to the database bypassing the SQL layer.

The utility can be run on the server that runs Vector, in which case the database files must be available locally to the server. You can also run the vwload utility from a remote machine.

Data files must be in a consistent character set encoding to fully load into Vector. If the files are not using UTF8 encoding, then you can pass the encoding as a parameter into vwload and it will load the data correctly.

Below is an example of using the vwload utility.

In a database called test we create a table called region:

CREATE TABLE region

( r_regionkey INTEGER not null,

r_name CHAR(25) not null,

r_comment VARCHAR(152) not null

);

To load the following set of data in a local data file region.tbl:

0|AFRICA|Continent of the elephants

1|AMERICA|Both North and South America

2|ASIA|Where tigers live

3|EUROPE|Many languages are spoken here

4|MIDDLE EAST|Sunny and very warm

5|AUSTRALIA|For goodness sake, please don’t leave us out

use the following statement:

vwload -t region test region.tbl

The vwload utility has options for error handling, date formats, use of enclosures and escape character, and so on. For more information, refer to the Vector User Guide. At present, time and timestamp fields must use the ANSI data representation to load correctly:

• hh24:mi:ss for time fields

• yyyy-mm-dd hh24:mi:ss.ffffff for timestamp fields

COPY VWLOAD Statement

The COPY...VWLOAD statement copies the contents of a file to a table using the VWLOAD operator. This operation is like using vwload -‑cluster on the command line, except that the load can be done from within a multi-statement transaction instead of in a separate session.

The vwload --cluster option speeds up loading by parallelizing some of the processing steps.

To use this option, multiple input files are required, and they must reside on the server, not the client. Parallelization works best when input files have (almost) the same size.

For example, bulk load the data from multiple files into the region table. Columns in the data file are delimited with a vertical bar, records are delimited by a new line, and null values are identified as NULL:

COPY region() VWLOAD FROM 'path/to/data/mytable_1.txt', 'path/to/data/mytable_2.txt' WITH INSERTMODE=BULK, FDELIM='|', RDELIM='\n', NULLVALUE='NULL', WORK_DIR = '/home/users/temp'

COPY Statement

The COPY statement is a flexible and efficient way to bulk load data from a data file into Vector. COPY can be run locally on the database server, or remotely from an SQL terminal where Net is available. Note that COPY cannot be run from a JDBC-based application because JDBC does not use Net to connect to the database.

The syntax for the COPY statement may be different than the syntax for other bulk loaders you are used to. Below is an example of using the COPY statement using the same table definition and data file as mentioned in vwload Utility.

Use the following COPY statement in an SQL terminal session:

COPY TABLE region (

r_regionkey = 'c0|',

r_name = 'c0|',

r_comment = 'c0nl'

)

FROM 'region.tbl' \g

In this example, 'c0|' represents a free format character string (versus fixed length) followed by a pipe delimiter, and 'c0nl' a free format character string followed by the newline at the end of the line.

The COPY statement also supports the commonly used comma or semicolon separated files that use double quote enclosures and the backslash as the escape character. For more information on the COPY statement, refer to the Vector SQL Language Guide.

At present COPY for Vector only supports date, time, and timestamp fields formatted using the ANSI standard:

• yyyy-mm-dd for date fields unless you use II_DATE_FORMAT to override the default

• hh24:mi:ss for time fields

• yyyy-mm-dd hh24:mi:ss.ffffff for timestamp fields

If the load fails for whatever reason, for example due to a bad record, then the default behavior is to stop loading and roll back the entire statement. To diagnose data load issues and identify the data load problem you should use the options the COPY statement provides in its WITH clause. Use ON_ERROR = CONTINUE and LOG = 'filename' to continue the load when hitting an issue, logging bad records to the file identified by 'filename'. Include ERROR_COUNT if you want to stop after a certain number of errors.

For example:

COPY TABLE region (

r_regionkey = 'c0|',

r_name = 'c0|',

r_comment = 'c0nl'

)

FROM 'region.tbl'

WITH ON_ERROR = CONTINUE

, LOG = 'region_bad_records.log' \g

SparkSQL

The Spark-Vector Connector lets you interact with a Vector database using Apache Spark.

For more information on the Spark-Vector Connector, see https://github.com/ActianCorp/spark-vector/tree/master.

To load data using Spark SQL

1. Create a Vector table using Spark SQL:

sqlContext.sql("""CREATE TEMPORARY TABLE vector_table

USING com.actian.spark_vector.sql.DefaultSource

OPTIONS (

host "vectorhost",

instance "VI",

database "databasename",

table "vector_table"

)""")

2. Load data into Vector:

sqlContext.sql("INSERT INTO vector_table SELECT * FROM spark_table")

INSERT...SELECT

Vector supports INSERT...SELECT between Ingres and Vector tables (bidirectionally). If you start with data in Ingres tables then INSERT...SELECT (or CREATE TABLE...AS SELECT) is an effective way to populate tables in your Vector database.

Batch Inserts

Vector supports batch statements through ODBC, JDBC, and .NET. Through the batch interface, inserts are internally transformed into efficient data appends. From a data storage perspective applying a large transaction that (among others) consists of many inserts into the same table is as efficient through the batch interface as it is through an intermediate staging step on the file system followed by a bulk load using vwload or the COPY statement.

Third-party Tools

As an alternative to writing programs and scripts to unload data followed by using vwload or COPY to load the data into Vector tables, you can use third-party tools such as Pentaho Data Integration that support the Vector "bulk" load operation. Also check if third-party tools support batch operations through ODBC, JDBC, or .NET to achieve efficient data loads into Vector.

Last modified date: 06/28/2024