Loading Data

• INSERT is an SQL statement that is typically used for loading small volumes of data into VectorH tables.

• COPY is an SQL statement that can read and write file based data. The files can be located on the server or on the client.

• vwload is an external utility that can load file based data. As with COPY, files can be located on the client or on the server. The vwload utility also supports a cluster mode option that uses each DataNode, performing the load in a parallel fashion.

You can use SQL for the vwload cluster mode operation with the COPY table() VWLOAD FROM 'file1', 'file2',... syntax.

• The ODBC, JDBC, and .NET drivers support the high performance API calls such as addBatch().

• The Spark-Vector Connector lets you load data using Apache Spark. For more information on the Spark-Vector Connector, see https://github.com/ActianCorp/spark-vector/tree/master.

For examples of COPY and vwload, see the SQL Language Guide and the User Guide.

VectorH relies on reasonably up to date statistics to correctly optimize queries.

Statistics can be generated by:

The simplest approach is to generate statistics for all columns and all tables that the current user owns, which can be done as follows:

This will take some time for a large schema, so you can apply fine tuning. To speed up the operation and reduce the amount of statistics data generated, generate statistics only for columns involved in joins, restrictions, or grouping, using the ‑r -a optimizedb flags or table/column listing with CREATE STATISTICS.

On server startup, histograms are automatically generated on all columns that appear in WHERE clauses and do not already have a histogram stored in the catalog. A typical use strategy is to create histograms (with optimizedb or CREATE STATISTICS) for columns whose distribution does not change, and then let Vector generate the new histograms on the other, more dynamic, columns for every server cycle.

Statistics can be generated on a table-by-table (or even individual column) basis. In addition the CREATE STATISTICS command can create column correlation statistics using the WITH COLUMN GROUP clause.

Statistics should be generated after every significant change to the distribution of data in a database, including on first loading the database, or on subsequent data loads that change the distribution of key values. Out of date statistics is a common reason for queries not performing as expected, or for worsening performance as data is loaded over time.

For example, with the DBT-3 data set used in the performance test kit, query performance without statistics is much worse that performance after statistics have been created. Statistics can be deleted using:

After significant changes are made to a database schema, overflow can occur on internal system tables (often referred to as system catalogs). For optimum performance, these internal tables should be re-balanced after the schema is created.

Sysmod is the command-line utility that re-balances the system catalogs to ensure that they are optimized for rapid access. It should be run after significant changes to the database schema, such as creating the schema for the first time or running optimizedb for the first time. It can be executed as follows: