VH 6.0 | Creating a Database and Loading Data

After the database is created, you can use the infodb command to confirm that the database is using the HDFS locations, such as II_HDFSDATA as its VWROOT location.

infodb dbtest

==================Wed May 13 20:33:43 2015 Database Information=================

================================================================================

----Vectorwise directory--------------------------------------------------------

Location Flags Physical_path

------------------------------------------------------------------

ii_database ROOT,DATA /opt/Actian/VectorVH/ingres/data/vectorwise/dbtest

ii_journal JOURNAL /opt/Actian/VectorVH/ingres/jnl/vectorwise/dbtest

ii_checkpoint CHECKPOINT /opt/Actian/VectorVH/ingres/ckp/vectorwise/dbtest

ii_hdfsbackup CHECKPOINT,HDFS hdfs://namenode:8020/Actian/VectorVH/ingres/ckp/vectorwise/dbtest

ii_work WORK /opt/Actian/VectorVH/ingres/work/vectorwise/dbtest

ii_hdfsdata DATA,VWROOT,HDFS hdfs://namenode:8020/Actian/VectorVH/ingres/data/vectorwise/dbtest

ii_hdfswork WORK,HDFS hdfs://namenode:8020/Actian/VectorVH/ingres/work/vectorwise/dbtest

================================================================================

Connect to or Disconnect from a Database

To connect to a database and enter SQL commands, you can use the line-based Terminal Monitor or Actian Director.

(Actian Director is a graphical interface that lets you interact with Vector installations and perform tasks like importing data, querying data, and administering databases. You can download it from http://esd.actian.com/.)

To connect to the local database using the Terminal Monitor

1. Issue the sql command:

sql dbname

where dbname is the name of the database.

The Terminal Monitor starts.

2. Type your SQL statements at the Terminal Monitor prompt.

3. Enter \g to execute the statements.

To disconnect from the database and exit the Terminal Monitor

Enter \q.

For more information about using the Terminal Monitor, see the SQL Language Guide.

Create a Table

The CREATE TABLE statement by default creates a table with X100 storage.

In this example, we create the lineitem table from the DBT-3 benchmark.

To create the lineitem table

1. Connect to the database created in the previous procedure by issuing the sql command at the operating system prompt:

sql dbtest

The Terminal Monitor starts.

2. Enter the following SQL statement at the Terminal Monitor prompt:

CREATE TABLE lineitem (

l_orderkey INTEGER NOT NULL,

l_partkey INTEGER NOT NULL,

l_suppkey INTEGER NOT NULL,

l_linenumber INTEGER NOT NULL,

l_quantity DECIMAL(2,0) NOT NULL,

l_extendedprice DECIMAL(8,2) NOT NULL,

l_discount DECIMAL(2,2) NOT NULL,

l_tax DECIMAL(2,2) NOT NULL,

l_returnflag CHAR(1) NOT NULL,

l_linestatus CHAR(1) NOT NULL,

l_shipdate ANSIDATE NOT NULL,

l_commitdate ANSIDATE NOT NULL,

l_receiptdate ANSIDATE NOT NULL,

l_shipinstruct CHAR(25) NOT NULL,

l_shipmode CHAR(10) NOT NULL,

l_comment VARCHAR(44) NOT NULL)

WITH PARTITION = (HASH ON l_orderkey NN PARTITIONS)\g

In the WITH PARTITION clause, replace NN with the number of partitions for the table. For more information, see VectorH Partitioning Guidelines.

In Cloud environments, the number of partitions may be a preconfigured default.

VectorH Partitioning Guidelines

Large tables should be partitioned to ensure distributed query execution. Queries on non-partitioned tables are executed on the master node only. Queries on partitioned tables or on partitioned tables joining with non-partitioned tables are executed on all nodes.

Partitioning is an essential part of VectorH analytical query execution. Tables are partitioned using a hash-based algorithm.

The partition key and number of partitions can be defined using the following guidelines:

• We recommend at most one partition per core, evenly balanced between all nodes (including the master node), scaling down this number when the number of columns or nodes is large. The typical guideline is: Number of cores divided by 4. So, for 4 nodes with 16 cores each, 16 partitions are recommended. That is, (4*16)/4=16.

Note: In preconfigured cloud environments, a default number of partitions is determined according to cluster topology.

• Tables should be partitioned on a reasonably unique key, for example, primary key, foreign key, or a combination. A partition key can be a compound key of multiple columns. The key should be reasonably unique to avoid data skew but does not have to be fully unique. Query execution will benefit greatly if tables are partitioned on join keys (for example, primary key for one table of the join and foreign key for the other).

Changing Partitioning

Tables should be repartitioned after adding or removing nodes to or from the VectorH cluster.

A table can be repartitioned by using the MODIFY...TO RECONSTRUCT WITH PARTITION statement:

MODIFY table TO RECONSTRUCT WITH PARTITION=(HASH ON key NN PARTITIONS)

Loading Data

After a table is created, you can load data into it. There are a variety of methods for loading data.

Load Data with vwload Utility

The vwload command can be used to load data into a Vector table. It is less flexible but easier to use than the COPY statement and often faster.

For example, the following command, entered at the operating system prompt, loads the data in the lineitem.txt file into the lineitem table of the dbt3 database. In the lineitem table, fields are delimited by | and records are delimited by a new line (\n).

vwload --fdelim "|" --rdelim "\n" --table lineitem dbt3 lineitem.txt

To further speed up data loads, vwload can load multiple data files in parallel into a single table. To enable distributed data loading, add the -c option (or --cluster) to the vwload command line and provide the full path to all input files. There must be multiple input files and all files must be accessible to all nodes in the same location. For example:

vwload -c --fdelim "|" --rdelim "\n" --table lineitem dbt3 hdfs://namenode:8020/path/to/data/lineitem_1.txt hdfs://namenode:8020/path/to/data/lineitem_2.txt . . .

The following example loads the customer table into the mydb database in parallel mode. The input files reside in Amazon S3 cloud storage. Columns in the data file are delimited with a comma, quoted strings are delimited with quotation marks ("), and records are delimited by a new line:

vwload -c --fdelim "," –q "\"" --rdelim "\n" --table customer mydb s3a://mys3bucket/path/to/data/customer1.csv s3a://mys3bucket/path/to/data/customer3.csv s3a://mys3bucket/path/to/data/customer3.csv

IMPORTANT! Referencing your AWS credentials in the target URI from the command line exposes them. To keep your credentials safe, see the chapter “Securely Managing Amazon S3 Credentials” in the Security Guide.

Load Data with COPY VWLOAD Statement

The COPY...VWLOAD statement copies the contents of a file to a table using the VWLOAD operator. This operation is like using vwload --cluster on the command line, except that the load can be done from within a multi-statement transaction instead of in a separate session.

For example, bulk load the data from multiple files into mytable table. Columns in the data file are delimited with a vertical bar, records are delimited by a new line, and null values are identified as NULL:

COPY mytable() VWLOAD FROM 'hdfs://namenode:8020/path/to/data/mytable_1.txt', 'hdfs://namenode:8020/path/to/data/mytable_2.txt' WITH INSERTMODE=BULK, FDELIM='|', RDELIM='\n', NULLVALUE='NULL', WORK_DIR = '/home/users/temp'

Load Data with Spark SQL through the Spark-Vector Connector

The Spark-Vector Connector lets you interact with a Vector database using Apache Spark.

For more information on the Spark-Vector Connector, see https://github.com/ActianCorp/spark-vector/tree/master.

To load data using Spark SQL

1. Create a Vector table using Spark SQL:

sqlContext.sql("""CREATE TEMPORARY TABLE vector_table

USING com.actian.spark_vector.sql.DefaultSource

OPTIONS (

host "vectorhost",

instance "VI",

database "databasename",

table "vector_table"

)""")

2. Load data into Vector:

sqlContext.sql("INSERT INTO vector_table SELECT * FROM spark_table")

Statistics for the Query Optimizer

Vector uses a query optimizer to develop sophisticated query execution strategies. Statistics must be generated that tell the query optimizer what your data looks like. The query optimizer uses the statistics to generate a query execution plan (QEP) that shows how your query is executed. The QEP can be reused to execute the same query.

Vector automatically generates statistics after the server is started but you can control the generation of statistics by using the optimizedb command or CREATE STATISTIC statement.

To run optimizedb on the database dbtest

Issue the following command from the command line:

optimizedb -zns dbtest

The -zns flag speeds the building of the histogram.

For more information on the query optimizer, see the User Guide.

Optimize System Catalogs with sysmod

Operations such as create table, create view, and create statistics (also known as optimizedb) add entries to the system catalogs. After the initial set of tables, views, and statistics are created, you should run the sysmod command, which modifies the system tables to optimize catalog access.

You should also run sysmod after making substantial subsequent changes, such as creating many new tables, dropping many tables, or creating statistics on new tables or columns. After the set of tables, views, and statistics is stable, running sysmod is needed less often. We recommend, however, running sysmod on a periodic basis, say weekly, to prevent any possible degradation in query performance.

You can use sysmod at the command line or use the Modify System Tables dialog in Actian Director.

To optimize system catalogs for the empdata database

Enter the following command:

sysmod empdata

To modify only the system tables affected by optimizedb

sysmod empdata iistatistics iihistogram

Run a Reporting Query

This example reporting query (Q1) from the DBT-3 benchmark asks for the amount of business that was billed, shipped, and returned in the last quarter of 1998.

SELECT l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc, count(*) as count_order

FROM lineitem

WHERE l_shipdate <= date '1998-12-01' - interval '90' day

GROUP BY l_returnflag, l_linestatus

ORDER BY l_returnflag, l_linestatus;\g

Executing . . .

returnflag linestatus sum_qty    sum_base_price     sum_disc_price
A            F            1132676958 1698421699729.99 1613502554023.1240
N            F            29581022     44353456023.61    42135855268.9589

N O 2230403478 3344516955134.66 3177286529933.9344

R F 1132872903 1698719719123.67 1613789096673.0512

sum_charge avg_qty avg_price avg_dis count_order

1678049479312.961024 25.4 38236.375 0.050 44419004

43821835543.370384 25.5 38270.478 0.049 1158947

3304387310370.192384 25.4 38235.736 0.049 87470970

1678345906004.148224 25.5 38243.421 0.049 44418612

(4 rows)

Last modified date: 01/26/2023