Getting Started : 10. Creating a Database and Loading Data
 
Share this page                  
Creating a Database and Loading Data
Create a Database
After VectorH is installed you can create your databases.
First, you must log in to the master node as the actian user and source your environment.
To source the environment
Enter the following commands:
su - actian
source .ingVHsh
To create a database named dbtest
createdb dbtest
Viewing Database Information
After the database is created, you can use the infodb command to confirm that the database is using the HDFS locations, such as II_HDFSDATA as its VWROOT location.
infodb dbtest
 
==================Wed May 13 20:33:43 2015 Database Information=================
.
.
.
================================================================================
----Vectorwise directory--------------------------------------------------------
 
    Location         Flags             Physical_path
    ------------------------------------------------------------------
    ii_database      ROOT,DATA         /opt/Actian/VectorVH/ingres/data/vectorwise/dbtest
    ii_journal       JOURNAL           /opt/Actian/VectorVH/ingres/jnl/vectorwise/dbtest
    ii_checkpoint    CHECKPOINT        /opt/Actian/VectorVH/ingres/ckp/vectorwise/dbtest
    ii_hdfsbackup    CHECKPOINT,HDFS   hdfs://namenode:8020/Actian/VectorVH/ingres/ckp/vectorwise/dbtest
    ii_work          WORK              /opt/Actian/VectorVH/ingres/work/vectorwise/dbtest
    ii_hdfsdata      DATA,VWROOT,HDFS hdfs://namenode:8020/Actian/VectorVH/ingres/data/vectorwise/dbtest
    ii_hdfswork      WORK,HDFS hdfs://namenode:8020/Actian/VectorVH/ingres/work/vectorwise/dbtest
================================================================================
Connect to or Disconnect from a Database
To connect to a database and enter SQL commands, you can use the line-based Terminal Monitor or Actian Director.
(Actian Director is a graphical interface that lets you interact with Vector installations and perform tasks like importing data, querying data, and administering databases. You can download it from http://esd.actian.com/.)
To connect to the local database using the Terminal Monitor
1. Issue the sql command:
sql dbname
where dbname is the name of the database.
The Terminal Monitor starts.
2. Type your SQL statements at the Terminal Monitor prompt.
3. Enter \g to execute the statements.
To disconnect from the database and exit the Terminal Monitor
Enter \q.
For more information about using the Terminal Monitor, see the SQL Language Guide.
Create a Table
The CREATE TABLE statement by default creates a table with VECTORWISE storage.
In this example, we create the lineitem table from the DBT-3 benchmark.
To create the lineitem table
1. Connect to the database created in the previous procedure by issuing the sql command at the operating system prompt:
sql dbtest
The Terminal Monitor starts.
2. Enter the following SQL statement at the Terminal Monitor prompt:
CREATE TABLE lineitem (
        l_orderkey INTEGER NOT NULL,
        l_partkey INTEGER NOT NULL,
        l_suppkey INTEGER NOT NULL,
        l_linenumber INTEGER NOT NULL,
        l_quantity DECIMAL(2,0) NOT NULL,
        l_extendedprice DECIMAL(8,2) NOT NULL,
        l_discount DECIMAL(2,2) NOT NULL,
        l_tax DECIMAL(2,2) NOT NULL,
        l_returnflag CHAR(1) NOT NULL,
        l_linestatus CHAR(1) NOT NULL,
        l_shipdate ANSIDATE NOT NULL,
        l_commitdate ANSIDATE NOT NULL,
        l_receiptdate ANSIDATE NOT NULL,
        l_shipinstruct CHAR(25) NOT NULL,
        l_shipmode CHAR(10) NOT NULL,
        l_comment VARCHAR(44) NOT NULL)
WITH PARTITION = (HASH ON l_orderkey NN PARTITIONS)\g
In the WITH PARTITION clause, replace NN with the number of partitions for the table. For more information, see VectorH Partitioning Guidelines.
In Cloud environments, the number of partitions may be a preconfigured default.
VectorH Partitioning Guidelines
Large tables should be partitioned to ensure distributed query execution. Queries on non-partitioned tables are executed on the master node only. Queries on partitioned tables or on partitioned tables joining with non-partitioned tables are executed on all nodes.
Partitioning is an essential part of VectorH analytical query execution. Tables are partitioned using a hash-based algorithm.
The partition key and number of partitions can be defined using the following guidelines:
We recommend at most one partition per core, evenly balanced between all nodes (including the master node), scaling down this number when the number of columns or nodes is large. The typical guideline is: Number of cores divided by 4. So, for 4 nodes with 16 cores each, 16 partitions are recommended. That is, (4*16)/4=16.
Note:  In preconfigured cloud environments, a default number of partitions is determined according to cluster topology.
Tables should be partitioned on a reasonably unique key, for example, primary key, foreign key, or a combination. A partition key can be a compound key of multiple columns. The key should be reasonably unique to avoid data skew but does not have to be fully unique. Query execution will benefit greatly if tables are partitioned on join keys (for example, primary key for one table of the join and foreign key for the other).
Changing Partitioning
Tables should be repartitioned after adding or removing nodes to or from the VectorH cluster.
A table can be repartitioned by using the MODIFY...TO RECONSTRUCT WITH PARTITION statement:
MODIFY table TO RECONSTRUCT WITH PARTITION=(HASH ON key NN PARTITIONS)
Loading Data
After a table is created, you can load data into it. There are a variety of methods for loading data.
Load Data with vwload Utility
The vwload command can be used to load data into a Vector table. It is less flexible but easier to use than the COPY statement and often faster.
For example, the following command, entered at the operating system prompt, loads the data in the lineitem.txt file into the lineitem table of the dbt3 database. In the lineitem table, fields are delimited by | and records are delimited by a new line (\n).
vwload --fdelim "|" --rdelim "\n" --table lineitem dbt3 lineitem.txt
To further speed up data loads, vwload can load multiple data files in parallel into a single table. To enable distributed data loading, add the -c option (or --cluster) to the vwload command line and provide the full path to all input files. There must be multiple input files and all files must be accessible to all nodes in the same location. For example:
vwload -c --fdelim "|" --rdelim "\n" --table lineitem dbt3 hdfs://namenode:8020/path/to/data/lineitem_1.txt hdfs://namenode:8020/path/to/data/lineitem_2.txt . . .
The following example loads the customer table into the mydb database in parallel mode. The input files reside in Amazon S3 cloud storage. Columns in the data file are delimited with a comma, quoted strings are delimited with quotation marks ("), and records are delimited by a new line:
vwload -c --fdelim "," –q "\"" --rdelim "\n" --table customer mydb s3a://mys3bucket/path/to/data/customer1.csv s3a://mys3bucket/path/to/data/customer3.csv s3a://mys3bucket/path/to/data/customer3.csv
IMPORTANT!  Referencing your AWS credentials in the target URI from the command line exposes them. To keep your credentials safe, see the chapter “Securely Managing Amazon S3 Credentials” in the Security Guide.
Load Data with COPY VWLOAD Statement
The COPY...VWLOAD statement copies the contents of a file to a table using the VWLOAD operator. This operation is like using vwload --cluster on the command line, except that the load can be done from within a multi-statement transaction instead of in a separate session.
For example, bulk load the data from multiple files into mytable table. Columns in the data file are delimited with a vertical bar, records are delimited by a new line, and null values are identified as NULL:
COPY mytable() VWLOAD FROM 'hdfs://namenode:8020/path/to/data/mytable_1.txt', 'hdfs://namenode:8020/path/to/data/mytable_2.txt' WITH INSERTMODE=BULK, FDELIM='|', RDELIM='\n', NULLVALUE='NULL', WORK_DIR = '/home/users/temp'
Load Data with Spark SQL through the Spark-Vector Connector
The Spark-Vector Connector lets you interact with a Vector database using Apache Spark.
For more information on the Spark-Vector Connector, see https://github.com/ActianCorp/spark-vector/tree/master.
To load data using Spark SQL
1. Create a Vector table using Spark SQL:
sqlContext.sql("""CREATE TEMPORARY TABLE vector_table
USING com.actian.spark_vector.sql.DefaultSource
OPTIONS (
host "vectorhost",
instance "VI",
database "databasename",
table "vector_table"
)""")
2. Load data into Vector:
sqlContext.sql("INSERT INTO vector_table SELECT * FROM spark_table")
Statistics for the Query Optimizer
Vector uses a query optimizer to develop sophisticated query execution strategies. Statistics must be generated that tell the query optimizer what your data looks like. The query optimizer uses the statistics to generate a query execution plan (QEP) that shows how your query is executed. The QEP can be reused to execute the same query.
Vector automatically generates statistics after the server is started but you can control the generation of statistics by using the optimizedb command or CREATE STATISTIC statement.
To run optimizedb on the database dbtest
Issue the following command from the command line:
optimizedb -zns dbtest
The -zns flag speeds the building of the histogram.
For more information on the query optimizer, see the User Guide.
Optimize System Catalogs with sysmod
Operations such as create table, create view, and create statistics (also known as optimizedb) add entries to the system catalogs. After the initial set of tables, views, and statistics are created, you should run the sysmod command, which modifies the system tables to optimize catalog access.
You should also run sysmod after making substantial subsequent changes, such as creating many new tables, dropping many tables, or creating statistics on new tables or columns. After the set of tables, views, and statistics is stable, running sysmod is needed less often. We recommend, however, running sysmod on a periodic basis, say weekly, to prevent any possible degradation in query performance.
You can use sysmod at the command line or use the Modify System Tables dialog in Actian Director.
To optimize system catalogs for the empdata database
Enter the following command:
sysmod empdata
To modify only the system tables affected by optimizedb
sysmod empdata iistatistics iihistogram
Run a Reporting Query
This example reporting query (Q1) from the DBT-3 benchmark asks for the amount of business that was billed, shipped, and returned in the last quarter of 1998.
SELECT   l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
         sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc, count(*) as count_order
FROM     lineitem
WHERE    l_shipdate <= date '1998-12-01' - interval '90' day
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;\g
 
Executing . . .
 
 
 returnflag   linestatus   sum_qty      sum_base_price     sum_disc_price 
 A            F            1132676958   1698421699729.99   1613502554023.1240
 N            F            29581022     44353456023.61    42135855268.9589  
 N            O            2230403478   3344516955134.66   3177286529933.9344
 R            F            1132872903   1698719719123.67   1613789096673.0512
 
 sum_charge              avg_qty     avg_price    avg_dis    count_order
 1678049479312.961024    25.4        38236.375    0.050      44419004
 43821835543.370384      25.5        38270.478    0.049      1158947
 3304387310370.192384    25.4        38235.736    0.049      87470970
 1678345906004.148224    25.5        38243.421    0.049      44418612
 
(4 rows)