Load Sample Data Using the Vector CLI
Note: To create a database and load it with data in Vector Community Edition, which has a 100 GB size limit, you must first delete the provided ontimedb database. For more information, see
Remove the Sample Databases.
So far, you have been able to query data that was made available to you in the ontimedb database. In this exercise, you will create a database, create tables in that database, load data into Vector, and generate statistics. At the end of the exercise, you will have loaded the same amount of data (approximately 175 million rows) that you ran your queries against earlier.
Create a Database
In this exercise, you will create a database named demodb.
To create a database
Enter the following command:
createdb demodb
The demodb database is created.
Create a Table
In this exercise, you will create two tables in the demodb database: ontime and carriers. You will create the tables using copy-and-paste query commands.
To create the tables
1. If it it not already started, start the sql tool by entering the following command at the $ prompt:
sql demodb
2. Copy and paste the following SQL command at the * prompt:
create table ontime(
year integer not null,
quarter i1 not null,
month i1 not null,
dayofmonth i1 not null,
dayofweek i1 not null,
flightdate ansidate not null,
uniquecarrier char(7) not null,
airlineid integer not null,
carrier char(2) default NULL,
tailnum varchar(50) default NULL,
flightnum varchar(10) not null,
originairportid integer default NULL,
originairportseqid integer default NULL,
origincitymarketid integer default NULL,
origin char(5) default NULL,
origincityname varchar(35) not null,
originstate char(2) default NULL,
originstatefips varchar(10) default NULL,
originstatename varchar(46) default NULL,
originwac integer default NULL,
destairportid integer default NULL,
destairportseqid integer default NULL,
destcitymarketid integer default NULL,
dest char(5) default NULL,
destcityname varchar(35) not null,
deststate char(2) default NULL,
deststatefips varchar(10) default NULL,
deststatename varchar(46) default NULL,
destwac integer default NULL,
crsdeptime integer default NULL,
deptime integer default NULL,
depdelay integer default NULL,
depdelayminutes integer default NULL,
depdel15 integer default NULL,
departuredelaygroups integer default NULL,
deptimeblk varchar(9) default NULL,
taxiout integer default NULL,
wheelsoff varchar(10) default NULL,
wheelson varchar(10) default NULL,
taxiin integer default NULL,
crsarrtime integer default NULL,
arrtime integer default NULL,
arrdelay integer default NULL,
arrdelayminutes integer default NULL,
arrdel15 integer default NULL,
arrivaldelaygroups integer default NULL,
arrtimeblk varchar(9) default NULL,
cancelled i1 default NULL,
cancellationcode char(1) default NULL,
diverted i1 default NULL,
crselapsedtime integer default NULL,
actualelapsedtime integer default NULL,
airtime integer default NULL,
flights integer default NULL,
distance integer default NULL,
distancegroup i1 default NULL,
carrierdelay integer default NULL,
weatherdelay integer default NULL,
nasdelay integer default NULL,
securitydelay integer default NULL,
lateaircraftdelay integer default NULL,
firstdeptime varchar(10) default NULL,
totaladdgtime varchar(10) default NULL,
longestaddgtime varchar(10) default NULL,
divairportlandings varchar(10) default NULL,
divreacheddest varchar(10) default NULL,
divactualelapsedtime varchar(10) default NULL,
divarrdelay varchar(10) default NULL,
divdistance varchar(10) default NULL,
div1airport varchar(10) default NULL,
div1airportid integer default NULL,
div1airportseqid integer default NULL,
div1wheelson varchar(10) default NULL,
div1totalgtime varchar(10) default NULL,
div1longestgtime varchar(10) default NULL,
div1wheelsoff varchar(10) default NULL,
div1tailnum varchar(10) default NULL,
div2airport varchar(10) default NULL,
div2airportid integer default NULL,
div2airportseqid integer default NULL,
div2wheelson varchar(10) default NULL,
div2totalgtime varchar(10) default NULL,
div2longestgtime varchar(10) default NULL,
div2wheelsoff varchar(10) default NULL,
div2tailnum varchar(10) default NULL,
div3airport varchar(10) default NULL,
div3airportid integer default NULL,
div3airportseqid integer default NULL,
div3wheelson varchar(10) default NULL,
div3totalgtime varchar(10) default NULL,
div3longestgtime varchar(10) default NULL,
div3wheelsoff varchar(10) default NULL,
div3tailnum varchar(10) default NULL,
div4airport varchar(10) default NULL,
div4airportid integer default NULL,
div4airportseqid integer default NULL,
div4wheelson varchar(10) default NULL,
div4totalgtime varchar(10) default NULL,
div4longestgtime varchar(10) default NULL,
div4wheelsoff varchar(10) default NULL,
div4tailnum varchar(10) default NULL,
div5airport varchar(10) default NULL,
div5airportid integer default NULL,
div5airportseqid integer default NULL,
div5wheelson varchar(10) default NULL,
div5totalgtime varchar(10) default NULL,
div5longestgtime varchar(10) default NULL,
div5wheelsoff varchar(10) default NULL,
div5tailnum varchar(10) default NULL
)
\g
3. Press Enter.
The table is created.
4. Create the carriers table and insert data into it:
create table carriers(ccode char(2) collate ucs_basic, carrier char(25) collate ucs_basic )\g
INSERT INTO carriers VALUES ('AS','Alaska Airlines (AS)'), ('AA','American Airlines (AA)'), ('DL','Delta Air Lines (DL)'), ('EV','ExpressJet Airlines (EV)'), ('F9','Frontier Airlines (F9)'), ('HA','Hawaiian Airlines (HA)'), ('B6','JetBlue Airways (B6)'), ('OO','SkyWest Airlines (OO)'), ('WN','Southwest Airlines (WN)'), ('NK','Spirit Airlines (NK)'), ('UA','United Airlines (UA)'), ('VX','Virgin America (VX)')\g
5. Quit the terminal monitor: type \q and press Enter.
Your SQL statement(s) have been committed.
You may now load data into the ontime table.
Load Data
In this exercise, you will load airline flight data into the empty ontime table you created in the previous exercise.
The bulk loader command for Vector is called vwload. This command can load up to 500,000 rows per second, depending on the speed of your hard drive.
This exercise showcases a couple of vwload options to load part of the sample data and then points you to a handy script that will load all the data for you using one command.
The ontimefiles.zip file contains CSV files with airline flight data from 1987 to the present and must be unzipped before loading the data.
To unzip and load the data files
1. In the CLI window, change to the directory where ontimefiles.zip is located and unzip the file:
cd /Vector/sample_data
unzip ontimefiles.zip
Note: This could take a while; the uncompressed data is about 75 GB.
2. Load a single CSV file with vwload:
Copy and paste the following command to load a subset of airline data for January 1988 into the ontime table of the demodb database:
vwload -H -f , -q '"' -I -t ontime demodb On_Time_On_Time_Performance_1988_1.csv
The command line uses the following options:
-H
Indicates a header is used in the data files
-f
Specifies the comma field separator
-q
Specifies the quote character, enclosed within single quotes
-I
Specifies that one too many fields in the data files should be ignored
-t
Specifies the target table ontime, the demodb database name, and the data filenames
The vwload command is executed, and the data in the file On_Time_On_Time_Performance_1988_1.csv is loaded into the ontime table of demodb.
3. Load multiple CSV files with vwload:
To speed data loading, engage multiple cores on your machine using the -c option, which enables parallelization. To show how this works, load all data files for the year 1989. Copy and paste the following command to load the data files into the ontime table of the demodb database:
vwload -H -f , -q '"' -I -c -t ontime demodb On_Time_On_Time_Performance_1989_*.csv
The vwload command is executed, and the data in the 12 files for 1989 is loaded into the ontime table of demodb.
4. Delete the data loaded in previous steps to prepare for the rest of this exercise. Enter the following command:
echo "delete from ontime\g" | sql demodb
5. Verify that the ontime table is empty:
echo "select count(*) from ontime\g" | sql demodb
Now that you know how to load single and multiple files, you can use the parallel load option to load the complete 75-GB data set into Vector.
6. Load the complete data set:
The loadall.sh script is included in ontimefiles.zip that will load all the years of data using the -c option. Run the load script:
./loadall.sh demodb
The entire data set of about 175 million rows is loaded into Vector.
Note: This can take up to 15 minutes, depending on your machine type. The more cores you have, the faster the data will load. Generally, when using the -c option, the number of input files to the vwload command should equal the number of physical cores (not virtual cores) on the machine.
To display the number of rows in the ontime table
1. At the $ prompt, enter:
echo "select count(*) from ontime\g" | sql demodb
2. Verify that the table contains over 175 million rows.
You may now optimize the data.
Optimize the Data by Generating Statistics
Vector has a sophisticated cost-based optimizer that uses statistics to optimize queries. We recommend that you collect statistics upon initially loading data and after any subsequent changes to the data.
There are several ways to collect statistics, but for this exercise you will use the command-line utility, optimizedb.
To generate statistics
Enter the following at the $ prompt:
optimizedb demodb
Using vector processing, statistics are generated for both the ontime and carriers tables.
For more information about optimizedb, see
optimizedb command.
Run More Queries
After loading and optimizing data, run some of the queries against demodb using Director or the Vector CLI. See
More Sample Queries.
Check execution time to see how fast they run.
Note: Vector will automatically tune itself to the number of cores and memory on the machine and will use more resources where it can to scale. You may shut down the EC2 instance and resize it to a larger number of cores and check the query times again.