Running the Performance Test Kit
The performance testing kit is based on the well-known DBT-3 benchmark tests on retail sales data. These tests are based on eight tables (customer, lineitem, nation, orders, partsupp, part, region, and supplier) and include several sample queries. This setup loads the data for the test and runs a sample set of queries.
The test kit has been scripted and can be simply executed as follows:
1. Install git (for example, through yum install –y git).
2. Clone the test kit to your machine:
git clone https://github.com/ActianCorp/VectorH-DBT3-Scripts.git
3. Run the test kit:
cd VectorH-DBT3-Scripts; sh load-run-dbt3-benchmark.sh
The above will run the data generator for the scale factor 1 tests (1 GB of data). The data files are generated in the current directory and end with extension tbl (for example, customer.tbl). In addition to the files themselves, the generator also creates symbolic links in /tmp that point to each data file. If you need to generate the files in a different location, copy dbgen and dists.dss to that location and run from there.
To generate larger data volumes, include the –s switch (for example,
dbgen –s 10 will create 10 GB of data). If disk space is limited, then the files can be generated in sections (using ‑S) or can be done for one table at a time (using ‑T). For more information on these and other command line switches, run
dbgen ‑h or visit the web (for example, at
https://github.com/electrum/tpch-dbgen).
Note: In the supplied scripts, the part table is not partitioned, which is suitable for scale factors up to 100. However, if you plan to test with scale factors beyond 100 then partition this table by the partkey column.
Example Results
The following results were taken from a cluster with the following specification (the timings are taken after the execution engine has started as described in
Stopping and Starting a Database).
• 4 Nodes
• Per Node: 2 x 6 core CPU @ 2.8 GHz, 288 GB RAM, 16x10k SAS Drives, 10 GbE
• All Actian default settings
Results Calibration
To add some context to the above results, it may be worth comparing basic cluster performance for the environment in which your tests are run. Two tests are given here: one using the Hadoop TestDFSIO utility and the second testing network throughput using nc (netcat).
To use TestDFSIO issue a command similar to below (the exact location of the utility may be different on your installation).
hadoop jar /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 20 -fileSize 10
15/07/07 10:50:13 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
15/07/07 10:50:13 INFO fs.TestDFSIO: Date & time: Tue Jul 07 10:50:13 CDT 2015
15/07/07 10:50:13 INFO fs.TestDFSIO: Number of files: 20
15/07/07 10:50:13 INFO fs.TestDFSIO: Total MBytes processed: 200.0
15/07/07 10:50:13 INFO fs.TestDFSIO: Throughput mb/sec: 49.01960784313726
15/07/07 10:50:13 INFO fs.TestDFSIO: Average IO rate mb/sec: 54.8790283203125
15/07/07 10:50:13 INFO fs.TestDFSIO: IO rate std deviation: 15.338172504395319
15/07/07 10:50:13 INFO fs.TestDFSIO: Test exec time sec: 17.854
To test network throughput, use the
nc (netcat) and
time Linux commands to test throughput on the network interface being used by VectorH (as described in
Using the Correct Ethernet Connection). The example below copies the
lineitem data file for the SF100 tests from one host to another using the interface with IP address
10.0.0.2.
On the receiving host (in this case pfc11-13), type:
nc -l 1234 > /tmp/lineitem.tbl # May need to use a different port if 1234 is in use
On the sending host, type:
time nc -s 10.0.0.2 pfc11-13 1234 < /tmp/lineitem.tbl
real 5m25.776s
user 0m2.684s
sys 1m14.204s
The file is 74 GB and the transfer takes 325 seconds on this environment giving an overall transfer rate of 231 MB/s.
If you do not have the required access to run the nc command, a similar test can be run with scp using the BindAddress parameter. The scp command is generally slower; in this environment the transfer took 396 seconds.