New Features in VectorH 5.0
VectorH 5.0 contains the following new features.
Database Administration
• External tables let you read from and write to data sources stored outside of Vector. The data source must be one that Apache Spark is able to read from and write to, such as HDFS files stored in formats like Parquet, ORC, CSV, or tables in external database systems. After the external table is defined with the CREATE EXTERNAL TABLE syntax, queries can be run directly against the external data.
• Distributed Write-Ahead Log: The single LOG file has been split into multiple files stored in the wal directory. This feature improves performance, especially for large data sizes, by using parallel writes and removing the need for the master node to send this information over the network. It alleviates pressure on memory, speeds up COMMIT processing, and improves startup times.
• Distributed indexes, which improve scalability because the master node no longer needs to maintain remote partitions' min-max indexes and join indexes in memory. This feature speeds up DML queries and improves startup times.
• Automatic histogram generation so you do not have to generate statistics for proper query execution. This feature gives you more flexibility in managing statistics. Histograms are automatically generated on all columns that appear in WHERE clauses and do not already have a histogram stored in the catalog. The histograms are generated from sample data maintained in memory by the min-max indexes.
• Clonedb utility, which lets you copy a database from one Vector instance to another, for example, from one installation, machine, or cluster to another. Clonedb can be used to clone a production database for testing purposes.
• A requirement to specify either WITH PARTITION=(...) or WITH NOPARTITION when creating a Vector table using CREATE TABLE or CREATE TABLE AS SELECT syntax. During installation, the configuration parameter partition_spec_required in config.dat is set to vector, which forces you to be aware that partitioning is an essential strategy in VectorH.
• UUID data type and functions: Automatic generation of UUID identifiers for inserting data. A UUID can be used as a primary key and/or as a partition key to ensure that data is spread evenly across nodes.
• SET SERVER_TRACE and SET SESSION_TRACE statements allow tracing of all queries processed by the DBMS Server regardless of the source, whether it be an interactive query, or from a JDBC, ODBC, or .NET connection.
Data Import and Export
• The Spark-Vector Connector has been enhanced to provide parallel unload, useful for large data volumes.
• SQL syntax for parallel vwload (COPY table() VWLOAD FROM 'file1', 'file2',...) performs the same operation as running vwload -c from the command line. Using SQL means the vwload operation can be part of a bigger transaction. A single transaction avoids the overhead of committing separate transactions and writing to disk. This is especially useful when loading data to apply updates.
• SQL syntax for CSV export (INSERT INTO EXTERNAL CSV 'filename'...) writes a table to a local file system. The result is either a single CSV file or a collection of CSV files, depending on whether the query is run in parallel.
Hadoop
• Detecting YARN resources at install time and dynamically adapting VectorH configuration.
Security
• Documentation on using Hadoop security systems Apache Knox and Apache Ranger with VectorH.