Usage Considerations
Considerations for Using VectorH
Using VectorH is like using Vector on a single node, but there are some subtle differences.
VectorH Configuration Considerations
The list of cluster nodes is written to a configuration file, which is generated during installation: $II_SYSTEM/ingres/files/hdfs/slaves.
There is a separate vectorwise.conf per node. Some configuration parameters, however, are relevant for the master node only and some are relevant for all nodes.
• Query execution memory size (max_memory_size) and bufferpool size (bufferpool_size) are set per node.
• num_cores is relevant on the master node only but is set to the number of cores per node. An equal number of cores on each node is assumed.
• max_parallelism_level is relevant on the master node only but is set to the total level of parallelism throughout the cluster. A recommended value is the total number of physical cores.
Most options are set correctly automatically.
Note: VectorH assumes a homogeneous cluster, that is, all nodes participating in VectorH query execution have the same amount of memory, cores, and execution speed.
VectorH Storage Location Considerations
Storage considerations are as follows:
• Vector data files should be stored on an HDFS location, separate from the default storage location, II_DATABASE. An II_HDFSDATA location is created by the installation process for this purpose.
• Backup files should be stored on an HDFS location, separate from the default backup location, II_CHECKPOINT. An II_HDFSBACKUP location is created by the installation process for this purpose.
• Work locations can be on local or HDFS storage. The local storage path will be created on each node. Also, an II_HDFSWORK location is created by the installation process for this purpose.
• The database wal directory is stored on HDFS and is accessible by all nodes.
• Data is distributed and replicated over the nodes by HDFS, influenced by the VectorH partitioning scheme.
• On the master node, the storage considerations are identical to those of Vector (standard Vector data paths).
VectorH Partitioning Guidelines
Large tables should be partitioned to ensure distributed query execution. Queries on non-partitioned tables are executed on the master node only. Queries on partitioned tables or on partitioned tables joining with non-partitioned tables are executed on all nodes.
Partitioning is an essential part of the VectorH analytical query execution. Tables are partitioned using a hash-based algorithm.
The partition key and number of partitions must be defined manually using the following guidelines:
• We recommend at most one partition per core, evenly balanced between all nodes (including the master node), scaling down this number when the number of columns or nodes is large. The typical guideline is: Number of cores divided by 4. So for 4 nodes with 16 cores each, 16 partitions are recommended. That is, (4*16)/4=16.
Note: In preconfigured cloud environments, a default number of partitions is determined according to cluster topology.
• Tables should be partitioned on a reasonably unique key, for example, primary key, foreign key, or a combination. A partition key can be a compound key of multiple columns. The key should be reasonably unique to avoid data skew but does not have to be fully unique. Query execution will benefit greatly if tables are partitioned on join keys (for example, primary key for one table of the join and foreign key for the other).
How Partitioning Is Handled
VectorH handles partitioning as follows:
• A table consists of one logical table with multiple physical tables underneath.
• One node is responsible for a partition. All data of a specific partition will be locally stored (albeit in HDFS) by the responsible node. Query execution for this partition will be moved to this node, as much as possible.
• Data is distributed by hashing on one or more key columns.
• In-memory updates are handled and stored by the responsible node.
• Non-partitioned tables are stored by the master node and replicated by HDFS. All nodes can read them. Non-partitioned tables are good for smaller tables, such as dimension tables in a star schema. I/O access will not be as efficient as partitioned tables because nodes will often need to read remote data, but small tables will likely be kept in memory and so execution speed will not be affected much.
Changing Partitioning
Tables should be repartitioned after adding or removing nodes to or from the VectorH cluster.
A table can be repartitioned by using the MODIFY...TO RECONSTRUCT WITH PARTITION statement:
MODIFY table TO RECONSTRUCT WITH PARTITION=(HASH ON key NN PARTITIONS)
A less elegant way to repartition is to recreate the table:
CREATE TABLE new_table AS SELECT * FROM old_table WITH PARTITION = (HASH ON key NN PARTITIONS)
VectorH Data Loading Guidelines
Each node loads one or more input files. Each node stores data for one or more partitions. The maximum parallelism level used for reading and parsing the data is determined by the number of input files; the maximum parallelism level for generating compressed data blocks and writing those out is determined by the number of partitions.
Use these guidelines when loading data:
• Input files can be in HDFS, but do not have to be.
• Use ‑‑cluster option on the vwload command if data is stored in HDFS.
• Large tables should be partitioned and will be stored across all nodes.
• Partition key and the number of partitions are specified when a table is created and should be determined by the guidelines listed previously.
• COPY FROM can also be used but is slower.
VectorH Startup and Shutdown
Start VectorH on the master node only.
VectorH on the slave nodes is started automatically. The in-memory state of all the nodes is kept in sync.
Shut down VectorH on the master node. The slave nodes are shut down automatically.
VectorH Monitoring Considerations
Monitoring tools include the following:
• Actian Director
• Existing Hadoop and HDFS monitoring tools
• vectorwise.log and profile files
– Per node
– Stored in a work location or a local path that exists on all nodes.
• vwinfo utility
– Displays statistics of the master node only.
VectorH Backup and Recovery Considerations
Full and incremental backup and restore are supported.
A backup directory (II_HDFSBACKUP) is defined during installation. This location is used for both the full (checkpoint file) and incremental (journal files) backups.
Note: Commands for backup (ckpdb) and restore (rollforwarddb) must be issued on the master node.
VectorH Management Tool
Actian Director can be used to manage your VectorH installation, including the master node and all slave nodes.
Actian Director 1.3 or above supports VectorH.
VectorH Limitations
VectorH has the following limitations:
• Adding and removing nodes dynamically and using more or less cores per node are not supported. Nodes can be added or removed with manual intervention. Repartitioning may be needed.
• Proper partitioning is essential because non-local queries, especially joins, are expensive.
• Deleted space is reclaimed but only if all data of a horizontal split is deleted (by default 1 GB). So, in a worst case scenario, no space is reclaimed; in a best case scenario, all space is reclaimed. This applies only to data that is deleted through a DELETE statement, not to data deleted through DROP TABLE, MODIFY TO TRUNCATED, or MODIFY TO COMBINE operations, where space is always reclaimed.