VectorH Partitioning Guidelines

5. Usage Considerations : Considerations for Using VectorH : VectorH Partitioning Guidelines

Share this page

Large tables should be partitioned to ensure distributed query execution. Queries on non-partitioned tables are executed on the master node only. Queries on partitioned tables or on partitioned tables joining with non-partitioned tables are executed on all nodes.

Partitioning is an essential part of the VectorH analytical query execution. Tables are partitioned using a hash-based algorithm.

The partition key and number of partitions must be defined manually using the following guidelines:

• We recommend at most one partition per core, evenly balanced between all nodes (including the master node), scaling down this number when the number of columns or nodes is large. The typical guideline is: Number of cores divided by 4. So for 4 nodes with 16 cores each, 16 partitions are recommended. That is, (4*16)/4=16.

• Tables should be partitioned on a reasonably unique key, for example, primary key, foreign key, or a combination. A partition key can be a compound key of multiple columns. The key should be reasonably unique to avoid data skew, but does not have to be fully unique. Query execution will benefit greatly if tables are partitioned on join keys (for example, primary key for one table of the join and foreign key for the other).

How Partitioning Is Handled

VectorH handles partitioning as follows:

• A table consists of one logical table with multiple physical tables underneath.

• One node is responsible for a particular partition. All data of a specific partition will be locally stored (albeit in HDFS) by the responsible node. Query execution for this partition will be moved to this node, as much as possible.

• Data is distributed by hashing on one or more key columns.

• In-memory updates are handled and stored by the responsible node.

• Non-partitioned tables are stored by the master node and replicated by HDFS. All nodes can read them. Non-partitioned tables are good for smaller tables, for example, dimension tables in a star schema. I/O access will not be as efficient as partitioned tables because nodes will often need to read remote data, but small tables will likely be kept in memory and so execution speed will not be affected much.