Stage 4--Optimize the Database Schema
Optimizing the database schema for the best possible performance of VectorH generally consists of working through the following Performance Considerations steps.
Performance Considerations
In addition to having suitable hardware, a correctly configured Hadoop cluster, and a correctly configured VectorH environment, there are some additional items that affect the performance of VectorH.
Table Partitioning
Follow the partitioning guidelines as described in
Partitioning Tables.
Large tables that are not partitioned will exhibit significantly worse performance.
Query Statistics
Follow the guidelines on gathering statistics as described in
Generating Statistics.
Ordered Data
VectorH makes use of storage tracking structures to record details of the range of values that may be held in a particular block. These structures are used at runtime to effect block elimination based on query predicates. This approach improves query performance generally and is particularly effective when dealing with ordered data.
Most systems have such an ordering of data—for example carrying an “inserted date” or “effective date” field that matches the date when the data was created. Mimicking such an ordering in the test system by loading the data in the appropriate order will give a better indication of query performance in the real system.
The Actian GitHub site has a tool to help determine whether the data is being stored on-disk in sorted order for a given column.
The simplest way to guarantee that data is stored in sorted order on disk is to use an index.
Indexing
Generally speaking, VectorH is an “index free” database management system. It does, however, support two types of index structure that can be useful in certain cases.
The default index structure (created by CREATE INDEX) generates a clustered structure causing the table data to be stored in an ordered fashion. This type of index can be used to effect an ordering of data where there is no natural ordering as described in
Ordered Data.
A secondary index (created by CREATE SECONDARY INDEX) is an in-memory structure that has no disk footprint and is rebuilt from scratch when VectorH starts. A secondary index can speed up retrieval of small amounts of data from a “medium” sized table. If you have dimension tables with a few million rows and where queries select only a few rows, then a secondary index can speed up retrieval.
Use of these structures is not generally required and, unless they can be proven to improve performance, are generally discouraged.
Nulls
The use of nulls should be avoided wherever possible. Every column that is defined as WITH NULL will entail an additional column being created to support that null. This not only affects storage requirements, but has a detrimental impact on performance.
Nulls are often seen as a result of poor database design. If they must be used, however, try to consider using a value that would otherwise be out-of-range for the column instead of defining it as NULL. For example in a “number_of_dependents” column, where valid values are >=0, using a negative number to represent NULL is a better option.
Using Foreign Key Specifications
Specifying primary and foreign key constraints provide the optimizer with more information about the structure of data and in some cases can improve query performance.
Such constraints are typically enforced by VectorH generating secondary indexes, and these in turn may degrade performance. However, VectorH also provides a mechanism whereby such constraints can be declared as NOT ENFORCED. This type of constraint can often help third-party tools generate “better” SQL and in turn provide better query execution. As the name implies, however, such constraints are not enforced, meaning that the actual data may not conform to the constraint.
Using Correct Data Types
VectorH is optimized to perform as much query processing in memory as possible. As such, choice of appropriate data types and length is an important factor in ensuring efficient execution. In general, choosing large character data types for columns that are used in table joins is less efficient than choosing numeric types.
In addition, choosing numeric types that are larger than they need to be will also affect performance, so do not simply choose the largest data type available when designing your schema; try to pick the smallest and most appropriate data type that will still hold all of your target data.
For example, we have often observed that customers who have a background in Oracle will choose large DECIMAL data types with a scale of zero (for example, DECIMAL(38,0)) when one of the INTEGER data types (1, 2, 4, or 8-bytes) is a better choice. This can further be improved by selecting the correct INTEGER length. An 8-byte INTEGER can store numbers in the range -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807; if storing the floor number in a building, a 1-byte INTEGER may be more than adequate (‑128 to +127)!
Query Profiling
If you have taken all of the above steps and still think that your queries could perform better, then there are two diagnostic steps that we typically recommend:
• Looking at the Query Execution Plan
• Looking at the query profile
Proper interpretation of these cannot be covered in this guide, so please contact Actian Support for assistance.