Data Size Consideration for Some Queries

To get the fastest query response times, Vector by default does not spill intermediate results to disk for joins and aggregations. As a result, using the default settings, Vector may not be able to perform a requested operation for some types of queries due to high memory usage. The following scenarios are the most typical:

• Performing an aggregation that produces a high number of records. For example:

If the number of distinct values of "l_orderkey" is high (tens of millions or more), the query may fail, depending on available memory.

Increasing available memory can be a partial solution for this problem. Also, indexing the "lineitem" table on "l_orderkey" allows a faster and more memory-efficient solution to be used.

Another option is to let aggregation save its temporary results on disk, allowing it to scale beyond available memory. This technique, known as "spilling to disk," can be enabled by setting the [engine] enable_aggregation_disk_spilling parameter to "true".

Note: Enabling spilling to disk will let the query complete but may cause severe performance degradation, possibly even orders of magnitude slower than when the query is running in memory. This is because disk access is slower than memory access. Using faster disk subsystems can partially alleviate the slower performance.

• Performing a join between two large tables without any restrictions. For example:

If the number of records in "orders" is high (tens of millions or more), the query may fail, depending on available memory.

Increasing available memory can be a partial solution for this problem. Also, indexing both "lineitem" and "orders" tables on the join key allows a faster and more memory-efficient solution to be used.

Another option is to let the join save its temporary hash table on disk, allowing it to scale beyond available memory. This technique, known as "spilling to disk," can be enabled by setting the [engine] enable_hashjoin_disk_spilling parameter to "true".

• Performing a duplicate-eliminating UNION where the input relations have a high number of records. For example:

If the number of values in orders1 and orders2 is high (tens of millions or more), the query may fail, depending on available memory. Increasing available memory can be a partial solution for this problem. Also, you can consider using a UNION ALL approach if you know there are no duplicates, or if you accept duplicate entries.

Because duplicate-elimination in UNION is a special form of aggregation, the previous discussion about spilling to disk for aggregation also applies to UNION.