Data Size Consideration for Some Queries
To get the fastest query response times, Vector by default does not spill intermediate results to disk for joins and aggregations. As a result, using the default settings, Vector may not be able to perform a requested operation for some types of queries due to high memory usage. The following scenarios are the most typical:
• Performing an aggregation that produces a high number of records. For example:
SELECT l_orderkey, COUNT(*)
FROM lineitem
GROUP BY l_orderkey
If the number of distinct values of "l_orderkey" is high (tens of millions or more), the query may fail, depending on available memory.
Increasing available memory can be a partial solution for this problem. Also, indexing the "lineitem" table on "l_orderkey" allows a faster and more memory-efficient solution to be used.
Another option is to let aggregation save its temporary results on disk, allowing it to scale beyond available memory. This technique, known as "spilling to disk," can be enabled by setting the [engine] enable_aggregation_disk_spilling parameter to "true".
Note: Enabling spilling to disk will let the query complete but may cause severe performance degradation, possibly even orders of magnitude slower than when the query is running in memory. This is because disk access is slower than memory access. Using faster disk subsystems can partially alleviate the slower performance.
• Performing a join between two large tables without any restrictions. For example:
SELECT o_orderdate, l_receiptdate
FROM lineitem, orders
WHERE l_orderkey = o_orderkey
If the number of records in "orders" is high (tens of millions or more), the query may fail, depending on available memory.
Increasing available memory can be a partial solution for this problem. Also, indexing both "lineitem" and "orders" tables on the join key allows a faster and more memory-efficient solution to be used.
Another option is to let the join save its temporary hash table on disk, allowing it to scale beyond available memory. This technique, known as "spilling to disk," can be enabled by setting the [engine] enable_hashjoin_disk_spilling parameter to "true".
Note: Enabling spilling to disk will let the query complete but may cause severe performance degradation, possibly even orders of magnitude slower than when the query is running in memory. This is because disk access is slower than memory access. Using faster disk subsystems can partially alleviate the slower performance.
• Performing a duplicate-eliminating UNION where the input relations have a high number of records. For example:
SELECT o_orderkey, o_orderdate
FROM orders1
UNION
SELECT o_orderkey, o_orderdate
FROM orders2
If the number of values in orders1 and orders2 is high (tens of millions or more), the query may fail, depending on available memory. Increasing available memory can be a partial solution for this problem. Also, you can consider using a UNION ALL approach if you know there are no duplicates, or if you accept duplicate entries.
Because duplicate-elimination in UNION is a special form of aggregation, the previous discussion about spilling to disk for aggregation also applies to UNION.
Last modified date: 11/09/2022