How Much Memory Do I Need?

To get the most value out of your investment it is extremely important to size memory appropriately for a Vector deployment. Insufficient memory allocated to Vector may force you to enable spill-to-disk for various operations, in which case those queries that spill to disk may see significant performance degradation over those that do not.

Your server must have enough memory to support the Operating System (OS), Vector and any other applications running on the server. While the OS can deal with insufficient memory by swapping memory to disk, this must be avoided as it will cause a severe slowdown of the system and result in poor and unpredictable performance.

Vector uses two memory pools:

• Execution memory: the execution memory pool should be large enough to support concurrent hash joins, aggregations and sorts against your biggest tables to prevent large queries from spilling to disk. Also, if you plan to define secondary indexes prepare for the unique entries in the indexed data to reside uncompressed in execution memory at all times plus an overhead for every column value.

These memory pools are assigned per database and cannot be shared between databases. If you plan to run multiple databases concurrently then you must ensure to divide the available resources--memory in this case--between the databases that will run concurrently and make sure that the total memory consumption does not exceed the total amount of available RAM. Consider using multiple schemas instead of multiple databases to allow maximum memory resource utilization across multiple applications.

To identify how much memory is needed, identify how much memory you need for every pool (as discussed below), per database if you plan to run multiple databases concurrently, then add at least 2 GB for the OS and memory for any other applications you plan to run on the same server.

To simplify system sizing and to achieve the most predictable performance from Vector, you should dedicate your server to run Vector.

Realistically, an environment to run Vector for more than very basic testing should be configured with at least 32 GB of RAM.

Considering that multiple factors impact memory consumption, including the data size that is frequently accessed, the complexity of the queries, and query concurrency, there is no simple formula to determine the total amount of memory that will work well for your application. For an extremely rough estimate, consider the uncompressed total amount of source data you store in Vector divided by 10. For example, for a database that stores 5 TB of source data start by planning for 512 GB of RAM and adjust as necessary. Smaller databases will likely need more memory than one tenth of the source data size.

There is no simple formula that determines the amount of execution memory needed for a query. The execution memory should be large enough to support query operations against your biggest tables and must take into account concurrent query execution. In addition, if you plan to create secondary indexes then you must take the memory consumption of these indexes into consideration for sizing execution memory. Unique entries of indexed column values will reside uncompressed in execution memory, and every unique entry in the index has an additional fixed 21-byte overhead for the first occurrence with an 8-byte overhead for every subsequently repeating value.

Table joins, aggregations, and sorts, whether implicit--for example, as a result of a query with the keyword DISTINCT or queries with UNION (not UNION ALL)--or explicit, can use lots of memory when operating against large data sets.

For a large hash join the smallest side of the join should fit in memory. For example, if a query retrieves on average 30 bytes per row for the smallest side of the join and the query retrieves 100 million rows, then to complete a hash join the query will use:

For Vector to correctly identify which side of the join is the smallest (after applying filter criteria) it is important that table statistics are reliable. For more information about collecting and maintaining statistics, see the Vector User Guide.

For a large aggregation the entire (intermediate) result set should fit in memory. For example, if a query retrieves 50 bytes per row out of a 1 billion rows table (with no filter criteria), and the aggregation reduces the cardinality by a factor of 20, then the query will use:

To achieve the fastest query response times Vector does not compress data in execution memory.

Multiple queries running concurrently will each consume memory. Consider query concurrency and the workload when determining the total amount of memory required for execution memory. Also, parallel queries generally use more memory than queries that do not run in parallel.

A query will consume as much memory as it can. If a query needs more memory than is available, it will return an out-of-memory error unless you enabled spill-to-disk operations. Spill-to-disk for hash joins and aggregations is disabled by default but can be enabled in the vectorwise.conf parameter file. Spill-to-disk for sorts is always enabled and cannot be turned off. Unless configured otherwise, per database Vector allocates 50% of your physical memory to execution memory by default.

The first question to ask when you want to determine how much CBM memory you need is whether you want the data set to be able to fit in memory. I/O performance is very important to achieve maximum query performance and using main memory to store the data (in addition to disk for high availability and recovery reasons) can avoid a potential data storage layer bottleneck.

By default, Vector keeps data compressed in CBM memory. Maximum performance can be achieved when data is stored uncompressed in memory--as described in Database Configuration but this will increase the memory size requirement.

If your data set is significantly greater than your planned CBM memory size, then consider the size of the frequently accessed data, which should ideally fit into CBM memory. Vector performs smart buffer management to maximize the throughput across multiple operations when data retrieval from disk is required.

Unless configured otherwise, per database Vector will configure 25% of your available memory to CBM memory by default.

Vector is available on 64-bit Linux and Windows operating systems. You will experience dramatic query performance improvements for data analysis and reporting workloads on either platform compared to other relational databases using the same hardware.

If you use substantial amounts of memory for a single operation (greater than 10 GB) then you will likely experience better performance using Linux compared to Windows on the same system due to the more efficient large memory management capabilities of Linux.