Sizing Disk Storage
The primary storage for your Vector database is disk storage. Your storage solution must satisfy performance and space requirements.
Note: Any component in the storage infrastructure may become a bottleneck, including individual drives, disk controllers, Host Bus Adapters (HBAs--if applicable), switches (if applicable), and so on.
I/O Performance Requirement
Depending on the query and the data, Vector can process data at more than 1.5 GB/s per CPU core. To achieve this rate, the CPU cores must be fed at a rate fast enough to keep them busy. If data is compressed--which it is by default--then the cost of decompression is about 25% of the CPU core’s processing capacity.
Use the following formula to calculate the maximum required bandwidth:
maximum bandwidth per core = core processing speed / compression ratio * 0.75
Multiply this number by the number of cores in your system to calculate the total required disk throughput to drive the maximum bandwidth across all CPU cores.
Such a high I/O consumption requirement is typically found with relatively simple queries such as single-table scans with aggregations. For more complex queries (for example, performing many joins against small memory-resident tables), the I/O consumption requirement may be significantly less.
The data compression ratio you achieve depends on the data types in your tables as well as the data. It is common to achieve 3x to 5x compression ratios, although both higher and lower compression ratios have been observed.
Maximum data processing throughput is achieved only if all cores are fully occupied. Also, relatively simple queries require more data throughput in contrast to computation-intensive queries. If you choose to not configure your system to achieve absolute maximum performance, then you should know that the system, if fully busy executing relatively simple queries, will likely become I/O-bound and not achieve maximum performance.
Spinning Disks
The primary development focus for spinning disks over the last several years has been on increasing the capacity of a single disk. Data transfer rates have improved only slightly. In fact, simply due to the laws of physics larger spinning disks show a greater variability for data transfer rates between data stored on the outer slices versus the inner slices of the disk. Therefore, to achieve good consistent performance you should choose smaller rather than bigger disks. For example, choose 146 GB disks over 500 GB (or larger) disks.
Faster spinning disks at 15k RPM have higher throughput rates than slower 10k RPM or 7.2k RPM disks. In an ideal case, a single spinning 15k RPM disk can sustain up to 150 MB/s data transfer.
Solid State Drives (SSDs)
Solid state drive technology has matured in the past few years. This, with data layout flexibility in Vector, make SSD technology a viable storage consideration for your system. For example, you may choose SSDs for temporary database storage to improve performance for spill-to-disk operations.
A single solid-state drive (SSD) can sustain 250 MB/s or more, independent of size. SSDs are still more expensive per storage unit than spinning disks but generous size SSDs are widely available. For maximum performance configurations--unless most of the frequently accessed data can reside in memory--SSDs are often the best available option to drive Vector’s data processing capacity using only internal storage, as they provide excellent bandwidth for their physical size and power consumption.
A special type of SSDs is PCI cards available, for example, from Fusion-IO. Throughput on these cards is restricted by the PCI channel, which is a high 1.5 GB/s per card. If extremely high I/O throughput rates are required and the cost of the configuration is not a major concern, then consider evaluating such cards.
RAID Configuration
Unless your entire data set fits in memory, to fully drive the data processing capacity of Vector, you must have multiple drives working in parallel. To do this, use striping across multiple drives through RAID (Redundant Array of Inexpensive/Independent Disks) volumes. Use a large stripe size of 512 KB or multiples of 512 KB to optimize I/Os sent by the OS down to disk.
Hardware controllers have a limited throughput capacity that may limit the I/O throughput below the level you need for Vector. Therefore, consider software-based RAID with multiple non-RAID controllers to avoid a bottleneck at a single hardware RAID controller if many fast drives are included in the RAID configuration (for example, more than 8 SSDs).
Also consider choosing a RAID configuration that protects against a drive failure (unless you use a different approach to achieve high availability); RAID5 or RAID6 setups are typical RAID configurations that provide a good trade-off between storage overhead, performance, and availability.
Storage Requirement
The minimum number of disks is determined by the amount of storage space required and the I/O throughput requirement. You will also likely need storage space for a data staging area on a fast RAID array to load data into Vector. Also, be sure to consider the expected database growth with the expected challenge of extending the file system when you need more storage space for your database.
The Vector database can be stored across multiple storage locations to allow for data expansion, or to allow distinct types of data to be stored on different types of storage. For example, you may want to store temporary files rather than data on faster storage.
File System
There is no specific file system that you are required to use. On Linux you can use ext2, ext3, ext4 or xfs. On Windows choose NTFS. Keep the following in mind:
• Some file systems have a maximum file size (for example, ext3 has a 2 TB file size limit) and for a large table a single file (for one column) may grow very large.
• Choose an XFS file system on Linux if you need maximum flexibility for your file size.