Recommended Platforms
Hadoop Distributions
VectorH depends on a working Hadoop installation and is compatible with most common Hadoop distributions, like Hortonworks, Cloudera, Apache, and MapR.
Before installing VectorH, make sure a fully functional Hadoop is running on the target cluster. The installation script detects most environments correctly and will install VectorH on top of the existing Hadoop installation.
RAID or HDFS File System
Because VectorH stores its data in HDFS, we recommend letting HDFS manage the redundancy and performance that comes with multiple drives by using a JBOD setup. However, for non-HDFS data (the operating system and the VectorH binaries) we recommend using a RAID file system (either RAID1 or RAID5, implemented in either hardware or software) for high availability of the nodes.
Supported CPUs
Vector is designed to run on all modern x86_64-based hardware.
The 64-bit Linux distribution can run on all x86_64 CPUs from both Intel and AMD.
Note: CHAR and VARCHAR operations can gain performance benefits from SSE4.2 instructions, and INTEGER operations can gain performance benefits from AVX2 instructions. Check with your hardware vendor whether SSE4.2 and AVX2 instructions are supported.
Vector uses multiple cores for handling concurrent queries and if you enable parallel execution, a single statement can use multiple CPU cores.
Disk Subsystems
For processing outside-of-memory datasets, Vector needs, above all, a high performance sequential read disk subsystem. This can be achieved in Hadoop by using a JBOD setup using any of these technologies:
• SCSI
• SAS
• SATA
Because random lookups are not as important in a data warehousing context, using SAS hardware tends to be the cost-effective option.
Solid State Drives (SSD), which typically use SATA, are more expensive per gigabyte, but high-end models deliver more than 2.5 times the sequential throughput of a single magnetic disk. They also come in a 2.5-inch form factor, allowing higher "bandwidth density".
Be sure to balance SSDs with enough disk controllers: at least one controller is needed per four drives. Because Vector uses advanced differential techniques for handling updates efficiently, the amount of write operations is significantly reduced, so using cost-effective "MLC" memory SSDs is an option.
Connectivity
Although VectorH aims to keep the inter-node communication to a minimum, there will always be a moderate amount of it. As network bandwidth is relatively low compared to direct memory bandwidth, a high-speed network is highly recommended.
For proper operation, we recommend at least a 10GigE network. 20GigE (2 x 10GigE bonded) or InfiniBand will improve performance. A combination of 10GigE or 20GigE and InfiniBand is also possible. In this case, the system can be configured such that HDFS and other Hadoop components communicate over the 10GigE or 20GigE link and VectorH exclusively communicates over the InfiniBand link. A separate, dedicated network interface exclusively for VectorH is recommended, even when Infiniband is not available.
Guidelines for a Balanced Platform
A "balanced" hardware configuration is one that has no clear performance bottleneck. In a balanced configuration, CPUs can process at maximum performance while there is little to no surplus capacity in other resources. A balanced configuration gives you maximum return for your investment.
Today's multi-core CPUs process data extremely fast and most configurations cannot provide enough disk IO bandwidth to feed CPUs. For practical, non-benchmark configurations, the in-memory column buffer mitigates the need for enough storage bandwidth to keep all cores busy at any point in time. Configure a system to allow for a lot of query execution memory as well as a generous column buffer to keep most of the frequently accessed data compressed in memory. Systems with a large amount of memory and fast spinning disks typically deliver the most cost-effective solutions.
Recommended hardware configuration for the NameNode:
Recommend hardware configuration for each DataNode (VectorH node):