3. Stage 1--Define and Create the Test Environment

VectorH runs directly inside a Hadoop cluster and can be installed and evaluated in a number of different environments, such as:

• Dedicated on-premises bare metal cluster. This is the best choice for complete control and performance testing, but has the highest costs (in time and setup effort).

• In a virtualized Cloud environment, such as Amazon AWS, Rackspace, or Microsoft Azure.

• In a dedicated, hardware-based Cloud environment, such as those offered by Rackspace or a hosting company.

• Inside a user-created cluster of virtual machines running on a single machine or on a VMWare ESX server.

There are advantages and disadvantages to testing in these different environments, such as:

• Cloud environments can be much faster to provision since there is no need to procure new hardware.

• Virtual machines and shared infrastructure, rather than dedicated hardware, mean that performance will be variable.

• Cloud environments reduce your choices in terms of exact configurations of CPU, memory, network, and disks to what the Cloud provider has chosen. If this does not match your workload, you cannot customize it.

• Hadoop Sandbox environments are quick to set up, since they are pre-built and require no setup effort. They typically require a minimum of 8 GB of memory in your test machine to get Hadoop working, and then Actian VectorH requirements need to be added to this. So although this is smaller than most other options, it might still be too large for a typical laptop.

• Sandbox environments only implement a single node of Hadoop, which is fundamentally designed to be a cluster-based system. It does work, however, and allows a level of functional testing to be completed. No performance testing or disaster recovery will be possible though.

For rapid functional and non-functional testing, a Cloud-based environment is a good choice to start with. But if performance is critical, then a dedicated on-premises (bare metal) cluster is the most common choice—although customers have recently had success with building performant environments in the Cloud also, with careful choice of each Cloud vendor’s infrastructure options (for example, high speed disks, lots of CPU cores and RAM).

The key prerequisites for evaluating VectorH are:

• A high speed interconnect between data nodes. 10 GB Ethernet is recommended. 1 GB Ethernet will work, but performance is likely to be impaired.

• Sufficient disk space in HDFS to store your test data, including the fact that by default three copies of each data block are needed. Vector, however, stores data in a compressed format, so frequently the space required is about the same as the uncompressed, CSV format of the data, due to the net effect of these two factors.

• At least three Hadoop data nodes are required (a Master node and two slave nodes), in addition to the core Hadoop requirements of your distribution.

• If failover/resilience testing is needed, then an extra standby master node is also needed, and a standby data node is also useful.

• If scale-out testing is planned, then more data nodes are needed to be able to expand the cluster.

• For performance testing, as many CPU cores on each machine as are available. In principle, VectorH will work with however many cores are available on the machine, but in practice, many Big Data workloads will benefit from 16 cores (or more) per machine.

• For performance testing, at least 8 GB RAM per CPU core. Again, depending on the workload and data volumes, more is better.