3. Stage 1--Define and Create the Test Environment

VectorH runs directly inside a Hadoop cluster and can be installed and evaluated in a number of different environments, such as:

• Dedicated on-premises bare metal cluster. This is the best choice for complete control and performance testing, but has the highest costs (in time and setup effort).

• In a virtualized Cloud environment, such as Amazon AWS, Rackspace, or Microsoft Azure.

• In a dedicated, hardware-based Cloud environment, such as those offered by Rackspace or a hosting company.

• Inside a user-created cluster of virtual machines running on a single machine or on a VMWare ESX server.

There are advantages and disadvantages to testing in these different environments, such as:

• Cloud environments can be much faster to provision since there is no need to procure new hardware.

• Virtual machines and shared infrastructure, rather than dedicated hardware, mean that performance will be variable.

• Cloud environments reduce your choices in terms of exact configurations of CPU, memory, network, and disks to what the Cloud provider has chosen. If this does not match your workload, you cannot customize it.

• Hadoop Sandbox environments are quick to set up, since they are pre-built and require no setup effort. They typically require a minimum of 8 GB of memory in your test machine to get Hadoop working, and then Actian VectorH requirements need to be added to this. So although this is smaller than most other options, it might still be too large for a typical laptop.

• Sandbox environments only implement a single node of Hadoop, which is fundamentally designed to be a cluster-based system. It does work, however, and allows a level of functional testing to be completed. No performance testing or disaster recovery will be possible though.

For rapid functional and non-functional testing, a Cloud-based environment is a good choice to start with. But if performance is critical, then a dedicated on-premises (bare metal) cluster is the most common choice—although customers have recently had success with building performant environments in the Cloud also, with careful choice of each Cloud vendor’s infrastructure options (for example, high speed disks, lots of CPU cores and RAM).

The key prerequisites for evaluating VectorH are:

• A high speed interconnect between data nodes. 10 GB Ethernet is recommended. 1 GB Ethernet will work, but performance is likely to be impaired.

• Sufficient disk space in HDFS to store your test data, including the fact that by default three copies of each data block are needed. Vector, however, stores data in a compressed format, so frequently the space required is about the same as the uncompressed, CSV format of the data, due to the net effect of these two factors.

• At least three Hadoop data nodes are required (a Master node and two slave nodes), in addition to the core Hadoop requirements of your distribution.

• If failover/resilience testing is needed, then an extra standby master node is also needed, and a standby data node is also useful.

• If scale-out testing is planned, then more data nodes are needed to be able to expand the cluster.

• For performance testing, as many CPU cores on each machine as are available. In principle, VectorH will work with however many cores are available on the machine, but in practice, many Big Data workloads will benefit from 16 cores (or more) per machine.

• For performance testing, at least 8 GB RAM per CPU core. Again, depending on the workload and data volumes, more is better.

One evaluation option is to use Actian Vector rather than Actian VectorH. Although Actian Vector does not run in Hadoop, it provides a single-node environment that is identical to VectorH in terms of schema definition, query syntax, management tools, and functionality.

This is an attractive option because it requires only a single machine rather than a cluster, and so is fast and simple to get started, and can be used on both Windows and Linux.

For a Windows user, there is an installation package for Vector on Windows available for download from http://esd.actian.com/product/Vector.

For a Linux developer, there is also a Linux installation package available from the same location that can be installed in the usual way (for example, using a package manager).

You can also evaluate a Linux version of Vector from a Windows machine by using a virtual machine (either locally or in a Cloud). The fastest way to do this with test data preloaded is as follows:

1. If you want to run on a local virtual machine, download and Install Oracle Virtual Box from https://www.virtualbox.org/wiki/Downloads.

2. If you want to run on a Microsoft Azure cloud platform, create an Azure account first, and note your account key information.

3. Download the Vagrant developer tool from https://www.vagrantup.com/downloads.html. Windows, Linux, and Mac OS X are supported.

4. Download the Vector Vagrant project from Github at: https://github.com/ActianCorp/Vagrant-Vector-Install. If you do not already have Git installed, this can be downloaded as a zip file, and then unpacked using the Download Zip button on this web page.

5. Download Actian Vector from http://esd.actian.com/product/Vector, and place it into the Vagrant-Vector-Install folder, along with the license file that you will receive through email, and the RPM keys file.

6. If you want to use Microsoft Azure, edit the vagrantfile in this folder and fill in your Azure account information. Then start a Command Prompt in the Vagrant-Vector-Install folder and type:

7. To use a local virtual machine, simply start a Command Prompt in the Vagrant-Vector-Install folder and type:

For further instructions, see the readme file on the Actian Github page, or the documentation for each tool or platform.

VectorH uses HDFS or MapR-FS as the storage layer for data and calls on other Hadoop services such as the NameNode and YARN. VectorH is agnostic to the Hadoop distribution used.

A supported Hadoop distribution must be installed and running for VectorH to operate. Supported versions of Hadoop distributions and underlying operating systems are described in the Product Availability Matrix (http://downloads.actian.com/media/PDFs/Product_Avail_Matrix_Vector.pdf).

Generally speaking the hardware requirements of VectorH match the recommendations of the underlying Hadoop vendors: dual CPU commodity servers with sufficient RAM, disks, and 10 GbE networking.

For disks, we recommend using the higher end of the Hadoop vendor recommendations and opting for at least 16 physical drives and, if possible, choosing the drive type as SAS rather than SATA or Near Line SAS. Solid State Disk (SSD) is preferred over rotating drives for maximum performance.

Consult the vendor documentation on high performance DataNodes for distribution specific configuration details.

For the test cluster, ensure that the number of VectorH nodes is greater than the HDFS block replication factor. (As of version 4.2.2 a configuration item is available the sets the replication factor for VectorH independently of the HDFS default value.)