Implementation Design

As a minimum, Hadoop also requires at least three instances of ZooKeeper to provide fault tolerance for state and configuration information. These services can run on nodes shared with other tasks.

• High availability of the VectorH master node: We recommend that any production instance be set up with high availability support for the VectorH master node, which is handled by the Red Hat Cluster Suite. For details, see the User Guide.

• Resilience of VectorH slave nodes: YARN must be used for the cluster to automatically recover when one of the slave nodes fails. For details, see the User Guide. As of VectorH 4.2, YARN is disabled by default.

• Use of edge nodes: some customers have adopted a security practice of having end users connect only to “edge nodes” rather than opening up all Hadoop data nodes to external connections from outside the cluster. Such edge nodes typically need to have VectorH client software installed to allow external users to log in.

• Are there any corporate security or other mandated configuration requirements that need to be considered for the evaluation (such as Kerberos)?

Beyond the core considerations of what to install where, the other factors to consider are how to structure the VectorH users, schema, and data to meet the needs of your use case. For example, should all data and users be hosted in a single instance of VectorH, or should you adopt some segmentation across multiple schemas, databases, or installations?

A note on terminology: In VectorH, a database can contain many schemas, each of which contains multiple database objects (tables, secondary indexes, and views). Objects in different schemas can be joined in the same query, providing that the user has permission to do so.

A VectorH installation can contain many databases, but a single query cannot join data across multiple databases.

During an initial evaluation, we recommend that only one database be installed per VectorH instance, since the default resource allocation policy is to allocate 75% of memory and all CPU cores to the first database, with each subsequent database also wanting to acquire the same resources. Multiple databases can be set up in the same installation, but this requires extra configuration. It is therefore more common to have a single database per VectorH installation and use multiple schemas in that database to achieve a level of separation. Valid use cases for multiple databases exist, but it is best to gain experience with VectorH performance characteristics first.

Fault isolation is increased with multiple instances of VectorH running independently in the same Hadoop cluster, but so is administration overhead of managing each instance because users, service packs, and data are not shared between instances. In addition, client connections need to know to connect to one IP address versus another. Segmentation in this way can be straightforward when users are naturally separated by geography, or function, or data that they work on. But if no such natural segmentation exists, then the administration and management overhead of multiple instances is likely to be greater.