Evaluation Process
Every evaluation should begin with a definition of “success” of the evaluation. Evaluation criteria are typically grouped into two categories:
• Functional evaluation criteria: Can I solve my problem?
• Non-functional evaluation criteria: Can I make it work in my environment?
This guide focuses primarily on the functional aspect of evaluations, that is, product installation, setup, configuration, and tuning. But non-functional tests are also important and should be added to the evaluation.
Evaluation Steps
The evaluation test process is typically as follows:
1. Define success criteria and tests to be undertaken.
2. Define and set up the test environment (hardware plus Hadoop). Validate the environment for use with VectorH, and then install VectorH.
3. Migrate the schema, data, and queries from an existing environment into VectorH.
4. Run a single-user test to validate the queries.
5. Optimize the schema for VectorH and for the workload.
6. Run a multi-user concurrency test.
7. Run the non-functional tests.
Typically an evaluation is conducted on a subset of the data and the workload, so the results need to be scaled to assess what will be needed for production. A scalability model must be developed to show how to do this mapping, and the evaluation stage must gather the key data points to be able to build that model.
Functional Test Criteria
Typical functional test criteria might be:
• What BI tools or applications will be used? How will they connect?
• Will the platform support both interactive and batch use cases?
• Will my queries work unchanged, and do I get back the results I expect?
• Where will my business logic live?
• How will I port the schema and data from my current system?
• What changes, if any, will need to be made to the schema definition?
• Do I have source data that can be used as is, or does it need to be obfuscated or even generated from scratch?
• Does the system need to support updates and changes in-place to the data?
If you simply want to quickly run some basic tests to get a feel for the product, then use the performance test kit (see
Running the Performance Test Kit), which is highly automated and can produce some test data and query results.
Non-functional Test Criteria
Typical non-functional test criteria might include areas such as:
• What target data volumes need to be supported for production and for evaluation?
• What target data loading rates need to be supported?
• What ETL tools, if any, will be used to load the data, and what operating systems will they run on? VectorH supports clients on Windows and Linux.
• Where is the data coming from, and how often will data need to be loaded—that is, what is the target data latency from the originating source system?
• How many users need to be supported, and therefore what query concurrency level is needed? Typically we have seen a 3-5% “active query per connected user” ratio.
• What proportion of my production workload do I need to test with to create a scalability model that I can trust for sizing the production environment?
• What service level is needed for the production service, and how can the Actian Analytics Platform support that Service Level Agreement?
• What happens if the master node fails?
• What happens if a slave node fails?
• Will VectorH be expected to compete with other services (for example, Hive, Impala)?
• What does hardware (CPU, memory, disk, network) usage look like during the performance tests? How balanced and saturated is the cluster?
• Are there any restrictions on the movement of data or programs (such as downloading software from esd.actian.com) that need to be addressed?
Success Criteria
The context and use case determine which of the previously mentioned characteristics, if any, are important and necessary to the evaluation. The most common evaluation criteria include:
• Testing with specific BI and ETL tools. How will the data get in and out?
• Check that porting a schema and data from an existing platform works well.
• Check that a single-user run of a set of queries works well.
• Check that multiple concurrent users can also be supported without performance slowing down to unacceptable levels.
• Check that the performance of data loading meets the use-case needs.
• Draft an architecture design for scaling to support the target levels of users and data needed for production.
Implementation Design
The core principles of how to set up Hadoop depend on the distribution, but some common considerations are:
• How will high availability of the Hadoop NameNode be achieved?
Cloudera and Hortonworks require that the Hadoop NameNode be managed using an active/passive arrangement. MapR has implemented a multi-master approach for the NameNode to eliminate this concern.
• What other Hadoop components are required for the solution?
As a minimum, Hadoop also requires at least three instances of ZooKeeper to provide fault tolerance for state and configuration information. These services can run on nodes shared with other tasks.
• High availability of the VectorH master node: We recommend that any production instance be set up with high availability support for the VectorH master node, which is handled by the Red Hat Cluster Suite. For details, see the User Guide.
• Resilience of VectorH slave nodes: YARN must be used for the cluster to automatically recover when one of the slave nodes fails. YARN is disabled by default.
• Use of edge nodes: some customers have adopted a security practice of having end users connect only to “edge nodes” rather than opening up all Hadoop data nodes to external connections from outside the cluster. Such edge nodes typically need to have VectorH client software installed to allow external users to log in.
• Are there any corporate security or other mandated configuration requirements that need to be considered for the evaluation (such as Kerberos)?
Beyond the core considerations of what to install where, the other factors to consider are how to structure the VectorH users, schema, and data to meet the needs of your use case. For example, should all data and users be hosted in a single instance of VectorH, or should you adopt some segmentation across multiple schemas, databases, or installations?
A note on terminology: In VectorH, a database can contain many schemas, each of which contains multiple database objects (tables, secondary indexes, and views). Objects in different schemas can be joined in the same query, providing that the user has permission to do so.
A VectorH installation can contain many databases, but a single query cannot join data across multiple databases.
During an initial evaluation, we recommend that only one database be installed per VectorH instance, since the default resource allocation policy is to allocate 75% of memory and all CPU cores to the first database, with each subsequent database also wanting to acquire the same resources. Multiple databases can be set up in the same installation, but this requires extra configuration. It is therefore more common to have a single database per VectorH installation and use multiple schemas in that database to achieve a level of separation. Valid use cases for multiple databases exist, but it is best to gain experience with VectorH performance characteristics first.
Fault isolation is increased with multiple instances of VectorH running independently in the same Hadoop cluster, but so is administration overhead of managing each instance because users, service packs, and data are not shared between instances. In addition, client connections need to know to connect to one IP address versus another. Segmentation in this way can be straightforward when users are naturally separated by geography, or function, or data that they work on. But if no such natural segmentation exists, then the administration and management overhead of multiple instances is likely to be greater.
Disaster Recovery
Another design consideration is how to deal with disaster recovery, that is, the consequences of your main cluster failing for any reason.
The simplest approach is to rely on a single cluster; if it fails, then you need to restore it, including restoring it from a backup if necessary (and resolving any hardware or network issues that may have caused the failure).
Backing up and restoring a Hadoop cluster is a much broader topic than this document can cover, but typically the data volumes involved can be large, and naturally may take a long time to back up. This restoration process, in addition to resolving the hardware and networking problems that may have caused the failure, will inevitably be a lengthy process—possibly reinstalling, configuring, and patching the operating system, Hadoop, other applications, and VectorH, and then restoring the databases.
In such a scenario, the system outage could last longer than the business requirement allows for, in which case a faster recovery strategy needs to be adopted.
The most common disaster recovery strategy is to have a second cluster available to switch users and processing over to, but then you need to consider other issues such as how and how often to refresh the data on the second cluster from the primary, and whether to implement an active/active or active/passive design. Again, these design issues are more complex than the scope of this document. One consideration, however, is that in a replication-based solution, VectorH can be used only as the target of data replication (for example, using a partner product such as Attunity Replicate or HVR High Volume Replicator), not as a source.