Disaster Recovery

Another design consideration is how to deal with disaster recovery, that is, the consequences of your main cluster failing for any reason.

The simplest approach is to rely on a single cluster; if it fails, then you need to restore it, including restoring it from a backup if necessary (and resolving any hardware or network issues that may have caused the failure).

Backing up and restoring a Hadoop cluster is a much broader topic than this document can cover, but typically the data volumes involved can be large, and naturally may take a long time to back up. This restoration process, in addition to resolving the hardware and networking problems that may have caused the failure, will inevitably be a lengthy process—possibly reinstalling, configuring, and patching the operating system, Hadoop, other applications, and VectorH, and then restoring the databases.

In such a scenario, the system outage could last longer than the business requirement allows for, in which case a faster recovery strategy needs to be adopted.

The most common disaster recovery strategy is to have a second cluster available to switch users and processing over to, but then you need to consider other issues such as how and how often to refresh the data on the second cluster from the primary, and whether to implement an active/active or active/passive design. Again, these design issues are more complex than the scope of this document. One consideration, however, is that in a replication-based solution, VectorH can be used only as the target of data replication (for example, using a partner product such as Attunity Replicate or HVR High Volume Replicator), not as a source.