User Guide : 14. Disaster Recovery
 
Share this page                  
Disaster Recovery
Master Node Recovery
The VectorH High Availability (HA) Option for the Redhat Cluster Suite can be used to configure failover of the VectorH Master Node to a designated Secondary Master Node. An existing slave node is configured to act as a Secondary Master Node during HA setup.
When the Master Node fails, the HA failure mechanism attempts to reconfigure to remove the node from the node list and start the Secondary Master Node. 
Requirements for Failover of Master Node
The following are required to use the master node failover feature:
Red Hat Enterprise Linux Cluster Suite
II_SYSTEM and all data locations and log locations (II_DATABASE, II_LOG, and so on) must be located on a device that is physically mounted on both the Master Node and Secondary Master Node under the same path. Use of NFS is not supported.
How Master Node Failover Works
The master node failover process works like this:
1. An existing slave node is configured as a Secondary Master Node.
2. During regular operation the Secondary Master is used as a slave node.
3. On failover, the Secondary Master is removed from the list of slaves and set as the Master Node.
4. The instance is then started on the Secondary Master with the original Master Node removed from the configuration.
5. After the failed Master Node is back online, you can reverse this process by “failing over” again back to the original Master Node. This process reverts all nodes to their original states.
Data Locality after Failover
After failover occurs, most data on HDFS will no longer be available locally, which can result in a significant performance overhead.
To resolve this, a REWRITE operation (see HA Configuration Parameters) can be run as part of the failover process to re-locate the data. Doing so, however, can take a significant amount of time and will increase the total amount of time VectorH is down. For details, see How to Add and Remove Slave Nodes in the System Administrator Guide.
Configuring the High Availability Option for Red Hat Cluster Suite
The mkrc utility can generate a service script, ha_actian-vectorhXX, that can be used to monitor and control the Vector instance in the cluster. The script is distributed as a template, ha.rc, in $II_SYSTEM/ingres/files/rcfiles.
Install the VectorH Service Script
To install the service script, follow these steps:
1. Log in as the instance owner and source the instance environment:
$ . ~/.ingXXsh
where XX is the VectorH instance ID.
2. Generate the VectorH service script:
$ mkrc
3. As root, install the VectorH service script under /etc/init.d:
$ su -c "$II_SYSTEM/ingres/utility/mkrc -i"
4. Disable the service:
$ su -c "chkconfig actian-vectorhXX off"
5. Verify the script has been installed correctly:
$ ls -l /etc/init.d/actian-vectorhXX
-rwxr-xr-x. 1 root root 1064 May 29 19:34 /etc/init.d/actian-vectorhXX
$ chkconfig ha_actian-vectorhXX --list
ha_actian-vectorhXX 0:off 1:off 2:off 3:off 4:off 5:off 6:off
Note:  It should be off by default.
6. (Optional) Install the Management Server service for Actian Director
$ source ~/.ingXXsh
$ mkrc -s iimgmtsvc
$ su -c "$II_SYSTEM/ingres/utility/mkrc -s iimgmtsvc -i"
$ su –c "chkconfig iimgmtsvrXX off"
7. Repeat steps 1 through 5 on the secondary master node.
Install the Cluster Service Script
To install the service script, follow these steps:
1. Log in as the instance owner and source the instance environment:
$ . ~/.ingXXsh
where XX is the Vector instance ID.
2. Generate the HA service script:
$ mkrc -s ha
3. As root, install the HA service script under /etc/init.d:
$ su -c "$II_SYSTEM/ingres/utility/mkrc -s ha -i"
4. Verify the script has been installed correctly:
$ ls -l /etc/init.d/ha_actian-vectorXX
-rwxr-xr-x. 1 root root 1064 May 29 19:34 /etc/init.d/ha_actian-vectorXX
$ chkconfig ha_actian-vectorXX --list
ha_actian-vectorXX 0:off 1:off 2:off 3:off 4:off 5:off 6:off
Note:  It should be off by default.
5. Repeat steps 1 through 4 on the secondary master node.
Set Up the Cluster Service
To set up the cluster service for the VectorH database, follow these steps:
1. As root, use the cluster service script to configure failover:
$ sudo service ha_actian-vectorhXX configure
where XX is the VectorH instance ID.
2. Follow the prompts to configure the Secondary Master Node.
Setting up Actian Vector H HDFS Support...
 
Setting up High Availability (HA) for the
Actian Vector H Master node requires
that the installation location pointed to by:
 
    $II_SYSTEM
 
and data locations, pointed to by:
 
    II_DATABASE
    II_CHECKPOINT
    II_JOURNAL
    II_WORK
  & II_DUMP
 
be located on a shared file system. This must be mounted
on the secondary master node under the same location as
the master node in order for HA Failover to work correctly.
 
The secondary master node must be an existing slave node.
 
Do you wish to continue? (y/n) [y] 
 
Select the slave node to be used as the secondary master node.
 
    1) nodename
    q) quit
 
[1]: 
 
'nodename' configured as secondary Master node
 
After failover, most HDFS data will no longer be available locally,
which can result in a significant performance overhead. To resolve
this a 'rewrite' operation can be run as part of the failover
process to re-locate the data.
 
Doing so, however, can take a significant amount of time and
will increase the total amount of time Actian Vector H
is down.
 
Do you wish to run a 'rewrite' operation during failover? (y/n) [n] y
 
The Actian Vector H HDFS setup program has successfully completed.
Red Hat Cluster Configuration
The ha_actian-vectorhXX script, once installed on the primary and secondary master nodes, should be used as the basis to configure a service group within the Red Hat High Availability framework.
Note:  The service group should be restricted to a failover domain containing only the two nodes capable of acting as VectorH master nodes.
HA Configuration Parameters
Configuration parameters in config.dat for the HA feature include:
ii.hostname.x100.hdfs.ha.rewrite_on_failover
Run a REWRITE statement on startup after failover to rebuild the data in the new configuration. Valid values: true and false
ii.hostname.x100.hdfs.ha.master
Name of master node
ii.hostname.x100.hdfs.ha.secondary
Name of secondary master node
ii.hostname.x100.hdfs.ha.status
States of the instances currently running. Valid values: master and failover.
DataNode Recovery
If a Hadoop DataNode that is part of the VectorH cluster fails, and your system uses YARN, then failover is automatic. For more information, see Configuring VectorH with High Availability in YARN.
If your system does not use YARN, you can manually remove the failing node and then add another node to the VectorH cluster. For details, see How to Add and Remove Slave Nodes in the System Administrator Guide.
Configuring VectorH with High Availability in YARN
To configure VectorH with high availability in YARN, use the following settings in YARN and the DBMS.
Recommended timeout settings are as follows:
core-site.xml:
<property>
    <name> ipc.client.connect.max.retries </name>
    <description>
        Indicates the number of retries a client will make to establish a server connection. 
    </description>
    <value> 3 </value>
</property>
<property>
    <name> ipc.client.connect.max.retries.on.timeouts </name>
    <description>
        Indicates the number of retries a client will make on socket timeout to establish a server connection.
    </description>
    <value> 3 </value>
</property>
yarn-site.xml:
<property>
    <name> yarn.client.nodemanager-connect.max-wait-ms </name>
    <description>
       Max time to wait to establish a connection to the NodeManager.
    </description>
    <value> 50000 </value>
</property>
<property>
    <name> yarn.client.nodemanager-connect.retry-interval-ms </name>
    <description>
       Time interval between each attempt to connect to the NodeManager.
    </description>
    <value> 10000 </value>
</property>
Vector SQL settings:
SET SESSION WITH ON_ERROR = ROLLBACK TRANSACTION
Slave Node Failover Behavior
Whenever a slave node fails (for example, a hardware crash) all the X100 back end processes will terminate at once, making the current transaction fail and ending the TCP session with the front end. The DBMS server becomes aware of the X100 node failure (due to the above setting and assuming the master node is not the failed node) and it will try to restart the X100 cluster. Implicitly (if yarn.enabled is set to true) this action will go through the DbAgent component, which will handle the failover and restart the slaves-set on a reduced (or same size) set of nodes for the next SQL statement. (For more on the DbAgent, see the System Administrator Guide.)
High Availability for the DbAgent
DbAgent will wait for the X100 back end processes to terminate after a failure and then release all cluster pre-allocated resources. We mark as unresponsive the NodeManagers that did not respond to the RM's container shutdown/stop signals and save this information in the partition assignment metadata. For a subsequent X100 restart, triggered by a new SQL statement, if a container is not allocated the DbAgent checks if the offending nodes are in that unresponsive list. Since the NodeManagers would likely still appear alive to YARN, as yarn.am.liveness-monitor.expiry-interval-ms is set to ten minutes in most distributions, we immediately fail the application master, that is, the YARN submitted job. We handle that failure internally, remove the unresponsive nodes from the slave-set during the re-initialization phase and then retry or restart the application master. Everything should be transparent to the user.
Limitations of VectorH High Availability in YARN
Currently, from the time we attempt the next SQL statement, we can recover in about 25s (without X100's log read time) despite what YARN knows about the states of the NodeManagers. On that basis and given the recommended timeout values, the total recovery time including the failure time and the X100's log read time should be around 1 minute +/- 5s.
Database Recovery
Full and incremental backup and restore of the database is available through the ckpdb and rollforwarddb operations.
For more information, see the chapter “Backing Up and Restoring the Database.”