High Availability for the DbAgent
DbAgent will wait for the X100 back end processes to terminate after a failure and then release all cluster pre-allocated resources. We mark as unresponsive the NodeManagers that did not respond to the RM's container shutdown/stop signals and save this information in the partition assignment metadata. For a subsequent X100 restart, triggered by a new SQL statement, if a container is not allocated the DbAgent checks if the offending nodes are in that unresponsive list. Since the NodeManagers would likely still appear alive to YARN, as yarn.am.liveness-monitor.expiry-interval-ms is set to ten minutes in most distributions, we immediately fail the application master, that is, the YARN submitted job. We handle that failure internally, remove the unresponsive nodes from the slave-set during the re-initialization phase and then retry or restart the application master. Everything should be transparent to the user.