How the Set of Slave Nodes Is Determined
The Dbagent determines the set of slave nodes by selecting the best num_nodes nodes (N) from the given list of DataNodes (M) with respect to their resource (such as CPU and memory) capabilities and local data (VectorH columnar files stored in HDFS), if any.
DbAgent handles the process transparently, as follows:
1. Asks YARN for the cluster node reports
2. Queries the HDFS NameNode to find out where the VectorH files are located, assuming some data is already loaded into the system
3. Sorts the list of M nodes in descending order by cluster resources and local data size, with priority on the latter
4. Selects the top N
Initially, when no data is stored in the system, this strategy determines a set of (more or less) homogeneous nodes, which is preferable for load-balancing query workloads.
After loading data into the database, the VectorH slave set is kept the same over any system restart because the data resides on those nodes. If a restart is due to a system failure (node crash) and your VectorH version comes with HDFS Custom Block Placement support, the missing replicas are determined by looking at the data-locality information and are re-replicated to other (possibly new) slave nodes. Moreover, in the latter case, the slave set will automatically adjust to a new set of slaves based on what nodes from the remaining DataNodes are still running. You do not have to manually modify the configuration files to remove the failed nodes.
Note: With the value of num_nodes less than the number of input DataNodes, the slave set will try to maintain its size by replacing a failing node with another one from the remaining DataNodes; otherwise, if num_nodes is equal to the number of input DataNodes, the slave set will obviously get smaller.