Managing Cluster Resources
How VectorH Interacts with YARN
VectorH interacts with YARN through a custom implementation of the YARN Client API, named DbAgent.
The DbAgent allocates the amount of resources required by VectorH (memory and cores per container) in a YARN environment. It then starts the set of VectorH backend processes with specific slave nodes given as a list of parameters.
During startup, the required set of cluster resources is registered and resource amounts are computed based on the vectorwise.conf file. If YARN is not present, the Vector Server starts without the DbAgent.
Errors and warnings are logged to the following files:
• vector-agent.log
• vector-wset-appmaster.log
• vector-workload-appmaster.log (if preemption support is enabled)
Note: Typically (although it depends on total resources) you can work on only one database if YARN is enabled, because the DbAgent allocates 75% of the operating system resources to the first X100 server. If an attempt is made to create or connect to another database, the X100 process stops running and returns the warning message, "Not enough memory," in the vector-agent.log file.
YARN Required Settings
The following YARN settings are required.
yarn-site.xml
<property>
<name> yarn.resourcemanager.system-metrics-publisher.enabled </name>
<description>
This property indicates to the ResourceManager, as well as to clients,
whether or not the Generic History Service (GHS) is enabled.
If the GHS is enabled, the ResourceManager begins recording
historical data that the GHS can consume, and clients can
redirect to the GHS when applications finish running.
</description>
<value> false </value>
</property>
capacity-scheduler.xml
For Hadoop version 2.4 (and older) the CapacityScheduler out of the box uses the DefaultResourceCalculator, which takes into account only the memory dimension for computing the number of available containers; the amount of virtual cores is ignored. We recommend using DominantResourceCalculator instead.
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
<description>
The ResourceCalculator implementation to be used to compare resources in the
scheduler. The default (i.e., DefaultResourceCalculator) only uses memory
while the DominantResourceCalculator uses dominant-resource to compare
multi-dimensional resources such as memory, CPU etc.
</description>
</property>
Actian Service ID
The installer creates actian as a service ID (system account) to reduce the chances of conflict on one or more of the other nodes in the cluster. Service accounts have uids below 1000, 500 on some systems. This causes YARN to complain.
To stop YARN complaining about the service ID, add the following line to container-executor.cfg under HADOOP_CONF_DIR:
allowed.system.users=actian
Additional YARN Settings when Using Kerberos
To work around YARN-2892 an "actian" (short-name) proxy user must be created. The following YARN configuration snippet allows the user "actian" (which can be the Kerberos principal actian@domain) to impersonate the local user "actian" by its short-name.
core-site.xml
<property>
<name>hadoop.proxyuser.actian.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.actian.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.actian.users</name>
<value>actian</value>
</property>
Also, make sure that YARN is configured to support long running applications. By default, any delegation token has a maximum lifetime of seven days. The following properties allow YARN ResourceManager to request new tokens when the existing ones are past their maximum lifetime. YARN is then able to continue performing localization and log-aggregation on behalf of the hdfs user.
yarn-site.xml
<property>
<name>yarn.resourcemanager.proxy-user-privileges.enabled</name>
<value>true</value>
</property>
core-site.xml
<property>
<name>hadoop.proxyuser.yarn.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.yarn.groups</name>
<value>*</value>
</property>
Configure YARN Integration
By default, integration with Apache Hadoop YARN is disabled. During installation, you are asked if you want to enable it.
You can configure integration with YARN by using the iisetres command to change the following parameters in config.dat:
ii.hostname.x100.yarn.enable
Enables or disables YARN integration.
Valid values: true and false
Default: false
ii.hostname.x100.yarn.amqueue
Specifies the YARN queue for VectorH. The DbAgent (API through with VectorH interacts with YARN) uses this setting to acquire cluster resources from a specific queue, which must be already set up.
Valid values: the name of a YARN queue
Default: default (the default YARN queue)
ii.hostname.x100.yarn.ampriority
Specifies the DbAgent's priority for getting the VectorH required resources from YARN.
Valid values: 0 (low priority) – 10 (high priority)
Default: 10
ii.hostname.x100.yarn.update_slaves
Specifies whether YARN should update the slaves file on X100 server startup. Set this parameter to "false" when running VectorH on a subset of nodes in a Hadoop cluster when YARN is enabled. Valid values: true and false.
Default: false
Configure Resource Allocation in the Cluster
Cluster resources are allocated statically at startup time for the lifetime of the Vector Server. After the database is shut down, DbAgent terminates and deallocates the resources (killing all YARN containers registered for the current database).
To influence the amount of resources allocated, you can modify the following parameters in vectorwise.conf:
How the Set of Slave Nodes Is Determined
The Dbagent determines the set of slave nodes by selecting the best num_nodes nodes (N) from the given list of DataNodes (M) with respect to their resource (such as CPU and memory) capabilities and local data (VectorH columnar files stored in HDFS), if any.
DbAgent handles the process transparently, as follows:
1. Asks YARN for the cluster node reports
2. Queries the HDFS NameNode to find out where the VectorH files are located, assuming some data is already loaded into the system
3. Sorts the list of M nodes in descending order by cluster resources and local data size, with priority on the latter
4. Selects the top N
Initially, when no data is stored in the system, this strategy determines a set of (more or less) homogeneous nodes, which is preferable for load-balancing query workloads.
After loading data into the database, the VectorH slave set is kept the same over any system restart because the data resides on those nodes. If a restart is due to a system failure (node crash) and your VectorH version comes with HDFS Custom Block Placement support, the missing replicas are determined by looking at the data-locality information and are re-replicated to other (possibly new) slave nodes. Moreover, in the latter case, the slave set will automatically adjust to a new set of slaves based on what nodes from the remaining DataNodes are still running. You do not have to manually modify the configuration files to remove the failed nodes.
Note: With the value of num_nodes less than the number of input DataNodes, the slave set will try to maintain its size by replacing a failing node with another one from the remaining DataNodes; otherwise, if num_nodes is equal to the number of input DataNodes, the slave set will obviously get smaller.
Preemption Support
Preemption support allows YARN to preempt a VectorH job with a higher priority job.
This feature is important in a scenario where a queue has a defined level of cluster resources, but must wait to run applications because other queues are using all of its available resources.
If YARN preemption is enabled, a higher-priority application does not have to wait for a lower priority application (such as a VectorH query) that has taken up the available capacity. An under-served queue can begin to claim its allocated cluster resources almost immediately, without having to wait for an application in another queue to finish running.
How VectorH Preemption Support Works
After the slave nodes are started, VectorH preemption support in YARN works as follows:
1. Starts a subsequent application master, WorkloadAppMaster, which handles YARN containers preemption.
2. The WorkloadAppMaster allocates a fraction of VectorH cluster resources (cores and query execution memory) and requests the needed containers with a lower priority.
3. YARN preemption acts over this set of resources first and not over the WsetApplicationMaster's core set of resources, which is required for the entire database lifespan.
Configuring VectorH for Preemption Support
To configure VectorH to use preemption, set the following configuration parameters in the [dbagent] section of vectorwise.conf:
• required_num_cores
• required_max_memory_size
These parameters define the minimum resources VectorH is to allocate. The difference between these values and the [system] num_cores and [memory] max_memory_size values, respectively, defines the target level of performance, which is then requested in a separate lower priority container.
The second application master (WorkloadAppMaster) tries to reach the target amount of resources, given a queue with sufficient resources. The corresponding containers for the target resources may be preempted by YARN at any time (because they are allocated with lower priority) and eventually re-allocated to VectorH as soon as the job that needed extra resources has finished. On the other hand, the required amount of resources is requested with a higher priority (10 by default) and if the job happens to be preempted by YARN, the DbAgent will enter a fallback procedure, which ends by shutting down all Vector servers.
Note: In YARN, because containers cannot have only core resources or only memory resources, both parameters must be set for the DbAgent to correctly compute the target amount of resources and start the subsequent application master (WorkloadAppMaster).
For example, the following settings in vectorwise.conf tell the WsetApplicationMaster to request 6 GB and 4 cores, and the WorkloadAppMaster a further 2 GB and 2 more cores.
[dbagent]
required_num_cores=4
required_max_memory_size=4G
[system]
num_cores=6
[memory]
max_memory_size=6G
[cbm]
bufferpool_size=2G
Preemption Support Limitations
Running two or more VectorH clusters at the same time is not supported if their entire amount of resources cannot be allocated, even though VectorH has YARN preemption enabled to displace some containers from other queues. YARN, during preemption, ignores the node locations from where VectorH want resources to be allocated and offers other containers than expected. The major consequence is that a subsequent VectorH cluster may not get the required one-container per node to continue starting the Vector servers. This issue cannot be reproduced if the amount of resources needed is idle at the time VectorH attempts to allocate the containers.
YARN Configuration Settings
The following YARN configuration settings are relevant for VectorH.
Memory related settings in YARN are as follows:
yarn.nodemanager.resource.memory-mb
Amount of physical memory per NodeManager, in MB, that can be allocated for containers.
yarn.scheduler.minimum-allocation-mb
The minimum allocation for every container request at the ResourceManager, in MB. Memory requests lower than the specified value will not take effect.
yarn.scheduler.maximum-allocation-mb
The maximum allocation for every container request at the ResourceManager, in MB. Memory requests higher than the specified value will not take effect.
CPU related settings in YARN are as follows:
yarn.nodemanager.resource.cpu-vcores
Number of CPU cores per NodeManager that can be allocated for containers.
yarn.scheduler.minimum-allocation-vcores
The minimum allocation for every container request at the ResourceManager, in terms of virtual CPU cores. Requests lower than the specified value will not take effect.
yarn.scheduler.maximum-allocation-vcores
The maximum allocation for every container request at the ResourceManager, in terms of virtual CPU cores. Requests higher than the specified value will not take effect.
We recommend keeping the yarn.scheduler.minimum-allocation-mb constant (1024MB, for example) and setting the yarn.scheduler.maximum-allocation-mb to the same value as yarn.nodemanager.resource.memory-mb. These settings allow applications to acquire memory within the minimum-maximum range. If a YARN application wants to acquire one large container per node, as Vector does, then it can do so. In case of multiple applications (or YARN enabled data frameworks) running in the same cluster, the yarn.scheduler.maximum-allocation-mb property can be adjusted (for example, 40% of yarn.nodemanager.resource.memory-mb). In this case, however, the difference between the (max_memory_size + bufferpool_size) minus yarn.scheduler.minimum-allocation-mb will be out-of-band from YARN and this can increase the resource contention in the cluster. The same applies for the YARN vcores and Vector num_cores configuration options.
Resource Manager Schedulers
yarn.resourcemanager.scheduler.class
1. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
yarn.scheduler.fair.assignmultiple
Whether to allow multiple container assignments in one heartbeat. Defaults to false. We recommend true instead.
yarn.scheduler.fair.user-as-default-queue
Whether to use the username associated with the allocation as the default queue name, in the event that a queue name is not specified. If this is set to false or unset, all jobs have a shared default queue named "default". The default is "true". If a queue placement policy is given in the allocations file, this property is ignored.
How to Add and Remove Slave Nodes
You can add nodes to or remove nodes from a VectorH cluster by using a manual process.
Note: The master node should not be removed.
Adding or removing nodes requires a full restart and reconfiguration of the cluster, but this typically takes only a few minutes. Individual data partitions, however, are allocated uniquely and evenly to each node based on the original cluster configuration. Queries run against that data are intended to be executed solely on the node where the data resides. Without repartitioning of data, any additional nodes in the cluster are required to access the data remotely, which may introduce a bottleneck on the system, resulting in sub-optimal performance. This will be particularly acute during "cold" runs immediately after the restart. If the [cbm] bufferpool_size is sufficiently large, performance impact will be negligible once the workload is entirely resident in memory.
Note: The VectorH HDFS block placement policy does not help in this situation, given that it is installed and configured beforehand. It will help, however, when nodes fail and are then restarted or replaced, and only if, after the restart, the number of nodes from the slave set (that is, num_nodes) remains the same. Only in such a situation will the HDFS Block Placement program, upon calling hdfs fsck /path/to/hdfs/data/location ‑blocks ‑locations ‑files, handle the re-replication and co-location of the data that was lost.
Similarly, after removing nodes from the cluster, data that was resident on the removed nodes must be retrieved remotely by the remaining nodes, again leading to a performance degradation. This can be minimized after the initial run by increasing the [cbm] bufferpool_size enough so that the entire workload can still be loaded into memory.
The process for adding or removing nodes in the cluster is as follows:
Note: This process automatically updates vectorwise.conf, so you may want to make a backup copy of vectorwise.conf prior to performing the process.
Note: All steps, except Step 4, should be run on the master node.
1. Shut down VectorH and verify all processes have stopped on all nodes.
2. Edit $II_SYSTEM/ingres/files/hdfs/slaves to list all the nodes to be used in the new cluster.
3. If YARN integration is enabled, make sure the following resource in config.dat is set to false. This will prevent the slaves file from being overwritten at startup.
iisetres ii.hostname.x100.yarn.update_slaves false
4. Create the "actian" user and $II_SYSTEM directory on any new nodes.
5. Modify vectorwise.conf on the master node to reflect the new cluster configuration. Set max_parallelism to the total number of cores in the cluster. (You can do this manually or by running iisuhdfs genconfig).
Issue the following command:
iisuhdfs sync
The change in vectorwise.conf is synced to all the slave nodes.
6. Restart VectorH.
7. Repartition existing tables. For example:
MODIFY table TO RECONSTRUCT WITH PARTITION = (HASH ON key NN PARTITIONS)