Was this helpful?
Hadoop Requirement
You must install and configure Hadoop before installing VectorH.
IMPORTANT!  The Hadoop NameNode and DataNode must be separate, according to Hadoop best practices. VectorH works only when the NameNode works. If the NameNode doubles as a DataNode, and the DataNode runs out of disk space because of a large amount of data, the cluster will stop working.
For information about supported Hadoop distributions, see the readme.
Recommended Hadoop Settings
We recommend the following Hadoop settings:
dfs.datanode.max.transfer.threads: 4096 or higher. Follow the Hadoop vendor recommendations, if higher.
dfs.replication: Less than the number of VectorH nodes. As of 4.2.2, the [cbm] hdfs_replication configuration setting can be used instead.
If you want VectorH to integrate with YARN:
ipc.client.connect.max.retries: 3
ipc.client.max.retires.on.timeouts: 3
yarn.nm.liveness-monitor.expiry-interval-ms: 10000
yarn.client.nodemanager-connect.max-wait-ms: 50000
yarn.client.nodemanager-connect.retry-interval-ms: 10000
yarn.resourcemanager.system-metrics-publisher.enabled: false
yarn.am.liveness-monitor.expiry-interval-ms: 10000
yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
If the yarn-site.xml file contains the property “yarn.nodemanager.remote-app-log-dir: hdfs://var/...”, you must add the NameNode into the hdfs URI:
yarn.nodemanager.remote-app-log-dir: hdfs://your_name_node/var/...
Add the following to the yarn-site.xml file if it is missing:
yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
For more on YARN integration, see Enable YARN Integration on page 51.
Last modified date: 01/26/2023