Integrating DataFlow with Apache HBase
The datarush-hadoop-apache2 module supports HBase 1.1.2 in HortonWorks HDP 2.6.5.
DataFlow on an HDP cluster supports only HBase version 1.1.2.
To support HBase with HDP, manually add the HBase classpath to yarn.application.classpath and mapreduce.application.classpath properties. For example:
/usr/hdp/2.6.5.0-292/hbase/lib/*
The datarush-hadoop-apache3 module supports HBase 2.0.0 in HortonWorks HDP 3.0.1.
To support HBase with HDP, manually add the HBase classpath to yarn.application.classpath and mapreduce.application.classpath properties. For example:
/usr/hdp/3.0.1.0-187/hbase/lib/*
There is no mechanism to create a table through HBase Writer. A little tuning is necessary to get the desired type of data read from and written to tables. Do not insert data through HBase Shell. When data is written for the first time using HBase Writer, it determines the datatype and stores that metadata information.
1. Read Data from CSV:
a. Write this data to the HBase table using HBase Writer.
b. Connect to a cluster.
c. Get the schema of a table you want to write into.
d. Add the qualifiers you desire.
e. (Optional) Check or uncheck the Row ID.
f. Submit the job for execution.
Data from CSV will be written into HBase table. Double-check it by running the scan '<table name>' command in HBase Shell on the name node.
2. Read data from HBase and write it into CSV:
a. Configure HBase Reader.
b. In Delimited Text Writer, edit the schema to remove the Row ID.
Because it is binary data and is currently a mandatory field in HBase Reader, it should manually be omitted at the destination end.
c. Submit the job.
It reads data from the HBase table and writes it into CSV.
You can check or uncheck the RowID field only in HBase Writer. It is a mandatory field in HBase Reader and cannot be unchecked because it reads the complete data set from the HBase table. Therefore, the destination operator could be either included or excluded based on requirements.
If HBase Writer was not used to insert the data, it may not include any type of usable information. In this case, the correct behavior is for the data to be read as binary. You are responsible for any required conversions.