DF 8.2 | HBase Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > HBase Operators

Was this helpful?

HBase Operators

The DataFlow operator library includes several pre-built Input/Output operators. This section covers the HBase operators and provides details on how to use them. For more information, refer to the following topics:

• DeleteHBase Operator

• ReadHBase Operator

• WriteHBase Operator

DeleteHBase Operator

The DeleteHBase operator writes delete markers to a table in HBase.

A DeleteMarker can be mapped to a select HBase cell or it can be mapped to a family as a sub table:

1. mapCellMarker(java.lang.String, java.lang.String, com.pervasive.datarush.hbase.DeleteHBase.DeleteMarker)– Map a DeleteMarker to a select cell within a column family. A row key field is required to uniquely identify cells. A DeleteMarker will be inserted for each mapped cell qualifier, for each input record.

2. mapFamilyMarker(java.lang.String, com.pervasive.datarush.hbase.DeleteHBase.DeleteMarker)– Map a DeleteMarker to a column family. If mapping DeleteMarker.DeleteFamily, then only a row key field is required; otherwise both a row key field and a qualifier key field are required to uniquely identify cells. A single DeleteMarker will be inserted for each input record.

A time key field can optionally be specified to allow the user to provide a timestamp value as part of the input record. If a time key field is not specified, then each record will default to the current time. If mapping DeleteMarker.Delete to delete a specific version of a cell, then a time key field is required to uniquely identify cells. DeleteMarkers that delete past versions will use the time key field or default to current time.

The input will be repartitioned using HBase table region row key ranges. Each partition will sort its DeleteMarkers in row-key ascending, qualifier-key ascending, time-key descending order, and then write the DeleteMarkers to the appropriate regions.

If the specified HBase table does not exist, it will be created. The number of regions created will be MAX (4, the level of parallelism).

Code Example

The following example demonstrates using the DeleteHBase operator to delete cells in a table in HBase.

Using the DeleteHBase Operator

DeleteHBase delete = graph.add(new DeleteHBase());
delete.setTableName("table");
delete.setRowFieldName("rowkey");
delete.mapCellMarker("family1", "qualifier1", DeleteMarker.DeleteColumn);
delete.mapCellMarker("family1", "qualifier2", DeleteMarker.DeleteColumn);
delete.mapCellMarker("family2", "qualifier1", DeleteMarker.DeleteColumn);
graph.connect(data.getOutput(), delete.getInput());

Using the DeleteHBase Operator in RushScript

var markerMap = {
    'family1':{'qualifier1':'DeleteColumn', 'qualifier2':'DeleteColumn'},
    'family2':{'qualifier1':'DeleteColumn'}
}
var data = dr.deleteHBase(
                data,
                {
                    tableName:'table',
                    rowFieldName:'rowKey',
                    cellMarkerMap:markerMap
                });

Properties

The following table provides the properties supported by DeleteHBase operator.

Name	Type	Description
cellMarkerMap	Map<String, Map<String, DeleteMarker>>	Maps DeleteMarkers to selected cells within column families. A DeleteMarker is inserted for each mapped cell qualifier and each input record.
configuration	HDFSConfiguration	The HDFS configuration used to connect to HBase.
familyMarkerMap	Map<String, DeleteMarker>	Maps DeleteMarkers to selected column families. A DeleteMarker is inserted for each family and each input record.
outputPath	org.apache.hadoop.fs.Path	The output path that sets the location used to create HFiles. Default: hdfs://user/userName/.
qualifierFieldName	String	The qualifier key field name.
rowFieldName	String	The row key field name.
tableName	String	The HBase table name.
timeFieldName	String	The version timestamp field name. This is optionally used to designate a version timestamp field. Default: current time

Ports

The DeleteHBase operator provides a single input port.

Name	Type	Get method	Description
input	RecordPort	getInput()	The input data to DeleteHBase.

ReadHBase Operator

The ReadHBase operator is used to read a result set from a table in HBase. The result set is specified by various field mappings and can be filtered to only return a specific time range.

When running with parallelism enabled, each partition will read its assigned regions one region at a time in row-key order. This guarantees that records within the same region will not span a partition, and that each partition output is returned in row-key, qualifier-key order.

Specifying Field Mappings

Because of the nature of NoSQL databases , there may or may not be a particular schema that can be used across all the rows of a given table in HBase. Therefore, the field mappings when reading from HBase must be specified after adding the operator to a logical graph.

Within HBase a cell is uniquely identified by a tuple consisting of {row, column, timestamp}. The cell at a particular row and column with the latest timestamp is considered the latest version of a given cell. Since HBase stores multiple previous versions of a given cell, the default behavior of the operator returns only the latest version of any retrieved cell. Cell versions can be filtered by specifying the time range with the associated properties.

Because HBase by default stores all of its data as pure binary, the ReadHBase operator will use a catalog table to keep track of the data types that the raw data should be converted into. The various conversion methods for specific DataFlow types are listed in the following table.

Java type	DataFlow type	Conversion method
int	TokenTypeConstant.INT	org.apache.hadoop.util.Bytes.toBytes(int)
long	TokenTypeConstant.LONG	org.apache.hadoop.util.Bytes.toBytes(long)
float	TokenTypeConstant.FLOAT	org.apache.hadoop.util.Bytes.toBytes(float)
double	TokenTypeConstant.DOUBLE	org.apache.hadoop.util.Bytes.toBytes(double)
BigDecimal	TokenTypeConstant.NUMERIC	org.apache.hadoop.util.Bytes.toBytes(BigDecimal)
String	TokenTypeConstant.STRING	org.apache.hadoop.util.Bytes.toBytes(String)
byte[]	TokenTypeConstant.BINARY	N/A - store/retrieve byte array as is

All other data types will be serialized or deserialized using the default DataFlow formats. If the entry for a particular table is missing in the catalog, the data will be returned as byte arrays. This allows the user to manually apply any required conversions.

Since HBase is a column-oriented database, the column portion of the index key consists of two parts: a column family and a column qualifier. The column family identifies one of several families that are determined when the table is initially created. Column families provide a way to logically and physically partition columns into groups such that the cells associated with a particular family are stored together in the same files on disk. The column qualifier uniquely identifies a cell and its previous versions within a column family.

DataFlow fields can be mapped to HBase cells in one of two ways:

• mapCell()—Map individual cells within a column family. Cells within a family can be of heterogeneous types and only the mapped cells are accessed. The mapped fields will be together in a single record. An optional row-key field can be specified to uniquely identify each DataFlow flow record.

• mapFamily()—Map all cells within a column family as a subtable. All cells within a family are of homogeneous types and all cells are accessed. Each cell within a family is contained in an individual record. Optional row-key, qualifier-key fields can be specified to uniquely identify each DataFlow flow record.

In both cases a cell can contain a single field or a record of fields. Mapping a single cell as a record of fields will allow multiple fields to be packed into a single cell thus greatly increasing I/O performance at the expense of reduced version granularity. All fields packed together in a single cell are versioned together and therefore all fields must be present when writing. The default DataFlow serialization is used exclusively in this case.

Using HCatalog

Field mappings can also be derived from a schema stored in HCatalog. If a table’s schema is stored in HCatalog, the ReadHBase operator only needs to be directed to this HCatalog table; the HBase table name and field mappings are not required. Individual cells within a column family will be mapped, as defined in the HCatalog table mapping. The resulting DataFlow fields will have the same names as the fields in HCatalog.

To use HCatalog with the ReadHBase operator, specify the HCatalog database name and HCatalog table name where the schema is stored, rather than specifying an HBase table name and field mapping. Optionally, specify a list of fields, as named in HCatalog, to read from the table; if this list is not specified, all fields mapped in HCatalog will be read from the table.

Hadoop Configuration Properties

An HBase cluster has the following minimum connection configuration properties:

fs.default.name

The HDFS URL (such as hdfs://headnode:8020)

hbase.rootdir

The root directory where HBase data is stored (such as hdfs://headnode:8020/hbase)

hbase.zookeeper.quroum

The nodes running ZooKeeper

hbase.zookeeper.property.clientPort

The ZooKeeper client port.

Additionally, other properties may occasionally need to be defined:

hive.metastore.uris

The Hive Metastore URI, if using HCatalog (such as thrift://headnode:9083)

zookeeper.znode.parent

The parent znode in ZooKeeper used by HBase. This should be defined if the default value of /hbase is not being used by your cluster. (For some Hortonworks clusters, you may need to define this property with a value of /hbase-unsecure.)

Code Example

The following example demonstrates using the ReadHBase operator to read an entire table in HBase. It uses the typical map cell as field mapping to read the table.

Using the ReadHBase Operator in Java

ReadHBase reader = graph.add(new ReadHBase());
reader.setTableName("HBaseTable");
reader.setRowFieldName("rowkey");
reader.mapCell("family1", "qualifier1", "data1");
reader.mapCell("family1", "qualifier2", "data2");
reader.mapCell("family2", "qualifier1", "data3");
graph.connect(reader.getOutput(), operator.getInput());

Using the ReadHBase Operator in RushScript

var data = dr.readHBase({tableName:'HBaseTable', rowFieldName:'rowkey'});

The following example demonstrates using the ReadHBase operator with HCatalog.

Using the ReadHBase Operator with HCatalog in Java

ReadHBase reader = graph.add(new ReadHBase());
reader.setHCatalogDatabase("HCatDatabase");
reader.setHCatalogTable("HCatTable");
graph.connect(reader.getOutput(), operator.getInput());

Using the ReadHBase Operator with HCatalog in RushScript

var data = dr.readHBase({hcatalogDatabase:'HCatDatabase', hcatalogTable:'HCatTable'});

Properties

The ReadHBase operator supports the following properties.

Name	Type	Description
startTime	java.util.Date	The time range filter start time.
endTime	java.util.Date	The time range filter end time.
tableName	String	The HBase table name.
rowFieldName	String	The row key field name. Only required when reading from two or more families.
qualifierFieldName	String	The qualifier key field name. Only required when reading from two or more families using the map family mapping.
timeFieldName	String	The version timestamp field name. Only required when versionCount > 1.
versionCount	long	The cell version count. Default: 1.
hCatalogDatabase	String	The HCatalog database name. Only required when reading a table whose schema is stored in HCatalog.
hCatalogTable	String	The HCatalog table name. Only required when reading a table whose schema is stored in HCatalog.
hCatalogFields	String...	A list of fields (as named in HCatalog) to read from the table. Only applicable when reading a table whose schema is stored in HCatalog. If not specified, all fields will be read.
configuration	HDFSConfiguration	The HDFS configuration used to connect to the HBase.

Ports

The ReadHBase operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the output record data.

WriteHBase Operator

The WriteHBase operator is used to write a result set to a table in HBase. If the target table for the write does not exist, it will be automatically created and an entry added to the catalog table to keep track of the current table’s schema. The input set and how it should be mapped to a table in HBase is specified by various field mappings defined within the operator. For more information on how to specify the HBase operator’s field mappings, see the section on specifying field mappings in ReadHBase Operator.

When writing to a table in HBase the default behavior will generate a unique binary row key for every inserted record. The user may optionally specify a row key input field within the data set, in which case the input will be repartitioned using HBase table region row-key ranges. Each partition will individually sort blocks of rows using the row and qualifier keys and region boundaries, and will write the rows to the appropriate regions. If the row key is not specified, the input will be written to regions local to the partition and no repartitioning will be performed. A row key input field must be one of the following types to allow serialization to be performed:

• TokenTypeConstant.LONG

• TokenTypeConstant.STRING

• TokenTypeConstant.BINARY

A unique qualifier key will also be generated for each record if the user maps a family as a subtable and does not specify a qualifier key input field. The qualifier key is generated in a similar manner as the row key, and has the same type restrictions.

A timestamp field can optionally be specified to allow the user to provide a timestamp value as part of the input record. If a timestamp field is not specified, then each record will default to the current time. In both cases the timestamp value is narrowed to millisecond resolution to match HBase and will be advanced slightly to uniquely identify cells with duplicate row/qualifier keys and timestamp values. This ensures the operator is tolerant of duplicate cell versions in the input stream as long as they occur infrequently. Importing large numbers of duplicate cell versions in a short amount of time (more than thousands of duplicates per second) may result in significant time skew to maintain uniqueness.

Using HCatalog

The WriteHBase operator can use HCatalog in two ways.

If the target table already defined in HCatalog, similarly to the ReadHBase operator, no mapping needs to be defined; instead, it will be read from HCatalog. Define the HCatalog database and table name. Optionally, list the fields you would like to write to the table; if this property is not defined, all fields will be written.

If the target table has not yet been defined in HCatalog, a mapping will need to be provided. However, by providing an HCatalog database and table name, a new entry for this table will be created in HCatalog, which can be used for subsequent reads and writes of this table without manually specifying a mapping. If a list of fields is specified, only these fields will be added to the HCatalog schema; otherwise, all mapped fields are included.

The HCatalog field names will be matched to the DataFlow field names. These field names must be valid for HCatalog: containing only lowercase and numeric characters. If necessary, DeriveFields Operator can be used to rename input fields to match these requirements.

Hadoop Configuration Properties

An HBase cluster has the following minimum connection configuration properties:

fs.default.name

The HDFS URL (such as hdfs://headnode:8020)

hbase.rootdir

The root directory where HBase data is stored (such as hdfs://headnode:8020/hbase)

hbase.zookeeper.quroum

The nodes running ZooKeeper

hbase.zookeeper.property.clientPort

The ZooKeeper client port.

Additionally, other properties may occasionally need to be defined:

hive.metastore.uris

The Hive Metastore URI, if using HCatalog (such as thrift://headnode:9083)

zookeeper.znode.parent

Code Example

The following example demonstrates using the WriteHBase operator to write a table in HBase. It uses the typical map cell as field mapping to write the table.

Using the WriteHBase operator

WriteHBase writer = graph.add(new WriteHBase());
writer.setTableName("data");
writer.mapCell("family1", "qualifier1", "data1");
writer.mapCell("family1", "qualifier2", "data2");
writer.mapCell("family2", "qualifier1", "data3");
graph.connect(data.getOutput(), writer.getInput());

The following example demonstrates using the WriteHBase operator to write to a new HCatalog table.

Using the WriteHBase operator with a new HCatalog table

WriteHBase writer = graph.add(new WriteHBase());
writer.setTableName("data");
writer.mapCell("family1", "qualifier1", "data1");
writer.mapCell("family1", "qualifier2", "data2");
writer.mapCell("family2", "qualifier1", "data3");
writer.setHCatalogDatabase("hCatDatabase");
writer.setHCatalogTable("hCatTable");
graph.connect(data.getOutput(), writer.getInput());

The following example demonstrates using the WriteHBase operator to write to an existing HCatalog table.

Using the WriteHBase operator with an existing HCatalog table

WriteHBase writer = graph.add(new WriteHBase());
writer.setHCatalogDatabase("hCatDatabase");
writer.setHCatalogTable("hCatTable");
graph.connect(data.getOutput(), writer.getInput());

Properties

The WriteHBase operator supports the following properties.

Name	Type	Description
configuration	HDFSConfiguration	The HDFS configuration used to connect to HBase.
hCatalogDatabase	String	The HCatalog database name. Only required when writing a table’s schema to HCatalog or when writing to a table whose schema is already defined in HCatalog.
hCatalogTable	String	The HCatalog table name. Only required when writing a table’s schema to HCatalog or when writing to a table whose schema is already defined in HCatalog.
hCatalogFields	String...	The fields (as named in HCatalog) to write to the table. Only applicable when writing a table's schema to HCatalog or when writing to a table whose schema is already defined in HCatalog. If not specified, all fields are written.
qualifierFieldName	String	The qualifier-key field name.
rowFieldName	String	The row-key field name.
timeFieldName	String	The version timestamp field name. This is optionally used to designate a version timestamp field. Default current time.

Ports

The WriteHBase operator provides a single input port.

Name	Type	Get method	Description
input	RecordPort	getInput()	The input data to be written to HBase.

Last modified date: 03/10/2025