Hadoop Versions and Distributions
Hadoop is an ecosystem of software providing various services related to distributed processing of data. One of these core services is HDFS ,which is a scalable and fault-tolerant distributed file system. If you want to run DataFlow in a distributed cluster, then we recommend that you use a distributed file system such as HDFS.
The following distributions and versions of Hadoop are supported for use with DataFlow:
• Apache Hadoop 2.2 and 3.1
• HortonWorks distribution (HDP) version 2.3 to 2.6.5
• Cloudera’s distribution (CDH) version 4.2 up to version 5.15
• MapR distribution version 5.2 or 6.1
HBase Version
DataFlow provides both a reader and writer for accessing
HBase, a scalable database built using Hadoop. The HBase support in DataFlow works with:
• Apache HBase distributed with CDH version 4.2 and later
• Hortonworks HBase distributed with HDP version 2.0 and later
Hive Version
DataFlow the follwoing readers and writers for
Hive: ORCReader, ORCWriter, and ParquetReader operator.