DF 8.2 | Decision Tree Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Decision Tree Operators

Was this helpful?

Decision Tree Operators

Decision tree learning is a method that uses inductive inference to approximate a target function, which will produce discrete values.

Decision trees, also referred to as classification trees or regression trees, can be used in decision analysis to visually and explicitly represent decisions and decision making. It is generally best suited to problems where the instances are represented by attribute-value pairs. Also, if each attribute has a small number of distinct values, it is easier for the decision tree to reach a useful solution. A decision tree learner will produce a model that consists of an arrangement of tests that provide an appropriate classification for an instance of data at every step in an analysis.

The decision tree implemented in DataFlow is a classification tree. A classification tree can be used to classify a particular instance by sorting them down the tree from the root node to some leaf node, which then provides the classification of the instance.

Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute.

An instance in a specific data set is classified by starting at the root node of the tree, testing the attribute specified by this node, and then moving down the branch corresponding to the value of the attribute. This process is then repeated at the node on this branch and so on until a leaf node is reached and the instance can be classified.

DataFlow provides operators to produce and use decision tree models. The learner is used to determine the classification rules for a particular data set while the predictor can apply these rules to a data set. There is also an operator that can be used to prune a decision tree when the model provided is overly complex and does not generalize the data well. For more information, refer to the following topics:

• DecisionTreeLearner Operator

• DecisionTreePredictor Operator

• DecisionTreePruner Operator

DecisionTreeLearner Operator

The DecisionTreeLearner is the operator responsible for constructing a decision tree PMML model. The implementation is based primarily on Ross Quinlan’s book, C4.5: Programs for Machine Learning. Quinlan’s C4.5 implementation (and this implementation) have the following key features and limitations:

• Supports both numerical and categorical attributes

• Supports only categorical predictions

• Uses Information Gain/Gain Ratio as the measure of quality

• Missing value handling is done using fractional cases. This corresponds to the PMML "aggregateNodes" missing value strategy.

The following are differences between this implementation and Quinlan’s C4.5 implementation:

• Parallel/distributed implementation

• Scales to data sets that are too large for memory

• Does not support C4.5 rule generation. C4.5 is a software distribution that includes several executables. Our primary focus is the decision tree itself.

• Does not support "sub-tree raising" as part of the pruning strategy. This adds substantial processing time and is of arguable benefit.

• Currently limited to single-tree; there is no support for automatic cross-validation and tree selection

Implementation, from a scaling and parallelism standpoint, is based on the following papers:

• SPRINT

• ScalParC

Memory Requirements

At a minimum, we require 13 bytes of RAM per row of data in order to support the row mapping data structure. This is distributed throughout the cluster, so if we have 10 nodes in the cluster and 100 million rows of data, we require 13*100 million/10 = 130 MB of RAM per node.

If the data set contains null values, this minimum memory requirement may be larger, as we require an extra n*12+12 bytes of bookkeeping for each row that must be split between children nodes where n is the number of children of the split.

If the inMemoryDataset option is used, the memory requirements are much larger as the attribute tables must be kept in memory. Attribute tables require roughly 32 bytes per row, per attribute. In addition, whenever attributes are split we require working space for the split, so the number of attributes+1 must be calculated. Finally, unknown(null) values may impact the memory sizes since splitting on an unknown value requires adding the row in question to both of the children nodes. Note that attribute tables are distributed throughout the cluster, so memory requirements for attributes do scale out in the same way as for the row mapping structure mentioned above.

Code Example

The following example demonstrates using the DecisionTreeLearner operator to train a predictive model using the "class" column as the target variable. The decision tree learner produces a PMML model that can be persisted. The generated PMML model can be used with the DecisionTreePredictor to predict target values.

Using the DecisionTreeLearner operator in Java

// Run the Decision Tree learner using "class" as the target column.
// All other columns are used as learning columns by default.
// Use default settings for all other properties.
DecisionTreeLearner dtLearner = graph.add(new DecisionTreeLearner());
dtLearner.setTargetColumn("class");

Using the DecisionTreeLearner operator in RushScript

var model = dr.decisionTreeLearner(data, {targetColumn:'class'});

Properties

The DecisionTreeLearner operator provides the following properties.

Name	Type	Description
targetColumn	String	The name of the column to predict. Must match the name of one of the string fields in the input.
qualityMeasure	QualityMeasure	The measure of quality to be used when determining the best split. Defaults to gain ratio.
minRecordsPerNode	int	The minimum number of records per node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Default: 2.
maxTreeNodes	int	The maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth. Default: 1000.
maxDistinctNominalValues	int	The maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. Default: 20.
binaryNominalSplits	boolean	Determines whether to use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. If gain is selected as the splitting criteria, will always choose two subsets. If gain ratio is selected, it will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Default: false.
includedFields	List<String>	The list of columns to include for learning. An empty list means all columns of type double or string.
inMemoryDataset	boolean	Determines whether the dataset is to be kept in memory while the decision tree is being built. Default: false.

Ports

The DecisionTreeLearner operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The training data that is used to build the model. Fields of type string represent categorical data. Fields of type double represent numerical data.

The DecisionTreeLearner operator provides a single output port.

Name	Type	Get Method	Description
model	PMMLPort	getModel()	The model that is built from the training data. The model will be a PMML TreeModel described at http://www.dmg.org/v4-0-1/TreeModel.html.

DecisionTreePredictor Operator

The DecisionTreePredictor operator applies a previously built decision tree PMML model to the input data. This operator supports most of the functionality listed at http://www.dmg.org/v4-0-1/TreeModel.html. Specifically it supports all required elements and attributes as well as the following optional elements and attributes:

• missingValueStrategy (all strategies supported)

• missingValuePenalty

• noTrueChildStrategy (all strategies supported)

• All predicates: SimplePredicate, CompoundPredicate, SimpleSetPredicate, True, False

• ScoreDistribution

It ignores the following elements and attributes:

• EmbeddedModel

• Partition

• ModelStats

• ModelExplanation

• Targets

• LocalTransformations

• ModelVerification

• splitCharacteristic

Code Example

Using the DecisionTreePredictor operator in Java

// Create the decision tree predictor operator and add it to a graph
DecisionTreePredictor predictor = graph.add(new DecisionTreePredictor());

// Connect the predictor to an input port and a model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the DecisionTreePredictor operator in RushScript

// Apply the given model to the input data
var classifiedData = dr.decisionTreePredictor(model, data);

Properties

The DecisionTreePredictor operator provides the following properties.

Name	Type	Description
appendRecordCount	boolean	Whether to append record count information. Default: false.
appendConfidence	boolean	Whether to append confidence information. Default: false.
confidencePrefix	String	The field name prefix to use for confidence. Default: "confidence_"
recordCountPrefix	String	The field name prefix to use for record counts. Default: "record_count_"
winnerField	String	The name of the winner field to output. Default: "winner"

Ports

The DecisionTreePredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data to which the model is applied.
model	PMMLPort	getModel()	Decision Tree model in PMML to apply.

The DecisionTreePredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the model to the input data.

DecisionTreePruner Operator

The DecisionTreePruner operator performs pruning of the provided decision tree PMML model. This is useful when the learning data creates a model that is overly complex and does not generalize the data well. The operator will examine the tree and reduce its size by removing sections of the tree that provide little power to classify the data. Therefore the overall goal of the DecisionTreePruner is to reduce the complexity of the model while improving its predictive accuracy.

Note: This is a relatively inexpensive operation and thus is not parallelized.

Code Example

Using the DecisionTreePruner operator in Java

// Create the decision tree pruner and add it to the graph
DecisionTreePruner pruner = graph.add(new DecisionTreePruner());

// Connect the pruner to an input model and output model sink
graph.connect(dtLearner.getModel(), pruner.getInput());
graph.connect(pruner.getOutput(), modelWriter.getModel())

// The model produced can be used by operators needing a pruned decision tree PMML model

Using the DecisionTreePruner in RushScript

// Prune a decision tree model
var prunedModel = dr.decisionTreePruner(model);

Properties

The DecisionTreePruner operator provides the following properties.

Name	Type	Description
configuration	PruneConfiguration	The configuration that controls how pruning is performed.

Ports

The DecisionTreePruner operator provides a single input port.

Name	Type	Get Method	Description
input	PMMLPort	getInput()	This is the original non-pruned model.

The DecisionTreePruner operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getOutput()	The model pruned according to the PruneConfiguration.

Last modified date: 03/10/2025