Was this helpful?
Decision Tree Operators
Decision tree learning is a method that uses inductive inference to approximate a target function, which will produce discrete values.
Decision trees, also referred to as classification trees or regression trees, can be used in decision analysis to visually and explicitly represent decisions and decision making. It is generally best suited to problems where the instances are represented by attribute-value pairs. Also, if each attribute has a small number of distinct values, it is easier for the decision tree to reach a useful solution. A decision tree learner will produce a model that consists of an arrangement of tests that provide an appropriate classification for an instance of data at every step in an analysis.
The decision tree implemented in DataFlow is a classification tree. A classification tree can be used to classify a particular instance by sorting them down the tree from the root node to some leaf node, which then provides the classification of the instance.
Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute.
An instance in a specific data set is classified by starting at the root node of the tree, testing the attribute specified by this node, and then moving down the branch corresponding to the value of the attribute. This process is then repeated at the node on this branch and so on until a leaf node is reached and the instance can be classified.
DataFlow provides operators to produce and use decision tree models. The learner is used to determine the classification rules for a particular data set while the predictor can apply these rules to a data set. There is also an operator that can be used to prune a decision tree when the model provided is overly complex and does not generalize the data well. For more information, refer to the following topics:
DecisionTreeLearner Operator
The DecisionTreeLearner is the operator responsible for constructing a decision tree PMML model. The implementation is based primarily on Ross Quinlan’s book, C4.5: Programs for Machine Learning. Quinlan’s C4.5 implementation (and this implementation) have the following key features and limitations:
Supports both numerical and categorical attributes
Supports only categorical predictions
Uses Information Gain/Gain Ratio as the measure of quality
Missing value handling is done using fractional cases. This corresponds to the PMML "aggregateNodes" missing value strategy.
The following are differences between this implementation and Quinlan’s C4.5 implementation:
Parallel/distributed implementation
Scales to data sets that are too large for memory
Does not support C4.5 rule generation. C4.5 is a software distribution that includes several executables. Our primary focus is the decision tree itself.
Does not support "sub-tree raising" as part of the pruning strategy. This adds substantial processing time and is of arguable benefit.
Currently limited to single-tree; there is no support for automatic cross-validation and tree selection
Implementation, from a scaling and parallelism standpoint, is based on the following papers:
SPRINT
Memory Requirements
At a minimum, we require 13 bytes of RAM per row of data in order to support the row mapping data structure. This is distributed throughout the cluster, so if we have 10 nodes in the cluster and 100 million rows of data, we require 13*100 million/10 = 130 MB of RAM per node.
If the data set contains null values, this minimum memory requirement may be larger, as we require an extra n*12+12 bytes of bookkeeping for each row that must be split between children nodes where n is the number of children of the split.
If the inMemoryDataset option is used, the memory requirements are much larger as the attribute tables must be kept in memory. Attribute tables require roughly 32 bytes per row, per attribute. In addition, whenever attributes are split we require working space for the split, so the number of attributes+1 must be calculated. Finally, unknown(null) values may impact the memory sizes since splitting on an unknown value requires adding the row in question to both of the children nodes. Note that attribute tables are distributed throughout the cluster, so memory requirements for attributes do scale out in the same way as for the row mapping structure mentioned above.
Code Example
The following example demonstrates using the DecisionTreeLearner operator to train a predictive model using the "class" column as the target variable. The decision tree learner produces a PMML model that can be persisted. The generated PMML model can be used with the DecisionTreePredictor to predict target values.
Using the DecisionTreeLearner operator in Java
// Run the Decision Tree learner using "class" as the target column.
// All other columns are used as learning columns by default.
// Use default settings for all other properties.
DecisionTreeLearner dtLearner = graph.add(new DecisionTreeLearner());
dtLearner.setTargetColumn("class");
Using the DecisionTreeLearner operator in RushScript
var model = dr.decisionTreeLearner(data, {targetColumn:'class'});
Properties
The DecisionTreeLearner operator provides the following properties.
Name
Type
Description
targetColumn
String
The name of the column to predict. Must match the name of one of the string fields in the input.
qualityMeasure
The measure of quality to be used when determining the best split.
Defaults to gain ratio.
minRecordsPerNode
int
The minimum number of records per node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Default: 2.
maxTreeNodes
int
The maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth. Default: 1000.
maxDistinctNominalValues
int
The maximum number of distinct nominal values to allow. Attributes with more than this
number of distinct values will be filtered from the model. Default: 20.
binaryNominalSplits
boolean
Determines whether to use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. If gain is selected as the splitting criteria, will always choose two subsets. If gain ratio is selected, it will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Default: false.
includedFields
List<String>
The list of columns to include for learning. An empty list means all columns of type double or string.
inMemoryDataset
boolean
Determines whether the dataset is to be kept in memory while the decision tree is being built. Default: false.
Ports
The DecisionTreeLearner operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The training data that is used to build the model. Fields of type string represent categorical data. Fields of type double represent numerical data.
The DecisionTreeLearner operator provides a single output port.
Name
Type
Get Method
Description
model
getModel()
The model that is built from the training data. The model will be a PMML TreeModel described at http://www.dmg.org/v4-0-1/TreeModel.html.
DecisionTreePredictor Operator
The DecisionTreePredictor operator applies a previously built decision tree PMML model to the input data. This operator supports most of the functionality listed at http://www.dmg.org/v4-0-1/TreeModel.html. Specifically it supports all required elements and attributes as well as the following optional elements and attributes:
missingValueStrategy (all strategies supported)
missingValuePenalty
noTrueChildStrategy (all strategies supported)
All predicates: SimplePredicate, CompoundPredicate, SimpleSetPredicate, True, False
ScoreDistribution
It ignores the following elements and attributes:
EmbeddedModel
Partition
ModelStats
ModelExplanation
Targets
LocalTransformations
ModelVerification
splitCharacteristic
Code Example
Using the DecisionTreePredictor operator in Java
// Create the decision tree predictor operator and add it to a graph
DecisionTreePredictor predictor = graph.add(new DecisionTreePredictor());

// Connect the predictor to an input port and a model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use
Using the DecisionTreePredictor operator in RushScript
// Apply the given model to the input data
var classifiedData = dr.decisionTreePredictor(model, data);
Properties
The DecisionTreePredictor operator provides the following properties.
Name
Type
Description
appendRecordCount
boolean
Whether to append record count information. Default: false.
appendConfidence
boolean
Whether to append confidence information. Default: false.
confidencePrefix
String
The field name prefix to use for confidence. Default: "confidence_"
recordCountPrefix
String
The field name prefix to use for record counts. Default: "record_count_"
winnerField
String
The name of the winner field to output. Default: "winner"
Ports
The DecisionTreePredictor operator provides the following input ports.
Name
Type
Get Method
Description
input
getInput()
The input data to which the model is applied.
model
getModel()
Decision Tree model in PMML to apply.
The DecisionTreePredictor operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
Results of applying the model to the input data.
DecisionTreePruner Operator
The DecisionTreePruner operator performs pruning of the provided decision tree PMML model. This is useful when the learning data creates a model that is overly complex and does not generalize the data well. The operator will examine the tree and reduce its size by removing sections of the tree that provide little power to classify the data. Therefore the overall goal of the DecisionTreePruner is to reduce the complexity of the model while improving its predictive accuracy.
Note:  This is a relatively inexpensive operation and thus is not parallelized.
Code Example
Using the DecisionTreePruner operator in Java
// Create the decision tree pruner and add it to the graph
DecisionTreePruner pruner = graph.add(new DecisionTreePruner());

// Connect the pruner to an input model and output model sink
graph.connect(dtLearner.getModel(), pruner.getInput());
graph.connect(pruner.getOutput(), modelWriter.getModel())

// The model produced can be used by operators needing a pruned decision tree PMML model
Using the DecisionTreePruner in RushScript
// Prune a decision tree model
var prunedModel = dr.decisionTreePruner(model);
Properties
The DecisionTreePruner operator provides the following properties.
Name
Type
Description
configuration
The configuration that controls how pruning is performed.
Ports
The DecisionTreePruner operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
This is the original non-pruned model.
The DecisionTreePruner operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
The model pruned according to the PruneConfiguration.
Last modified date: 03/10/2025