Decision Tree Operators
Decision tree learning is a method that uses inductive inference to approximate a target function, which will produce discrete values.
Decision trees, also referred to as classification trees or regression trees, can be used in decision analysis to visually and explicitly represent decisions and decision making. It is generally best suited to problems where the instances are represented by attribute-value pairs. Also, if each attribute has a small number of distinct values, it is easier for the decision tree to reach a useful solution. A decision tree learner will produce a model that consists of an arrangement of tests that provide an appropriate classification for an instance of data at every step in an analysis.
The decision tree implemented in DataFlow is a classification tree. A classification tree can be used to classify a particular instance by sorting them down the tree from the root node to some leaf node, which then provides the classification of the instance.
Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute.
An instance in a specific data set is classified by starting at the root node of the tree, testing the attribute specified by this node, and then moving down the branch corresponding to the value of the attribute. This process is then repeated at the node on this branch and so on until a leaf node is reached and the instance can be classified.
DataFlow provides operators to produce and use decision tree models. The learner is used to determine the classification rules for a particular data set while the predictor can apply these rules to a data set. There is also an operator that can be used to prune a decision tree when the model provided is overly complex and does not generalize the data well. For more information, refer to the following topics:
DecisionTreeLearner Operator
The
DecisionTreeLearner is the operator responsible for constructing a decision tree PMML model. The implementation is based primarily on Ross Quinlan’s book,
C4.5: Programs for Machine Learning. Quinlan’s C4.5 implementation (and this implementation) have the following key features and limitations:
• Supports both numerical and categorical attributes
• Supports only categorical predictions
• Uses Information Gain/Gain Ratio as the measure of quality
• Missing value handling is done using fractional cases. This corresponds to the PMML "aggregateNodes" missing value strategy.
The following are differences between this implementation and Quinlan’s C4.5 implementation:
• Parallel/distributed implementation
• Scales to data sets that are too large for memory
• Does not support C4.5 rule generation. C4.5 is a software distribution that includes several executables. Our primary focus is the decision tree itself.
• Does not support "sub-tree raising" as part of the pruning strategy. This adds substantial processing time and is of arguable benefit.
• Currently limited to single-tree; there is no support for automatic cross-validation and tree selection
Implementation, from a scaling and parallelism standpoint, is based on the following papers:
Memory Requirements
At a minimum, we require 13 bytes of RAM per row of data in order to support the row mapping data structure. This is distributed throughout the cluster, so if we have 10 nodes in the cluster and 100 million rows of data, we require 13*100 million/10 = 130 MB of RAM per node.
If the data set contains null values, this minimum memory requirement may be larger, as we require an extra n*12+12 bytes of bookkeeping for each row that must be split between children nodes where n is the number of children of the split.
If the inMemoryDataset option is used, the memory requirements are much larger as the attribute tables must be kept in memory. Attribute tables require roughly 32 bytes per row, per attribute. In addition, whenever attributes are split we require working space for the split, so the number of attributes+1 must be calculated. Finally, unknown(null) values may impact the memory sizes since splitting on an unknown value requires adding the row in question to both of the children nodes. Note that attribute tables are distributed throughout the cluster, so memory requirements for attributes do scale out in the same way as for the row mapping structure mentioned above.
Code Example
The following example demonstrates using the
DecisionTreeLearner operator to train a predictive model using the "class" column as the target variable. The decision tree learner produces a PMML model that can be persisted. The generated PMML model can be used with the
DecisionTreePredictor to predict target values.
Using the DecisionTreeLearner operator in Java
// Run the Decision Tree learner using "class" as the target column.
// All other columns are used as learning columns by default.
// Use default settings for all other properties.
DecisionTreeLearner dtLearner = graph.add(new DecisionTreeLearner());
dtLearner.setTargetColumn("class");
Using the DecisionTreeLearner operator in RushScript
var model = dr.decisionTreeLearner(data, {targetColumn:'class'});
Properties
The
DecisionTreeLearner operator provides the following properties.
Ports
The
DecisionTreeLearner operator provides a single input port.
The
DecisionTreeLearner operator provides a single output port.
DecisionTreePredictor Operator
The
DecisionTreePredictor operator applies a previously built decision tree PMML model to the input data. This operator supports most of the functionality listed at
http://www.dmg.org/v4-0-1/TreeModel.html. Specifically it supports all required elements and attributes as well as the following optional elements and attributes:
• missingValueStrategy (all strategies supported)
• missingValuePenalty
• noTrueChildStrategy (all strategies supported)
• All predicates: SimplePredicate, CompoundPredicate, SimpleSetPredicate, True, False
• ScoreDistribution
It ignores the following elements and attributes:
• EmbeddedModel
• Partition
• ModelStats
• ModelExplanation
• Targets
• LocalTransformations
• ModelVerification
• splitCharacteristic
Code Example
Using the DecisionTreePredictor operator in Java
// Create the decision tree predictor operator and add it to a graph
DecisionTreePredictor predictor = graph.add(new DecisionTreePredictor());
// Connect the predictor to an input port and a model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());
// The output of the predictor is available for downstream operators to use
Using the DecisionTreePredictor operator in RushScript
// Apply the given model to the input data
var classifiedData = dr.decisionTreePredictor(model, data);
Properties
The
DecisionTreePredictor operator provides the following properties.
Ports
The
DecisionTreePredictor operator provides the following input ports.
The
DecisionTreePredictor operator provides a single output port.
DecisionTreePruner Operator
The
DecisionTreePruner operator performs pruning of the provided decision tree PMML model. This is useful when the learning data creates a model that is overly complex and does not generalize the data well. The operator will examine the tree and reduce its size by removing sections of the tree that provide little power to classify the data. Therefore the overall goal of the
DecisionTreePruner is to reduce the complexity of the model while improving its predictive accuracy.
Note: This is a relatively inexpensive operation and thus is not parallelized.
Code Example
Using the DecisionTreePruner operator in Java
// Create the decision tree pruner and add it to the graph
DecisionTreePruner pruner = graph.add(new DecisionTreePruner());
// Connect the pruner to an input model and output model sink
graph.connect(dtLearner.getModel(), pruner.getInput());
graph.connect(pruner.getOutput(), modelWriter.getModel())
// The model produced can be used by operators needing a pruned decision tree PMML model
Using the DecisionTreePruner in RushScript
// Prune a decision tree model
var prunedModel = dr.decisionTreePruner(model);
Properties
The
DecisionTreePruner operator provides the following properties.
Ports
The
DecisionTreePruner operator provides a single input port.
The
DecisionTreePruner operator provides a single output port.