Performing Data Analysis

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Performing Data Analysis

Was this helpful?

Dataflow Analytics Operators

The Dataflow operator library contains several pre-built analytics operators. This section covers each of those operators and provides details on how to use them.

Covered Analytics Operations

• Association Rule Mining

• Cluster Analysis

• Decision Trees

• Using the DiscoverDomain Operator to Discover Domains

• Using the RunRScript Operator to Invoke R Scripts

• Using the KNNClassifier Operator to Classify K-Nearest Neighbors

• Naive Bayes within DataFlow

• Predictive Modeling Markup Language (PMML)

• Regression Analysis

• Statistics

• Using the SumOfSquares Operator to Compute a Sum of Squares

• Using the CountRanges Operator to Count Ranges

• Text Processing

• Using the DrawDiagnosticsChart Operator to Draw Diagnostic Charts

Association Rule Mining

Association Rule Mining within DataFlow

Association Rule Mining (ARM) is an unsupervised learning technique for finding relationships of items contained within transactions. ARM is typically applied to sales transactions where transactions contain line items. Each line item provides a transaction identifier and an item identifier. ARM uses this relationship (transaction to item) to find relationships between sets of items. The two main results of ARM are frequent item sets and association rules.

Frequent item sets capture sets of items that are considered frequent along with some statistics about the sets. Item and item sets are considered frequent if they meet a minimum support threshold. The minimum support is usually quantified as a percentage of transactions within a data set. For example, setting the minimum support to 5% implies that an item is only frequent if it appears in 5% of the transactions within a data set. If a data set contains 5 million transactions (unique transactions, not just records), then an item must appear in at least 250,000 transactions to be considered frequent.

Item sets can have a cardinality of 1 (meaning one item per set) up to a maximum value, usually denoted as k. Frequent item sets are useful in finding sets of items that appear together in transactions and may be opportunities for up-sell and cross-sell activities.

Association rules are another result of ARM. Association rules provide information about frequent item sets and how they interact with each other. From association rules, it can be determined how the presence of one set of items within a transaction influences the presence of another set of items. The rules consist of:

Antecedent

Specifies the frequent item set driving the rule (antecedent implies consequent)

Consequent

Specifies the frequent items included in a transaction due to the antecedent

Confidence

Specifies the confidence of a rule, defined as:

Confidence(Antecedent => Consequent) = Support(Antecedent U Consequent) / Support(Antecedent)

A confidence of 25% implies that the rule is correct for 25% of the transactions containing the antecedent items. The higher the confidence value, the higher the strength of the rule.

Lift

Specifies the lift of a rule, defined as:

Lift(Antecedent => Consequent) = Support(Antecedent U Consequent) / (Support(Antecedent) * Support(Consequent))

Lift provides a ratio of the support of the item sets to that expected if the item sets were independent of each other. A higher lift value indicates the two item sets are dependent on each other, implying the rule is valid. In general, the confidence measurement alone cannot be used when considering the validity of a rule. A rule can have high confidence but a low lift, implying the rule may not be useful.

Support

Specifies the ratio of the number of transactions where the antecedent and consequent are present to the total number of transactions.

DataFlow provides operators for calculating the frequent items and the association rules. If only the frequent items are wanted, use the FrequentItems operator. For a combination of the frequent items and association rules, use the FPGrowth operator. Using the ConvertARMModel Operator to Convert Association Models from PMML can be used to convert an association PMML model into other formats.

Covered ARM Operators

• Using the FrequentItems Operator to Compute Item Frequency

• Using the FPGrowth Operator to Determine Frequent Pattern Growth

• Using the ConvertARMModel Operator to Convert Association Models from PMML

Using the FrequentItems Operator to Compute Item Frequency

The FrequentItems operator computes the frequency of items within the given transactions. Two fields are needed in the input flow: one to identify transactions and another to identify items.

A minimum support value must also be specified as a percentage of the transactions an item must participate in to be considered frequent. Setting this value’s threshold lower will typically return more data. An optional label field may also be specified that will be used for display purposes instead of the transaction identifiers.

The main output of this operator is the set of frequent items. Two fields are contained in the output: the item field from the input and the frequency count of the items that are above the minimum support value.

The operator also outputs a PMML port that contains an association model. The model will be partially filled in with frequent items and transaction statistics.

Code Example

In this example we are using a movie ratings data set to determine which movies have been rated by at least 40% of the users.

Using the FrequentItems operator in Java

// Create frequent items operator with minimum support set to 40%
FrequentItems freqItems = graph.add(new FrequentItems("userID", "movieID", 0.4D));

The following example demonstrates using the operator in RushScript. Note that the operator has two output ports: the frequent items and a PMML model. Access the returned data set container variable with the port names to access a specific port.

Using the FrequentItems operator in RushScript

var freqItem = dr.frequentItems(data, {txnFieldName:'userID', itemFieldName:'movieID', minSupport:0.4});

Properties

The FrequentItems operator provides the following properties.

Name	Type	Description
txnFieldName	String	The name of the input field containing the transaction identifier.
itemFieldName	String	The name of the input field containing the item name.
labelFieldName	String	The name of the input field containing the optional labels.
minSupport	double	The minimum support percentage. Must be a value between 0 and 1.

Ports

The FrequentItems operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The data on which the frequency is calculated.

The FrequentItems operator provides the following output ports.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Output data with item field and frequency count field.
model	PMMLPort	getModel()	Outputs partially filled association model built from the data.

Using the FPGrowth Operator to Determine Frequent Pattern Growth

The FPGrowth operator implements the FP-growth algorithm, creating a PMML model that contains generated item sets and association rules. The FP-growth algorithm is optimized to require only two passes over the input data. Other ARM algorithms such as apriori require many passes over the data.

The input data is required to have two fields: a transaction identifier and an item identifier. The transaction identifier discriminates transactions. The item identifier indicates items within transactions.

Transactions are assumed to be in line item order (one input record per item in a transaction). Transactions records are also assumed to be contiguous. An optional label field may also be specified that will be used for display purposes instead of the transaction identifiers.

The operator has two outputs: the frequent item sets and a PMML model. The frequent item sets are output in pivoted form with a set identifier and one item per row. The output model is a PMML-based association model. The model contains frequent items, frequent item sets, and association rules.

Code Example

In this example, the BPM sample data is read and used as input into the FPGrowth operator. Downstream from the FPGrowth operator, the PMML model is persisted to a local file. The PMML model contains the frequent items, frequent item sets, association rules, and various metrics about the model.

Using the FPGrowth operator in Java

// Create the FP-Growth operator.
FPGrowth fpgrowth = graph.add(new FPGrowth());
fpgrowth.setTxnFieldName("txnID");             // field containing transaction ID
fpgrowth.setItemFieldName("itemID");           // field containing item ID
fpgrowth.setMinSupport(0.02);                  // min support for frequent items
fpgrowth.setMinConfidence(0.2);                // min confidence for rules
fpgrowth.setK(2);                              // highest cardinality of item sets wanted
fpgrowth.setAnnotationText("This is an ARM model built using DataFlow");

Using the FPGrowth operator in RushScript

var model = dr.fpgrowth(data, {
    txnFieldName:'txnID',
    itemFieldName:'itemID',
    minSupport:0.02,
    minConfidence:0.2,
    k:2,
    annotationText:'This is an ARM model built using DataFlow'
    });

Properties

The FPGrowth operator provides the following properties.

Name	Type	Description
annotationText	String	The text provided will be added as an annotation to the output PMML model. This can be used to tag a model for later reference. This is opaque text added to the annotation element within the model.
k	int	The largest item-set cardinality wanted. The default value is zero, which implies that all cardinality item-sets should be discovered.
txnFieldName	String	The transaction identifier field name property.
itemFieldName	String	The item field name property.
labelFieldName	String	The optional label field name property.
minSupport	double	The minimum support property. This property is a floating point number between 0 and 1 (exclusive) that defines the percentage of transactions an item must participate in to be considered frequent.
minConfidence	double	The minimum confidence threshold. The minimum confidence value must be between 0.0 and 1.0 (exclusive).

Ports

The FPGrowth operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data. Must have at least two fields.

The FPGrowth operator provides two output ports.

Name	Type	Get Method	Description
itemSets	RecordPort	getItemSets()	The discovered frequent item sets.
model	PMMLPort	getOutput()	Outputs the generated PMML-based association model.

Using the ConvertARMModel Operator to Convert Association Models from PMML

The ConvertARMModel operator converts a PMML model created by Using the FPGrowth Operator to Determine Frequent Pattern Growth into other formats. PMML is a standard format for data mining models developed by the DMG organization.

The operator currently supports the output format: GEXF. This format is a standard that is accepted by many graph visualization tools. GEXF was originally developed for the open source Gephi tool. The format is also supported by other visualization tools.

When converting into the GEXF format, the frequent item sets of the PMML model are used. Each frequent item is added as a node in the graph. Attributes are added to each node indicating the support metric for the item.

Each relationship between items is added as an edge. Edges are weighted by how many times items are found in frequent item sets together. After the GEXF file is created, the Gephi tool can be launched to visualize the results. The Gephi tool is not included as part of DataFlow. You can download it from http://gephi.org/.

A path name is required to specify where to persist the converted model. When executed, the operator applies the specified conversion of the input PMML model into the specified format. The results are written to the specified file. If the file exists already, it will be overwritten.

Code Examples

Using the ConvertARMModel operator in Java

// Create converter and set properties.
ConvertARMModel converter = graph.add(new ConvertARMModel());
converter.setConversionType(ConversionType.GEXF);
converter.setTarget("/path/to/output/results.gexf"));

// Connect FP-Growth learner model output with converter input
graph.connect(learner.getOutput(), converter.getInput());

Using the ConvertARMModel operator in RushScript

dr.convertARMModel(model, {conversionType:'GEXF', target:'/path/to/output/results.gexf'});

Properties

The ConvertARMModel operator has the following properties.

Name	Type	Description
conversionType	ConversionType	The type of conversion to apply. This property is set using an enumerated type. Default: GEXF.
target	String	Pathname of the target file to create with the results of the conversion.

Ports

The ConvertARMModel operator provides a single input port.

Name	Type	Get Method	Description
input	PMMLPort	getInput()	The input PMML model to convert. The model must be an association model built by Using the FPGrowth Operator to Determine Frequent Pattern Growth or other ARM operator.

The ConvertARMModel operator has no output ports.

Cluster Analysis

Cluster analysis includes techniques for grouping entities based on their characteristics as well as describing the resulting groups also known as clusters. This is done by applying cluster algorithms to attributes of entities that make them similar. The goal is to create clusters such that entities that belong to the same cluster are similar with regard to the relevant attributes, but entities from two different clusters are dissimilar.

Cluster analysis is a common task in data mining as well as statistical data analysis. It helps to organize observed data and build taxonomies.

Covered Clustering Operators

• Using the KMeans Operator to Compute K-Means

• Using the ClusterPredictor Operator for Cluster Predicting

Using the KMeans Operator to Compute K-Means

The KMeans operator performs k-means computation for the given input. All included fields in the given input must be of type double. The k-means algorithm chooses a random k data points as the initial centroids. Computation ends when one of the following conditions is true:

• MaxIterations is exceeded.

• For each centroid, there is no significant difference from the corresponding centroid from the previous iteration when compared with the configured quality.

Code Example

The following example demonstrates computing the clusters for the Iris data set using k-means. The input data contains 4 numeric fields but only 3 are used and so must be explicitly specified.

Using the KMeans operator in Java

// Run k-means to create cluster assignments
KMeans kmeans = graph.add(new KMeans());
kmeans.setK(3);
kmeans.setMaxIterations(10);
kmeans.setDistanceMeasure(DistanceMeasure.EUCLIDEAN);
kmeans.setIncludedColumns(Arrays.asList(new String[] {"sepal length", "sepal width", "petal length"}));

Using the KMeans operator in RushScript

var results = dr.kmeans(data, {
    k:3,
    maxIterations:10,
    distanceMeasure:'EUCLIDEAN',
    includedColumns:['sepal length', 'sepal width', 'petal length']});

Properties

The KMeans operator provides the following properties.

Name	Type	Description
maxIterations	int	The maximum number of iterations to allow before iteration is halted. Defaults to 99.
k	int	The k value, where k is the number of centroids to compute. Default: 3.
includedColumns	List<String>	The list of learning columns to include. An empty list means all columns of type double.
distanceMeasure	DistanceMeasure	The distance measure used to measure the distance between two points when building the model. Either EUCLIDEAN or COSINE_SIMILARITY. Default: EUCLIDEAN.

Ports

The KMeans operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The training data used to build the model.

The KMeans operator provides the following output port.

Name	Type	Get Method	Description
model	PMMLPort	getModel()	The resultant PMML cluster model.

Using the ClusterPredictor Operator for Cluster Predicting

The ClusterPredictor operator assigns input data to clusters based on the provided PMML clustering model. The explicit cluster IDs will be used for the assignment if the model provides any. Otherwise, the implicit 1-based index, indicating the position in which each cluster appears in the model will be used as ID.

The input data must contain the same fields as the training data that was used to build the model (in the PMML model: clustering fields with the attribute "isCenterField" set to "true") and these fields must be of type double, float, long, or int. The resulting assignments will be part of the output alongside with the original input data.

Code Example

Using the ClusterPredictor operator in Java

// Create the cluster predictor operator and add it to a graph
ClusterPredictor predictor = graph.add(new ClusterPredictor());

// Set the name of the field containing the cluster assignments
predictor.setWinnerFieldName("label");

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the ClusterPredictor operator in RushScript

// Apply a clustering model to the given data
var results = dr.clusterPredictor(model, data, {winnerFieldName:"label"});

Properties

The ClusterPredictor operator provides the following properties.

Name	Type	Description
winnerFieldName	String	The name of the output field containing the cluster assignments. Default: "winner".

Ports

The ClusterPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	Input data to which the clustering model is applied.
model	PMMLPort	getModel()	Clustering model in PMML to apply.

The ClusterPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the clustering model to the input data. Contains the original input data as well as a new field with the assigned cluster ID.

Decision Trees

Decision Tree Learning within DataFlow

Decision tree learning is a method that uses inductive inference to approximate a target function, which will produce discrete values.

Decision trees, also referred to as classification trees or regression trees, can be used in decision analysis to visually and explicitly represent decisions and decision making. It is generally best suited to problems where the instances are represented by attribute-value pairs. Also, if each attribute has a small number of distinct values, it is easier for the decision tree to reach a useful solution. A decision tree learner will produce a model that consists of an arrangement of tests that provide an appropriate classification for an instance of data at every step in an analysis.

The decision tree implemented in DataFlow is a classification tree. A classification tree can be used to classify a particular instance by sorting them down the tree from the root node to some leaf node, which then provides the classification of the instance.

Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute.

An instance in a specific data set is classified by starting at the root node of the tree, testing the attribute specified by this node, and then moving down the branch corresponding to the value of the attribute. This process is then repeated at the node on this branch and so on until a leaf node is reached and the instance can be classified.

DataFlow provides operators to produce and use decision tree models. The learner is used to determine the classification rules for a particular data set while the predictor can apply these rules to a data set. There is also an operator that can be used to prune a decision tree when the model provided is overly complex and does not generalize the data well.

Covered Decision Tree Operators

• Using the DecisionTreeLearner Operator to Construct a Decision Tree PMML Model

• Using the DecisionTreePredictor Operator for Decision Tree Predicting

• Using the DecisionTreePruner Operator for Decision Tree Pruning

Using the DecisionTreeLearner Operator to Construct a Decision Tree PMML Model

The DecisionTreeLearner is the operator responsible for constructing a decision tree PMML model. The implementation is based primarily on Ross Quinlan’s book, C4.5: Programs for Machine Learning. Quinlan’s C4.5 implementation (and this implementation) have the following key features and limitations:

• Supports both numerical and categorical attributes

• Supports only categorical predictions

• Uses Information Gain/Gain Ratio as the measure of quality

• Missing value handling is done using fractional cases. This corresponds to the PMML "aggregateNodes" missing value strategy.

The following are differences between this implementation and Quinlan’s C4.5 implementation:

• Parallel/distributed implementation

• Scales to data sets that are too large for memory

• Does not support C4.5 rule generation. C4.5 is a software distribution that includes several executables. Our primary focus is the decision tree itself.

• Does not support "sub-tree raising" as part of the pruning strategy. This adds substantial processing time and is of arguable benefit.

• Currently limited to single-tree; there is no support for automatic cross-validation and tree selection

Implementation, from a scaling and parallelism standpoint, is based on the following papers:

• SPRINT

• ScalParC

Memory Requirements

At a minimum, we require 13 bytes of RAM per row of data in order to support the row mapping data structure. This is distributed throughout the cluster, so if we have 10 nodes in the cluster and 100 million rows of data, we require 13*100 million/10 = 130 MB of RAM per node.

If the data set contains null values, this minimum memory requirement may be larger, as we require an extra n*12+12 bytes of bookkeeping for each row that must be split between children nodes where n is the number of children of the split.

If the inMemoryDataset option is used, the memory requirements are much larger as the attribute tables must be kept in memory. Attribute tables require roughly 32 bytes per row, per attribute. In addition, whenever attributes are split we require working space for the split, so the number of attributes+1 must be calculated. Finally, unknown(null) values may impact the memory sizes since splitting on an unknown value requires adding the row in question to both of the children nodes. Note that attribute tables are distributed throughout the cluster, so memory requirements for attributes do scale out in the same way as for the row mapping structure mentioned above.

Code Example

The following example demonstrates using the DecisionTreeLearner operator to train a predictive model using the "class" column as the target variable. The decision tree learner produces a PMML model that can be persisted. The generated PMML model can be used with the DecisionTreePredictor to predict target values.

Using the DecisionTreeLearner operator in Java

// Run the Decision Tree learner using "class" as the target column.
// All other columns are used as learning columns by default.
// Use default settings for all other properties.
DecisionTreeLearner dtLearner = graph.add(new DecisionTreeLearner());
dtLearner.setTargetColumn("class");

Using the DecisionTreeLearner operator in RushScript

var model = dr.decisionTreeLearner(data, {targetColumn:'class'});

Properties

The DecisionTreeLearner operator provides the following properties.

Name	Type	Description
targetColumn	String	The name of the column to predict. Must match the name of one of the string fields in the input.
qualityMeasure	QualityMeasure	The measure of quality to be used when determining the best split. Defaults to gain ratio.
minRecordsPerNode	int	The minimum number of records per node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Default: 2.
maxTreeNodes	int	The maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth. Default: 1000.
maxDistinctNominalValues	int	The maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. Default: 20.
binaryNominalSplits	boolean	Determines whether to use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. If gain is selected as the splitting criteria, will always choose two subsets. If gain ratio is selected, it will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Default: false.
includedFields	List<String>	The list of columns to include for learning. An empty list means all columns of type double or string.
inMemoryDataset	boolean	Determines whether the dataset is to be kept in memory while the decision tree is being built. Default: false.

Ports

The DecisionTreeLearner operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The training data that is used to build the model. Fields of type string represent categorical data. Fields of type double represent numerical data.

The DecisionTreeLearner operator provides a single output port.

Name	Type	Get Method	Description
model	PMMLPort	getModel()	The model that is built from the training data. The model will be a PMML TreeModel described at http://www.dmg.org/v4-0-1/TreeModel.html.

Using the DecisionTreePredictor Operator for Decision Tree Predicting

The DecisionTreePredictor operator applies a previously built decision tree PMML model to the input data. This operator supports most of the functionality listed at http://www.dmg.org/v4-0-1/TreeModel.html. Specifically it supports all required elements and attributes as well as the following optional elements and attributes:

• missingValueStrategy (all strategies supported)

• missingValuePenalty

• noTrueChildStrategy (all strategies supported)

• All predicates: SimplePredicate, CompoundPredicate, SimpleSetPredicate, True, False

• ScoreDistribution

It ignores the following elements and attributes:

• EmbeddedModel

• Partition

• ModelStats

• ModelExplanation

• Targets

• LocalTransformations

• ModelVerification

• splitCharacteristic

Code Example

Using the DecisionTreePredictor operator in Java

// Create the decision tree predictor operator and add it to a graph
DecisionTreePredictor predictor = graph.add(new DecisionTreePredictor());

// Connect the predictor to an input port and a model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the DecisionTreePredictor operator in RushScript

// Apply the given model to the input data
var classifiedData = dr.decisionTreePredictor(model, data);

Properties

The DecisionTreePredictor operator provides the following properties.

Name	Type	Description
appendRecordCount	boolean	Whether to append record count information. Default: false.
appendConfidence	boolean	Whether to append confidence information. Default: false.
confidencePrefix	String	The field name prefix to use for confidence. Default: "confidence_"
recordCountPrefix	String	The field name prefix to use for record counts. Default: "record_count_"
winnerField	String	The name of the winner field to output. Default: "winner"

Ports

The DecisionTreePredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data to which the model is applied.
model	PMMLPort	getModel()	Decision Tree model in PMML to apply.

The DecisionTreePredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the model to the input data.

Using the DecisionTreePruner Operator for Decision Tree Pruning

The DecisionTreePruner operator performs pruning of the provided decision tree PMML model. This is useful when the learning data creates a model that is overly complex and does not generalize the data well. The operator will examine the tree and reduce its size by removing sections of the tree that provide little power to classify the data. Therefore the overall goal of the DecisionTreePruner is to reduce the complexity of the model while improving its predictive accuracy.

Note: This is a relatively inexpensive operation and thus is not parallelized.

Code Example

Using the DecisionTreePruner operator in Java

// Create the decision tree pruner and add it to the graph
DecisionTreePruner pruner = graph.add(new DecisionTreePruner());

// Connect the pruner to an input model and output model sink
graph.connect(dtLearner.getModel(), pruner.getInput());
graph.connect(pruner.getOutput(), modelWriter.getModel())

// The model produced can be used by operators needing a pruned decision tree PMML model

Using the DecisionTreePruner in RushScript

// Prune a decision tree model
var prunedModel = dr.decisionTreePruner(model);

Properties

The DecisionTreePruner operator provides the following properties.

Name	Type	Description
configuration	PruneConfiguration	The configuration that controls how pruning is performed.

Ports

The DecisionTreePruner operator provides a single input port.

Name	Type	Get Method	Description
input	PMMLPort	getInput()	This is the original non-pruned model.

The DecisionTreePruner operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getOutput()	The model pruned according to the PruneConfiguration.

Using the DiscoverDomain Operator to Discover Domains

The DiscoverDomain operator can be used as a utility operator for discovering the domain of String fields. Note that this is not intended to be used as a "top-level" operator; rather this is a utility for iterative operators that want to compute their domains.

Code Example

The following example demonstrates configuring and using the DiscoverDomain operator within an iterative operator.

Configuring the DiscoverDomain operator in Java

// Configuring the operator with an IterativeExecutionContext
LogicalSubgraph graph = ctx.newSubgraph();
DiscoverDomain dd = graph.add(new DiscoverDomain());
dd.setIncludedFields(Arrays.asList(new String[]{"field1", "field2", "field3"}));
dd.setMaxDiscoveredValues(100);
graph.connect(input, dd.getInput());
graph.run();
RecordTokenType type = dd.getResult();

Using DiscoverDomain in RushScript

var discoverDomain = dr.discoverDomain(data, {maxDiscoveredValues:100, includedFields:['field1', 'field2', 'field3']});

Properties

The DiscoverDomain operator provides the following properties.

Name	Type	Description
includedFields	List<String>	A list of fields to include in the discovery. The default value of "empty" implies that we include all fields.
maxDiscoveredValues	int	The maximum number of values to discover. This places a cap on the memory usage in the event that the input data contains a large number of distinct values. Fields that exceed the number of discovered values will remain as strings. This is 1000 by default.

Ports

The DiscoverDomain operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The data containing the fields to convert.

The DiscoverDomain operator does not provide an output port. Instead when the operator has , you can get the result of the domain discovery.

Name	Type	Get Method	Description
result	RecordTokenType	getResult()	The discovered field information.

Using the RunRScript Operator to Invoke R Scripts

The RunRScript operator is used to execute a snippet of R code within the context of a DataFlow operator. The R script is executed for all data contained within the given input flow. The output of the R script is pushed as the output of the operator.

The R environment must be installed on the execution platform. If this operator is to be used in a distributed environment, the R environment must be installed in all worker nodes of the cluster. The installation path must be the same on all nodes. Visit the CRAN (Comprehensive R Archive Network) website for download links and installation instructions.

By default, the RunRScript operator executes in parallel. This implies that the given R script has no data dependencies and can run independently on segmented data and produce the correct results. If this is not the case, disable parallel execution for the operator.

Note that the operator gathers all of its input data to hand off to R. The R engine then loads the data into memory in preparation for executing the script. The input data must therefore fit into memory. In a parallel or distributed environment, the data is segmented, and each segment of data is processed in a separate R engine. In this case, each data segment must fit into the memory of its executing engine.

The R script is handed an R data frame in a variable named "R". The script can use the data frame as desired. The resultant data is gathered from a data frame of the same name. Two variables are set within the R environment to support parallel execution:

partitionID

Specifies the zero-based identifier of the partition the current instance is running in

partitionCount

Specifies the total number of data partitions executing for this graph. For the scripting operator, this equates to the total number of instances of the operator replicated for execution.

These variables are numeric R variables and can be accessed directly within the user-provided R script.

The sequence of operations are these:

1. Data from the input port is cached in a local disk file.

2. The given R script is wrapped with statements to load the data from the file and store the results to another file.

3. The R script is executed using the Rscript executable.

4. The resultant data file is parsed and the data is pushed to the output of the operator.

5. The temporary input/output files are deleted.

Code Examples

The following code example demonstrates using the RunRScript operator. First, a set of R statements is defined. Next, the RunRScript operator is created and the required properties are set. Code to connect the input and output ports of the operator is not shown.

Note how the R script uses the data frame in the variable named "R" and sets the data to be output in a data frame of the same name.

Using RunRScript in Java

// Define a snippet of R code to apply to the input data
String scriptText =
            "myfunction <- function(x) { " +
            "    return (x * 2);" +
            "}\n" +
            "tmpVal <- sapply(R$sepal.length, myfunction);" +
            "R <- data.frame(tmpVal);";

// Create the script operator and set the required properties.
RunRScript script = app.add(new RunRScript());
script.setPathToRScript("/usr/bin/Rscript");
script.setOutputType(record(DOUBLE("result"));
script.setScriptSnippet(scriptText);

Using RunRScript in RushScript

var scriptText =
            'myfunction <- function(x) { ' +
            '    return (x * 2);' +
            '}\n' +
            'tmpVal <- sapply(R$sepal.length, myfunction);' +
            'R <- data.frame(tmpVal);';

var script = dr.runRScript(
    data, {
        pathToRScript:'/usr/bin/Rscript',
        scriptSnippet:scriptText,
        outputType:dr.schema().DOUBLE('result')} );

Properties

The RunRScript operator supports the following properties.

Name	Type	Description
charset	String	Sets character set value that is used to format input and output data for R. Default: UTF-8.
outputType	RecordTokenType	The number of characters to read for performing schema discovery and structural analysis.
pathToRScript	String	Fully qualified file path to the Rscript executable in the local R installation. This executable is usually found within the bin directory of the R installation.
scriptSnippet	String	The snippet of R code to execute within the R environment.
requiredDataDistribution	DataDistribution	Sets the required data distribution of the input data port of this operator. The R scripting operator has no knowledge of the processing being done by its script. As such, it cannot set its metadata for input or output. Exposing the data distribution allows setting the required distribution for the input port. For example, if the script is using vertical partitioning, the required data distribution can be set to FullDataDistribution to ensure each replica of the operator sees all of the input data.

Ports

The RunRScript operator supports the following input ports:

Name	Type	Get method	Description
input	RecordPort	getInput()	The input record data to apply the R script against.

The RunRScript operator supports the following output ports:

Name	Type	Get method	Description
output	RecordPort	getOutput()	Data resulting from the application of the given R script.

Using the KNNClassifier Operator to Classify K-Nearest Neighbors

The KNNClassifier operator applies the k-nearest neighbor algorithm to classify input data against an already classified set of example data. A naive implementation is used, with each input record being compared against all example records to find the set of example records closest to it, as measured by a user-specified measure. The input record is then classified according to a user-specified method of combining the classes of the neighbors.

The field containing the classification value (also referred to as the target feature) must be specified. It is not necessary to specify the fields used to calculate nearness (also referred to as the selected features). If omitted, they will be derived from the example and query schema using all eligible fields. The example and query records need not have the same schema. All that is required is that:

• The selected features must be present in both the example and query records and be of a numeric type (representing continuous data). In this context, a numeric type is any type that can be widened to a TokenTypeConstant.DOUBLE .

• The target feature must be present in the example records and be either numeric (as described above) or an enumerated type (representing categorical data; TokenTypeConstant.ENUM(List)).

The output consists of the query data with the resulting classification appended to it. This value is in the field named "PREDICTED_VAL".

The implementation is designed to minimize memory usage. It is possible to specify an approximate limit on the amount of memory used by the operator; it is not necessary to have sufficient memory to hold both the example and query data in memory, although performance is best in this case.

Code Example

This example uses the Iris data to classify the type of iris based on its physical characteristics.

Using the KNNClassifier operator in Java

// Initialize the KNN classifier
KNNClassifier classifier = graph.add(new KNNClassifier(10, "class"));
List<String> features = new ArrayList<String>();
features.add("sepal length");
features.add("petal length");
classifier.setSelectedFeatures(features);
classifier.setClassificationScheme(ClassificationScheme.vote());
classifier.setNearnessMeasure(NearnessMeasure.euclidean());

Using the KNNClassifier operator in RushScript

// Use the KNN classifier
var results = dr.knnClassifier(data, queryData, {
    k:10,
    targetFeature:"class",
    selectedFeatures:['sepal length', 'petal length'],
    classificationScheme:ClassificationScheme.vote(),
    nearnessMeasure:NearnessMeasure.euclidean()});

Properties

The KNNClassifier operator provides the following properties.

Name	Type	Description
scheme	ClassificationScheme	How to determine the classification of a record in the query data from the classifications of its nearest neighbors in the example data.
k	int	The size of the nearest neighbor set. The algorithm will use this many neighbors to perform classification of query data.
measure	NearnessMeasure	How to determine the nearest neighbors of a record in the query data.
targetFeature	String	The field in the example data that provides classification data.
selectedFeatures	String...	The fields to use when determining the nearest neighbors. These fields must be present in both the example and query records. They must also be numeric.
size	long	The amount of memory to use for buffering the example data. This value is in bytes.
sizeSpecifier	String	Alternate setting to specify the amount of memory to use for buffering the example data. The amount is specified using standard multipliers such as K, M, and G.

Ports

The KNNClassifier operator provides the following input ports.

Name	Type	Get Method	Description
query	RecordPort	getQuery()	The query data that will be used.
example	RecordPort	getTraining()	The example training data that will be used.

The KNNClassifier operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The query data with an additional classification field.

Naive Bayes within DataFlow

The Naive Bayes algorithm uses Bayes’ theorem to find the probability of an event occurring given the probability of another event that has already occurred.

The Naive Bayes learner produces a model that can be used with the Naive Bayes predictor as a simple probabilistic classifier based on applying Bayes’ theorem with strong, or naive, independence assumptions.

One of the advantages of a naive Bayes classifier is that it only requires a relatively small amount of training data to estimate the parameters necessary for classification.

DataFlow provides operators to produce and use naive Bayes classifiers. The learner is used to determine the classification rules for a particular data set while the predictor can apply these rules to a data set.

Covered Naive Bayes Operators

• Using the NaiveBayesLearner Operator

• Using the NaiveBayesPredictor Operator

Using the NaiveBayesLearner Operator

The NaiveBayesLearner operator is responsible for building a Naive Bayes PMML model from input data. The base algorithm used is specified at http://www.dmg.org/v4-0-1/NaiveBayes.html, with the following differences:

• Provides the ability to predict based on numerical data. For numerical data, we compute probability based on the assumption of a Gaussian distribution.

• We use Laplace smoothing in place of the "threshold" parameter.

• We provide an option to count missing values. If selected, missing values are treated like any other single distinct value. Probability is calculated in terms of the ratio of missing to non-missing.

• Calculation is performed in terms of log-likelihood rather than likelihood to avoid underflow on data with a large number of fields.

Code Example

This example uses Naive Bayes to create a predictive classification model based on the Iris data set. It uses the field "class" within the iris data as the target column. This example produces a PMML model that is persisted. This PMML model can then be used with the NaiveBayesPredictor operator to predict target values.

Using the NaiveBayesLearner operator in Java

// Run Naive Bayes using "class" as the target column.
// All other columns are used as learning columns by default.
NaiveBayesLearner nbLearner = graph.add(new NaiveBayesLearner());
nbLearner.setTargetColumn("class");

Using the NaiveBayesLearner operator in RushScript

// Run Naive Bayes using "class" as the target column.
// All other columns are used as learning columns by default.
var model = dr.naiveBayesLearner(data, {targetColumn:'class'});

Properties

The NaiveBayesLearner operator provides the following properties.

Name	Type	Description
learningColumns	List<String>	The list of columns to be used to predict the output value. Default of empty list means "everything but targetColumn".
targetColumn	String	The name of the column to be predicted. Must be a column of type string.

Ports

The NaiveBayesLearner operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data. String fields are assumed to be categorical. Double fields are assumed to be numerical. All other fields are ignored.

The NaiveBayesLearner operator provides a single output port.

Name	Type	Get Method	Description
model	PMMLPort	getModel()	The Naive Bayes PMML model.

Using the NaiveBayesPredictor Operator

The NaiveBayesPredictor operator applies a previously built Naive Bayes model to the input data. The base algorithm used is specified at http://www.dmg.org/v4-0-1/NaiveBayes.html, with the following differences:

• Provides the ability to predict based on numerical data. For numerical data, we compute probability based on the assumption of a Gaussian distribution.

• We use Laplace smoothing in place of the "threshold" parameter.

• Calculation is performed in terms of log-likelihood rather than likelihood.

Code Example

Using the NaiveBayesPredictor operator in Java

// Create the Naive Bayes predictor operator and add it to a graph
NaiveBayesPredictor predictor = graph.add(new NaiveBayesPredictor());
predictor.setAppendProbabilities(false);

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOuptut(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the NaiveBayesPredictor operator in RushScript

// Apply a naive Bayes model to the given data
var classifiedData = dr.naiveBayesPredictor(model, data, {appendProbabilities:false});

Properties

The NaiveBayesPredictor operator provides the following properties.

Name	Type	Description
appendProbabilities	boolean	Whether to include probabilities in the prediction. Default: true.
laplaceCorrector	double	The Laplace corrector to be used. The Laplace corrector is a way to handle "zero" counts in the training data. Otherwise a value that was never observed in the training data results in zero probability. The default of 0.0 means no correction. The "threshold" value specified in the PMML model will always be ignored in favor of the Laplace corrector specified on a NaiveBayesPredictor.
ignoreMissingValues	boolean	Whether to ignore missing values. If set to true, missing values are ignored for the purposes of prediction; otherwise missing values are considered when calculating probability distribution. Default: true.
probabilityPrefix	String	The field name prefix to use for probabilities. Default: "probability_"
winnerField	String	The name of the winner field to output. Default: "winner"

Ports

The NaiveBayesPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data to which the Naive Bayes model is applied.
model	PMMLPort	getModel()	Naive Bayes model in PMML to apply.

The NaiveBayesPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the model to the input data.

Predictive Modeling Markup Language (PMML)

The Predictive Modeling Markup Language (PMML) is an XML-based markup language used extensively by DataFlow. PMML provides a vendor-independent method of defining mining models that can be used by many of the DataFlow analytics operators. Any operator that consumes or produces a PMML model will do so through the use of a PMMLPort.

Most operators have a method called getModel() that exposes access to the PMMLPort. There are also various utility methods and classes that can be used to manipulate PMML documents in the com.pervasive.dataflow.analytics.pmml package.

Note: If you are using KNIME version 2.10 or later, then DataFlow uses PMML version 4.2.x. If you are using KNIME 2.9.x, DataFlow uses PMML version 4.0 or 4.1.

PMML Operators

• Using the ReadPMML Operator to Read PMML

• Using the WritePMML Operator to Write PMML

Using the ReadPMML Operator to Read PMML

The ReadPMML operator reads a PMML object model from a PMML file.

Code Example

Using the ReadPMML operator in Java

// Create a PMML reader
ReadPMML reader = graph.add(new ReadPMML("data/model.pmml"));

// You can also set the PMML file to read from after constructing the operator
ReadPMML reader = graph.add(new ReadPMML());
reader.setFilePath("data/model.pmml");

Using the ReadPMML operator in RushScript

// Read a PMML model
var model = dr.readPMML({filePath:'data/model.pmml'});

Properties

The ReadPMML operator provides one property.

Name	Type	Description
filePath	String	The path to the PMML file that will be read.

Ports

The ReadPMML operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getOutput()	The output port that will contain the PMML model that is read from the file.

Using the WritePMML Operator to Write PMML

The WritePMML operator is used to write a PMML object model to a PMML file.

Code Example

Using the WritePMML operator in Java

// Create a PMML writer
WritePMML writer = graph.add(new WritePMML("results/model.pmml"));

// You can also set the PMML file to write to after constructing the operator
WritePMML writer = graph.add(new WritePMML());
writer.setTargetPathName("results/model.pmml");

Using the WritePMML operator in RushScript

// Write a PMML model to a local file
dr.writePMML(model, {targetPathName:'results/model.pmml'});

Properties

The WritePMML operator provides the following properties.

Name	Type	Description
targetPathName	String	The path to the PMML file that will be written.

Ports

The WritePMML operator provides a single input port.

Name	Type	Get Method	Description
input	PMMLPort	getModel()	The input port that passes a PMML model to be written to a file.

Regression Analysis

Regression Learning within DataFlow

Regression analysis includes many techniques for modelling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

Regression analysis is applied when you want to understand how the typical value of the dependent variable changes as the independent variables are varied. It is therefore useful in finding the relationship between variables. This information can then be used to predict the expected values of a variable given the current values of the other variables.

Covered Regression Operators

• Using the LinearRegressionLearner Operator to Learn Linear Regressions

• Using the RegressionPredictor Operator to Apply Regression Models

• Using the LogisticRegressionLearner Operator to Perform Stochastic Gradient Descent

• Using the LogisticRegressionPredictor Operator to Apply Classification Models

Using the LinearRegressionLearner Operator to Learn Linear Regressions

The LinearRegressionLearner operator performs a multivariate linear regression on the given training data. The output is a PMML model describing the resultant regression model. The model consists of the y-intercept and coefficients for each of the given independent variables.

A dependent variable must be specified. This is a field in the input that is the target of the linear regression model. One or more independent variables are also required from the input data.

This operator supports numeric as well as categorical data as input. The linear regression is performed using an Ordinary Least Squares (OLS) fit. Dummy Coding is used to handle categorical variables.

This approach requires for each of the categorical variables one value from its domain to be chosen that serves as reference for all other values in that domain during the computation of the model. Specifying reference values using operator's API is optional. If for a certain categorical variable no reference value is specified by the user, it will be randomly chosen.

The output is an estimate of coefficients for the model:

Y = a + (b1*x1 + ... + bn*xn) + (0*w1 ref + c1,1*w1,1 + ... + c1,k 1 *w1,k1 + ... + 0*w m ref + c m,1 *w m,1 + ... + c m,km *w m,km )

where

• a is the constant term (aka the intercept)

• n is the number of numeric input variables

• bi , 0 < i ≤ n, is the coefficient for numerical input variable xi

• m is the number of categorical input variables

• wiref , 0 < i ≤ m, is the reference value of the categorical variable wi

• ki , 0 < i ≤ m, is the domain size of the categorical variable wi

• ci,j, 0 < i ≤ m, 0 < j ≤ ki , is the coefficient for the jth nonreference value wi,j of the ith categorical input variable wi

The following assumptions are made about the nature of input data:

• Independent variables must be linearly independent from each other.

• Dependent variable must be noncategorical (that is, continuous and not discrete).

• All variables loosely follow the normal distribution.

Code Example

This example uses linear regression to create a predictive model based on a simple data set. It uses the field "y" as the dependent variable and the fields "x1" and "x2" as the independent variables. This example produces a PMML model that is persisted. This model can then be used with the RegressionPredictor operator to predict data values.

Using the LinearRegressionLearner operator in Java

// Run Linear Regression with y as the dependent variable field
// and x1 and x2 as the independent variable fields. x2 is a
// categorical variable field.
LinearRegressionLearner lrl = graph.add(new LinearRegressionLearner()); lrl.setDependentVariable("y"); lrl.setIndependentVariables("x1", "x2");
// Passing in reference values for categorical variable fields
// is optional. If for a certain categorical variable field no
// reference value is passed in, a value from variable field's
// domain is randomly chosen.
Map<String, String> varToRef = new HashMap<String, String>();
varToRef.put("x2", "blue");
lrl.setReferenceValues(varToRef);

Using the LinearRegressionLearner operator in RushScript

// Run Linear Regression with y as the dependent variable field
// and x1 and x2 as the independent variable fields. x2 is a
// categorical variable field.

// Passing in reference values for categorical variable fields
// is optional. If for a certain categorical variable field no
// reference value is passed in, a value from variable field's
// domain is randomly chosen.
var results = dr.linearRegressionLearner(data, {dependentVariable:'y',
independentVariables:['x1', 'x2'], referenceValues:{'x2':'blue'}});

Properties

The LinearRegressionLearner operator provides the following properties.

Name	Type	Description
dependentVariable	String	The field name of the dependent variable to use in the calculations.
independentVariables	String...	A list of fields to use as independent variables in the calculations.
referenceValues	Map<String,String>	A mapping from categorical variable field names to reference values.
singularityThreshold	Double	A threshold value used to determine effective singularity in LU decomposition. Default: Double.MIN_VALUE

Ports

The LinearRegressionLearner operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The training data used to compute the linear regression.

The LinearRegressionLearner operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getOutput()	The model describing the resulting regression model.

Using the RegressionPredictor Operator to Apply Regression Models

The RegressionPredictor operator applies a previously built regression model to the input data. The model defines the independent variables used to create the model. The input data must contain the same dependent and independent variable fields as the training data that was used to build the model. The predicted value is added to the output dataflow as an additional field. All input fields are also transferred to the output.

Code Example

Using the RegressionPredictor operator in Java

// Create the regression predictor operator and add it to a graph
RegressionPredictor predictor = graph.add(new RegressionPredictor());

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the RegressionPredictor in RushScript

// Apply a regression model to the given data
var results = dr.regressionPredictor(model, data);

Properties

The RegressionPredictor operator provides the following properties.

Name	Type	Description
predictedFieldSuffix	String	The suffix to be added to the target field name. The suffix is set to "(predicted)" by default.

Ports

The RegressionPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	Input data to which the regression model is applied.
model	PMMLPort	getModel()	Regression model in PMML to apply.

The RegressionPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the regression model to the input data. Contains a new field with the predicted value.

Using the LogisticRegressionLearner Operator to Perform Stochastic Gradient Descent

The LogisticRegressionLearner operator performs a stochastic gradient descent, a probabilistic algorithm for very large data sets, on the given training data. The output is a PMML model describing the resultant classification model.

A single dependent variable must be specified. This is a field in the input data that is the target of the logistic regression model. One or more independent variables are also required from the input data.

This operator supports numerical and categorical data as input.

Take the following considerations into account when composing the operator:

• The dependent variable must be categorical.

• Very low learning rates can cause the algorithm to take a long time to converge. It is often better to set the learning rate too high and let the algorithm lower it automatically.

• The learning rate adjustment takes an iteration to perform, so if it is set high, make sure to adjust the maxIterations property appropriately.

• The smaller the training data set, the higher the maxIterations property should be set.

• The algorithm used by the operator is much more accurate when the training data set is relatively large.

• Input data that includes null or infinite values can produce nonsensical results.

Code Example

This example uses logistic regression to create a predictive model based on the Iris data set. It uses the "class" field as the dependent variable and the remaining fields as the independent variables. This operator produces a PMML model that may be persisted. The resultant model can be used with the RegressionClassifier operator to predict data values.

Using the LogisticRegressionLearner operator in Java

// Run Logistic Regression with "class" as the target field
// and the sepal and petal length and width as independent variables
LogisticRegressionLearner learner = graph.add(new LogisticRegressionLearner());
learner.setLearningColumns(Arrays.asList("sepal length","sepal width","petal length","petal width"));
learner.setTargetColumn("class");

Using the LogisticRegressionLearner operator in RushScript

// Run Logistic Regression with "class" as the target field
// and the sepal and petal length and width as independent variables
var model = dr.logisticRegressionLearner(data, {targetColumn:'class', learningColumns:['sepal length', 'sepal width', 'petal length', 'petal width']});

Properties

The LogisticRegressionLearner operator provides the following properties.

Name	Type	Description
learningColumns	List<String>	A list of fields to use as independent variables in the calculations.
targetColumn	String	The field name of the dependent variable.
ridge	double	The regularization constant, also called lambda. The regularization constant penalizes very large coefficients and is sometimes necessary for convergence. Must be greater or equal to 0.0 and less than 1.0. It should generally be small.
learningRate	double	The learning rate that should be used at the start of the computation. This is a maximum value; the algorithm may reduce the rate if it is likely to result in divergence. The rate must be positive and should be less than 1.0. If set much higher, additional iterations will likely need to be used to adjust the rate to a more reasonable value, although this is not always the case.
maxIterations	int	The maximum number of iterations attempted before generating a model. Note that more iterations can produce a more accurate model at the cost of much greater run time.
tolerance	double	The strictness of the convergence criteria as a fraction of the total length of the coefficient vector. Note that a threshold much higher than the learning rate or above 1.0 can result in premature convergence detection.
seed	long	The seed for the random number generator used by the algorithm. The main use of the seed is to randomly reorder the input. Note that even with the same seed, results may vary based on engine settings such as the number of partitions.
maxDistinctNominalValues	int	The maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. Default: 1000.

Ports

The LogisticRegressionLearner provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The training data used to compute the logistic regression.

The LogisticRegressionLearner operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getModelOutput()	The model describing the resulting classification model.

Using the LogisticRegressionPredictor Operator to Apply Classification Models

The LogisticRegressionPredictor operator implements a maximum-likelihood classifier using regression models to predict the relative likelihoods of the various categories. The operator is used to apply a previously built classification model to the input data.

The model defines the independent variables used to create the model. The input data must contain the same dependent and independent variable fields as the training data that was used to build the model.

The predicted categories are added to the output dataflow as an additional field. All input fields are also transferred to the output. The output also includes a confidence score for each category of the dependent variable.

Note: The LogisticRegressionPredictor currently only supports models produced by the LogisticRegressionLearner operator.

Code Example

Using the RegressionClassifier operator in Java

// Create the regression predictor operator and add it to a graph
LogisticRegressionPredictor classifier = graph.add(new LogisticRegressionPredictor());

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOutput(), classifier.getInput());
graph.connect(modelSource.getOutput(), classifier.getModel());

// The output of the classifier is available for downstream operators to use

Using the RegressionClassifier operator in RushScript

// Classify the input data using the given model
var classifiedData = dr.logisticRegressionPredictor(model, data);

Properties

The LogisticRegressionPredictor operator provides the following properties.

Name	Type	Description
winnerField	String	The name of the winner field to output. Default: "winner"

Ports

The LogisticRegressionPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	Input data to which the classification model is applied.
model	PMMLPort	getModel()	Classification model in PMML to apply.

The LogisticRegressionPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the classification model to the input data. Contains a new field with the winner category and a field with the probability score for each category of the input dependent variable.

Last modified date: 06/14/2024