Using DataFlow in KNIME : DataFlow Nodes in KNIME : Analytics Nodes
 
Share this page                  
Analytics Nodes
Association Rules
ARM Model Converter
FP-growth
Frequent Items
Classifiers
Decision Tree Learner
Decision Tree Predictor
Decision Tree Pruner
K-Nearest Neighbors Classifier
Naive Bayes Learner
Naive Bayes Predictor
SVM Learner
SVM Predictor
Clustering
Cluster Predictor
k-Means
Regression Nodes
Linear Regression (Learner)
Logistic Regression (Learner)
Logistic Regression (Predictor)
Regression (Predictor)
Viz
Diagnostics Chart Drawer
Association Rules
ARM Model Converter
ARM Model Converter converts a PMML model containing association modeling results into the selected format.
FP-growth
FP-growth mines input transactions for frequent item sets and association rules using the FP-growth algorithm.
Frequent Items
Frequent Items discovers frequent items in a dataset containing items segregated by transactions.
ARM Model Converter
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ConvertARMModel Operator to Convert Association Models from PMML.
Converts the given association PMML model into the selected target type. The results of the conversion are written to the local file system at the given target path name.
Dialog Options
Target Path
Specifies the path name of the file to write in the local file system with the results of the conversion.
Conversion Type
Specifies tThe type of conversion to apply to the input model.
Ports
Input Ports
0 - Input association rules PMML model.
FP-growth
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FPGrowth Operator to Determine Frequent Pattern Growth.
The FP-growth node mines input transactions for frequent item sets and association rules using the FP-growth algorithm. A PMML association model is generated as a result.
The input dataset must consist of at least two fields: a transaction identifier and an item name. The transaction identifier is used to define transaction boundaries (which items are contained within which transactions). The item name uniquely identifies an item, which may be a SKU, an item category or any other field that represents an entity within the context of a transaction.
This node outputs PMML version 3.2 using the association model definition. The resultant PMML may be written to a file using a PMML writer node or transformed in other ways by downstream nodes. The PMML association model captures aggregate information about the transaction dataset, the frequent items and frequent item sets, and the association rules.
Dialog Options
Transaction Field
Specifies the name of the input field containing the transaction identifier. This field is used to find transaction boundaries for determining item frequencies and item sets.
Item Field
Specifies the name of the input field containing the item identifiers.
Minimum Support
Specifies the minimum support that an item must have in the input transaction dataset to be considered frequent. This option must be in the range from 0.0 to 1.0 (exclusive). It represents the percentage of transactions that a frequent item must participate in at a minimum.
Minimum Confidence
Specifies the minimum confidence that an association rule must have in the input transaction dataset to be considered interesting. This option must be in the range from 0.0 to 1.0 (exclusive). The confidence of a rule is defined as the support of the combined antecedent and consequent item sets divided by the support of the antecedent item set.
K
Item sets with a cardinality of K or smaller are generated. Larger item sets are filtered from the output and not included in the resultant PMML model. The default value is 0, which indicates that all frequent item sets, no matter the size, should be included in the resultant model.
Annotation
Includes the given text in the annotation element of the resultant PMML model. This is useful to annotate a model as needed for later recognition.
Label Field
Specifies the name of the input field containing the optional transaction identifier labels.
Ports
Input Ports
0 - Input transaction dataset to mine for frequent item sets and association rules.
Output Ports
0 - Output port containing the resultant PMML association model.
Views
Association Model – Item Sets and Association Rules
Provides a tabular presentation of the discovered frequent item sets and association rules. You can click gthe columns of the frequent item sets and the association rules to change their sort order.
Association Rules Graph – Bubble chart of the Association Rules
Provides a bubble chart of the association rules. The support of each rule is graphed on the X-axis. The confidence values are plotted on the Y-axis. The lift values are used to size the bubble for each rule. To view the label for a rule, place the cursor over a bubble in the chart. The fly-over text will provide more information about the rule including the antecedent, consequent, and the lift value.
Frequent Items
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the FrequentItems Operator to Compute Item Frequency.
The frequent items node computes the frequent items contained within the input data set. Items are considered frequent if they appear in a percentage of transactions greater than the given minimum support property.
Dialog Options
Transaction Field
Specifies the name of the input field containing the transaction identifier. This field is used to find transaction boundaries for determining item frequencies and item sets.
Item Field
Specifies the name of the input field containing the item identifiers.
Minimum Support
Specifies the minimum support that an item must have in the input transaction dataset to be considered frequent. This option must be in the range from 0.0 to 1.0 (exclusive). It represents the percentage of transactions that a frequent item must participate in at a minimum.
Label Field
Specifies the name of the input field containing the optional transaction identifier labels.
Ports
Input Ports
0 - Input transaction dataset to mine for frequent item sets and association rules.
Output Ports
0 - Output port containing the discovered frequent items.
1 - Partially constructed association model.
Classifiers
Decision Tree Learner
Decision Tree Learner creates a Decision Tree PMML model for the given input data.
Decision Tree Predictor
Decision Tree Predictor performs classification of input data based on a Decision Tree PMML model.
Decision Tree Pruner
Decision Tree Pruner prunes a Decision Tree Model.
K-Nearest Neighbors Classifier
K-Nearest Neighbors Classifier classifies or predicts unlabeled data using the k-nearest neighbors algorithm.
Naive Bayes Learner
Naive Bayes Learner creates a Naive Bayes PMML model for the given input data.
Naive Bayes Predictor
Naive Bayes Predictor performs classification of input data based on a Naive Bayes PMML model.
SVM Learner
SVM Learner builds a support vector machine model.
SVM Predictor
SVM Predictor performs classification based on a support vector machine model.
Decision Tree Learner
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DecisionTreeLearner Operator to Construct a Decision Tree PMML Model.
This node creates a Decision Tree PMML model for the input data. The implementation is primarily based on Ross Quinlan’s book C4.5: Programs for Machine Learning. This implementation and Quinlan’s C4.5 implementation have the following key features and limitations:
Support for both numerical and categorical attributes.
Support only for categorical predictions.
Use Information Gain/Gain Ratio as the measure of quality.
Handle the missing values using fractional cases. This corresponds to the PMML aggregateNodes missing-value strategy.
Following are the differences between this implementation and Quinlan’s C4.5 implementation:
Parallel/distributed implementation.
Scales to data sets that are large for memory.
C4.5 rule generation is not supported. C4.5 is a software distribution that includes several executables. Our primary focus is the decision tree.
Subtree raising is not supported as part of the pruning strategy. This adds significant processing time.
Currently limited to single-tree. Automatic cross-validation and tree selection are not supported.
The implementation (from a scaling and parallelism standpoint) is based on the following papers:
“SPRINT: A Scalable Parallel Classifier for Data Mining.”
“ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Data Sets.”
Memory Requirements
The minimum memory required is 13 bytes of RAM for each row of data. This supports the row-mapping data structure and is distributed throughout the cluster. For example, if there are 10 nodes in the cluster and 100 million rows of data, then 13*100millon/10=130 MB RAM per node is required.
If the dataset contains null values, the minimum memory required may be large. It requires an extra n*12+12 bytes of bookkeeping for each row that must be split among children nodes and n defines the number of children of the split.
If the “load data set into memory” option is used, the memory requirements are very large and the attribute tables must be in memory. Attribute tables require about 32 bytes for each row and attribute. In addition, when the attributes are split, working space is required for the split. Therefore, you need to calculate adding 1 for the number of attributes. Finally, unknown (null) values may impact the memory sizes as splitting on an unknown value requires adding the row in query to both the children nodes.
Note:  Attribute tables are distributed throughout the cluster, and the memory requirements for attributes scale out similar to the row-mapping structure mentioned above.
Dialog Options
Target column
Specifies the column to predict. Must be enumerated type.
Quality measure
Specifies the quality measure to use when determining the best split. The options available are gain and gain ratio.
Min records per-node
Specifies the minimum number of records per node (Prepruning).
Max tree size
Specifies the maximum number of nodes to allow in the tree (Prepruning).
Max distinct nominal values
Specifies the maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values is filtered from the model.
Load dataset into memory
If checked, the attribute tables will be processed in memory. This is generally much faster but carries stronger memory requirements.
Use binary nominal splits
If selected, will use sets of nominal values for splitting. If GAIN is selected as the splitting criteria, it will always chose a binary nominal split. If GAIN RATIO is selected, it will chose whichever split maximizes the gain ratio.
Verbose logging
Provides low-level debug logging describing split selection, and so on, if checked.
Ports
Input Ports
0 - Training data
Output Ports
0 - Decision Tree PMML Model
Views
Tree view
View of the decision tree
Decision Tree Predictor
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DecisionTreePredictor Operator for Decision Tree Predicting.
This node classifies the input data based on a Decision Tree PMML model.
The node displays red (error) at configuration time and turns green after the Decision Tree Learner input node is run and the PMML model is generated.
Dialog Options
Append record counts
If selected, the row counts of several results are included in the classified data.
Append confidences
If selected, the confidences of several results are included in the classified data.
Ports
Input Ports
0 - Decision Tree PMML Model
1 - Data to classify
Output Ports
0 - Classified data
Decision Tree Pruner
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DecisionTreePruner Operator for Decision Tree Pruning.
Prunes a decision tree model. The pruner is separate from the learner to allow experimentation with different pruning configurations without needing to retrain the decision tree. This is a relatively inexpensive operation and thus is not parallelized.
Dialog Options
Pruning confidence level
Reflects the confidence level that the current tree will see equal or fewer errors on the test set than it saw on the training set. Lower values result in more pruning; higher values result in less pruning.
Ports
Input Ports
0 - Decision Tree PMML Model (not pruned)
Output Ports
0 - Decision Tree PMML Model (pruned)
Views
Pruned tree
View of the decision tree after pruning
K-Nearest Neighbors Classifier
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the KNNClassifier Operator to Classify K-Nearest Neighbors.
Classify or predict unlabeled data using the k-nearest neighbors algorithm.
Dialog Options
K
Specifies the number of labeled neighbors to use to classify each unlabeled data point.
Target Feature
Specifies the feature of the training data to use as the label. This field must be of numeric or enumerated type. If the classification method is averaging, then this field must be numeric.
Training Features
Lets you choose the features/columns of the training data to use when computing nearest neighbor distances. These features must be of numeric type. By default, all valid features (except the Target Feature) are used.
Nearness Measure
Specifies the measurement used to determine nearest neighbor data points. Currently, Euclidean distance and cosine similarity are supported.
Classification Method
Specifies the mechanism by which the final classification is determined from the nearest neighbors. Currently, voting and averaging are supported. Averaging is only valid if the target feature is continuous.
In-Memory Data Size (MB)
Specifies the amount of memory to use (in MB) when buffering chunks of the data into memory. The higher this value, the fewer passes over the data (yielding faster performance).
Ports
Input Ports
0 - Labeled training data for k-nearest neighbor classifier.
1 - Unlabeled query data to classify.
Output Ports
0 - The query data labeled with classifications or predicted values.
Naive Bayes Learner
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the NaiveBayesLearner Operator.
Creates a Naive Bayes PMML model for the given input data. The base algorithm used is specified on the Data Mining Group website. In addition, DataFlow provides the following enhancements:
Provides the ability to predict based on numerical data. For numerical data, computes probability based on the assumption of a Gaussian distribution.
Uses Laplace smoothing in place of the threshold parameter.
Provides an option to count missing values. If selected, missing values are treated like any other single distinct value. Probability is calculated in terms of the ratio of missing to non-missing.
Calculation is performed in terms of log-likelihood rather than likelihood so as to guard against numerical underflow in high-dimensional data.
Dialog Options
Target column
Specifies the column to predict.
Ports
Input Ports
0 - Training data
Output Ports
0 - Naive Bayes PMML Model
Naive Bayes Predictor
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the NaiveBayesPredictor Operator.
Classifies input data based on a Naive Bayes PMML model. The base algorithm used is specified at the Data Mining Group. In addition, DataFlow provides the following enhancements:
Provides the ability to predict based on numerical data. For numerical data, computes probability based on the assumption of a Gaussian distribution.
Uses Laplace smoothing in place of the threshold parameter.
Provides an option to count missing values. If selected, missing values are treated like any other single distinct value. Probability is calculated in terms of the ratio of missing to non-missing.
Calculation is performed in terms of log-likelihood rather than likelihood so as to guard against numerical underflow in high-dimensional data.
Dialog Options
Ignore missing values
If selected, missing values will be ignored for purposes of prediction.
Append probabilities
If selected, the probabilities of various outcomes will be included in classified data.
Laplace corrector
Handles zero counts in the training data. Otherwise a value that was never observed in the training data results in zero probability. The default of 0.0 means no correction.
Note:  The threshold value specified in the PMML model will always be ignored in favor of the Laplace corrector.
Ports
Input Ports
0 - Naive Bayes PMML Model
1 - Data to classify
Output Ports
0 - Classified data
SVM Learner
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SVMPredictor Operator to Apply a Support Vector Machine Model.
Builds a PMML SupportVectorMachineModel from an input dataset. This supports two types of SVMs: CSvc and one-class. For the CSvc, it generates a PMML model for classification of a target column. For one-class, there is no target column.
Note:  This node is implemented as a wrapper for LIBSVM. Refer to the LIBSVM documentation for additional information regarding the various settings.
Dialog Options
SVM Type
Either CSvc or One Class. For CSvc, you must specify the following:
target column - The target column to be predicted by the generated model. It must be defined as a discrete (ENUMERATED) type.
c - The cost parameter. this controls the penalty parameter of the error term.
epsilon - The tolerance for termination criteria.
For One Class, you must specify the following:
nu - The cost parameter. this controls the penalty parameter of the error term.
epsilon - The tolerance for termination criteria.
Kernel Type
The kernel function to apply. Must be one of the following:
linear - <v1,v2>
radial basis - e^(-gamma*(<v1,v1>+<v2,v2>-2*<v1,v2>))
sigmoid - tanh(gamma*<v1,v2>+coef0)
polynomial - (gamma*<v1,v2>+coef0)^degree
Ports
Input Ports
0 - Training data
Output Ports
0 - SVM PMML Model
SVM Predictor
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SVMPredictor Operator to Apply a Support Vector Machine Model.
The SVM Precitor is a node responsible for classification based on a SVM PMML model. This supports either CSVC SVMs or one-class SVMs. We distinguish the two cases by the presence of MiningFields whose usage type is predicted. If there are zero predicted columns, it is assumed to be a one-class SVM. Otherwise, there must be exactly one predicted column of type enumerated, in which case it is a CSVC SVM.
For CSVC SVMs, the PMML is expected to contain SupportVectorMachine with target category and alternate target category populated. Each of the SVMs are evaluated, adding a vote to either target category or alternate target category. The predicted value is the one receiving the most votes.
For one-class SVMs, target category and alternate target category will be ignored. The result will either be -1 if the SupportVectorMachine evaluates to a number less that zero or 1 if greater than zero.
Note:  The alternateTargetCategory attribute is part of the PMML 4.0 specification. In order to operate in PMML 3.2, we currently store this value as a PMML extension. For example:

<Extension name="alternateTargetCategory" value="3"/>
Ports
Input Ports
0 - SVM PMML Model
1 - Data to classify
Output Ports
0 - Classified data
Clustering
DataFlow for KNIME provides the following cluster analysis features:
Cluster Predictor
Cluster Predictor assigns input data to clusters based on the provided PMML model.
k-Means
k-Means computes k-means clustering and outputs a PMML model.
Cluster Predictor
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ClusterPredictor Operator for Cluster Predicting.
Assigns input data to clusters based on the provided PMML clustering model. The explicit cluster IDs are used for the assignment, if the model provides any. Otherwise, the implicit 1-based index, indicating the position in which each cluster appears in the model, is used as ID.
The input data must contain the same fields as the training data used to build the model (in the PMML model, clustering fields with the attribute isCenterField set to true). The resulting assignments are part of the output along with the original input data.
Dialog Options
winner field name
Specifies the name of the output field containing the cluster assignments.
Ports
Input Ports
0 - PMML clustering model
1 - Input to clustering
Output Ports
0 - The input data labeled with cluster assignments
k-Means
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the KMeans Operator to Compute K-Means.
Computes k-means clustering on the inputs. The k-means algorithm initializes its centroids to the first k rows of input. Within each iteration, it makes cluster assignments based on the distance to the centroids. Iteration stops when cluster centroids stop changing or when the set maximum number of iterations is exceeded.
Dialog Options
number of clusters
Specifies the number of clusters to be created.
max number of iterations
Specifies the limit on the number of iterations. The k-means algorithm terminates when cluster centers stop changing or when the maximum number of iterations is reached.
distance measure
Specifies the measurement used to determine distance to the centroids. Currently, Euclidean distance and cosine similarity are supported.
Ports
Input Port
0 - Input to clustering.
Output Port
0 - PMML cluster model
Regression Nodes
Linear Regression (Learner)
Linear Regression (Learner) performs linear regression learning.
Logistic Regression (Learner)
Logistic Regression (Learner) performs logistic regression learning.
Logistic Regression (Predictor)
Logistic Regression (Predictor) predicts a target value using a previously built logistic regression model.
Regression (Predictor)
Regression (Predictor) predicts a target value using a previously built regression model.
Linear Regression (Learner)
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the LinearRegressionLearner Operator to Learn Linear Regressions.
Performs linear regression using an Ordinary Least Squares (OLS) fit of one dependent variable (the target) on one or more independent variables (the predictors). The output is a PMML model containing the estimate of the coefficients for the linear model:
Y = b0 + b1*X1 + ... + bn*Xn
including b0, the constant term (also known as the intercept).
These assumptions are made about the nature of input data:
Independent variables must be linearly independent from each other.
The dependent variable must be numerical (that is, continuous and not discrete).
All variables loosely follow the normal distribution.
Dialog Options
Dependent/Target Column
Specifies the name of the dependent variable (also known as the target variable) column.
Independent Variables
Specifies the list of columns to use as independent variables for the linear regression. All supported columns are selected by default.
Reference Values
Configures the reference values for categorical independent variables. If no reference value is provided for a given categorical variable, a randomly chosen value from its domain will be selected. If domain information is not available, you have the option of manually entering the reference value.
Ports
Input Ports
0 - Training data set
Output Ports
0 - Linear regression model
Views
Regression Analysis
Charting of regression analysis results
Regression Statistics
List of model quality indicators
Logistic Regression (Learner)
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the LogisticRegressionLearner Operator to Perform Stochastic Gradient Descent.
This node performs logistic regression learning using stochastic gradient descent with one dependent variable (the target) on one or more independent variables (the predictors). The output is a PMML model containing the estimate of the coefficients for the logistic classification model.
Stochastic gradient descent parallelizes well and is very scalable. It is suitable for use on large amounts of data and scales to run within cluster environments. However, it is very sensitive to small amounts of input data. Using small data sets with the default options may develop incorrect models.
Use this node with small data sets, increase the iterations, and lower the tolerance. Modifying the settings is required to ensure correct processing of the data.
Dialog Options
Dependent/Target Column
Specifies the name of the dependent (target) variable column. The dependent variable must be categorical.
Learning Rate
Specifies the rate that is used at the beginning of computation. This is a maximum value that can be reduced if it results in divergence. The rate must be a positive value.
Regularization Constant
The ridge or lambda constant authorizes very large coefficients and is sometimes required for convergence. It must be in the 0<=R<1 range.
Tolerance
Specifies the strictness of the convergence criteria as a fraction of total length of the coefficient vector. When you set a threshold greater than 1 or higher than the Learning Rate, it may result in premature convergence detection.
Maximum Iterations
Specifies the maximum number of iterations to attempt before generating a model. When you set a higher value, it may result in longer run times but may develop accurate models.
Random Seed
Specifies the seed for the random number generator used by the algorithm for input reordering and partitioning.
Note:  If you use the same seed, the results may vary based on other settings.
Max distinct nominal values
Defines the maximum number of distinct nominal values allowed. Attributes that are higher than this value are filtered from the model.
Independent Variables
Specifies the list of columns to use as independent variables for the logistic regression. Independent variables must be numerical or categorical. By default, all supported columns are selected.
Ports
Input Ports
0 - Training data set
Output Ports
0 - Logistic regression model
Logistic Regression (Predictor)
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the LogisticRegressionPredictor Operator to Apply Classification Models.
This node predicts a target value by using a previously built logistic regression model. The model defines the independent variables that are used to build it. These independent variables must be available in the input data. A field for the predicted values is added to the output data set.
If the domain of the target variable is unknown, then this node is in the error state. If the source data is from delimited text, then the domain can be set in the reader, and this allows the composition to work. Executing the workflow operates the composition when the upstream learner is run successfully.
Ports
Input Ports
0 - Logistic Regression Model
1 - Input data for prediction
Output Ports
0 - Input data with predicted values
Regression (Predictor)
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the RegressionPredictor Operator to Apply Regression Models.
Predicts a target value using a previously built regression model. The model defines the independent variables used to build the model. These independent variables must be present in the input data. A field for the predicted values is added to the output data set.
Ports
Input Ports
0 - Input data for prediction
1 - Regression model
Output Ports
0 - Input data with predicted values
Viz
Diagnostics Chart Drawer
Diagnostics Chart Drawer computes ROC, Gains, and Lift charts for classification models.
Diagnostics Chart Drawer
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DrawDiagnosticsChart Operator to Draw Diagnostic Charts.
To build diagnostic charts, the Diagnostics Chart Drawer node uses confidence values along with the actual target values (true class). These values are obtained using the input of this node and output of one or multiple predictors.
The supported chart types are:
The ROC (Receiver Operating Characteristics) chart provides a comparison between the true positive rate (y-axis) and false positive rate (x-axis) when the confidence threshold decreases.
The Gains chart (CRC, Cumulative Response Chart) provides the change in true positive rate (y-axis) when percentage of the targeted population (x-axis) decreases.
The Lift chart provides the change in lift (y-axis), which is the ratio between predictor result and baseline when percentage of the targeted population (x-axis) increases.
The true positive rate is the ratio of all correct positive classifications and total number of positive samples in the test set. The false positive rate is the ratio of all incorrect positive classifications and total number of negative samples in the test set.
This node can accept one to five predictors as input sources. Each predictor output must contain a column for the actual target values and confidence values (probability, score) assigned by the given predictor.
Dialog Options
Chart
Chart includes:
Result Size
Specifies the number of points in each generated chart. If the provided number is greater than the number of test instances, then the number of test instances is used in the Result Size.
Target Value
Specifies a value of the target domain that defines the 'true class'.
Chart Type
ROC, Gains, or Lift.
PNG Output File (Optional)
Defines the path of the output file (local file system or HDFS). The generated chart is written to that file in PNG format.
Inputs
Inputs include:
Chart Name
Specifies the name of the curve for the input data that will appear in the chart legend.
Confidence Field
Specifies the field containing the confidence values for the true class assigned by the predictor for the input data.
Target Field
Specifies the field containing the actual target value for the input data.
Ports
Input Ports
0 - (Mandatory) Predictor Data
1 - (Optional) Predictor Data
2 - (Optional) Predictor Data
3 - (Optional) Predictor Data
4 - (Optional) Predictor Data
Output Ports
0 - Confusion matrix entries and coordinate axis of the curves for each input data.
Views
Views provide the chart.
Chart
Displays the diagnostics chart for the input data.