Cluster Analysis Operators
Cluster analysis includes techniques for grouping entities based on their characteristics as well as describing the resulting groups also known as clusters. This is done by applying cluster algorithms to attributes of entities that make them similar. The goal is to create clusters such that entities that belong to the same cluster are similar with regard to the relevant attributes, but entities from two different clusters are dissimilar.
Cluster analysis is a common task in data mining as well as statistical data analysis. It helps to organize observed data and build taxonomies. DataFlow provides operators that can be used for cluster analysis. For more information, refer to the following topics:
KMeans Operator
The
KMeans operator performs k-means computation for the given input. All included fields in the given input must be of type double. The k-means algorithm chooses a random
k data points as the initial centroids. Computation ends when one of the following conditions is true:
• MaxIterations is exceeded.
• For each centroid, there is no significant difference from the corresponding centroid from the previous iteration when compared with the configured quality.
Code Example
The following example demonstrates computing the clusters for the Iris data set using k-means. The input data contains 4 numeric fields but only 3 are used and so must be explicitly specified.
Using the KMeans operator in Java
// Run k-means to create cluster assignments
KMeans kmeans = graph.add(new KMeans());
kmeans.setK(3);
kmeans.setMaxIterations(10);
kmeans.setDistanceMeasure(DistanceMeasure.EUCLIDEAN);
kmeans.setIncludedColumns(Arrays.asList(new String[] {"sepal length", "sepal width", "petal length"}));
Using the KMeans operator in RushScript
var results = dr.kmeans(data, {
k:3,
maxIterations:10,
distanceMeasure:'EUCLIDEAN',
includedColumns:['sepal length', 'sepal width', 'petal length']});
Properties
The
KMeans operator provides the following properties.
Ports
The
KMeans operator provides a single input port.
The
KMeans operator provides the following output port.
ClusterPredictor Operator
The
ClusterPredictor operator assigns input data to clusters based on the provided PMML clustering model. The explicit cluster IDs will be used for the assignment if the model provides any. Otherwise, the implicit 1-based index, indicating the position in which each cluster appears in the model will be used as ID.
The input data must contain the same fields as the training data that was used to build the model (in the PMML model: clustering fields with the attribute "isCenterField" set to "true") and these fields must be of type double, float, long, or int. The resulting assignments will be part of the output alongside with the original input data.
Code Example
Using the ClusterPredictor operator in Java
// Create the cluster predictor operator and add it to a graph
ClusterPredictor predictor = graph.add(new ClusterPredictor());
// Set the name of the field containing the cluster assignments
predictor.setWinnerFieldName("label");
// Connect the predictor to an input data and model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());
// The output of the predictor is available for downstream operators to use
Using the ClusterPredictor operator in RushScript
// Apply a clustering model to the given data
var results = dr.clusterPredictor(model, data, {winnerFieldName:"label"});
Properties
The
ClusterPredictor operator provides the following properties.
Ports
The
ClusterPredictor operator provides the following input ports.
The
ClusterPredictor operator provides a single output port.
Last modified date: 03/10/2025