Was this helpful?
Cluster Analysis Operators
Cluster analysis includes techniques for grouping entities based on their characteristics as well as describing the resulting groups also known as clusters. This is done by applying cluster algorithms to attributes of entities that make them similar. The goal is to create clusters such that entities that belong to the same cluster are similar with regard to the relevant attributes, but entities from two different clusters are dissimilar.
Cluster analysis is a common task in data mining as well as statistical data analysis. It helps to organize observed data and build taxonomies. DataFlow provides operators that can be used for cluster analysis. For more information, refer to the following topics:
KMeans Operator
The KMeans operator performs k-means computation for the given input. All included fields in the given input must be of type double. The k-means algorithm chooses a random k data points as the initial centroids. Computation ends when one of the following conditions is true:
MaxIterations is exceeded.
For each centroid, there is no significant difference from the corresponding centroid from the previous iteration when compared with the configured quality.
Code Example
The following example demonstrates computing the clusters for the Iris data set using k-means. The input data contains 4 numeric fields but only 3 are used and so must be explicitly specified.
Using the KMeans operator in Java
// Run k-means to create cluster assignments
KMeans kmeans = graph.add(new KMeans());
kmeans.setK(3);
kmeans.setMaxIterations(10);
kmeans.setDistanceMeasure(DistanceMeasure.EUCLIDEAN);
kmeans.setIncludedColumns(Arrays.asList(new String[] {"sepal length", "sepal width", "petal length"}));
Using the KMeans operator in RushScript
var results = dr.kmeans(data, {
    k:3,
    maxIterations:10,
    distanceMeasure:'EUCLIDEAN',
    includedColumns:['sepal length', 'sepal width', 'petal length']});
Properties
The KMeans operator provides the following properties.
Name
Type
Description
maxIterations
int
The maximum number of iterations to allow before iteration is halted. Defaults to 99.
k
int
The k value, where k is the number of centroids to compute. Default: 3.
includedColumns
List<String>
The list of learning columns to include. An empty list means all columns of type double.
distanceMeasure
The distance measure used to measure the distance between two points when building the model. Either EUCLIDEAN or COSINE_SIMILARITY. Default: EUCLIDEAN.
Ports
The KMeans operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The training data used to build the model.
The KMeans operator provides the following output port.
Name
Type
Get Method
Description
model
getModel()
The resultant PMML cluster model.
ClusterPredictor Operator
The ClusterPredictor operator assigns input data to clusters based on the provided PMML clustering model. The explicit cluster IDs will be used for the assignment if the model provides any. Otherwise, the implicit 1-based index, indicating the position in which each cluster appears in the model will be used as ID.
The input data must contain the same fields as the training data that was used to build the model (in the PMML model: clustering fields with the attribute "isCenterField" set to "true") and these fields must be of type double, float, long, or int. The resulting assignments will be part of the output alongside with the original input data.
Code Example
Using the ClusterPredictor operator in Java
// Create the cluster predictor operator and add it to a graph
ClusterPredictor predictor = graph.add(new ClusterPredictor());

// Set the name of the field containing the cluster assignments
predictor.setWinnerFieldName("label");

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use
Using the ClusterPredictor operator in RushScript
// Apply a clustering model to the given data
var results = dr.clusterPredictor(model, data, {winnerFieldName:"label"});
Properties
The ClusterPredictor operator provides the following properties.
Name
Type
Description
winnerFieldName
String
The name of the output field containing the cluster assignments. Default: "winner".
Ports
The ClusterPredictor operator provides the following input ports.
Name
Type
Get Method
Description
input
getInput()
Input data to which the clustering model is applied.
model
getModel()
Clustering model in PMML to apply.
The ClusterPredictor operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
Results of applying the clustering model to the input data. Contains the original input data as well as a new field with the assigned cluster ID.
Last modified date: 03/10/2025