DF 8.2 | Cluster Analysis Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Cluster Analysis Operators

Was this helpful?

Cluster Analysis Operators

Cluster analysis includes techniques for grouping entities based on their characteristics as well as describing the resulting groups also known as clusters. This is done by applying cluster algorithms to attributes of entities that make them similar. The goal is to create clusters such that entities that belong to the same cluster are similar with regard to the relevant attributes, but entities from two different clusters are dissimilar.

Cluster analysis is a common task in data mining as well as statistical data analysis. It helps to organize observed data and build taxonomies. DataFlow provides operators that can be used for cluster analysis. For more information, refer to the following topics:

• KMeans Operator

• ClusterPredictor Operator

KMeans Operator

The KMeans operator performs k-means computation for the given input. All included fields in the given input must be of type double. The k-means algorithm chooses a random k data points as the initial centroids. Computation ends when one of the following conditions is true:

• MaxIterations is exceeded.

• For each centroid, there is no significant difference from the corresponding centroid from the previous iteration when compared with the configured quality.

Code Example

The following example demonstrates computing the clusters for the Iris data set using k-means. The input data contains 4 numeric fields but only 3 are used and so must be explicitly specified.

Using the KMeans operator in Java

// Run k-means to create cluster assignments
KMeans kmeans = graph.add(new KMeans());
kmeans.setK(3);
kmeans.setMaxIterations(10);
kmeans.setDistanceMeasure(DistanceMeasure.EUCLIDEAN);
kmeans.setIncludedColumns(Arrays.asList(new String[] {"sepal length", "sepal width", "petal length"}));

Using the KMeans operator in RushScript

var results = dr.kmeans(data, {
    k:3,
    maxIterations:10,
    distanceMeasure:'EUCLIDEAN',
    includedColumns:['sepal length', 'sepal width', 'petal length']});

Properties

The KMeans operator provides the following properties.

Name	Type	Description
maxIterations	int	The maximum number of iterations to allow before iteration is halted. Defaults to 99.
k	int	The k value, where k is the number of centroids to compute. Default: 3.
includedColumns	List<String>	The list of learning columns to include. An empty list means all columns of type double.
distanceMeasure	DistanceMeasure	The distance measure used to measure the distance between two points when building the model. Either EUCLIDEAN or COSINE_SIMILARITY. Default: EUCLIDEAN.

Ports

The KMeans operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The training data used to build the model.

The KMeans operator provides the following output port.

Name	Type	Get Method	Description
model	PMMLPort	getModel()	The resultant PMML cluster model.

ClusterPredictor Operator

The ClusterPredictor operator assigns input data to clusters based on the provided PMML clustering model. The explicit cluster IDs will be used for the assignment if the model provides any. Otherwise, the implicit 1-based index, indicating the position in which each cluster appears in the model will be used as ID.

The input data must contain the same fields as the training data that was used to build the model (in the PMML model: clustering fields with the attribute "isCenterField" set to "true") and these fields must be of type double, float, long, or int. The resulting assignments will be part of the output alongside with the original input data.

Code Example

Using the ClusterPredictor operator in Java

// Create the cluster predictor operator and add it to a graph
ClusterPredictor predictor = graph.add(new ClusterPredictor());

// Set the name of the field containing the cluster assignments
predictor.setWinnerFieldName("label");

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the ClusterPredictor operator in RushScript

// Apply a clustering model to the given data
var results = dr.clusterPredictor(model, data, {winnerFieldName:"label"});

Properties

The ClusterPredictor operator provides the following properties.

Name	Type	Description
winnerFieldName	String	The name of the output field containing the cluster assignments. Default: "winner".

Ports

The ClusterPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	Input data to which the clustering model is applied.
model	PMMLPort	getModel()	Clustering model in PMML to apply.

The ClusterPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the clustering model to the input data. Contains the original input data as well as a new field with the assigned cluster ID.

Last modified date: 03/10/2025