Was this helpful?
Data Clustering Operators
DataFlow includes several prebuilt fuzzy matching operators used for clustering the results of duplicate detection and linkage discovery. For more information, refer to the following topics:
ClusterDuplicates Operator
The ClusterDuplicates operator transforms record pairs into clusters of like records, where the two sides of the pair are from the same source. The output of the DiscoverDuplicates operator is a stream of record pairs.
Each pair of records has passed the given qualifications for being a potential match. This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B, and C; generate a unique cluster identifier for the grouping; and output a row for records A, B, and C with the generated cluster identifier.
A cluster may contain any number of records. Note that the original record pairings are lost, as are the scores.
Code Example
The following code fragment demonstrates how to set up the ClusterDuplicates operator and use it within a simple dataflow.
Using the ClusterDuplicates operator in Java
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph();

// Create a delimited text reader for the "dedup-accounts.csv" file
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/dedup-accounts.csv"));
reader.setHeader(true);

// Create a ClusterDuplicates operator and set the id field
ClusterDuplicates cluster = graph.add(new ClusterDuplicates());
cluster.setDataIdField("id");

// Create a delimited text writer for the results
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/cluster-accounts.csv", WriteMode.OVERWRITE));
writer.setHeader(true);
writer.setWriteSingleSink(true);

// Connect the graph
graph.connect(reader.getOutput(), cluster.getInput());
graph.connect(cluster.getOutput(), writer.getInput());

// Compile and run the graph
graph.run();
Using the ClusterDuplicates operator in RushScript
var results = dr.clusterDuplicates(data, {dataIdField:"id"});
Properties
The ClusterDuplicates operator provides the following properties.
Name
Type
Description
dataIdField
String
The name of the field uniquely identifying records in the original source. This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
Ports
The ClusterDuplicates operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data for the clustering operation.
The ClusterDuplicates operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
The output from the clustering operation.
ClusterLinks Operator
The ClusterLinks operator transforms record pairs into clusters of like records. The output of the DiscoverLinks operator is a stream of record pairs. Each pair of records has passed the given qualifications for being a potential match.
This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B, and C; generate a unique cluster identifier for the grouping; and output a row for records A, B, and C with the generated cluster identifier.
A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.
Code Example
The following code fragment demonstrates how to set up the ClusterLinks operator and use it within a simple dataflow.
Using the ClusterLinks operator in Java
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph();

// Create a delimited text reader for the "link-accounts.csv" file
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/link-accounts.csv"));
reader.setHeader(true);

// Create a ClusterLinks operator and set the id field and field patterns
ClusterLinks cluster = graph.add(new ClusterLinks());
cluster.setDataIdField("id");
cluster.setLeftFieldPattern("left_{0}");
cluster.setRightFieldPattern("right_{0}");

// Create a delimited text writer for the results
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/cluster-accounts.csv", WriteMode.OVERWRITE));
writer.setHeader(true);
writer.setWriteSingleSink(true);

// Connect the graph
graph.connect(reader.getOutput(), cluster.getInput());
graph.connect(cluster.getOutput(), writer.getInput());

// Compile and run the graph
graph.run();
Using the ClusterLinks operator in RushScript
var results = dr.clusterLinks(data, {dataIdField:"id", leftFieldPattern:"left_{0}", rightFieldPattern:"right_{0}"});
Properties
The ClusterLinks operator provides the following properties.
Name
Type
Description
dataIdField
String
The name of the field uniquely identifying records on both sides of the pairs. This is a convenience mechanism for when both sides use the same name, as is the case with the output from DiscoverDuplicates. This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
leftFieldPattern
String
The naming pattern used for fields from the left side record. This will be used to determine the actual name of the left hand ID field.
leftDataIdField
String
The name of the field uniquely identifying records on the left side of the pairs. This name will also be used to identify cluster members in the output. This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
rightFieldPattern
String
The naming pattern used for fields from the right side record. This will be used to determine the actual name of the right hand ID field.
rightDataIdField
String
The name of the field uniquely identifying records on the right side of the pairs. This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
Ports
The ClusterLinks operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data for the clustering operation.
The ClusterLinks operator provides a single output port.
Name
Type
Get Method
Description
output
getOutput()
The output data from the clustering operation.
Last modified date: 03/10/2025