Data Clustering Operators
DataFlow includes several prebuilt fuzzy matching operators used for clustering the results of duplicate detection and linkage discovery. For more information, refer to the following topics:
ClusterDuplicates Operator
The
ClusterDuplicates operator transforms record pairs into clusters of like records, where the two sides of the pair are from the same source. The output of the
DiscoverDuplicates operator is a stream of record pairs.
Each pair of records has passed the given qualifications for being a potential match. This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B, and C; generate a unique cluster identifier for the grouping; and output a row for records A, B, and C with the generated cluster identifier.
A cluster may contain any number of records. Note that the original record pairings are lost, as are the scores.
Code Example
The following code fragment demonstrates how to set up the
ClusterDuplicates operator and use it within a simple dataflow.
Using the ClusterDuplicates operator in Java
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph();
// Create a delimited text reader for the "dedup-accounts.csv" file
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/dedup-accounts.csv"));
reader.setHeader(true);
// Create a ClusterDuplicates operator and set the id field
ClusterDuplicates cluster = graph.add(new ClusterDuplicates());
cluster.setDataIdField("id");
// Create a delimited text writer for the results
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/cluster-accounts.csv", WriteMode.OVERWRITE));
writer.setHeader(true);
writer.setWriteSingleSink(true);
// Connect the graph
graph.connect(reader.getOutput(), cluster.getInput());
graph.connect(cluster.getOutput(), writer.getInput());
// Compile and run the graph
graph.run();
Using the ClusterDuplicates operator in RushScript
var results = dr.clusterDuplicates(data, {dataIdField:"id"});
Properties
The
ClusterDuplicates operator provides the following properties.
Ports
The
ClusterDuplicates operator provides a single input port.
The
ClusterDuplicates operator provides a single output port.
ClusterLinks Operator
The
ClusterLinks operator transforms record pairs into clusters of like records. The output of the
DiscoverLinks operator is a stream of record pairs. Each pair of records has passed the given qualifications for being a potential match.
This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B, and C; generate a unique cluster identifier for the grouping; and output a row for records A, B, and C with the generated cluster identifier.
A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.
Code Example
The following code fragment demonstrates how to set up the
ClusterLinks operator and use it within a simple dataflow.
Using the ClusterLinks operator in Java
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph();
// Create a delimited text reader for the "link-accounts.csv" file
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/link-accounts.csv"));
reader.setHeader(true);
// Create a ClusterLinks operator and set the id field and field patterns
ClusterLinks cluster = graph.add(new ClusterLinks());
cluster.setDataIdField("id");
cluster.setLeftFieldPattern("left_{0}");
cluster.setRightFieldPattern("right_{0}");
// Create a delimited text writer for the results
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/cluster-accounts.csv", WriteMode.OVERWRITE));
writer.setHeader(true);
writer.setWriteSingleSink(true);
// Connect the graph
graph.connect(reader.getOutput(), cluster.getInput());
graph.connect(cluster.getOutput(), writer.getInput());
// Compile and run the graph
graph.run();
Using the ClusterLinks operator in RushScript
var results = dr.clusterLinks(data, {dataIdField:"id", leftFieldPattern:"left_{0}", rightFieldPattern:"right_{0}"});
Properties
The
ClusterLinks operator provides the following properties.
Ports
The
ClusterLinks operator provides a single input port.
The
ClusterLinks operator provides a single output port.
Last modified date: 03/10/2025