DF 8.2 | Data Clustering Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Data Clustering Operators

Was this helpful?

Data Clustering Operators

DataFlow includes several prebuilt fuzzy matching operators used for clustering the results of duplicate detection and linkage discovery. For more information, refer to the following topics:

• ClusterDuplicates Operator

• ClusterLinks Operator

ClusterDuplicates Operator

The ClusterDuplicates operator transforms record pairs into clusters of like records, where the two sides of the pair are from the same source. The output of the DiscoverDuplicates operator is a stream of record pairs.

Each pair of records has passed the given qualifications for being a potential match. This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B, and C; generate a unique cluster identifier for the grouping; and output a row for records A, B, and C with the generated cluster identifier.

A cluster may contain any number of records. Note that the original record pairings are lost, as are the scores.

Code Example

The following code fragment demonstrates how to set up the ClusterDuplicates operator and use it within a simple dataflow.

Using the ClusterDuplicates operator in Java

// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph();

// Create a delimited text reader for the "dedup-accounts.csv" file
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/dedup-accounts.csv"));
reader.setHeader(true);

// Create a ClusterDuplicates operator and set the id field
ClusterDuplicates cluster = graph.add(new ClusterDuplicates());
cluster.setDataIdField("id");

// Create a delimited text writer for the results
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/cluster-accounts.csv", WriteMode.OVERWRITE));
writer.setHeader(true);
writer.setWriteSingleSink(true);

// Connect the graph
graph.connect(reader.getOutput(), cluster.getInput());
graph.connect(cluster.getOutput(), writer.getInput());

// Compile and run the graph
graph.run();

Using the ClusterDuplicates operator in RushScript

var results = dr.clusterDuplicates(data, {dataIdField:"id"});

Properties

The ClusterDuplicates operator provides the following properties.

Name	Type	Description
dataIdField	String	The name of the field uniquely identifying records in the original source. This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.

Ports

The ClusterDuplicates operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data for the clustering operation.

The ClusterDuplicates operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output from the clustering operation.

ClusterLinks Operator

The ClusterLinks operator transforms record pairs into clusters of like records. The output of the DiscoverLinks operator is a stream of record pairs. Each pair of records has passed the given qualifications for being a potential match.

This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B, and C; generate a unique cluster identifier for the grouping; and output a row for records A, B, and C with the generated cluster identifier.

A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.

Code Example

The following code fragment demonstrates how to set up the ClusterLinks operator and use it within a simple dataflow.

Using the ClusterLinks operator in Java

// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph();

// Create a delimited text reader for the "link-accounts.csv" file
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/link-accounts.csv"));
reader.setHeader(true);

// Create a ClusterLinks operator and set the id field and field patterns
ClusterLinks cluster = graph.add(new ClusterLinks());
cluster.setDataIdField("id");
cluster.setLeftFieldPattern("left_{0}");
cluster.setRightFieldPattern("right_{0}");

// Create a delimited text writer for the results
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/cluster-accounts.csv", WriteMode.OVERWRITE));
writer.setHeader(true);
writer.setWriteSingleSink(true);

// Connect the graph
graph.connect(reader.getOutput(), cluster.getInput());
graph.connect(cluster.getOutput(), writer.getInput());

// Compile and run the graph
graph.run();

Using the ClusterLinks operator in RushScript

var results = dr.clusterLinks(data, {dataIdField:"id", leftFieldPattern:"left_{0}", rightFieldPattern:"right_{0}"});

Properties

The ClusterLinks operator provides the following properties.

Name	Type	Description
dataIdField	String	The name of the field uniquely identifying records on both sides of the pairs. This is a convenience mechanism for when both sides use the same name, as is the case with the output from DiscoverDuplicates. This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
leftFieldPattern	String	The naming pattern used for fields from the left side record. This will be used to determine the actual name of the left hand ID field.
leftDataIdField	String	The name of the field uniquely identifying records on the left side of the pairs. This name will also be used to identify cluster members in the output. This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
rightFieldPattern	String	The naming pattern used for fields from the right side record. This will be used to determine the actual name of the right hand ID field.
rightDataIdField	String	The name of the field uniquely identifying records on the right side of the pairs. This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.

Ports

The ClusterLinks operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data for the clustering operation.

The ClusterLinks operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	The output data from the clustering operation.

Last modified date: 03/10/2025