DataMatcher Nodes

Cluster Duplicates clusters records from the Discover Duplicates node into groups of similar records.

Cluster Links clusters records from the Discover Links node into groups of similar records.

Discover Duplicates discovers duplicate records within a data source using fuzzy matching algorithms.

Discover Links discovers duplicate records between two data sources using fuzzy matching algorithms.

Encode provides a library of phonetic algorithms used for indexing of words by their pronunciation.

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ClusterDuplicates Operator to Cluster Duplicates.

Using the results produced by the Discover Duplicates node, Cluster Duplicates will group records that were found to be similar to one another. For example, if the Discover Duplicates output indicated that record "A" and record "G" were similar, and record "G" and record "M" were also similar, Cluster Duplicates node will cluster A, G, and M into a group and assign it a groupId number.

Using this output, it is possible to join these results back to their original record source to visually inspect records that were found to be similar to one another.

Specifies the field in the data source that is the key field to be used to uniquely identify this record.

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ClusterLinks Operator to Cluster Links.

Using the results produced by the Discover Links node, Cluster Links will group records that were found to be similar to one another. For example, if the Discover Links output indicated that record "A" and record "G" were similar and record "G" and record "M" were also similar, Cluster Links node will cluster A, G and M into a group and assign it a groupId number.

Using this output, it is possible to join these results back to their original record source to visually inspect records that were found to be similar to one another.

Specifies the field in the left data source that is the key field to be used to uniquely identify the record.

Specifies the field in the right data source that is the key field to be used to uniquely identify the record.

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DiscoverDuplicates Operator to Discover Duplicates.

The Discover Duplicates node finds duplicate records within a data source using fuzzy matching techniques. A set of fields can be used as key fields to index the data into blocks for field level comparisons. A set of field comparisons can be defined using one of several fuzzy matching comparison algorithms.

The score from each field comparison can be given a weight. The higher the weight value, the bigger the effect of the field comparison on the overall record score. A filter value is defined that allows filtering out record pairs whose record level scores are less than the filter value.

The output of the Discover Duplicates node are pairs of records that are likely matches based on the criteria defined for the node.

Specifies the fields of the input data set to use as keys when indexing the data into blocks for record pair generation and comparison.

Sorts the input data. Your input data must be sorted by the same fields specified in the "Key fields" input. If your data is not sorted, check this box to have the node sort the data for you.

Specifies a set of field level comparisons utilizing fuzzy matching algorithms. You pick the fields to compare, the comparison algorithm to apply and property settings for the algorithm. A weight can be assigned to the comparison also. Comparisons with larger weight values are given more consideration when computing the record pair comparison score.

Filters record pairs with an aggregate score less than this value from the output data stream.

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DiscoverLinks Operator to Discover Links.

The Discover Links node finds duplicate records between two data sources using fuzzy matching techniques. A combination of left and right fields can be used as key fields to index the data into blocks for field-level comparisons. A set of field comparisons can be defined using one of several fuzzy matching comparison algorithms.

The output of the Discover Links node are left/right pairs of records that are likely matches based on the criteria defined for the node.

Specifies the fields of the left input data set to use as keys when indexing the data into blocks for record pair generation and comparison.

Sorts the left input data. Your input data must be sorted by the same fields specified in the "Left Key Fields" input. If your data is not sorted, check this box to have the node sort the data for you.

Specifies the fields of the right input data set to use as keys when indexing the data into blocks for record pair generation and comparison.

Sorts the right input data. Your input data must be sorted by the same fields specified in the "Right Key Fields" input. If your data is not sorted, check this box to have the node sort the data for you.

Specifies a set of field level comparisons using fuzzy matching algorithms. The user picks the fields to compare, the comparison algorithm to apply and property settings for the algorithm. A weight can be assigned to the comparison also.

Filters record pairs with an aggregate score less than this value from the output data stream.

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.

The Encode node provides a library of encoding algorithms that are useful in overcoming phonetic and spelling discrepancy in strings. Specify the encoding type, the field on which you want to perform the encoding and the resultant field name. Leaving the Target Name value blank causes the node to use the Source Field value, overwriting the source field with its encoded value.

Specifies the resultant field containing the encoded value. Leaving this setting blank means that the value of the Source Field is used.