Using DataFlow in KNIME : DataFlow Nodes in KNIME : DataMatcher Nodes
 
Share this page                  
DataMatcher Nodes
Cluster Duplicates
Cluster Duplicates clusters records from the Discover Duplicates node into groups of similar records.
Cluster Links
Cluster Links clusters records from the Discover Links node into groups of similar records.
Discover Duplicates
Discover Duplicates discovers duplicate records within a data source using fuzzy matching algorithms.
Discover Links
Discover Links discovers duplicate records between two data sources using fuzzy matching algorithms.
Encode
Encode provides a library of phonetic algorithms used for indexing of words by their pronunciation.
Cluster Duplicates
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ClusterDuplicates Operator to Cluster Duplicates.
Using the results produced by the Discover Duplicates node, Cluster Duplicates will group records that were found to be similar to one another. For example, if the Discover Duplicates output indicated that record "A" and record "G" were similar, and record "G" and record "M" were also similar, Cluster Duplicates node will cluster A, G, and M into a group and assign it a groupId number.
Using this output, it is possible to join these results back to their original record source to visually inspect records that were found to be similar to one another.
Dialog Options
Data Id Field
Specifies the field in the data source that is the key field to be used to uniquely identify this record.
Ports
Input Ports
0 - Discover Duplicates node output
Output Ports
0 - Clustered results
Cluster Links
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the ClusterLinks Operator to Cluster Links.
Using the results produced by the Discover Links node, Cluster Links will group records that were found to be similar to one another. For example, if the Discover Links output indicated that record "A" and record "G" were similar and record "G" and record "M" were also similar, Cluster Links node will cluster A, G and M into a group and assign it a groupId number.
Using this output, it is possible to join these results back to their original record source to visually inspect records that were found to be similar to one another.
Dialog Options
leftDataIdField
Specifies the field in the left data source that is the key field to be used to uniquely identify the record.
rightDataIdField
Specifies the field in the right data source that is the key field to be used to uniquely identify the record.
Ports
Input Ports
0 - Discover Links node output
Output Ports
0 - Clustered results
Discover Duplicates
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DiscoverDuplicates Operator to Discover Duplicates.
The Discover Duplicates node finds duplicate records within a data source using fuzzy matching techniques. A set of fields can be used as key fields to index the data into blocks for field level comparisons. A set of field comparisons can be defined using one of several fuzzy matching comparison algorithms.
The score from each field comparison can be given a weight. The higher the weight value, the bigger the effect of the field comparison on the overall record score. A filter value is defined that allows filtering out record pairs whose record level scores are less than the filter value.
The output of the Discover Duplicates node are pairs of records that are likely matches based on the criteria defined for the node.
Dialog Options
Key fields
Specifies the fields of the input data set to use as keys when indexing the data into blocks for record pair generation and comparison.
Sort Input
Sorts the input data. Your input data must be sorted by the same fields specified in the "Key fields" input. If your data is not sorted, check this box to have the node sort the data for you.
Field Comparisons
Specifies a set of field level comparisons utilizing fuzzy matching algorithms. You pick the fields to compare, the comparison algorithm to apply and property settings for the algorithm. A weight can be assigned to the comparison also. Comparisons with larger weight values are given more consideration when computing the record pair comparison score.
Record Pairs Filter Value
Filters record pairs with an aggregate score less than this value from the output data stream.
Ports
Input Ports
0 - Input port containing the data set.
Output Ports
0 - Output port containing the results of the discover duplicates operation.
Discover Links
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DiscoverLinks Operator to Discover Links.
The Discover Links node finds duplicate records between two data sources using fuzzy matching techniques. A combination of left and right fields can be used as key fields to index the data into blocks for field-level comparisons. A set of field comparisons can be defined using one of several fuzzy matching comparison algorithms.
The score from each field comparison can be given a weight. The higher the weight value, the bigger the effect of the field comparison on the overall record score. A filter value is defined that allows filtering out record pairs whose record level scores are less than the filter value.
The output of the Discover Links node are left/right pairs of records that are likely matches based on the criteria defined for the node.
Dialog Options
Left Key Fields
Specifies the fields of the left input data set to use as keys when indexing the data into blocks for record pair generation and comparison.
Sort Left Input
Sorts the left input data. Your input data must be sorted by the same fields specified in the "Left Key Fields" input. If your data is not sorted, check this box to have the node sort the data for you.
Right Key Fields
Specifies the fields of the right input data set to use as keys when indexing the data into blocks for record pair generation and comparison.
Sort Right Input
Sorts the right input data. Your input data must be sorted by the same fields specified in the "Right Key Fields" input. If your data is not sorted, check this box to have the node sort the data for you.
Field Comparisons
Specifies a set of field level comparisons using fuzzy matching algorithms. The user picks the fields to compare, the comparison algorithm to apply and property settings for the algorithm. A weight can be assigned to the comparison also.
Record Pairs Filter Value
Filters record pairs with an aggregate score less than this value from the output data stream.
Ports
Input Ports
0 - Input port containing the data set.
1 - Input port containing the data set.
Output Ports
0 - Output port containing the results of the discover links operation.
Encode
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DeriveFields Operator to Compute New Fields and Available Functions.
The Encode node provides a library of encoding algorithms that are useful in overcoming phonetic and spelling discrepancy in strings. Specify the encoding type, the field on which you want to perform the encoding and the resultant field name. Leaving the Target Name value blank causes the node to use the Source Field value, overwriting the source field with its encoded value.
Dialog Options
Encoding Type
Specifies the type of encoding to perform.
Source Field
Specifies the field on which the encoding takes place.
Target Name
Specifies the resultant field containing the encoded value. Leaving this setting blank means that the value of the Source Field is used.
Ports
Input Ports
0 - Source input
Output Ports
0 - Output flow containing one or more encoded fields