Data Explorer Nodes
Data Quality Analyzer
Data Summarizer
Data Summarizer Viewer
Distinct Values
Distinct Values computes all distinct values and their counts for a given column.
Data Quality Analyzer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the DataQualityAnalyzer Operator to Analyze Data Quality.
Evaluates a set of quality tests on an input dataset. Those rows for which all tests pass are considered “clean” and thus sent to the “Clean data” output. Those rows for which any tests fail are considered “dirty” and thus sent to the “Dirty data” output. In addition, this produces a summary model that includes the following statistics:
• Total number of rows
• Total number of rows for which at least one test involving a given field failed
• Per-test failure counts for each test involving a given field
Ports
Input Ports
0 - Input data
Output Ports
0 - Clean data
1 - Dirty data
2 - Quality Summary
Views
Reports
Result reports.
Data Summarizer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the SummaryStatistics Operator to Calculate Data Statistics.
Discovers various metrics of an input dataset, based on the configured detail level. The types of the fields, combined with the detail level, determine the set of metrics that are calculated.
If the detail level is set to fastest, the following statistics are calculated:
• Min: calculated for fields of type int, long, float, or double
• Max: calculated for fields of type int, long, float, or double
• Mean: calculated for fields of type int, long, float, or double
• Stddev: calculated for fields of type int, long, float, or double
• Variance: calculated for fields of type int, long, float, or double
• Sum: calculated for fields of type int, long, float, or double
• Missing Frequency: Calculated for all fields
• Correlation: calculated for fields of type int, long, float, or double
• Covariance: calculated for fields of type int, long, float, or double
If the detail level is set to most detail, the following additional statistics are calculated:
• Intervals: calculated for fields of type int, long, float, or double
• Most Frequest Values: Calculated for all fields
• Percentiles: calculated for fields of type int, long, float, or double
• Median: calculated for fields of type int, long, float, or double
• IQR: calculated for fields of type int, long, float, or double
Dialog Options
detail level
Sets the detail level with which statistics are calculated. Setting to fastest provides those statistics that can be calculated in a single pass over the data. Setting to most detail is slower but provides more statistics.
number of intervals to calculate
Specifies the number of intervals to calcluate for each numeric field. This setting is ignored if detail level is set to fastest.
number of quantiles to calculate
Specifies the number of quantiles to calculate for each numeric field. This setting is ignored if detail level is set to fastest.
number of frequent items to calculate
Provides a cap on the number of most frequent values to calculate. This setting is ignored if detail level is set to fastest.
expect a small number of distinct values
Should be set to true if the number of distinct values per column is expected to generally be small.
Ports
Input Ports
0 - Input data
Output Ports
0 - Summary statistics model
Views
Reports
Result reports
Data Summarizer Viewer
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Calculating Data Statistics.
Data Summarizer Viewer provides views for the PMML model produced by the Data Summarizer.
Ports
Input Ports
0 - Summary statistics model
Views
Reports
Result reports
Distinct Values
KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see
Using the DistinctValues Operator to Find Distinct Values.
Distinct values computes all distinct values and their counts for a given column.
Ports
Input Ports
0 - Input data
Output Ports
0 - Distinct values and counts