Using DataFlow in KNIME : DataFlow Nodes in KNIME : Data Explorer Nodes
 
Share this page                  
Data Explorer Nodes
Data Quality Analyzer
Data Quality Analyzer analyzes data quality.
Data Summarizer
Data Summarizer provides data summary statistics.
Data Summarizer Viewer
Data Summarizer Viewer provides views for the PMML model produced by the Data Summarizer.
Distinct Values
Distinct Values computes all distinct values and their counts for a given column.
Data Quality Analyzer
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DataQualityAnalyzer Operator to Analyze Data Quality.
Evaluates a set of quality tests on an input dataset. Those rows for which all tests pass are considered “clean” and thus sent to the “Clean data” output. Those rows for which any tests fail are considered “dirty” and thus sent to the “Dirty data” output. In addition, this produces a summary model that includes the following statistics:
Total number of rows
Total number of rows for which at least one test involving a given field failed
Per-test failure counts for each test involving a given field
Ports
Input Ports
0 - Input data
Output Ports
0 - Clean data
1 - Dirty data
2 - Quality Summary
Views
Reports
Result reports.
Data Summarizer
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SummaryStatistics Operator to Calculate Data Statistics.
Discovers various metrics of an input dataset, based on the configured detail level. The types of the fields, combined with the detail level, determine the set of metrics that are calculated.
If the detail level is set to fastest, the following statistics are calculated:
Min: calculated for fields of type int, long, float, or double
Max: calculated for fields of type int, long, float, or double
Mean: calculated for fields of type int, long, float, or double
Stddev: calculated for fields of type int, long, float, or double
Variance: calculated for fields of type int, long, float, or double
Sum: calculated for fields of type int, long, float, or double
Missing Frequency: Calculated for all fields
Correlation: calculated for fields of type int, long, float, or double
Covariance: calculated for fields of type int, long, float, or double
If the detail level is set to most detail, the following additional statistics are calculated:
Intervals: calculated for fields of type int, long, float, or double
Most Frequest Values: Calculated for all fields
Percentiles: calculated for fields of type int, long, float, or double
Median: calculated for fields of type int, long, float, or double
IQR: calculated for fields of type int, long, float, or double
Dialog Options
detail level
Sets the detail level with which statistics are calculated. Setting to fastest provides those statistics that can be calculated in a single pass over the data. Setting to most detail is slower but provides more statistics.
number of intervals to calculate
Specifies the number of intervals to calcluate for each numeric field. This setting is ignored if detail level is set to fastest.
number of quantiles to calculate
Specifies the number of quantiles to calculate for each numeric field. This setting is ignored if detail level is set to fastest.
number of frequent items to calculate
Provides a cap on the number of most frequent values to calculate. This setting is ignored if detail level is set to fastest.
expect a small number of distinct values
Should be set to true if the number of distinct values per column is expected to generally be small.
Ports
Input Ports
0 - Input data
Output Ports
0 - Summary statistics model
Views
Reports
Result reports
Data Summarizer Viewer
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Calculating Data Statistics.
Data Summarizer Viewer provides views for the PMML model produced by the Data Summarizer.
Ports
Input Ports
0 - Summary statistics model
Views
Reports
Result reports
Distinct Values
 KNIME  This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DistinctValues Operator to Find Distinct Values.
Distinct values computes all distinct values and their counts for a given column.
Ports
Input Ports
0 - Input data
Output Ports
0 - Distinct values and counts