Data Explorer Nodes

Using DataFlow in KNIME : DataFlow Nodes in KNIME : Data Explorer Nodes

Share this page

Data Explorer Nodes

Data Quality Analyzer

Data Quality Analyzer analyzes data quality.

Data Summarizer

Data Summarizer provides data summary statistics.

Data Summarizer Viewer

Data Summarizer Viewer provides views for the PMML model produced by the Data Summarizer.

Distinct Values

Distinct Values computes all distinct values and their counts for a given column.

Data Quality Analyzer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DataQualityAnalyzer Operator to Analyze Data Quality.

Evaluates a set of quality tests on an input dataset. Those rows for which all tests pass are considered “clean” and thus sent to the “Clean data” output. Those rows for which any tests fail are considered “dirty” and thus sent to the “Dirty data” output. In addition, this produces a summary model that includes the following statistics:

• Total number of rows

• Total number of rows for which at least one test involving a given field failed

• Per-test failure counts for each test involving a given field

Ports

Input Ports

0 - Input data

Output Ports

0 - Clean data

1 - Dirty data

2 - Quality Summary

Views

Reports

Result reports.

Data Summarizer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the SummaryStatistics Operator to Calculate Data Statistics.

Discovers various metrics of an input dataset, based on the configured detail level. The types of the fields, combined with the detail level, determine the set of metrics that are calculated.

If the detail level is set to fastest, the following statistics are calculated:

• Min: calculated for fields of type int, long, float, or double

• Max: calculated for fields of type int, long, float, or double

• Mean: calculated for fields of type int, long, float, or double

• Stddev: calculated for fields of type int, long, float, or double

• Variance: calculated for fields of type int, long, float, or double

• Sum: calculated for fields of type int, long, float, or double

• Missing Frequency: Calculated for all fields

• Correlation: calculated for fields of type int, long, float, or double

• Covariance: calculated for fields of type int, long, float, or double

If the detail level is set to most detail, the following additional statistics are calculated:

• Intervals: calculated for fields of type int, long, float, or double

• Most Frequest Values: Calculated for all fields

• Percentiles: calculated for fields of type int, long, float, or double

• Median: calculated for fields of type int, long, float, or double

• IQR: calculated for fields of type int, long, float, or double

Dialog Options

detail level

Sets the detail level with which statistics are calculated. Setting to fastest provides those statistics that can be calculated in a single pass over the data. Setting to most detail is slower but provides more statistics.

number of intervals to calculate

Specifies the number of intervals to calcluate for each numeric field. This setting is ignored if detail level is set to fastest.

number of quantiles to calculate

Specifies the number of quantiles to calculate for each numeric field. This setting is ignored if detail level is set to fastest.

number of frequent items to calculate

Provides a cap on the number of most frequent values to calculate. This setting is ignored if detail level is set to fastest.

expect a small number of distinct values

Should be set to true if the number of distinct values per column is expected to generally be small.

Ports

Input Ports

0 - Input data

Output Ports

0 - Summary statistics model

Views

Reports

Result reports

Data Summarizer Viewer

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Calculating Data Statistics.

Data Summarizer Viewer provides views for the PMML model produced by the Data Summarizer.

Ports

Input Ports

0 - Summary statistics model

Views

Reports

Result reports

Distinct Values

KNIME This topic describes a KNIME node. For the DataFlow operator it is based on, see Using the DistinctValues Operator to Find Distinct Values.

Distinct values computes all distinct values and their counts for a given column.

Ports

Input Ports

0 - Input data

Output Ports

0 - Distinct values and counts