Field Data Discovery
Data Discovery provides insights into patterns, values, and formats within the columns of a source dataset. It serves as a powerful tool for users to identify potential issues and understand the nature of their data more deeply. By exploring the characteristics of each column, users can pinpoint anomalies, inconsistencies, or outliers that may require attention. This also helps to determine the rules and rule types that are needed for data profiling and remediation.
Data Discovery can be configured to run automatically or can be triggered manually (see
Setting Data Profile Preferences). When configured to run automatically, data discovery will automatically run when user connects to the source file during profile creation or upon opening an existing profile.
On the Rules tab, Fields View, the rules are grouped under Field Names. When you select a field name, it allows you to view the Field Data Discovery on the right pane. If no fields are selected, no Field Data Discovery results are shown.
IMPORTANT! The content of the Field Data Discovery pane will not be displayed if you are not connected to the source.
Data Discovery options:
Note: If no rules are defined, the information and run icons will not be displayed on the Field Data Discovery pane. In such instances, users must add a rule to gain access to these icons.
The field data discovery information is based on the sample size source data and is displayed in the following tabs:
• Most Frequent Values: Shows the
Value,
Count (the number of times it occurs), and
Frequency of occurrence for all the unique values (includes blank and empty values) in the selected field. The
Total,
Missing, and
Invalid Frequency is also displayed on the top. The
Most Frequent Values are discoverable for most data types. See
Most Frequent Values.
• String Patterns: Shows the following details of discovered string patterns:
– Count-%: How many rows and what percentage of rows in the dataset have the current pattern.
– Input: The unique field value for which the other details are being displayed.
– RegX Pattern: The regex pattern (regular expression patterns) for the field value.
– Display Pattern: A user friendly pattern created from different character classes like digits, alphabets, special characters, and space.
– Literal Pattern: It is same as the input value but the regex related metacharacters are escaped. For example, a literal period (.) is displayed as (\.). This is required if you want to use it as a regular expression.
The
String Patterns are discoverable for String data type only. See
Matches Regex.
• Statistics: Shows the following information for numeric data type:
– Mode - Most frequently occurring value in the selected field.
– Min - The lowest value in the selected field.
– Max - The highest value in the selected field.
– Mean - The average of the given numbers in the selected field.
– Median - The median (middle value) of the numbers in the selected field.
– Standard Deviation - The Standard Deviation value for the selected field.
– Variance - The Variance value for the selected field.
– Sum - The sum of all values in the selected field.
– Quantile (25.0)
– Quantile (50.0)
– Quantile (75.0)
– Quantile Outlier Lower Bound
– Quantile Outlier Lower Bound
This information is discoverable for numeric data types only. String length is used for string fields. See
Statistics.
• Equal Range Binning: Equal width binning involves dividing the range of source field values into a specified number of equally spaced intervals (default is 10) between the minimum and maximum values. This information is discoverable for numeric data types only. See
Equal Range Binning.
• Possible Schema: A sample size of source field is scanned to identify the potential data type of a field.
Last modified date: 10/22/2024