DC 12.3 | Field Data Discovery

User Guide > Designing and Executing Data Profile > Adding Rules from Rules Tab > Field Data Discovery

Was this helpful?

Field Data Discovery

Data Discovery provides insights into patterns, values, and formats within the columns of a source dataset. It serves as a powerful tool for users to identify potential issues and understand the nature of their data more deeply. By exploring the characteristics of each column, users can pinpoint anomalies, inconsistencies, or outliers that may require attention. This also helps to determine the rules and rule types that are needed for data profiling and remediation.

Data Discovery can be configured to run automatically or can be triggered manually (see Setting Data Profile Preferences). When configured to run automatically, data discovery will automatically run when user connects to the source file during profile creation or upon opening an existing profile.

On the Rules tab, Fields View, the rules are grouped under Field Names. When you select a field name, it allows you to view the Field Data Discovery on the right pane. If no fields are selected, no Field Data Discovery results are shown.

IMPORTANT! The content of the Field Data Discovery pane will not be displayed if you are not connected to the source.

Data Discovery options:

Icons	Description
(Information Icon)	Hover this icon to display the Data Discovery preferences (see Setting Data Profile Preferences). The following information is displayed: • If the data discovery is turned ON - When Data Discovery is turned ON, data discovery will automatically run when user connects to the source file during profile creation or upon opening an existing profile. When OFF, you can still manually trigger data discovery within the profile editor by clicking . • Discovery sample size - The sample size of the source data that is used for Field Data Discovery. This option is useful if your source data is very large. Working with a sample size will improve performance while designing a profile and configuring profile rules. The default setting is 10,000 records but can be changed to 1000, 5000, 25000, or All Records.
(Run Icon)	Click this icon to manually run Data Discovery for the selected field. This option is useful when Field Data Discovery is turned OFF.

Icons

Description

(Information Icon)

Hover this icon to display the Data Discovery preferences (see Setting Data Profile Preferences). The following information is displayed:

• If the data discovery is turned ON - When Data Discovery is turned ON, data discovery will automatically run when user connects to the source file during profile creation or upon opening an existing profile. When OFF, you can still manually trigger data discovery within the profile editor by clicking

• Discovery sample size - The sample size of the source data that is used for Field Data Discovery. This option is useful if your source data is very large. Working with a sample size will improve performance while designing a profile and configuring profile rules. The default setting is 10,000 records but can be changed to 1000, 5000, 25000, or All Records.

(Run Icon)

Click this icon to manually run Data Discovery for the selected field. This option is useful when Field Data Discovery is turned OFF.

Note: If no rules are defined, the information and run icons will not be displayed on the Field Data Discovery pane. In such instances, users must add a rule to gain access to these icons.

The field data discovery information is based on the sample size source data and is displayed in the following tabs:

• Most Frequent Values: Shows the Value, Count (the number of times it occurs), and Frequency of occurrence for all the unique values (includes blank and empty values) in the selected field. The Total, Missing, and Invalid Frequency is also displayed on the top. The Most Frequent Values are discoverable for most data types. See Most Frequent Values.

• String Patterns: Shows the following details of discovered string patterns:

– Count-%: How many rows and what percentage of rows in the dataset have the current pattern.

– Input: The unique field value for which the other details are being displayed.

– RegX Pattern: The regex pattern (regular expression patterns) for the field value.

– Display Pattern: A user friendly pattern created from different character classes like digits, alphabets, special characters, and space.

– Literal Pattern: It is same as the input value but the regex related metacharacters are escaped. For example, a literal period (.) is displayed as (\.). This is required if you want to use it as a regular expression.

The String Patterns are discoverable for String data type only. See Matches Regex.

• Statistics: Shows the following information for numeric data type:

– Mode - Most frequently occurring value in the selected field.

– Min - The lowest value in the selected field.

– Max - The highest value in the selected field.

– Mean - The average of the given numbers in the selected field.

– Median - The median (middle value) of the numbers in the selected field.

– Standard Deviation - The Standard Deviation value for the selected field.

– Variance - The Variance value for the selected field.

– Sum - The sum of all values in the selected field.

– Quantile (25.0)

– Quantile (50.0)

– Quantile (75.0)

– Quantile Outlier Lower Bound

This information is discoverable for numeric data types only. String length is used for string fields. See Statistics.

• Equal Range Binning: Equal width binning involves dividing the range of source field values into a specified number of equally spaced intervals (default is 10) between the minimum and maximum values. This information is discoverable for numeric data types only. See Equal Range Binning.

• Possible Schema: A sample size of source field is scanned to identify the potential data type of a field.

Last modified date: 12/03/2024