DC 12.4 | Field Data Discovery

User Guide > Designing and Executing Data Profile > Field Data Discovery

Was this helpful?

Field Data Discovery

Data Discovery provides insights into patterns, values, and formats within the columns of a source dataset. It serves as a powerful tool for users to identify potential issues and understand the nature of their data more deeply. By exploring the characteristics of each column, users can pinpoint anomalies, inconsistencies, or outliers that may require attention. This also helps to determine the rules and rule types that are needed for data profiling and remediation.

Data Discovery can be configured to run automatically or can be triggered manually (see Setting Data Profile Preferences). When configured to run automatically, data discovery will automatically run when user connects to the source file during profile creation or upon opening an existing profile.

On the Rules tab, Field/Rule pane, the rules are grouped under Field Names. When you select a field name, it allows you to view the Field Data Discovery on the right pane. If no fields are selected, no Field Data Discovery results are shown.

IMPORTANT! The content of the Field Data Discovery pane will not be displayed if you are not connected to the source.

Data Discovery options:

Icons	Description
(Information Icon)	Hover this icon to display the Data Discovery preference settings (see Setting Data Profile Preferences). The following information is displayed: • Status (ON/OFF) - When Data Discovery is turned ON, data discovery will automatically run when user connects to the source file during profile creation or upon opening an existing profile. When OFF, you can still manually trigger data discovery within the profile editor by clicking . • Sample size - The sample size of the source data that is used for Field Data Discovery. This option is useful if your source data is very large (which can have slow performance). Setting a sample size improves performance while designing a profile and configuring profile rules. The default setting is 10,000 records but can be changed to 1000, 5000, 25000, or All Records.
(Run Icon)	Click this icon to manually run Data Discovery for a selected field. This option is useful when Field Data Discovery is turned OFF.

Icons

Description

(Information Icon)

Hover this icon to display the Data Discovery preference settings (see Setting Data Profile Preferences). The following information is displayed:

• Status (ON/OFF) - When Data Discovery is turned ON, data discovery will automatically run when user connects to the source file during profile creation or upon opening an existing profile. When OFF, you can still manually trigger data discovery within the profile editor by clicking

• Sample size - The sample size of the source data that is used for Field Data Discovery. This option is useful if your source data is very large (which can have slow performance). Setting a sample size improves performance while designing a profile and configuring profile rules. The default setting is 10,000 records but can be changed to 1000, 5000, 25000, or All Records.

(Run Icon)

Click this icon to manually run Data Discovery for a selected field. This option is useful when Field Data Discovery is turned OFF.

Note: If no rules are defined, the information and run icons will not be displayed on the Field Data Discovery pane. In such instances, users must add a rule to gain access to these icons.

The field data discovery information is based on the sample size source data and is displayed in the following tabs:

• Most Frequent Values: Shows the Value, Count (the number of times it occurs), and Frequency of occurrence for all the unique values (includes blank and empty values) in the selected field. The Total and Missing are also displayed at the top. The Most Frequent Values are discoverable for most data types. See MostFrequentValues.

• String Patterns: Shows the Count-%, Input, RegX Pattern, Display Pattern and Literal Pattern (which are described below). The total number of Unique Patterns and the total number of records sampled are also displayed at the top.

String patterns can be copy/pasted into rules (or any enabled text box) by right-clicking and selecting Copy Regex Pattern (to copy the regular expression) or Copy Input Value (to copy the input value). For example, you can paste a copied regex pattern into the MatchesRegex rule.

The following describes the discovered string patterns:

– Count-%: The total number of rows and the percentage of rows in the dataset that have the current pattern.

– Input: The unique field value which the other String Patterns columns describe.

– RegX Pattern: The regex pattern (regular expression pattern) used by the Input.

– Display Pattern: A user friendly pattern (created from different character classes like digits, alphabets, special characters, and space) which describes the Input.

– Literal Pattern: This pattern is same as the Input value but the regex related metacharacters are escaped. For example, a literal period (.) is displayed as (\.). This is required if you want to use it as a regular expression.

The String Patterns are discoverable for String data type only. See MatchesRegex.

• Statistics: Shows the following information for numeric data type:

– Mode - Most frequently occurring value in the selected field.

– Min - The lowest value in the selected field.

– Max - The highest value in the selected field.

– Mean - The average of the given numbers in the selected field.

– Median - The median (middle value) of the numbers in the selected field.

– Standard Deviation - The Standard Deviation value for the selected field.

– Variance - The Variance value for the selected field.

– Sum - The sum of all values in the selected field.

– Quantile (25.0)

– Quantile (50.0)

– Quantile (75.0)

– Quantile Outlier Lower Bound

This information is discoverable for numeric data types only. String length is used for string fields. See Statistics.

• Equal Range Binning: Equal width binning involves dividing the range of source field values into a specified number of equally spaced intervals (default is 10) between the minimum and maximum values. This information is discoverable for numeric data types only. See EqualRangeBinning.

• Possible Data Type: A sample size of source field is scanned to identify the potential data type of a field. This discovery result is shown for strings that could possibly be converted to other data types. For example string to date, time or boolean types.

Discovered data types and pictures can be copy/pasted into rules (or any enabled text box) by right-clicking and selecting Copy Type (to copy the data type) or Copy Picture (to copy the data picture). For example, you can paste a copied date format picture into the ChangeFormat rule.

Last modified date: 01/08/2026