Data Quality Dimensions
Users can optionally organize test rules into data quality dimensions to surface specific data quality issues in the dataset. Each data quality dimension generates a data quality score, enabling users to immediately learn what types of data quality issues exist, how prevalent they are, and which fields have each issue.
There are six dimensions to choose from, each representing an essential characteristic of quality data: Accuracy, Completeness, Consistency, Timeliness, Uniqueness and Validity.
To configure data quality dimensions, users associate a test rule with a dimension when they configure the rule (by setting the
Dimensions property). Users can associate test rules with the dimensions of their choice. Test rules are
Assert,
CompareToConstant,
FuzzyMatch,
IsNotBlank,
IsNotDuplicate,
IsNotNull,
InRange,
MatchesRegex,
RemoveDuplicates and
RemoveDuplicatesFuzzyMatching.
For an example of how dimensions work and the visibility users gain from using them, see
Example: Data Quality Dimensions.
Once data quality dimensions are configured, users manage them on the
Configuration tab in the
Dimensions section (see
Managing Data Quality Dimensions), and view output results in the
Statistics tab (see
Viewing Statistics).
The following are the dimensions and the data quality characteristic they represent:
• Accuracy - The data is correct.
• Completeness - The data is present.
• Consistency - The data uses the same format or pattern across different sources.
• Timeliness - The data is recent and available.
• Uniqueness - The data is not duplicated.
• Validity - The data conforms to business rules and is within an acceptable range.
Dimensions also enable users to specify the importance levels of rules by assigning rule weights, which are figured into dimension scores. Rule weights are useful, for example, to call out business critical fields.
A dimension score is derived by aggregating individual rule pass results (fields that passed test rule criteria) using the following calculation:
(Rule PASS Percentage * Weight)/Total Weights
For details about how scores are calculated, see
How Data Quality Dimension Scores are Calculated.
All scores are updated upon profile execution and visible in the
Statistics tab (which opens automatically after a profile is executed). See
Viewing Statistics. If no dimensions are configured, dimension scores are not generated.
To disable dimensions, move all rules out of dimensions.
Example: Data Quality Dimensions
To illustrate how dimensions provide greater visibility into data quality issues, let’s compare profile execution results for a single dataset with the same rules applied: one where data quality dimensions are not configured, and another where they are configured.
Our dataset contains customer account information, including name, address, email and account number. We want to know what data is missing, duplicated, and inconsistently formatted. To get answers to these initial questions, we apply the:
• IsNotBlank and IsNotNull rule types against all fields in the dataset
• MatchesRegex rule type against the Zip and Email fields
• IsNotDuplicate rule type against the Email, Company and Account Number fields
Note: When you compare the two profile execution results, you’ll notice that the Data Quality Index (DQI) scores differ slightly. This is because when no dimensions are configured the Data Quality Index score is equal to the percentage of records that passed their rule criteria. And when dimensions are configured it's equal to the average of all the dimension scores.
Without Dimensions
When data quality dimensions are not configured, post profile execution results display a long list of all executed rules and their pass/fail results (see figure, below).
To see which records are missing, duplicated, and inconsistently formatted we must look at each rule. For example, we see individual results for each field the rule is configured against (EmailIsNotBlank, StreetIsNotBlank, ZipIsNotBlank, and so forth). We cannot see the results as a whole (for all the fields the rule is applied to).
With Dimensions
As with configuring rules, configuring data quality dimensions is an iterative process. To begin configuring dimensions, we organize the rules into three dimensions:
• Completeness: IsNotBlank and IsNotNull rules – Both of these rules find fields with missing values
• Uniqueness: IsNotDuplicate rules – This rule finds non-unique values
• Validity: MatchesRegex rules – This rule finds values that don’t conform to the specified format
When dimensions are configured, post profile execution results display the same information as when dimensions are absent. However, additional information is immediately visible. We learn what data quality issues exist. In this case, there are issues with completeness, uniqueness and validity (see figure, below).
We also learn how prevalent each issue is. The Dimension Score indicates the success rate for all the rules in each dimension as a whole:
• Uniqueness is the most prevalent issue (duplicate records) – 87.4% pass rate
• Validity is next largest issue (wrong formats) – 94.26% pass rate
• Completeness also needs to be addressed (missing information) – 98.8% pass rate
We immediately see the results as a whole (for all the fields the rule is applied to). To see which records are missing, duplicated, and inconsistently formatted we look at the Completeness dimension, Uniqueness dimension and Validity dimension (respectively). We then click on a dimension to see pass/fail results for each rule, and click a rule bar graph (in the right panel) to get a list of records.
The results also indicate what actions we need to take to improve the data. In this case, we need to create rules to:
• deduplicate data in the Email, Company and Account Number fields
• fix formats in Zip and Email fields
• populate missing data for various fields
How Data Quality Dimension Scores are Calculated
This example describes in detail how data quality dimension scores are calculated. A dimension score reflects the overall quality of data/rules associated for a dimension.
Key points about calculating dimension scores:
• The dimension score is derived by aggregating individual pass results of rules in the dimension. Scores are 1-100, where 100 is highest quality.
• Rule weights factor into the individual pass results of rules. A rule weight is the user-assigned importance of the rule. Values are 1-5, where 5 is the most important. The default is 1.
• Individual pass results of rules are referred to as Weighted Pass %. The following equation is used to calculate the Weighted Pass % (score):
(Rule PASS Percentage * Weight)/Total Weights = Weighted Pass %
To illustrate the impact that rule weights have on a dimension score, we present three scenarios for the same dimension pass results, where:
Scenario 1: Default Weights
In this example, the dimension has three rules associated with it: Account_IsNotBlank, Email_IsNotBlank and Balance_IsNotBlank. The following are the profile execution Pass % results and scores, where all the rules have the default weight:
Scenario 2: One Varied Weight
The following table shows the same profile execution Pass % results. But in this example, the Email_IsNotBlank rule has a weight of 5 assigned. Notice the impact on the Account_IsNotBlank rule Weighted Pass % value (which dropped 16 percent), and the Balance_IsNotBlank rule Weighted Pass % value (which dropped over 10 percent).
Scenario 3: All Varied Weights
The following table shows the same profile execution Pass % results. But in this example, all the rules have varying weights.
Managing Data Quality Dimensions
This topic describes how to manage dimensions in the Configuration tab, Dimensions section. The following instructions assume that you have associated at least one test rule with a dimension.
For information about associating test rules with a dimension, see
Rules Tab. Or, see the test rule
Dimension property for the following rules that support dimensions:
Assert,
CompareToConstant,
FuzzyMatch,
IsNotBlank,
IsNotDuplicate,
IsNotNull,
InRange,
MatchesRegex,
RemoveDuplicates and
RemoveDuplicatesFuzzyMatching.
For information about viewing profile execution results, see
Viewing Statistics.
Note: Changes made to rule dimensions and weights the Configuration tab, Dimensions section are reflected in the Rules tab.
In the data profile file, go to the Configuration tab and expand the Dimensions section. The Dimension and Rules are displayed in a tree view. In this tree view interface, use the right arrow to collapse a branch (close it) and the down arrow to expand a branch (open it).
You can drag and drop rules into and out of dimensions. You can also right-click on a rule to move it, or Delete it (which deletes its association with the dimension and moves the rule into -No Dimension-. Changes are reflected for the rule in the Rules tab.
Additionally, you can use the toolbar as follows:
Manage dimensions and rules also using the following options:
Save and execute the profile. View results in the
Statistics tab. See
Viewing Statistics.
Tip... To undo the dimension configuration, remove all rules out of dimensions and into the -No Dimension- folder.