Was this helpful?
Profiling Rules
Profiling rules can help you identify problems in source data. These rules are of the following types:
Some rules can be used to generate aggregate statistics, which help identify inaccuracies by examining aggregated values over large datasets.
Other rules are test rules that generate pass and fail statistics. Some of these rules also provide a method to invert the results into pass and fail files.
See Add New Rule for information about adding rules.
The following is a list of profiling rules:
Rule Name
Description
Provides users the ability to evaluate multiple fields and conditions within a single rule and it ensures that a specific relationship or calculation between fields is met.
Returns true or false after comparing a field to a constant value.
Returns true or false after comparing a field value to another field value.
Writes all the distinct (unique) values found in a source dataset into a file.
Writes all the duplicate values found in a source dataset into a file.
Calculates equal width ranges across the field values and counts the number of values in each range.
Returns true if two strings match based on the configured fuzzy matching rules and match score filter.
Returns true if a field is not blank, returns false otherwise.
Returns true if a field is not duplicated, returns false otherwise.
Returns true if a field is not null, returns false otherwise.
Checks if the value is within specified range.
Returns true if a field matches a regular expression, returns false otherwise.
Calculates the most frequent values of a field.
Calculates the following statistics for a numeric field.
Min, Max, Mean, Median, Mode, Standard Deviation, Sum, Variance.
 
Assert
Assert is a profiling rule which provides users the ability to evaluate multiple fields and conditions within a single rule and it ensures that a specific relationship or calculation between fields is met. Essentially, Assert compares the left side (actual value) with the right side (expected value) to see if they match.
Rule Properties
Rule Name
A default rule name (Assert or Assert_n, where “n” is 1,2,3, and so on) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
This is shown as empty as this rule can be used to evaluate multiple fields.
Rule Type
The type of rule that is applied to the field. That is Assert (Test rule).
Tip...  A Red Cross in a rule icon (for example ) indicates that the rule has a parameter that has not been defined.
Rule Parameters
Invert Test
Select this to report the opposite results in your pass or fail file. For more information, see Invert Rule Test.
Script (Left Expression)
The left expression is always a script and can be one or more valid expressions supported by the ExecuteExpression rule.
To add multiple regular expressions use the matchesPattern string function:
matchesPatterns('FieldName', "patternA", "patternB", "patternC")
where FieldName is the field name and patternABC are regular expressions.
For example: matchesPatterns('Account Number', "01-.*", "02-.*", "03-.*")
Click Build to open the Expression Builder dialog which helps you to construct expressions. This dialog provides a list of available fields, operators, and functions you can use to build expressions. See Using the Expression Builder.
Operator
The operator that compares the left side (actual value) with the right side (expected value). Select from one of the following comparison operators:
Equal - Checks if the left side value is equal to the right side value.
Greater - Checks if the left side value is greater than the right side value.
Greater or Equal - Checks if the left side value is greater than or equal to the right side value.
Lesser - Checks if the left side value is lesser than the right side value.
Lesser or Equal - Checks if the left side value is lesser than or equal to the right side value.
Not Equal - Checks if the left side value is not equal to the right side value.
StartsWith - Checks if the left side value starts with the right side value.
EndsWith - Checks whether the left side value ends with the right side value.
Matches - Checks whether the left side value matches the right side value (uses regex match).
Expression Type (Right Expression)
This drop-down displays options for the type of right hand side expression. The available choices depend on the selection made in the Operator field. The following options are available:
Expression - Compare the left expression against a right expression. The right expression can be specified in the Script box. This option is available for the Equal, Greater, Greater or Equal, Lesser, Lesser or Equal, Not Equal, StartsWith, and EndsWith operators.
Constant - Compare the left expression against a list of constants. You can type one or more constant values to compare and then press the Enter key or click Add. This option is available for the Equal, Greater, Greater or Equal, Lesser, Lesser or Equal, Not Equal, StartsWith, and EndsWith operators.
Field - Compare the left expression against a specific field. You can specify only one field using the Compare Field drop-down. This option is available for the Equal, Greater, Greater or Equal, Lesser, Lesser or Equal, Not Equal, StartsWith, and EndsWith operators.
Lookup - Compare the left side value with a looked up value. You can specify the Lookup value using the Lookup widget (see LookupValue rule). This option is available for the Equal, Greater, Greater or Equal, Lesser, Lesser or Equal, Not Equal, StartsWith, and EndsWith operators.
NotBlank - Check if the left side value is not blank. This option is available only when the Operator is specified as Equal.
NotNull - Check if the left expression value is not null. This option is available only when the Operator is specified as Equal.
Regex - Compare the left side value against the specified regular expression (Regex) pattern. This option is available only when the Operator is specified as Matches.
Script (Right Expression)
The right expression in some cases is a script and can be any valid expression supported by the ExecuteExpression rule.
You can click Build to open The Expression Builder dialog which helps you to construct expressions. This dialog provides a list of available fields, operators, and functions you can use to build expressions. See Using the Expression Builder.
Regular Expression (Right Expression)
The right expression when the Operator is specified as Matches. In this case the Expression Type is set to Regular Expression and cannot be changed.
You can specify a regular expression (Regex) pattern here. See Java Regular Expressions.
Compare Field (Right Expression)
Used to specify a field for comparison, when the Expression Type is Field. This drop-down is displayed only when the Expression Type is Field.
Constant box (Right Expression)
Used to specify a list of constants for comparison, when the Expression Type is Constant. This text box is displayed only when the Expression Type is Constant.
You can type one or more constant values to compare and then press the Enter key or click Add.
Lookup widget (Right Expression)
Used to specify the Lookup value for comparison, when the Expression Type is Lookup. The Lookup Widget is displayed only when the Expression Type is Lookup.
To learn how to use the lookup widget, see LookupValue rule.
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Supported Data Types
String, Date, Time, Timestamp, Boolean, Numeric, Double, Float, Long, Integer
Remarks
For Date, Time, or Timestamp values, use the ISO 8601 format for constants. For boolean values, use true or false.
 
CompareToConstant
This test rule returns true or false after comparing a field to a constant value.
Rule Properties
Rule Name
A default rule name (<FieldName>_CompareToConstant) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is CompareToConstant (Test rule).
Tip...  A Red Cross in a rule icon (for example ) indicates that the rule has a parameter that has not been defined.
Rule Parameters
Invert Test
Select this to report the opposite results in your pass or fail file. For more information, see Invert Rule Test.
Operator
Select from one of the following comparison operators:
Equal
Greater
Greater or Equal
Lesser
Lesser or Equal
Not Equal
Constant
In this text box, type the constant value to compare, and then press the Enter key or click Add. For equal or not equal you can specify one or more constants.
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Supported Data Types
String, Date, Timestamp, Boolean, Numeric, Double, Float, Long, Integer
Remarks
The constant must be of the same data type as the profiled field. The equal operator is not recommended to use on double or float field comparisons. Do not use quotation marks with string data.
The following formats are supported for date and timestamp:
yyyy-mm-dd
yyyy-mm-dd hh:mm:ss (military time)
Note:  If source has GMT timestamp data, the constant value should end with 'Z'.
CompareToField
This test rule returns true or false after comparing a field value to another field value.
Rule Properties
Rule Name
A default rule name (<FieldName>_CompareToField) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is CompareToField (Test rule).
Rule Parameters
Invert Test
Select this to report the opposite results in your pass or fail file. For more information, see Invert Rule Test.
Operator
Select from one of the following comparison operators:
Equal
Greater
Greater or Equal
Lesser
Lesser or Equal
Not Equal
Compare Field
Select the required field to compare from.
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Supported Data Types
String, Date, Timestamp, Boolean, Numeric, Double, Float, Long, Integer
Remarks
The two fields must be of the same data type. The equal operator is not recommended for use on double or float field comparisons.
 
DistinctValues
This summary rule writes all the distinct (unique) values found in a source dataset into a file.
Rule Properties
Rule Name
A default rule name (<FieldName>_DistinctValues) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is DistinctValues (Summary rule).
Rule Parameters
Results File
Displays the distinct values output file. Default is within your project. However, you can change it.
Sort by
Select the required Sort Order from the following:
Field Data (Faster)
Frequency Count (Slower)
Note:  Frequency Count (Slower) sorts the results in descending order and hence the sorting is slower compared to Field Data sort option.
Supported Data Types
String, Date, Timestamp, Boolean, Numeric, Double, Float, Long, Integer
Remarks
The results file name must not contain the following characters:
/
\
.
,
..
 
DuplicateValues
This summary rule writes all the duplicate values found in a source dataset into a file.
Rule Properties
Rule Name
A default rule name (<FieldName>_DuplicateValues) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is DuplicateValues (Summary rule).
Tip...  A Red Cross in a rule icon (for example ) indicates that the rule has a parameter that has not been defined.
Rule Parameters
Results File
Displays the duplicate values output file. Default is within your project. However, you can change it.
Sort by
Select the Sort Order from the following:
Field Data (Faster)
Frequency Count (Slower).
Note:  Frequency Count (Slower) sorts the results in descending order and hence the sorting is slower compared to Field Data sort option.
Min Count
Type the required value (default is 2). The value must not be lower than 2.
Supported Data Types
String, Date, Timestamp, Boolean, Numeric, Double, Float, Long, Integer
Remarks
The results file name must not contain the following characters:
/
\
.
,
..
 
EqualRangeBinning
This summary rule calculates equal width ranges across the field values and counts the number of values in each range. The following is an example of the Equal Range Binning histogram output:
Rule Properties
Rule Name
A default rule name (<FieldName>_EqualRangeBinning) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Count (Integer).
Rule Type
The type of rule that is applied to the field. That is EqualRangeBinning (Summary rule).
Rule Parameters
RangeCount
Specify the number of ranges to create. The default is 10.
Supported Data Types
All numeric types supported as input (Numeric, Double, Float, Long, Integer)
Remarks
This rule constructs a histogram and defines an equal range for each bin. The rule algorithm works as follows:
Finds the lowest and highest values in the field to profile to determine the end points
Creates Range Count equal-width ranges between highest and lowest values
Counts the non-null values in each range
The generated histogram has a Range Count and one bar for nulls.
Exceptions: If min and max values are identical, then bar is generated. If data type is integer or long and the minimum and maximum are close together, then it generates the number of bars that fits in the graph.
Example: If minimum is 1 and maximum is 4, then rangeCount > = 4 results in one bar for each possible non-null whole value.
The behavior for interpolated breaks between ranges depends upon the input data type:
int and long: interpolated breaks are whole numbers.
float and double: interpolated breaks are double precision floating point values and can be values such as 1.6799999999999999995 due to floating-point roundoff error.
numeric: interpolated breaks are rounded for presentation. The scale used when rounding is the scale of the minimum or maximum value (whichever is larger), plus a number of digits equal to the number of digits in the rangeCount..
 
FuzzyMatch
This test rule returns true if two strings match based on the configured fuzzy matching rules and match score filter.
Rule Properties
Rule Name
A default rule name (<FieldName>_FuzzyMatch) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is FuzzyMatch (Test rule).
Tip...  A Red Cross in a rule icon (for example ) indicates that the rule has a parameter that has not been defined.
Rule Parameters
Invert Test
Select this to report the opposite results in your pass or fail file. For more information, see Invert Rule Test.
Fuzzy Match Algorithm
CONTAINS
DAMERAU_LEVENSHTEIN
EXACT_MATCH
JARO
JARO_WINKLER
LEVENSHTEIN
QGRAM
POSITIONAL_QGRAM
SHORTHAND
Constant
In this text box, type the constant value with which to compare, and then press the Enter key or click Add. You can compare with multiple values but at least one constant should be specified.
Fuzzy Score Filter
Add a decimal values between 0.01 to 1. Default value is 0.7. Comparison score is between 0 to 1. Records with comparison score less than this value are not considered a match for probable duplicates and are discarded. The higher the value you select, more strict is the matching.
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Supported Data Types
String
Remarks
Supported Fuzzy Matching algorithms:
CONTAINS: Checks the small string in the long string.
DAMERAU–LEVENSHTEIN: This distance between two strings is the minimum number of operations (insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one string into another. For more information, see Damerau–Levenshtein distance.
EXACT MATCH: Checks two strings for exact match.
JARO: A measure of characters in common, being no more than half the length of the longer string in distance with consideration for transpositions.
JARO-WINKLER: This algorithm extends Jaro algorithm and uses a prefix scale “p” which gives more favorable ratings to strings that match from the beginning for a set prefix length “l”. The prefix length is set to 4 characters (the maximum possible prefix length). For more information, see Jaro–Winkler distance.
LEVENSHTEIN: This distance between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into other. For more information, see Levenshtein distance.
POSITIONAL QGRAM: Positional q-grams is an extension to q-grams with an additional property called maxDistance used in calculating the string similarity. For example, ‘peter’ contains the positional bigrams (‘pe’,0), (‘et’,1), (‘te’,2) and (‘er’,3). If maxDistance is set to 1, then (‘et’,1) will only be matched to bi-grams in the other string with positions 0 to 2
QGRAM: q-grams (also called as n-grams) is the process of breaking up a string into multiple strings each with length n (q). For example, apply q-grams on abcde with q = 3 would yield abc, bcd, cde. For example, the word ‘peter’ with q=2 would yield following bigrams - ‘pe’, ‘et’, ‘te’ and ‘er’. A q-gram similarity measure between two strings is calculated by counting the number of q-grams in common.
SHORTHAND: Determine if the shorter string is a “shorthand” of the longer. That is, if it can be produced only by deletions. If the first character of a word in the longer string is deleted, the entire word is considered deleted. For e.g. IBM is a shorthand of International Business Machines.
 
IsNotBlank
This test rule returns true if a field is not blank, returns false otherwise.
Rule Properties
Rule Name
A default rule name (<FieldName>_IsNotBlank) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is IsNotBlank (Test rule).
Rule Parameter
Invert Test
Select this to report the opposite results in your pass or fail file. For more information, see Invert Rule Test.
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Supported Data Types
String
Remarks
Counts null, white space, and empty strings as blank.
 
IsNotDuplicate
This test rule returns true if a field is not duplicated, returns false otherwise.
Rule Properties
Rule Name
A default rule name (<FieldName>_IsNotDuplicate) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is IsNotDuplicate (Test rule).
Rule Parameter
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Supported Data Types
String, Date, Timestamp, Boolean, Numeric, Double, Float, Long, Integer
Remarks
None.
IsNotNull
This test rule returns true if a field is not null, returns false otherwise.
Rule Properties
Rule Name
A default rule name (<FieldName>_IsNotNull) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is IsNotNull (Test rule).
Rule Parameter
Invert Test
Select this to report the opposite results in your pass or fail file. For more information, see Invert Rule Test.
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Supported Data Types
String, Date, Timestamp, Boolean, Numeric, Double, Float, Long, Integer
Remarks
None
 
InRange
This test rule checks if the value is within a specified range.
Rule Properties
Rule Name
A default rule name (<FieldName>_InRange) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Count (Integer).
Rule Type
The type of rule that is applied to the field. That is InRange (Test rule).
Tip...  A Red Cross in a rule icon (for example ) indicates that the rule has a parameter that has not been defined.
Rule Parameters
Lower Bound
The lower range value. Select Inclusive to include the lower bound number in the range.
Upper Bound
The upper range value. Select Inclusive to include the upper bound number in the range.
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Supported Data Types
Date, Timestamp, Numeric, Double, Float, Long, Integer
Remarks
The following formats are supported for Date and Timestamp:
yyyy-mm-dd
yyyy-mm-dd hh:mm:ss (military time).
Note:  If source has GMT timestamp data, the lower bound and upper bound values should end with 'Z'.
 
MatchesRegex
This test rule returns true if a field matches a specified regular expression or pattern, returns false otherwise.
Rule Properties
Rule Name
A default rule name (<FieldName>_MatchesRegex, <FieldName>_MatchesRegex_n, where “n” is 1,2,3, and so on) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is MatchesRegex (Test rule).
Rule Parameters
Invert Test
Select this to report the opposite results in your pass or fail file. For more information, see Invert Rule Test.
Regular Expression
Specify one or more regular expressions by typing or selecting them from the drop-down list and then clicking Add. Click to delete an expression. For information about the regular expressions available in the drop-down list, see Regular expressions available in the drop-down list.
You can also:
Copy/paste a regular expression or pattern detected by data discovery. See Field Data Discovery (also see steps, below).
Paste a macro by right-clicking in the text box and selecting the macro from the Paste Macro dialog. See Pasting Macro in Map, Profile or Process.
Create a macro by right-clicking in the text box. See Creating a Macro by Selecting a Value.
To copy/paste values detected by data discovery:
1. In the Field/Rule pane, select a field to open the Field Data Discovery pane.
2. In the Field Data Discovery pane:
Click to execute data discovery.
Select the String Patterns tab.
Right-click on the cell that contains the desired regular expression or pattern.
Do one of the following:
Select Copy Regex Pattern to copy the regular expression.
Select Copy Input Value to copy the input value.
3. In the Field/Rule pane, select the rule to open the Rule Definition pane.
4. Paste into the Regular Expression box and click Add.
5. To add another regular expression or pattern provided by data discovery, select the field again.
Dimension
(Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension.
A dimension represents a characteristic of data quality:
Accuracy - The data is correct.
Completeness - The data is present.
Consistency - The data uses the same format or pattern across different sources.
Timeliness - The data is recent and available.
Uniqueness - The data is not duplicated.
Validity - The data conforms to business rules and is within an acceptable range.
Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics.
For more information about dimensions, see Rules Tab.
For information about managing dimensions, see Managing Data Quality Dimensions.
Weight
(Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension).
For more information, see Rules Tab.
Regular expressions available in the drop-down list
AlphaNumeric Strings -->^[a-zA-Z0-9]+$
AnyString-->.*
ASCII characters only-->^([a-zA-Z0-9!\"#$%&',/:;<=>@_`|~ \\(\\)\\*\\+\\-\\.\\?\\[\\]\\\\\\^\\{\\}])*$")
Canada Postal Code-->^([A-Za-z]\d[A-Za-z][\s-]?\d[A-Zaz]\d)
Credit Card Number-->[1-9][0-9]{3} [0-9]{4} [0-9]{4} [0-9]{4}
Date - mm/dd/yyyy format-->^(3[0-1]|2[0-9]|1[0-9]|0[1-9])[\s{1}|\/|-](Jan|JAN|Feb|FEB|Mar|MAR|Apr|APR|May|MAY|Jun|JUN|Jul|JUL|Aug|AUG|Sep|SEP|Oct|OCT|Nov|NOV|Dec|DEC)[\s{1}|\/|-]\d{4}$
Date - YYMMDD format-->^((\d{2}((0[13578]|1[02])(0[1-9]|[12]\d|3[01])|(0[13456789]|1[012])(0[1-9] |[12]\d|30)|02(0[1-9]|1\d|2[0-8])))|([02468][048]|[13579][26])0229)$
Days of Week-->^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T ((ue?)|(hu?r?))\.?$
Digits Only-->\d+
Email Address-->^(([A-Za-z0-9]+_+)|([A-Za-z0-9]+\\-+)|([A-Za-z0-9]+\\.+)|([A-Za-z0-9]+\\++))*[A-Za-z0-9]+@((\\w+\\-+)|(\\w+\\.))*\\w{1,63}\\.[a-zAZ]{2,6}$
ISBN_10 Number-->ISBN\x20(?=.{13}$)\d{1,5}([-])\d{1,7}\1\d{1,6}\1(\d|X)$
Mr or Mrs-->Mr|Mrs
Printable Characters-->^([a-zA-Z0-9!@#$%^&amp;amp;*()-_=+;:'&amp;quot;|~`&amp;lt;&amp;gt;?/{}] {1,5})$
Single Alphabetic character-->^[a-zA-Z]$
Social Security Number-->^[0-9]{3}-?[0-9]{2}-?[0-9]{4}$
Time - hh:mm AMPM format-->^ *(1[0-2]|[1-9]):[0-5][0-9]*(a|p|A|P)(m|M) *$
Time - hh:mm:ss format-->(([0-1][0-9])|([2][0-3])):([0-5][0-9]):([0-5][0-9])
UK Postal Code-->^([A-Z]{1,2}[0-9]{1,2}|[A-Z]{3}|[A-Z]{1,2}[0-9][A-Z])( |-)[0-9][A-Z]{2}
US Currency-->^\$?( )*\d*(.\d{1,2})?$
US Phone Number-->^[01]?[- .]?\(?[2-9]\d{2}\)?[- .]?\d{3}[- .]?\d{4}$
US Zipcode or Zip+4-->[0-9]{5}(-[0-9]{4})?
Vehicle Identification Number (VIN)-->^(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a -h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6}))$
AnyString is the default regular expression. For a description of Regular Expression options, see Remarks (below).
Supported Data Types
String
Remarks
Regex pattern rule is displayed in the Parameters column on the Rules tab and the Description column is displayed on the Result summary. Actian DataConnect stores and uses regular expressions and you can edit the expressions if required.
Description of Regular Expressions:
AlphaNumeric - Matches strings that contain character and/or numeric data.
AnyString - Matches all strings.
ASCII characters only Postal- Matches ASCII characters only postal.
Canada Postal Code - Matches postal code for Canada.
Credit Card Number - Matches a credit card number.
Date - mm/dd/yyyy format - Flexible date validator that matches most date formats.
Date_YYMMDD - Date expression validator with format YYMMDD.
Days of Week - Matches days of the week or their abbreviations.
DigitsOnly - Expression for matching one or more digits.
EmailAddress - Matches an email address.
ISBN_10 Number - Matches format of an ISBN number.
Mr or Mrs - Matches Mr or Mrs characters.
Printable Characters - Matches all printable characters.
Single Alphabetic character - Matches a single alphabetic Character.
Social Security Number - Matches U.S. social security number. Dashes (-) are optional.
Time - hh:mm AMPM format - Matches time format hh/mm AM.
Time - hh:mm:ss format - Matches time format hh:mm:ss.
US Zipcode or Zip+4 - Matches a US zip code.
Vehicle Identification Number (VIN) - Matches US Vehicle Identification Numbers (VINs).
 
MostFrequentValues
This summary rule calculates the most frequent values of a field.
Rule Properties
Rule Name
A default rule name (<FieldName>_MostFrequentValues) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, myField (String).
Rule Type
The type of rule that is applied to the field. That is MostFrequentValues (Summary rule).
Rule Parameter
Top How Many
Specify the number of desired frequent values. The default is 25, which returns the top 25 most frequent values.
Supported Data Types
String, Date, Timestamp, Boolean, Numeric, Double, Float, Long, Integer
Remarks
This rule replaces some of the functionality of the DistinctValues rule.
 
Statistics
This summary rule calculates the statistics for a numeric field: Min, Max, Mean, Median, Mode, Standard Deviation, Sum, Variance.
Rule Properties
Rule Name
A default rule name (<FieldName>_Statistics) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name.
Note:  The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field.
Field Name
The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Count (Integer).
Rule Type
The type of rule that is applied to the field. That is Statistics (Summary rule).
 
Supported Data Types
Date, Timestamp, Boolean, Numeric, Double, Float, Long, Integer.
Remarks
None
 
Last modified date: 09/22/2025