Rule Name | Description |
|---|---|
Changes the format of a string field from one format to another (applies to string fields which can be parsed as numbers/boolean/date types). | |
Changes the case of a string field to specified case (for example upper, lower, title). | |
Masks or encrypts data so that it cannot be linked to an individual, but can still be shared without risk of exposing sensitive personal information. | |
Removes specified characters or words from a string field. | |
Writes unique records to pass target and duplicate records to fail target using exact match. | |
Writes unique records to pass target and duplicate records to fail target based on fuzzy matching rules and match score filter. | |
Replaces a field value with a new value. | |
Changes the format of zip code and phone number fields to make the values uniform. | |
Trims whitespace from a String field. | |
Applies the specified time zone to a timestamp field. |
Rule Name | A default rule name (<FieldName>ChangeFormat) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Joining_Date (Date). |
Rule Type | The type of rule that is applied to the field. That is ChangeFormat (Function rule). Tip... A Red Cross in a rule icon (for example |
Format Type | Select the format type of the source data (this selection populates the options available in the From Format and To Format parameters): • Boolean • Date • Double • Float • Integer • Long • Numeric • String • Time • TimeStamp |
From Format | Select source field format(s) to change from, or type into the field, then click Add (or the Enter key). Multiple formats can be added. To remove an added format click the Depending on the Format Type you have selected, you can perform the following operations: • Boolean: Replaces the different combinations of source values like yes.no, true.false, True.False, 1.0, etc., into a consistent set of values like True.False. • Date: Reformats the source field date values like MM-dd-yy, dd-MM-yyyy, etc., into a consistent format like MM.dd.yyyy. See DateTime Format. • Double: Converts source values with a format like #.##, .00, .#, etc., into a consistent format like ###. For example, set From Format to #.# and To Format to #.00 to change the format from 50 to 50.00. See Class DecimalFormat. • Float: Converts source values with a format like #.##, .00, .#, etc., into a consistently formatted float value like ###. See Class DecimalFormat. • Integer: Converts source values with a format like #, -#, etc., into a consistently formatted integer value like #. See Class DecimalFormat. • Long: Converts source values with a format like #, -#, etc., into a consistently formatted long value like #. See Class DecimalFormat. • Numeric: Converts source values with a format like #.##, .00, .#, etc., into a consistently formatted numeric value like ###. See Class DecimalFormat. • String: Accepts a list of regular expressions like .+?(\d{4})$, (.*)@(.*) etc., and a replacement expression like $0. Use this option, for example, to retrieve only the last four digits of a value (such as a credit card, phone number or social security number) by setting From Format to .+?(\d{4})$ and To format to $1. Similarly, to change the order of first and last names (e.g. John Doe) set From Format to ([a-zA-Z]+)\s+([a-zA-Z]+) and To Format to $2 $1 (so that the output becomes Doe John). See Java Regular Expressions. • Time: Reformats source values with a format like HH:mm:ss, hh:mm:ss, etc., into a consistent format like HH:mm. See DateTime Format. • TimeStamp: Reformats source values with a format like yyyy-MM-dd HH:mm:ss, MMM dd, yyyy HH:mm aa, etc., into a consistent format like MM/dd/yyyy HH:mm. See DateTime Format. |
To Format | Select the format to change the source to from the drop-down menu, or type into the field. Unlike the From Format, only one format is permitted in the To Format parameter. |
Rounding Mode | The rounding mode applies if input data is of the type Double, Float, and Numeric. Available options are: • CEILING - Rounding mode to round towards positive infinity. • DOWN - Rounding mode to round towards zero. • FLOOR - Rounding mode to round towards negative infinity. • HALF_DOWN - Rounding mode to round towards "nearest neighbor" unless both neighbors are equidistant, in which case round down. • HALF_EVEN - Rounding mode to round towards the "nearest neighbor" unless both neighbors are equidistant, in which case, round towards the even neighbor. • HALF_UP - Rounding mode to round towards "nearest neighbor" unless both neighbors are equidistant, in which case round up. • UNNECESSARY - Rounding mode to assert that the requested operation has an exact result, hence no rounding is necessary. • UP - Rounding mode to round away from zero. |
Locale | The language or country-based preferences for a user interface. Among other things, locales represent currency, date format, and numbers according to the protocols in the given region. You can choose from the available options. |
Rule Name | A default rule name (<FieldName>ChangeStringCase) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Name (String). |
Rule Type | The type of rule that is applied to the field. That is ChangeStringCase (Function rule). |
Change To Case | The side from which to remove white spaces: • Camel Case - Converts string to camel case which is a programming naming convention. In this all words of a phrase are combined together to form one continuous word and each word except the first word is capitalized. • Kebab Case - Converts string to kebab case which is a way of writing phrases without spaces, where spaces are replaced with hyphens -, and the words are typically all lower case. This is a programming naming convention often used for variable names. This is a programming naming convention. • Lower Case - Converts all characters in the string to lowercase. • Pascal Case - Converts string to pascal case which is a programming naming convention. In this all words of a phrase are combined together to form one continuous word and each word is capitalized. • Snake Case - Converts string to snake case which is a way of writing phrases without spaces, where spaces are replaced with underscores _, and the words are typically all lower case. This is a programming naming convention. • Tile Case - Converts string to title case. The first letter of each word is capitalized except for certain small words like “a,” “the,” and “of.” • Upper Case - (default) Converts all characters in the string to uppercase. |
Rule Name | A default rule name (<FieldName>DeidentifyValue) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Name (String). |
Rule Type | The type of rule that is applied to the field. That is DeidentifyValue (Function rule). |
Operation | Select how the data will change: • Mask - Select this option to replace the value irreversibly, then specify values for the Method, Location, N Value and Replacement Value parameters. • Encrypt - Select this option to encrypt the value. This option requires you to supply a 16-digit encryption key. The same key will be required in order to subsequently decrypt the value (if decryption is necessary). |
Method | This parameter is available when Mask is selected for the Operation. Select the action to perform on the data: • Replace N Characters: Replaces the number of characters specified in N Value. • Keep N Characters: Retains the number of characters specified in N Value. • Replace All: Replaces all characters in the field. When this option is selected, the Location and N Value fields are disregarded. |
Location | This parameter is available when Mask is selected for the Operation. Select which portion of the field to perform the selected Method on: • Start: Performs the selected Method on the beginning portion of the value. • Middle: Performs the selected Method on the center portion of the value. • End: Performs the selected Method on the end portion of the value. |
N Value | This parameter is available when Mask is selected for the Operation. Specify the number of characters to perform the selected Method on. The maximum value is 9. The default is 4. |
Replacement Value | This parameter is available when Mask is selected for the Operation. Select a replacement character, or enter a character of your choice. Predefined options are: • Asterisk • Lowercase x • Uppercase X • Hyphen (-) • Pound Sign (#) • At Sign (@) • Underscore (_) |
Key | This parameter is available when Encrypt is selected for the Operation. Enter a 16-digit encryption key. This is also the key to use to subsequently decrypt the value (if decryption is necessary). Right-click in the text box to access these options for macros: • Paste Macro: Allows you to copy/paste a macro from the Paste Macro dialog. See Pasting Macro in Map, Profile or Process. • Create Macro: Allows you to create a macro from the Create New Macro dialog. See Creating Macro using Create Macro Menu Option. Note: It is highly recommended that an encrypted macro be used for the key. An encrypted macro is more secure and can be shared with others without transmitting the key in plain text. |
Rule Name | A default rule name (<FieldName>RemoveChars) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Name (String). |
Rule Type | The type of rule that is applied to the field. That is RemoveChars (Function rule). Tip... A Red Cross in a rule icon (for example |
Select the character classes to remove | Tells what to remove from the source field value. You can choose one or multiple options. The options are: • Whitespace - Removes spaces. • Digits - Removes numeric characters. • Non Printable Characters - Removes non printable characters. Non printable characters are symbols that do not visibly appear when printed or displayed. Examples include newline (\n) for line breaks and tab (\t) for spacing. • Special Characters - Removes any special characters. Special characters are the punctuation characters on your keyboard, such as: ! @ # $ % ^ & * ( ) - _ = + \ | [ ] { } ; : / ? . > |
Remove Other | Choose from one of the following options: • <None> - Choose this when you want to skip specifying the Remove Other option. • Literal - Allows you to specify literal values to search for and remove. • Regular Expressions - Allows you to specify your own regular expression (regex pattern) to search for and remove. See Remarks. • Words - Allows you to specify a comma separated list of word numbers or a range of word numbers to remove. For example “1, 2-4,” will remove 1st, 2nd through 4th word. |
Rule Name | A default rule name (RemoveDuplicates) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, ZIP (String). |
Rule Type | The type of rule that is applied to the field. That is RemoveDuplicates (Test rule). Tip... A Red Cross in a rule icon (for example |
Dimension | (Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension. A dimension represents a characteristic of data quality: • Accuracy - The data is correct. • Completeness - The data is present. • Consistency - The data uses the same format or pattern across different sources. • Timeliness - The data is recent and available. • Uniqueness - The data is not duplicated. • Validity - The data conforms to business rules and is within an acceptable range. Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics. For more information about dimensions, see Rules Tab. For information about managing dimensions, see Managing Data Quality Dimensions. |
Weight | (Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension). For more information, see Rules Tab. |
Input Fields | List of available fields that can be used as key fields. |
Key Fields | You can move fields from Input Fields to Key Fields to define a key field. You can specify a combination of Fields to define a key field. Note: Key Fields cannot be selected as Sorting Fields and vice versa. |
Results File | Browse and select a csv file (only csv files are supported) or specify the path and name of a csv file that will be used as the cluster file. You can use macros for filenames. When the profile is executed, you can browse the cluster file in the Cluster Data tab by selecting the Remove Duplicates rule and clicking |
Field | Click the Field box and select a sorting field from the drop-down. |
Sort Order | Click the Sort Order box and select Ascending or Descending for each sorting field. |
Click | |
Select a rule and click | |
Select a rule and click | |
Select a disabled rule and click | |
Select an enabled rule and click | |
Click the up and down icons ( |
Select | This checkbox allows you to select records manually. By selecting the desired records and clicking the Generate Rules button, you will create Unique Record Rules. This feature is particularly helpful if you have few duplicates in your source data or prefer to have Data Profiler generate the rules for you. |
$$ClusterId | The unique ID that is assigned to each cluster. Using this, you can identify and understand how many clusters exist and how many records are in each cluster. The cluster IDs are not retained, so every time the profile is executed or the cluster file is generated, the order of the clusters and the associated IDs will change. |
$$pass | $$Pass gives you information about pass records, if it is blank then the record passed the rule, if marked as “X” then the record failed the rule. |
$$rule_id | This is an ID allocated to each Unique Record Rule. The rule_id is only displayed for the unique records identified from the Unique Record Rules. |
Rule Name | A default rule name (RemoveDuplicatesFuzzyMatching) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, ZIP (String). |
Rule Type | The type of rule that is applied to the field. That is RemoveDuplicatesFuzzyMatching (Test rule). Tip... A Red Cross in a rule icon (for example |
Dimension | (Optional) Select a dimension to associate the rule with. There is no limit to the number of rules a dimension can be associated with. A rule can be associated with a single dimension. A dimension represents a characteristic of data quality: • Accuracy - The data is correct. • Completeness - The data is present. • Consistency - The data uses the same format or pattern across different sources. • Timeliness - The data is recent and available. • Uniqueness - The data is not duplicated. • Validity - The data conforms to business rules and is within an acceptable range. Each dimension generates a Dimension Score for its associated rules. The score indicates the degree to which the data meets the characteristic. Scores can be viewed in the Statistics tab post profile execution. See Viewing Statistics. For more information about dimensions, see Rules Tab. For information about managing dimensions, see Managing Data Quality Dimensions. |
Weight | (Optional) Select the importance level of the rule. Values are 1-5, where 5 is the most important. The default value is 1. This value is reflected in the Data Quality Index (DQI) score and the Dimension Score (if the rule is associated with a dimension). For more information, see Rules Tab. |
Input Fields | List of available fields that can be used as key fields. |
Key Fields | You can move fields from Input Fields to Key Fields to define a key field. You can specify a combination of Fields to define a key field. Note: Key Fields cannot be selected as Sorting Fields and vice versa. |
Results File | Browse and select a csv file (only csv files are supported) or specify the path and name of a csv file that will be used as the cluster file. You can use macros for filenames. When the profile is executed, you can browse the cluster file in the Cluster Data tab by selecting the Remove Duplicates rule and clicking |
Field Name | Select a field whose values are compared with each other using the algorithm specified in Comparison Type. You can specify multiple matching fields and each matching field can be configured to use any of the provided algorithms. This list includes only string fields and those which are not used as Key Fields on the Grouping tab. Only fields within the cluster are compared with each other. |
Comparison Type | A list of supported string matching algorithms/functions. For more information, see Remarks. |
Weight | Weightage given to the rule which helps to calculate the final score. Default is 1 (same weightage). Allowed range is 1 to 100. If same value is assigned to all rules, they will have same weightage. For example, rule1 - 30, rule2 - 30, rule3 - 30 will have same weightage. |
Match Score Filter | Select a decimal values between 0.01 to 1. Comparison score is between 0 to 1. Records with comparison score less than this value are not considered a match for probable duplicates and are discarded. The higher the value you select, more strict is the matching. |
Field | Click the Field box and select a sorting field from the drop-down. |
Sort Order | Click the Sort Order box and select Ascending or Descending for each sorting field. |
Click | |
Select a rule and click | |
Select a rule and click | |
Select a disabled rule and click | |
Select an enabled rule and click | |
Click the up and down icons ( |
Select | This checkbox allows you to select records manually. By selecting the desired records and clicking the Generate Rules button, you will create Unique Record Rules. This feature is particularly helpful if you have few duplicates in your source data or prefer to have Data Profiler generate the rules for you. |
$$ClusterId | The unique ID that is assigned to each cluster. Using this, you can identify and understand how many clusters exist and how many records are in each cluster. The cluster IDs are not retained, so every time the profile is executed or the cluster file is generated, the order of the clusters and the associated IDs will change. |
$$pass | $$Pass gives you information about pass records, if it is blank then the record passed the rule, if marked as “X” then the record failed the rule. |
$$rule_id | This is an ID allocated to each Unique Record Rule. The rule_id is only displayed for the unique records identified from the Unique Record Rules. |
Rule Name | A default rule name (<FieldName><ReplaceValue>) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, ZIP (String). |
Rule Type | The type of rule that is applied to the field. That is ReplaceValue (Function rule). Tip... A Red Cross in a rule icon (for example |
Option | Description |
Replace with Constant | Replaces the field value with the specified Constant. • Constant - Enter the constant value with which to replace. |
Replace with Field | Replaces the field value with the specified alternate source Field value. • Field - Select the field from which to replace. |
Replace with Function Results | (Available for string fields) Replaces the field value with the selected function results. You can select one or both functions: • ChangeStringCase - Changes the case of a string field to specified case (for example upper, lower, title) before writing to target field. See ChangeStringCase for more information. • StringTrim - Trims whitespace from a String field. See StringTrim for more information. |
Replace with Lookup | Replaces a source field value with a value from an incore lookup which is stored in memory (RAM). See Replace with Lookup for more information. |
Replace with RegX | (Available for string fields) Replaces what matches the specified regular expression with the replacement value. See Replace with RegX for more information. |
Select Condition | (Optional) Select the condition that must be met for replacing a value. Options are: • CompareToConstant (see CompareToConstant) • CompareToField (see CompareToField) • FuzzyMatch (see FuzzyMatch) • IsBlank - (available for string fields) - Replaces if the field value is blank. • MatchesRegex (see MatchesRegex) |
Parameter | Description |
Lookup Name | Click Add Tip... In the Lookup Configuration Wizard, connect to the lookup as with a source connector. When prompted to select the Lookup Key Field, select the lookup column for comparing against the Matching Key Field in the source or derived field. Click |
Lookup Key Field | The lookup field previously selected in the Lookup Configuration Wizard. This field contains the lookup values to compare against the Matching Key Field values. When the Lookup Key Field and the Matching Key Field values match, the Replacement Field value (from the lookup) is used. If no match is found, the field is replaced with a null or blank. |
Matching Key Field | Select the field in the source (or derived field) to match against the Lookup Key Field. When the Lookup Key Field and the Matching Key Field values match, the Replacement Field value is returned. |
Replacement Field | Select the field in the lookup that contains the replacement value. |
Select Condition | (Optional) Select the condition that must be met for replacing a value. Options are: • CompareToConstant (see CompareToConstant) • CompareToField (see CompareToField) • FuzzyMatch (see FuzzyMatch) • IsBlank - (available for string fields) - Replaces if the field value is blank. • MatchesRegex (see MatchesRegex) |
Parameter | Description | |
Pattern | The regex pattern to search for. See Java Regular Expressions. | |
Replacement | The replacement value. | |
Occurrences | The field occurrences that you want to replace. For example “1, 3-4, 7”. The “3-4” (from 3rd through 4th) is a range. You can also specify “all”. If left blank, all instances are replaced. | |
Select Condition | (Optional) Select the condition that must be met for replacing a value. Options are: • CompareToConstant (see CompareToConstant) • CompareToField (see CompareToField) • FuzzyMatch (see FuzzyMatch) • IsBlank - (available for string fields) - Replaces if the field value is blank. • MatchesRegex (see MatchesRegex) | |
Rule Name | The default rule name (<FieldName>_StandardizeFormat) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Zip (String). |
Rule Type | The type of rule that is applied to the field. That is StandardizeFormat (Function rule). |
Field Type | (required) Select the relevant type of data in the field. This selection populates options in the Format Type field. Options are: • Zip - Postal zip codes. • Phone - Telephone numbers. |
Format Type | Select the type of format to apply to the field. |
• Phone - (required) Select the phone number formatting standard to use from the Format Type drop-down list: – International - The international formatting standard which includes the country code in the result. For example: +1 (202) 888 2943 – National - The national formatting standard, which does not include a country code in the result, unless the country code is explicitly included in the custom format. For example: (202) 888 2943 – E164 - The E164 formatting standard. For example: +34632123456 – RFC3966 - The RFC3966 formatting standard. For example: tel:+1-201-555-0123 • Country - Select the geographic region for the phone numbers. Each phone number format is based on a country. If there is a column named country, users can type `country` into this field so that the format applied to each phone number value is determined by the country value. This can be helpful if your data contains phone numbers from different countries. Note: Country value can be ISO2 code, or the full country name if being referenced dynamically using expressions. | |
• Zip - (required, unless Custom Format is specified) Select the formatting standard to use from the Format Type drop-down list. Each zip code format is based on a country. If there is a column named country, users can type `country` into this field so that the format applied to each zip code value is determined by the country value. This can be helpful if your data contains zip codes from different countries. In cases where a country supports multiple formats, multiple options are available to choose from. For example, both of these formats are available for the United States:United States and United States (Zip + 4). | |
Custom Format | (optional) Enter a format to use and click Add. The format specified is used (not the Format Type selected). Multiple formats can be entered. When multiple formats are specified, the first format that matches the cell value is used. The following characters have special meaning in the format: L = Letter (for zip codes only) N = Number ? = Indicates that the preceding letter or number is optional #{CC} - Inserts the country code (for phone numbers only) All other characters are included in the result at the position specified in the format. For example, the following specifies a United States zip code format: NNNNN NNNN |
Rule Name | A default rule name (<FieldName>_StringTrim) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Address (String). |
Rule Type | The type of rule that is applied to the field. That is StringTrim (Function rule). |
Trim Side | The side from which to remove white spaces: • Both (default) • Left • Right |
Rule Name | A default rule name (<FieldName>_ApplyTimeZone) is provided and displayed here. However, you can edit or overwrite it. Click Reset to restore the default rule name. Note: The underscore (_) character is the only special character allowed in the name. Rule names cannot begin with a digit. If a field or column in the source data starts with a digit, 'r_' will be prepended to any rules created based on that field. |
Field Name | The field name to which the rule applies is displayed here, along with the data type in parentheses. For example, Login_Time (Timestamp). |
Rule Type | The type of rule that is applied to the field. That is ApplyTimeZone (Function rule). |
Time Zone | Select from a list of available time zones. |