User Guide : Map Connectors : Source and Target Map Connectors : Unicode (Delimited)
 
Share this page             
Unicode (Delimited)
Unicode is a character set that uses 16 bits (two bytes) for each character and is able to include more characters than ASCII. Unicode can have 65,536 characters and therefore can be used to encode almost all the languages of the world. Unicode includes the ASCII character set within it. With this delimited text connector, you can read and write Unicode files.
Connector-Specific Notes
Source files containing null characters (0x00) embedded in a text string are not supported. All information following the null characters is stripped from the file.
Using an External Schema to Override Source Structure
When Unicode (Delimited) is the source connector, the data structure is normally set by field delimiters and the header record of the source file. However, after connecting to a file, you can override this structure by applying an external schema, for example to change field names, change their size, or even add additional fields for multiple record layouts.
Using an External Schema to Override Target Structure
When Unicode (Delimited) is the target connector, the data structure is normally set by field delimiters and the header record of the target file. However, after connecting to a file, you can override this structure by applying an external schema, for example to change field names, change their size, or even add additional fields for multiple record layouts.
Delimiter Characters Occurring as Data within a Field
Characters that delimit the start and end of a field may also appear as data within the field. To ensure that a data character is not interpreted as a delimiter, the integration platformthe creates an escape sequence by doubling the character when it is assumed to be data.
Quotation marks are a common example of this escape sequence. As shown below, the quotation marks enclose quoted words in an Excel source field. In the mapping to a delimited Unicode target with a quotation mark selected as field delimiter, the quotation marks are doubled for the data but not for the delimiters enclosing the field.
Excel source
The customer said, "A penny saved is a penny earned."
Delimited Unicode target, with field delimiters
"The customer said, ""A penny saved is a penny earned."""
HeaderRecord Property in the Source
If the HeaderRecord property is set to true, then a single header record is skipped at the beginning of the file. If there are later header records for the additional record types, they appear as data and can possibly cause errors.
All records are read using the same properties.
Unless truncation handling is turned off (set to Ignore), each record is read twice. To read a single Source record require reading the discriminator record, then if the discriminator indicates a different record type we must reread with the new record type. However, while reading the discriminator record we have to momentarily turn truncation handling off, so even if the discriminator record indicates itself as the record type, if truncation is not set to ignore we must turn it back on and reread the record.
Simply connecting to a source file produces only one record type. If the file has multiple record types, the user must create the record type structure in a schema.
HeaderRecord Property in the Target
If the HeaderRecord property is set to true, a header is written only for the first record type in a multiple record layout.
All records are written using the same properties.
Supported Encoding
For the list of supported encoding, see Binary (International) Unicode Support.
Property Options
You can set the following source (S) and target (T) properties.
Property
S/T
Description
AlternateFieldSeparator
S
Most data files have only one field separator between all the fields; however, it is possible to have more than one field separator. If your source file has one field separator between some fields and a different separator between other fields, you can specify the second field separator here. Otherwise, you should leave this set to None (the default).
The alternate field separators available from the list are none (default), comma, tab, space, carriage return-line feed, line feed, carriage return, line feed-carriage return, ctrl-R, and pipe (|). To select a separator, click AlternateFieldSeparator. Then click the arrow to the right of the box to choose from the list of available separators. If you have an alternate field separator other than one from the list, you can type it here.
If the field separator is not a printable character, replace CR-LF with a backslash, an X, and the hexadecimal value for the separator.
The Unicode connectors read the data from the file as Unicode and look for the Unicode characters specified as the separators to break up the data into fields or records. Then, the actual Unicode data is assigned to fields or records.
AutomaticStyling
S
AutomaticStyling changes the way Unicode data is read or written. By default, AutomaticStyling is set to false, causing all data to be read or written as Text. When set to true, it determines and formats (automatically) particular data types, such as numeric and date fields.
AutomaticStyling insures, for example, that a date field in a Unicode source file is formatted as a date field in the target file, and not as a character or as text data.
Note:  If a source file contains zip codes, you may want to leave AutomaticStyling to false so that the leading zeros in some zip codes in the eastern United States are not deleted.
Note:  For a Unicode target file, if you set FieldDelimitStyle to Text, you must also set AutomaticStyling to true so that delimiters are placed around only the nonnumeric fields.
ByteOrder
ST
Allows you to specify the byte order of Unicode (wide) characters. The default is Auto and is determined by the architecture of your computer. The list box options are Auto (default), Little Endian and Big Endian. Little Endian byte order is generally used by Intel machines and DEC Alphas and places the least significant portion of a byte value in the left portion of the memory used to store the value. Big Endian byte order is used by IBM 370 computers, Motorola microprocessors and most RISC-based systems and stores the values in the same order as the binary representation.
EmptyFieldsNull
S
Allows you to treat all empty fields as null.
Encoding
ST
Type of encoding to use with source and target files.
Field1IsRecTypeId
S
If the first field of each record in your source file contains the Record Type ID, you can select true for this property and the integration platform treats each record as a separate record type. Within each record, field names derived from the Record Type ID are automatically generated for each field. For details, see Field1IsRecordType.
FieldDelimitStyle
T
When Unicode (Delimited) is your connector, this option determines whether the specified FieldStartDelimiter and the FieldEndDelimiter is used for all fields, only for fields containing a separator, or only for text fields, as follows:
All – Places the delimiters specified in FieldStartDelimiter and FieldEndDelimiter before and after every field. Default setting is All. For example: "Smith","12345","Houston".
Partial – Places the specified delimiters before and after fields only where necessary. A field that contains a character that is the same as the field separator would have the field delimiters placed around it. A common example is a memo field that contains quotes within the data: "Customer responded with "No thank you" to my offer"
Text – Places delimiters before and after text and name fields (non-numeric fields). Numeric and date fields have no FieldStartDelimiter or FieldEndDelimiter. For example: "Smith", 12345,"Houston", 11/13/04
Non-numeric – Places delimiters before and after all nonnumeric types, such as date fields. An important difference between non-numeric and text is that non-numeric delimits date fields, while text does not.
FieldEndDelimiter
ST
Delimited Unicode files are presumed to have beginning-of-field and end-of-field delimiters. The default delimiter is a quotation mark because it is the most common. However, some files do not contain field delimiters, so this option is available for both source files and target files. To read from or write to a file with no delimiters, set FieldStartDelimiter to none.
FieldSeparator
ST
A delimited Unicode file is presumed to have a comma between each field. To specify some other field separator, click once in the FieldSeparator Current Value box. Then click the down arrow to the right of the box to display the list of options. The list box options are comma (default), tab, space, carriage return-line feed, linesep, line feed, carriage return, line feed-carriage return, a pipe (|), and no field separator. If you have or need an alternate field separator other than one from the list, you can type it here.
If the field separator is not a printable character, replace CR-LF with a backslash, an X, and the hexadecimal value for the separator.
The Unicode connectors read the data from the file as Unicode and look for the Unicode characters specified as the separators to break the data up into fields or records. Then the actual Unicode data is assigned to fields or records.
FieldStartDelimiter
ST
Delimited Unicode files are presumed to have beginning-of-field and end-of-field delimiters. The default delimiter is a quotation mark because it is the most common. However, some files do not contain field delimiters, so this option is available for both your source files and your target files. To read from or write to a file with no delimiters, set FieldEndDelimiter to none.
Header
ST
In some files, the first record is a header record. For source data, you can remove it from the input data and cause the header titles to be used automatically as field names. For target data, you can cause the field names in your source data to automatically create a header record in your target file. To identify a header record, set Header to true. The default is false.
Note:  If your target connector is Unicode (Delimited) and you are appending data to an existing file, leave Header set to false.
MaxDataLen
T
When Unicode (Delimited) is your target connector, this option allows you to specify the maximum number of characters to write to a field. If this value is set to 0 (the default), the number of characters written to a field is determined by the field length. If you set this value to a number other than zero, data may be truncated.
NullIndicator
ST
This property allows you to enter a special string used to represent null values. You can select predefined values or type any other string.
Target – When writing a null value, the contents of the null indicator string are written.
Source – A check is made to see if the null indicator is set. If it is set, the data is compared to the null indicator. If the data and the null indicator match, the field is set to null.
NumericFormatNormalization
S
Setting this property to true handles thousands-separators according to usage for locale when numeric strings are converted to numeric type. This property overrides any individual field settings. Supported in 9.2.2 and later. Default is false.
OrderMark
T
The Order Mark is a special character value sometimes written to a Unicode text file to indicate the byte order used for encoding each of the Unicode characters. In the integration platform, you have the option of writing byte order mark at the beginning of Unicode (wide) output or not. The default is false. If you wish to have the byte order mark placed at the beginning of your output, change this option to true.
RecordFieldCount
S
If your source data file has field separators but no record separator, or if it has the same separator for both the fields and the records, you should specify the RecordSeparator (most likely a blank line), leave the AlternateFieldSeparator option blank and enter the exact number of fields per record in this box. The default value is zero.
RecordSeparator
ST
A delimited Unicode file is presumed to have a carriage return-line feed (CR-LF) between records. To use other characters for a record separator, click RecordSeparator for a list of choices, including system default, carriage return-line feed (default), line feed, carriage return, line feed-carriage return, form feed, empty line, ctrl-E, and no record separator. To use a separator other than one from the list, enter it here. The SystemDefault setting enables the same transformation to run with CR-LF on Windows systems and LF on Unix systems without having to change this property.
If the record separator is not a printable character, replace CR-LF with a backslash, an X, and the hexadecimal value for the separator.
The Unicode connectors read the data from the file as Unicode and look for the Unicode characters specified as the separators to break the data up into fields or records. Then the actual Unicode data is assigned to fields or records.
StartOffset
 
If your source data file starts with characters that need to be excluded from the transformation, set the StartOffset option to specify at which byte of the file to begin. The default value is zero. The correct value may be determined by using the Hex Browser.
Note:  This property is set in number of bytes, not characters.
StripLeadingBlanks
ST
For a Unicode source file, by default the integration platform leaves leading blanks in delimited Unicode data. If you want to delete the leading blanks, set StripLeadingBlanks to true.
For a Unicode target file, by default, the integration platform strips leading blanks in delimited Unicode data. If you want to leave the leading blanks, set StripLeadingBlanks to false.
StripTrailingBlanks
ST
For a Unicode source file, by default the integration platform keeps trailing blanks in the data. If you want to delete the trailing blanks, set StripTrailingBlanks to true.
For a Unicode target file, by default the integration platform strips trailing blanks in the data. If you want to leave the trailing blanks, set StripTrailingBlanks to false.
The field options that you may change are listed below.
StyleSampleSize
S
Sets the number of records (starting with record 1) that are analyzed to set a default width for each source field. The default value for this option is 5000. You can change the value to any number between 1 and the total number of records in your source file. As the number gets larger, more time is required to analyze the file, and it may be necessary to analyze every record to ensure that no data is truncated.
To change the value, click the StyleSampleSize Current Value box, highlight the default value and type a new value.
TransliterationIn
T
Allows you to specify a character, or a set of characters, to be filtered out of the source data. For any character in TransliterateIn, the corresponding character from the TransliterateOut property is substituted. If there is no corresponding character, the source character is filtered out completely. TransliterateIn supports C-style escape sequences such as \n (new line), \r (carriage return) and \t (tab).
TransliterationOut
T
Allows you to specify a character to be substituted for another character from the source data. For any character in TransliterateIn, the corresponding character from the TransliterateOut property is substituted. If you wish the source character to be filtered out completely, leave this field blank. If there are no characters to be transliterated, this field should be left blank. The TransliterateOut property supports C-style escape sequences such as \n (new line), \r (carriage return) and \t (tab).
Field1IsRecordType
If Field1IsRecordType is to true and your first record consists of the following:
"Names", "Arnold", "Benton", "Cassidy", "Denton", "Exley", "Fenton"
Then the integration platform assigns these field names:
Names_01
Names_02
Names_03
Names_04
Names_05
Names_06
Names_07
Names
Arnold
Benton
Cassidy
Denton
Exley
Fenton
See the Field1IsRecordType entry in the table of property options for delimited Unicode connectors.
Additional Information about Encoding
You should be aware of the following regarding the Encoding property option:
Shift-JIS encoding is meaningful only in Japanese operating systems.
UCS-2 is no longer a valid encoding name, but you may use UCS2. Open the data file with a text editor and change UCS-2 to UCS2.
To display Chinese-Japanese-Korean-Vietnamese (CJKV) data in Data Browser
1. Verify that your operating system has at least one font available that corresponds to the specific character set and code page you want to use.
2. Select the Unicode connector and your encoding method in the source or target properties.
3. Go to the main menu and select View > Preferences > Fonts.
4. Choose a font that corresponds to your character set and encoding method.
Data Types
By default, all Unicode data is read as Text. Fields containing dates or numbers may be changed to a different data type.
Note:  Because of the variety of date formats in data files, we suggest you change the type of any field that contains a date to Date in the schema.
Length
These are field lengths in your data. If you need to change field lengths, reset them in the schema.
The maximum supported source field length is 2 GB.
For example, there is a field that contains numbers. The numbers are a dollar value with two decimal places, such as 7122.50. The field type default is Text and (because of values in other records) the field Size default is 10. You are transforming the data to a database application in which you want the data in this field to be numeric. If you change the source field type to Float, the field length becomes blank, the precision default is 15, and the decimal changes to 2. This field automatically appears as an appropriate numeric data type in your target schema and is a numeric field in your target data file.
If you want to set different field delimiters for fields that contain numeric data, see FieldDelimitStyle in the property options. The following data types are available:
Boolean
Date
Date/Time
Decimal
Float
Integer
Name (parses and displays a proper name into its parts, such as honorific, title, last name, middle initial, first name)
Text
Time