User Guide : Map Connectors : Encoding Reference : Encoding Tips
 
Share this page                  
Encoding Tips
This section provides information important to selected users but is not essential to running the integration products.
Troubleshooting Encoding Issues
Unicode 4.0 Limitations
Unicode 4.0 Character Encodings
Customizable ICU Encodings
Customizing Non-ICU Japanese Mappings
Byte Order Marks
Troubleshooting Encoding Issues
When you have encoding issues while working with the data integration tools, it is usually one of the following issues:
The encoding constant in the EZscript File() functions. See Encoding EZscript Functions and Objects.
Not properly specifying the encoding property when using a connector with encoding support. See Determining Which Connector to Use.
Using an ASCII connector instead of a Unicode connector. See Determining Which Connector to Use.
Not remembering to set both the source and target connector encoding properties to the correct value.
Not using the correct data type for encoded data. For instance, using varchar instead of the nvarchar data type.
Not setting the data field in the CRM or database to handle Unicode or encoded data.
Tip:  If you cannot view your data properly, this does not mean that the data did not transform correctly from the source to the target. Ensure that the following options are set correctly: font and script settings, operating system options, the integration platform preferences, database clients such as TOAD, database servers, and other third-party tools such as text viewers.
To test that your source data can be viewed correctly
1. Change the target connector to Unicode.
2. Run the transformation that is transforming the source data to the target.
3. View the data with a text viewer, such as Windows Notepad.
To test that your target data is being output from a database correctly
View your target data file or table in alternate viewers outside of Map Designer or Windows Notepad. For example, view the data in your database application, the CRM server, or in client views.
To determine if a file or table is Unicode
View the data in a hex viewer to determine if there are Unicode encoding values present. Open Structured Schema Designer to view the data with the hex browser. For instructions, search for the words "hex browser" in the online help.
Unicode 4.0 Limitations
This implementation of Unicode has the following limitations in the data integration tool set:
Data Browser
The Map Designer data browser may not allow you to view all language combinations at the same time. If your file has characters that are not compatible with the font, script, or language settings, you may see question mark characters in the browser display. For instance, the Data Browser may display certain Vietnamese characters incorrectly. This presentation issue is not due to data loss.
To read multiple languages in your files, use a text editor, such as Windows Notepad or Wordpad, to view your data.
To view a single language or related language types, review the following settings:
Operating System Language = [your language set] such as East Asian languages.
Operating system fonts to read Unicode = Unicode fonts installed. For example, Arial Unicode MS is a common Windows font set.
Script Editor
Script Editor does not support Unicode literal strings in scripts.
Data Parser
The Unicode (Fixed) connector is not supported in the Data Parser under these conditions:
When UTF-8, UTF-16, or UCS-2 encoding is used.
When the CharFieldWidths property is set to character width (True).
Unicode 4.0 Character Encodings
The integration platform supports the Unicode 4.0 character encodings listed here. The Binary (International) connector supports different encodings, listed in the topic "Binary (International) Unicode Support" in the Connector Reference.
Encoding
Name
Big5
ANSI Big 5 code page
Big5 HKSCS
HK Big5
CP037
EBCDIC USA/Canada code page
CP273
EBCDIC Germany
CP277
EBCDIC Norway and Denmark
CP278
EBCDIC Sweden and Finland
CP280
EBCDIC Italy
CP284
EBCDIC Spain
CP285
EBCDIC United Kingdom
CP297
EBCDIC France
CP420
EBCDIC Arabic
CP424
EBCDIC Hebrew
CP437
DOS Latin US code page
CP500
EBCDIC International code page
CP737
DOS Greek code page
CP775
DOS Baltic Rim code page
CP838
EBCDIC Thai
CP850
DOS Latin 1 code page
CP852
DOS Latin 2 code page
CP855
DOS Cyrillic code page
CP857
DOS Turkish code page
CP860
DOS Portugese code page
CP861
DOS Icelandic code page
CP862
DOS Hebrew code page
CP863
DOS Canada code page
CP864
DOS Arabic code page
CP865
DOS Nordic code page
CP866
DOS Cyrillic Russian code page
CP869
DOS Greek 2 code page
CP870
EBCDIC Latin 2
CP871
EBCDIC Iceland
CP874
DOS Thai code page
CP875
EBCDIC Greek code page
CP918
EBCDIC Urdu
CP932
ANSI Shift-JIS code page
CP1025
EBCDIC Cyrillic
CP1026
EBCDIC Latin5 Turkish code page
CP1047
EBCDIC Open Systems Latin-1
CP1051
Roman-8
CP1097
EBCDIC Farsi
CP1112
EBCDIC Baltic
CP1122
EBCDIC Estonia
CP1123
EBCDIC Cyrillic Ukraine
CP1130
EBCDIC Vietnamese
CP1132
EBCDIC Lao
CP1250
ANSI Latin-2 code page
CP1251
ANSI Cyrillic code page
CP1252
Standard Latin-1 code page
CP1253
ANSI Greek code page
CP1254
ANSI Turkish code page
CP1255
ANSI Hebrew code page
CP1256
ANSI Arabic code page
CP1257
ANSI Baltic code page
CP1258
ANSI Vietnamese code page
EUC-CN
Extended Unix Code (EUC) encoding for Simplified Chinese
EUC-JP
Extended Unix Code (EUC) encoding for Japanese
EUC-KR
Extended Unix Code (EUC) encoding for Korean
EUC-TW
Extended Unix Code (EUC) encoding for Traditional (Taiwanese) Chinese
GB18030
ANSI Extended GB code page
GB2312
ANSI Extended GB code page
GBK
Simplified Chinese GBK
HZ
Chinese HZ
IBM950
IBM Traditional Chinese
ISO-8859-1
ISO 8859-1 standard code page
ISO-8859-2
ISO 8859-2 standard code page
ISO-8859-3
ISO 8859-3 standard code page
ISO-8859-4
ISO 8859-4 standard code page
ISO-8859-5
ISO 8859-5 standard code page
ISO-8859-6
ISO 8859-6 standard code page
ISO-8859-7
ISO 8859-7 standard code page
ISO-8859-8
ISO 8859-8 standard code page
ISO-8859-9
ISO 8859-9 standard code page
JIS7
Japanese JIS7
JIS8
Japanese JIS8
ks_c_5601
Ansi Unified Hangul code page
OEM
Corresponding OEM code page
Shift-JIS
Unicode Consortium mapping of JIS 0201 and JIS 0208
UCS-2
Standard two-byte Unicode (nonencoded)
US-ASCII
Standard US ASCII, same as ISO8859-1
UTF-8
UTF-8 encoding of Unicode (into 8-bit characters)
UTF-16
UTF-16 encoding of Unicode (into 16-bit characters)
UTF-32
UTF-32 encoding of Unicode (into 32-bit characters)
Windows-1251
Windows Cyrillic, with the Euro sign
Windows-1252
Windows Latin 1, with the Euro sign
Windows-1253
Windows Greek, with the Euro sign
Windows-1254
Windows Turkish, with the Euro sign
Windows-1255
Windows Hebrew, with the Euro sign
Windows-1256
Windows Arabic, with the Euro sign
Windows-1257
Windows Baltic, with the Euro sign
Windows-1258
Windows Vietnamese, with the Euro sign
Customizable ICU Encodings
The following table lists customizable ICU encodings. See Customizing Character Mappings for instructions.
ICU Encoding
File in mappings Folder
Big5
windows-950-2000.ucm
Big5 HKSCS
ibm-1375_P100-2003.ucm
CP037
ibm-37_P100-1995.ucm
CP273
ibm-273_P100-1995.ucm
CP277
ibm-277_P100-1995.ucm
CP278
ibm-278_P100-1995.ucm
CP280
ibm-280_P100-1995.ucm
CP284
ibm-284_P100-1995.ucm
CP285
ibm-285_P100-1995.ucm
CP297
ibm-297_P100-1995.ucm
CP420
ibm-420_X120-1999.ucm
CP424
ibm-424_P100-1995.ucm
CP437
ibm-437_P100-1995.ucm
CP500
ibm-500_P100-1995.ucm
CP737
ibm-737_P100-1997.ucm
CP775
ibm-775_P100-1996.ucm
CP838
ibm-838_P100-1995.ucm
CP850
ibm-850_P100-1995.ucm
CP852
ibm-852_P100-1995.ucm
CP855
ibm-855_P100-1995.ucm
CP857
ibm-857_P100-1995.ucm
CP860
ibm-860_P100-1995.ucm
CP861
ibm-861_P100-1995.ucm
CP862
ibm-862_P100-1995.ucm
CP863
ibm-863_P100-1995.ucm
CP864
ibm-864_X110-1999.ucm
CP865
ibm-865_P100-1995.ucm
CP866
ibm-866_P100-1995.ucm
CP869
ibm-869_P100-1995.ucm
CP870
ibm-869_P100-1995.ucm
CP871
ibm-871_P100-1995.ucm
CP874
ibm-874_P100-1995.ucm
CP875
ibm-875_P100-1995.ucm
CP918
ibm-918_P100-1995.ucm
CP1025
ibm-1025_P100-1995.ucm
CP1026
ibm-1026_P100-1995.ucm
CP1047
ibm-1047_P100-1995.ucm
CP1051
ibm-1051_P100-1995.ucm
CP1097
ibm-1097_P100-1995.ucm
CP1112
ibm-1112_P100-1995.ucm
CP1122
ibm-1122_P100-1999.ucm
CP1123
ibm-1123_P100-1995.ucm
CP1130
ibm-1130_P100-1997.ucm
CP1131
ibm-1132_P100-1998.ucm
CP1250
ibm-5346_P100-1998.ucm
CP1251
ibm-5347_P100-1998.ucm
CP1252
ibm-5348_P100-1997.ucm
CP1253
ibm-5349_P100-1998.ucm
CP1254
ibm-5350_P100-1998.ucm
CP1255
ibm-9447_P100-2002.ucm
CP1256
windows-1256-2000.ucm
CP1257
ibm-9449_P100-2002.ucm
CP1258
ibm-5354_P100-1998.ucm
EUC-CN
ibm-1383_P110-1999.ucm
EUC-JP
ibm-954_P101-2000.ucm
EUC-KR
ibm-970_P110-1995.ucm
EUC-TW
ibm-964_P110-1999.ucm
GB18030
gb18030.ucm
GB2312
ibm-1383_P110-1999 .ucm
GBK
windows-936-2000.ucm
IBM950
ibm-950_P110-1999.ucm
ISO-8859-2
ibm-912_P100-1995.ucm
ISO-8859-3
ibm-913_P100-2000.ucm
ISO-8859-4
ibm-914_P100-1995.ucm
ISO-8859-5
ibm-915_P100-1995.ucm
ISO-8859-6
ibm-1089_P100-1995.ucm
ISO-8859-7
ibm-813_P100-1995.ucm
ISO-8859-8
ibm-916_P100-1995.ucm
ISO-8859-9
ibm-920_P100-1995.ucm
ks_c_5601
windows-949-2000.ucm
Shift-JIS
ibm-943_P15A-2003.ucm
Windows-1251
ibm-5347_P100-1998.ucm
Windows-1252
ibm-5348_P100-1997.ucm
Windows-1253
ibm-5349_P100-1998.ucm
Windows-1254
ibm-5350_P100-1998.ucm
Windows-1255
ibm-9447_P100-2002.ucm
Windows-1256
windows-1256-2000.ucm
Windows-1257
ibm-9449_P100-2002.ucm
Windows-1258
ibm-5354_P100-1998.ucm
Customizing Non-ICU Japanese Mappings
For customizing any of the following Japanese encodings, please contact Technical Support Engineering.
DEC
IBM78EBCDIC
IBM78EBCDIK
IBM83EBCDIC
IBM83EBCDIK
JEF78EBCDIC
JEF78EBCDIK
JEF83EBCDIC
JEF83EBCDIK
JIS78
JIS83
KEIS78EBCDIC
KEIS78EBCDIK
KEIS83EBCDIC
KEIS83EBCDIK
MELCOM
NECJIPSE
NECJIPSEINT
NECJIPSJ
NECJIPSJINT
UnisysLETSJ
Byte Order Marks
You must be familiar with how Unicode data is represented from a binary standpoint. For example, if a field is set with a size of x and x is not large enough to handle incoming double-byte data, the data may be truncated.
UTF-8 files often use a byte order mark (BOM) with leading bytes at the beginning of a data stream to distinguish between ASCII and Unicode data.
For instance, in a binary view of your data file, you may have extra bytes at the beginning of the file. These bytes are not necessary when handling typed data in databases and are removed during the transformation. Usually, a BOM wastes space and complicates string concatenation.