Understanding Unicode and Other Encoding Types

Unicode is an international character set. Like other character sets such as American Standard Code for Information Interchange (ASCII), the Unicode character set provides a standard correspondence between the binary numbers that computers understand, and the letters, digits, and punctuation that people understand.

Unlike ASCII, however, Unicode provides a code for every character in nearly every language in the world. This task requires more than the 256 characters available in ASCII. ASCII is based on the 8-bit character set, while Unicode uses 16-bit characters as the default.

Unicode characters are most commonly referred by their 4-digit hexadecimal representations (0000 to FFFF). The numbers 0 (0000) to 128 (007F) correspond exactly to their ASCII counterparts. The correspondence between the integer values and the actual characters may be found at Unicode’s website.

Unicode includes the Latin alphabet used for English, the Cyrillic alphabet used for Russian, the Greek, Hebrew, and Arabic alphabets, and other alphabets used in Europe, Africa, and Asia, such as Japanese kana, Korean hangul, and Chinese bopomofo.

Much of the Unicode standard includes thousands of unified character codes for Chinese, Japanese, and Korean ideographs. Adopted as an international standard in 1992, Unicode was originally a "double-byte," or 16-digit, binary number code that could represent up to 65,536 items.

No longer limited to 16 bits, Unicode can represent about one million code positions using three encoding forms called Unicode Transformation Formats (UTF) as shown here.

UTF Format	Number of Bytes	Application
UTF-8	Consists of one-, two-, three-, and four-byte codes	Used in World Wide Web applications. Widely used because it is backwards compatible with ASCII, since all 128 US-ASCII characters have the same single-byte code points as they would in ASCII.
UTF-16	Consists of two- and four-byte codes	Used primarily for data storage and text processing. Developed for Japanese, Chinese and Korean languages. Also called a double-byte character set (DBCS).
UTF-32	Consists of four-byte codes	Used when character handling efficiency is important.

ASCII characters are the most commonly known non-Unicode characters. English language characters are often included in the first 128 code points (Hex 00-9F) in non-Unicode code pages. However, in the popular Japanese ShiftJIS character set, many of the first 128 code points are reserved as lead bytes and are not available for English characters.

In SQL Server 2000, the non-Unicode character data types are char, varchar, and text. These data types use the character representation scheme defined in single or double-byte code pages in SQL Server.