Concepts and Definitions
This section presents important concepts and definitions of terms used in this chapter.
Character Sets
A character set defines the list of text and other symbolic characters that are recognized by a given hardware and software system. For a system that only needs to recognize the characters used in English, the set can be as small as the letters A-Z and a-z, the numerals 0-9 and a few punctuation symbols. Support for additional languages increases the size of the character set. For example, European languages add characters with accents and other diacriticals. Other languages have completely different characters.
Legacy Character Sets
A character is represented in a computer system as a numerical value (getting from a character to a number and back is called encoding and is discussed below). To represent all of the characters in a large character set requires a large range of numerical values. In order to efficiently use what was once limited and expensive storage resources, computer systems have commonly used a byte (8-bits) to store character representations. This limited the size of the character set to 256 characters. Consequently, we now have a legacy of specialized character sets for representing specific languages, such as Japanese, or limited groups of languages, such as Western European. PSQL supports a number of these legacy character sets.
The Unicode Character Set
The Unicode standard defines a character set that contains every character used in spoken languages in the world (see www.unicode.org). Unicode also expands the concept of a character set by defining additional annotation information to specify letter spacing, right-to-left behavior, word and line breaks, and so forth. This allows applications to properly display and manipulate Unicode text. Applications, and the database, also need this additional information for such actions as case conversion and sorting.
PSQL recognizes the Unicode character set, providing support for character data storage and retrieval in whatever languages are required by the application.
Encoding
Encoding is the association of each character in a character set with a numerical value. Initially, computer systems and system programming languages did not distinguish between characters and bytes and encoding was simply the byte value corresponding to a particular character. In order to respond to the need to display more characters than would be possible to encode in a single byte, different encodings have been defined. Encodings for larger character sets may use multiple bytes per character.
Legacy Encodings
For legacy character sets, the encoding is defined in a code page. You can think of the code page as the lookup table for converting from a character to a value (or a value to a character). It is important that applications that use text always use the same code page. A character that is stored in the database using one code page may be displayed as a different character when read using a different encoding.
The binary unit of all legacy code pages is an 8-bit byte. Most legacy code pages currently in use are supersets of ASCII (American Standard Code for Information Interchange), a 7-bit code defining a character set consisting of 128 control codes and printable characters. By using the 8th bit an additional 128 codes are possible for a total of 256 character encodings in a single byte value.
Microsoft Windows has two groups of code pages, ANSI and OEM code pages. Although code pages in both of these groups are extended ASCII code pages, neither group distinguishes between characters and bytes.
In ANSI (American National Standards Institute) code pages, the non-ASCII values (values greater than 127) represent international characters that are usually customized for a language or group of languages. ANSI code pages are typically used for byte string applications that use a graphical user interface on Windows systems. ANSI code pages are also sometimes referred to as active code pages (ACP, referred to as the client code page in PSQL). A Windows application always has one currently active ANSI code page. For example, the default active code page for English on Windows is code page 1252.
OEM (original equipment manufacturer) code pages are, as the name implies, code pages developed by a given manufacturer for specific systems. These code pages were originally used for MS-DOS and are still used for console applications. The usual OEM code page for English is code page 437. A typical OEM code page has a character set similar to one of the ANSI code pages, but uses a different character order for values greater than 127.
Unicode Encodings
In the Unicode character set, each character is assigned a unique value called a code point. That code point value is then encoded for storage. The code points are organized into planes. Each plane can contain 65536 code points. The first plane, plane 0, is named the Basic Multilingual Plane (BMP) and contains the majority of the code points currently defined. (Unicode has provision for up to 17 planes. At the time of this writing, only the first six contain code points.) The Unicode standard has several methods of encoding the code points. Two that are commonly used are UTF-8 and UCS-2. UTF-8 encodes character code point values to a byte string using one to four bytes per character. UCS-2 encodes character code point values using 16-bit values, often referred to as wide characters.
PSQL recognizes the Basic Multilingual Plane code points and is compatible with applications that use the Unicode encodings UTF-8 for byte strings and UCS-2 for wide character strings. The binary unit for UTF-8 is 8 bit. The binary unit for UCS-2 is 16-bit, wide character.
Declaring Encodings
The database code page is a PSQL database property that declares the encoding of character data stored in the database. The purpose is to help insure that character data can be interpreted correctly. However, the database code page property is just a declaration. PSQL does not validate the encoding of the data and metadata that an application inserts into a database. The application is responsible for ensuring that character data is stored and retrieved in a particular encoding. Note that the database code page only applies to text encoded with legacy code pages or UTF-8. Wide character text is always encoded using UCS-2. A proper setting is required if the engine will be converting between wide character text and byte-string text. The default value of the database code page is the system code page of the OS where the engine is running.
The PSQL SQL access methods infer a client code page for byte strings exchanged between the application and the access method. (Wide character strings are always encoded with UCS-2.) On Windows, the access method assumes that the application is respecting the ACP (Active Code Page) for byte strings. On Linux and OS X, the access method assumes that the application is respecting the encoding of the locale, which is usually given by the LANG environment variable.
PSQL provides methods to ensure compatible encoding between the database engine and clients. For example, an application can specify that it wants the PSQL SQL client to translate data automatically between the database code page and the client application. This is referred to as automatic translation. Note, however, that automatic translation can translate characters only if they are present in the character sets of both code pages (the code page on the server machine and the code page on the client machine).
For backward compatibility, automatic translation in access methods is disabled by default. The application must configure the access method to enable automatic translation. When possible, the recommended method is to set the database code page and configure the access method to read and use that value.
Collation and Sorting
Collation is the general term for the process and function of determining the sorting order of strings of characters. Collation varies by language, so it is not possible to arrange the encoding in multilanguage code pages so that simple binary string comparison produces the desired collation order for every language. If multilevel sorting is a requirement, separate data tables are required to define the correct sorting order.