A character set defines the list of text and other symbolic characters that are recognized by a given hardware and software system. For a system that only needs to recognize the characters used in English, the set can be as small as the letters A-Z and a-z, the numerals 0-9 and a few punctuation symbols. Support for additional languages increases the size of the character set. For example, European languages add characters with accents and other diacriticals. Other languages have completely different characters.
The Unicode standard defines a character set that contains every character used in spoken languages in the world (see www.unicode.org). Unicode also expands the concept of a character set by defining additional annotation information to specify letter spacing, right-to-left behavior, word and line breaks, and so forth. This allows applications to properly display and manipulate Unicode text. Applications, and the database, also need this additional information for such actions as case conversion and sorting.
For legacy character sets, the encoding is defined in a code page. You can think of the code page as the lookup table for converting from a character to a value (or a value to a character). It is important that applications that use text always use the same code page. A character that is stored in the database using one code page may be displayed as a different character when read using a different encoding.
In the Unicode character set, each character is assigned a unique value called a code point. That code point value is then encoded for storage. The code points are organized into
planes. Each plane can contain 65536 code points. The first plane, plane 0, is named the Basic Multilingual Plane (BMP) and contains the majority of the code points currently defined. (Unicode has provision for up to 17 planes. At the time of this writing, only the first six contain code points.) The Unicode standard has several methods of encoding the code points. Two that are commonly used are UTF-8 and UCS-2. UTF-8 encodes character code point values to a byte string using one to four bytes per character. UCS-2 encodes character code point values using 16-bit values, often referred to as
wide characters.
The PSQL SQL access methods infer a client code page for byte strings exchanged between the application and the access method. (Wide character strings are always encoded with UCS-2.) On Windows, the access method assumes that the application is respecting the ACP (Active Code Page) for byte strings. On Linux and OS X, the access method assumes that the application is respecting the encoding of the locale, which is usually given by the LANG environment variable.