Choosing a Character Set and Encoding

Choosing a Character Set and Encoding

To implement a globalization strategy you typically begin by identifying the character set required based on the languages or other text and character requirements you need to satisfy. The next step is to choose the encodings that support the character set. The encoding used may even be different for the database and for client applications. Let’s look at some examples.

The most global character set is the Unicode character set. Even if legacy character sets are used in the clients, they can all be translated to Unicode for storage in PSQL. For a new application or a new module, storing text in UCS-2 or UTF-8 encoding is a simple approach. However, not all applications are new.

Another consideration for applications is the technology of the client programs. If the application uses the .NET framework, the Java VM, or the UNICODE option with C/C++, the application is already processing text using wide character strings. In these situations the main consideration is configuring PSQL to preserve that text and choosing how to store it.

If the application is using byte strings in C/C++ and the legacy PSQL ODBC driver, there are two possible paths to globalization. One is to port the application to use wide character strings; the other is to let the application continue to support the legacy code page of the client where it is installed and to arrange for translation to Unicode storage.

A very conservative approach for existing applications is to continue using your current legacy code page and take advantage of the other languages that it supports. For example, an application developed for the English-speaking market on Windows using ANSI code page 1252 or OEM code page 850 can also support Western European languages without any change in application storage. The main changes would be to localize program text.

Note User Data and Metadata

PSQL has two types of text that it must handle. The first is user data, which is mostly manipulated by the application, and also by index ordering and by SQL string functions. The second type is metadata, which is the names of SQL objects, such as tables, columns, and indexes. Metadata does not handle UCS-2 encoding, and so follows the legacy code page of the database code page declaration. SQL queries can contain both user data in string literals, and metadata in object names. Thus when discussing SQL queries, we must distinguish the character sets of user data and metadata, even when we are using one of the Unicode encodings for the SQL text as a whole.

PSQL is not prepared to handle mixed encodings in text storage. The application should consider such text to be BINARY storage and handle all encoding translations in the application. PSQL assumes that all CHAR type data and SQL metadata respect the database code page, and that all NCHAR type data is UCS-2.

The following sections cover some specific storage cases, Multilingual Database Support With Unicode UTF-8 and Multilingual Database Support With Unicode UCS-2. Following is a section on handling legacy OEM code pages, Multilingual Database Support with Legacy and OEM Encodings.