Multilingual Database Support With Unicode UTF-8

Multilingual Database Support With Unicode UTF-8

If you choose to store text as UTF-8 you will continue to use the CHAR, VARCHAR, and LONGVARCHAR relational types. You also need to consider such aspects as the Unicode support for the operating system on which your application runs, the string manipulation libraries available to your application, the PSQL access methods your application uses, any columns that may need a different data type, and so forth.

When to Use Unicode UTF-8

Unicode UTF-8 encoding is a good choice for the following:

•

You want to add new language support to an existing application but keep application changes fairly minimal. For example, you have a PSQL database with ANSI-only characters (English, for instance). You want to extend your application to include data in English, German, Polish, and Czech. UTF-8 provides compact storage requirements for European scripts because it requires, at most, two bytes per character.

•

A web application since many web platforms use UTF-8. Because Unicode UTF-8 is ASCII-compatible and compact for Latin-based language character sets, it is often used as a standard encoding for interchange of Unicode text.

•

A Linux or OS X application that supports UTF-8 string handling.

•

A PSQL server on OS X.

Unicode UTF-8 Support in PSQL

One of the code pages supported by PSQL is UTF-8. For UTF-8 text storage, you would set the DB code page for your PSQL database to UTF-8.

Note that with UTF-8, string storage is byte strings. For byte strings, PSQL provides the relational data types CHAR, VARCHAR, and LONGVARCHAR, and the Btrieve data types STRING and ZSTRING. See also Data Types in SQL Engine Reference. Columns will likely be wider when storing UTF-8 because European languages often require two bytes per character instead of a single byte for legacy code pages.

All string data inserted by your application for existing CHAR, VARCHAR and LONGVARCHAR data types are interpreted as UTF-8 strings. You can configure the PSQL SQL access methods to automatically translate to UTF-8 (see Access Methods for Unicode UTF-8 Support).

When the database code page is UTF-8 and the client environment supports Unicode (wide character or UTF-8), SQL text supports Unicode characters in CHAR literals. With any other database code page, general Unicode characters must be in NCHAR literals.

Collation and Sorting

PSQL supports only code point order for collation and sorting with UTF-8 storage.

Access Methods for Unicode UTF-8 Support

The PSQL access methods ODBC, JDBC, and ADO.NET support translation to UTF-8 storage. These access methods exchange text values with the application as UCS-2 wide character strings or as legacy byte strings for the ANSI ODBC drivers. When properly configured, the access methods translate the application text values to UTF-8 for transmission to the storage engine.

•

If your application uses the ANSI ODBC driver on Windows, all data will be converted by the Windows Driver Manager to the client legacy code page for byte strings. This results in the loss of any characters that are not in the legacy character set. You may also need to convert your application to use the Unicode ODBC driver.

•

If your application uses the ANSI ODBC driver on Linux or OS X, you should set the app locale to use UTF-8 as the string encoding. For completeness, also declare pvtranslate=auto in the connection string and declare the database code page to be UTF-8.

•

For JDBC, your application needs to specify pvtranslate=auto in the connection string to the JDBC driver. See Connection String Overview in JDBC Driver Guide.

•

For ADO.NET, your application needs to specify pvtranslate=auto in the connection string to the database engine. See Adding Connections in Data Provider for .NET Guide.

Migrating an Existing Database to Unicode UTF-8

All text data must be converted from any legacy code page to UTF-8. Columns will likely need to be widened to accommodate the longer UTF-8 byte strings. Any non-ASCII metadata, such as table names, must be converted from the legacy code page to UTF-8. Given these combined changes, it is reasonable to migrate the database by copying from the old schema, using the legacy code page, to the new schema with UTF-8 as the database code page.

Note In the special case where all existing data and metadata is pure ASCII, it is possible to just change the database code page to UTF-8.

All existing (7-bit) ASCII byte strings are also valid UTF-8 byte strings.