Database Globalization
 
Database Globalization
Globalization Features of PSQL
This chapter contains the following sections:
Overview
Concepts and Definitions
Choosing a Character Set and Encoding
Multilingual Database Support with Unicode UTF-8
Multilingual Database Support with Unicode UCS-2
Multilingual Database Support with Legacy and OEM Encodings
Database Code Page and Client Encoding
Unicode Support in PSQL Utilities
Support for Collation and Sorting
Locale Support
Overview
Globalization means, in this context, adapting computer software to different languages. It is now commonplace that data be accessed by users around the globe and that applications present the data in the user’s own language. Support for globalization in PSQL allows your application to store text in multiple languages in the same database. This means that your application can store, process, and retrieve data in whatever language is required.
This chapter explains the PSQL features with which you can support globalization for your applications. It discusses overall approaches to globalization and the particular PSQL features that support globalized applications. By default, PSQL remains backward compatible with legacy text encodings and this chapter also discusses the settings that ease globalization for applications using legacy encodings.
Concepts and Definitions
This section presents important concepts and definitions of terms used in this chapter.
Character Sets
A character set defines the list of text and other symbolic characters that are recognized by a given hardware and software system. For a system that only needs to recognize the characters used in English, the set can be as small as the letters A-Z and a-z, the numerals 0-9 and a few punctuation symbols. Support for additional languages increases the size of the character set. For example, European languages add characters with accents and other diacriticals. Other languages have completely different characters.
Legacy Character Sets
A character is represented in a computer system as a numerical value (getting from a character to a number and back is called encoding and is discussed below). To represent all of the characters in a large character set requires a large range of numerical values. In order to efficiently use what was once limited and expensive storage resources, computer systems have commonly used a byte (8-bits) to store character representations. This limited the size of the character set to 256 characters. Consequently, we now have a legacy of specialized character sets for representing specific languages, such as Japanese, or limited groups of languages, such as Western European. PSQL supports a number of these legacy character sets.
The Unicode Character Set
The Unicode standard defines a character set that contains every character used in spoken languages in the world (see www.unicode.org). Unicode also expands the concept of a character set by defining additional annotation information to specify letter spacing, right-to-left behavior, word and line breaks, and so forth. This allows applications to properly display and manipulate Unicode text. Applications, and the database, also need this additional information for such actions as case conversion and sorting.
PSQL recognizes the Unicode character set, providing support for character data storage and retrieval in whatever languages are required by the application.
Encoding
Encoding is the association of each character in a character set with a numerical value. Initially, computer systems and system programming languages did not distinguish between characters and bytes and encoding was simply the byte value corresponding to a particular character. In order to respond to the need to display more characters than would be possible to encode in a single byte, different encodings have been defined. Encodings for larger character sets may use multiple bytes per character.
Legacy Encodings
For legacy character sets, the encoding is defined in a code page. You can think of the code page as the lookup table for converting from a character to a value (or a value to a character). It is important that applications that use text always use the same code page. A character that is stored in the database using one code page may be displayed as a different character when read using a different encoding.
The binary unit of all legacy code pages is an 8-bit byte. Most legacy code pages currently in use are supersets of ASCII (American Standard Code for Information Interchange), a 7-bit code defining a character set consisting of 128 control codes and printable characters. By using the eighth bit an additional 128 codes are possible for a total of 256 character encodings in a single byte value.
Microsoft Windows has two groups of code pages, ANSI and OEM code pages. Although code pages in both of these groups are extended ASCII code pages, neither group distinguishes between characters and bytes.
In ANSI code pages, non-ASCII values (those greater than 127) represent international characters that are usually customized for a language or group of languages. ANSI code pages are typically used for byte string applications that use a graphical user interface on Windows systems. An ANSI code page is also sometimes referred to as the active code page (ACP, referred to in PSQL as the client code page). A Windows application always has one currently active code page. For example, the default active code page for English on Windows is code page 1252.
OEM (original equipment manufacturer) code pages are, as the name implies, code pages developed by a given manufacturer for specific systems. These code pages were originally used for MS-DOS and are still used for console applications. The usual OEM code page for English is code page 437. A typical OEM code page has a character set similar to an ANSI code page but uses a different character order for values greater than 127.
Unicode Encodings
In character encoding systems, each character is assigned a unique value called a code point which can be used for encoding data. The code points are organized into planes. Each plane can contain 65536 (216) code points. Unicode has provision for up to 17 planes. The first plane, plane 0, is named the Basic Multilingual Plane (BMP) and contains the majority of the code points currently defined. At the time of this writing, only planes 0, 1, 2, 15, and 16 contain code points. The Unicode standard has several methods of encoding the code points. Two that are commonly used are UTF-8 and UCS-2. UTF-8 encodes character code point values to a byte string using 1–4 bytes per character. UCS-2 encodes character code point values using 16-bit values, often referred to as wide characters.
PSQL recognizes the Basic Multilingual Plane code points and is compatible with applications that use the Unicode encodings UTF-8 for byte strings and UCS-2 for wide character strings. The binary unit for UTF-8 is 8-bit. The binary unit for UCS-2 is 16-bit, wide character.
Declaring Encodings
The database code page is a PSQL database property that declares the encoding of character data stored in the database. The purpose is to help insure that character data can be interpreted correctly. However, the database code page property is simply a declaration. PSQL does not validate the encoding of the data and metadata that an application inserts into a database. The application is responsible for ensuring that character data is stored and retrieved in a particular encoding. Note that the database code page applies only to text encoded with legacy code pages or UTF-8. Wide character text is encoded using UCS-2. A proper setting is required for the engine to convert between wide character text and byte-string text. The default value of the database code page is the system code page of the operating system where the engine is running.
In PSQL the SQL access methods infer a client code page for byte strings exchanged between the application and the access method. On Windows, the access method assumes that the application is respecting the active code page (ACP) for byte strings. On Linux, macOS, and Raspbian, the access method assumes that the application is respecting the encoding of the locale, which is usually set in the LANG environment variable.
PSQL provides methods to ensure compatible encoding between the database engine and clients. For example, an application can specify that it wants the PSQL SQL client to translate data automatically between the database code page and the client application. This is referred to as automatic translation. Note, however, that automatic translation can translate characters only if they are present in the character sets of both code pages (the code page on the server machine and the code page on the client machine).
For backward compatibility, automatic translation in access methods is disabled by default. The application must configure the access method to enable automatic translation. When possible, the recommended method is to set the database code page and configure the access method to read and use that value.
Collation and Sorting
Collation is the general term for the process and function of determining the sorting order of strings of characters. Collation varies by language, so it is not possible to arrange the encoding in multilanguage code pages so that simple binary string comparison produces the desired collation order for every language. If multilevel sorting is a requirement, separate data tables are required to define the correct sorting order.
Choosing a Character Set and Encoding
To implement a globalization strategy you typically begin by identifying the character set required based on the languages or other text and character requirements you need to satisfy. The next step is to choose the encodings that support the character set. The encoding used may even be different for the database and for client applications. Let’s look at some examples.
The most global character set is the Unicode character set. Even if legacy character sets are used in the clients, they can all be translated to Unicode for storage in PSQL. For a new application or a new module, storing text in UCS-2 or UTF-8 encoding is a simple approach. However, not all applications are new.
Another consideration for applications is the technology of the client programs. If the application uses the .NET framework, the Java VM, or the UNICODE option with C/C++, the application is already processing text using wide character strings. In these situations the main consideration is configuring PSQL to preserve that text and choosing how to store it.
If the application is using byte strings in C/C++ and the legacy PSQL ODBC driver, there are two possible paths to globalization. One is to port the application to use wide character strings; the other is to let the application continue to support the legacy code page of the client where it is installed and to arrange for translation to Unicode storage.
A very conservative approach for existing applications is to continue using your current legacy code page and take advantage of the other languages that it supports. For example, an application developed for the English-speaking market on Windows using ANSI code page 1252 or OEM code page 850 can also support Western European languages without any change in application storage. The main changes would be to localize program text.
Note User Data and Metadata
PSQL has two types of text that it must handle. The first is user data, which is mostly manipulated by the application, and also by index ordering and by SQL string functions. The second type is metadata, which is the names of SQL objects, such as tables, columns, and indexes. Metadata does not handle UCS-2 encoding, and so follows the legacy code page of the database code page declaration. SQL queries can contain both user data in string literals, and metadata in object names. Thus when discussing SQL queries, we must distinguish the character sets of user data and metadata, even when we are using one of the Unicode encodings for the SQL text as a whole.
PSQL is not prepared to handle mixed encodings in text storage. The application should consider such text to be BINARY storage and handle all encoding translations in the application. PSQL assumes that all CHAR type data and SQL metadata respect the database code pag, and that all NCHAR data is UCS-2.
The following topics cover specific storage cases:
Multilingual Database Support with Unicode UTF-8
Multilingual Database Support with Unicode UCS-2
In addition, the following topic covers handling of legacy OEM code pages:
Multilingual Database Support with Legacy and OEM Encodings
Multilingual Database Support with Unicode UTF-8
If you choose to store text as UTF-8 you will continue to use the CHAR, VARCHAR, and LONGVARCHAR relational types. You also need to consider such aspects as the Unicode support for the operating system on which your application runs, the string manipulation libraries available to your application, the PSQL access methods your application uses, any columns that may need a different data type, and so forth.
When to Use Unicode UTF-8
Unicode UTF-8 encoding is a good choice for the following:
You want to add new language support to an existing application but keep application changes fairly minimal. For example, you have a PSQL database with ANSI-only characters (English, for instance). You want to extend your application to include data in English, German, Polish, and Czech. UTF-8 provides compact storage requirements for European scripts because it requires, at most, two bytes per character.
Web applications, since many web platforms use UTF-8. Because Unicode UTF-8 is ASCII-compatible and compact for Latin-based language character sets, it is often used as a standard encoding for interchange of Unicode text.
A Linux or macOS application that supports UTF-8 string handling
A PSQL server on macOS
Unicode UTF-8 Support in PSQL
One of the code pages supported by PSQL is UTF-8. For UTF-8 text storage, you would set the DB code page for your PSQL database to UTF-8.
Note that with UTF-8, string storage is byte strings. For byte strings, PSQL provides the relational data types CHAR, VARCHAR, and LONGVARCHAR, and the Btrieve data types STRING and ZSTRING. See also Data Types in SQL Engine Reference. Columns will likely be wider when storing UTF-8 because European languages often require two bytes per character instead of a single byte for legacy code pages.
All string data inserted by your application for existing CHAR, VARCHAR and LONGVARCHAR data types are interpreted as UTF-8 strings. You can configure the PSQL SQL access methods to automatically translate to UTF-8 (see Access Methods for Unicode UTF-8 Support).
When the database code page is UTF-8 and the client environment supports Unicode (wide character or UTF-8), SQL text supports Unicode characters in CHAR literals. With any other database code page, general Unicode characters must be in NCHAR literals.
Collation and Sorting
By default, PSQL supports code point order for collation and sorting with UTF-8 storage.
Access Methods for Unicode UTF-8 Support
The PSQL access methods ODBC, JDBC, and ADO.NET support translation to UTF-8 storage. These access methods exchange text values with the application as UCS-2 wide character strings or as legacy byte strings for the ANSI ODBC drivers. When properly configured, the access methods translate the application text values to UTF-8 for transmission to the storage engine.
If your application uses the ANSI ODBC driver on Windows, all data will be converted by the Windows Driver Manager to the client legacy code page for byte strings. This results in the loss of any characters that are not in the legacy character set. You may also need to convert your application to use the Unicode ODBC driver.
If your application uses the ANSI ODBC driver on Linux or macOS, you should set the app locale to use UTF-8 as the string encoding. For completeness, also declare pvtranslate=auto in the connection string and declare the database code page to be UTF-8.
For JDBC, your application needs to specify pvtranslate=auto in the connection string to the JDBC driver. See Connection String Overview in JDBC Driver Guide.
For ADO.NET, your application needs to specify pvtranslate=auto in the connection string to the database engine. See Adding Connections in Data Provider for .NET Guide.
Migrating an Existing Database to Unicode UTF-8
All text data must be converted from any legacy code page to UTF-8. Columns will likely need to be widened to accommodate the longer UTF-8 byte strings. Any non-ASCII metadata, such as table names, must be converted from the legacy code page to UTF-8. Given these combined changes, it is reasonable to migrate the database by copying from the old schema, using the legacy code page, to the new schema with UTF-8 as the database code page.
Note In the special case where all existing data and metadata is pure ASCII, it is possible to just change the database code page to UTF-8.
All existing (7-bit) ASCII byte strings are also valid UTF-8 byte strings.
Multilingual Database Support with Unicode UCS-2
If you choose to store text as UCS-2 you will use the NCHAR, NVARCHAR, and NLONGVARCHAR relational types. This has no effect on your ability to also store other text in the CHAR family of relational types. You also need to consider such aspects as the Unicode support for the operating system on which your application runs, the string manipulation libraries available to your application, the PSQL access methods your application uses, any columns that may need a different data type, and so forth.
When to Use Unicode UCS-2
Unicode UCS-2 is a good option for the following situations:
Your application supports Asian character data. With UCS-2, all characters are stored as two bytes which provides more compact storage than UTF-8 for Asian character data.
Your application will not be globalizing all of the storage and so will have wide character columns together with legacy byte string columns.
Your application needs better compatibility for wide character data with Java, ADO.NET, or wide character clients. Such applications use UCS-2 for application strings.
Unicode UCS-2 Support in PSQL
Wide character storage is a separate type from byte-string storage and may be used alongside byte-string storage.
For wide character strings, PSQL provides the relational data types NCHAR, NVARCHAR, and NLONGVARCHAR, and the Btrieve data types WSTRING and WZSTRING. All data inserted by your application for NCHAR, NVARCHAR, NLONGVARCHAR, WSTRING and WZSTRING data types is interpreted as wide character strings.
PSQL supports Unicode characters in NCHAR literals in SQL query text. Text data in CHAR literals is translated to the database code page. Keep in mind that database code pages other than UTF-8 cannot handle most Unicode characters.
Collation and Sorting with Unicode UCS-2
By default, PSQL supports code point order for collation and sorting with UCS-2 storage.
Access Methods for Unicode UCS-2 Support
Using NCHAR storage with the SQL access methods requires that applications specify the appropriate NCHAR SQL types to the access method. Use of CHAR types can cause conversion from NCHAR to CHAR and loss of data if the database code page is not UTF-8.
ODBC applications should use the SQL_WCHAR, SQL_WVARCHAR, and SQL_WLONGVARCHAR SQL types for parameters.
Windows ODBC applications should be using wide character strings, identified to ODBC as SQL_C_WCHAR. Also, Windows ODBC applications should use the PSQL Unicode ODBC driver. Use of the legacy ANSI ODBC driver will result in loss of data because the Windows Driver Manager will convert all wide character strings to byte string.
Linux, macOS, and Raspbian ODBC applications should use a locale with a UTF-8 encoding and use UTF-8 in application byte strings (SQL_C_CHAR) to provide data for NCHAR values (SQL_WCHAR).
JDBC applications should set string parameters to type NVARCHAR or use the ‘N’ methods such as setNString() to send string data as an NCHAR SQL type. Also, set pvtranslate=auto in the connection string and set the database code page. Otherwise, JDBC will remain backward compatible and use the declared or default byte-string encoding for string data. See Connection String Overview in JDBC Driver Guide.
ADO.NET applications need to specify pvtranslate=auto in the connection string to the database engine. See Adding Connections in Data Provider for .NET Guide.
In all cases, use NCHAR literals in SQL text.
Migrating an Existing Database to Unicode UCS-2
The tasks for migrating an existing database to support a UCS-2 application involve converting byte strings to wide character strings. The extent of the conversion depends on your application. ALTER TABLE can be used to convert columns provided that the database code page is set to match the encoding for existing CHAR columns. It is only necessary to convert columns that will store globalized data. Columns that will never store characters outside of the character set of the database code page can remain as CHAR types.
If your application uses the ANSI ODBC driver on Windows, you need to convert your application to wide character data and use the PSQL Unicode ODBC driver.
Multilingual Database Support with Legacy and OEM Encodings
If you choose to store text using a legacy or OEM code page, you will use the CHAR, VARCHAR, and LONGVARCHAR relational types. You may use wide character applications if you configure the access method to do conversion to your legacy code page.
When to Use a Legacy Code Page
Using a legacy code page is a compact and efficient way to store text data provided that the legacy code page defines a character set that meets the needs of your application.
Legacy Code Page Support in PSQL
The PSQL database engine assumes that CHAR data is encoded using the database code page. The access methods have configuration options that help ensure that this is the case.
Older applications that use OEM code pages for storage did not declare the database code page. You can continue to work with this situation so long as the application is careful to use the appropriate relational string functions.
Collation and Sorting with Legacy Code Pages
The default collation in PSQL is binary ordering of the encoded bytes. The declared database code page does not affect this default collation.
PSQL provides both alternate collating sequence (ACS) and international sorting rules (ISR) mechanisms for controlling collation. These may be declared on Btrieve indexes and on relational columns. PSQL will always use a particular ACS for the ASCII character set when using the CASE declaration on a relational column or the case-insensitive bit on a Btrieve index.
Access Methods for Legacy Code Pages
All PSQL access methods support byte-string text. Some assume that the database code page is the same as the application’s code page, and some allow configuration.
The ANSI ODBC drivers provide a DSN setting, OEM/ANSI translation, which declares that the application OEM code page (as distinct from ANSI code page) is the database code page. The ANSI ODBC drivers also support the encoding property in connection strings and the pvtranslate=auto option.
The Unicode ODBC driver always operates as if the pvtranslate_true property is set and translates application CHAR data to the database code page as needed.
The JDBC driver has an explicit encoding property in the connection string. The driver will set encoding to the database code page if the pvtranslage=auto property is given.
The ADO.NET provider has an explicit encoding connection property. The driver will set Encoding to the database code page if the PvTranslate=Auto connection property is given.
The PDAC access method for Delphi has the OEMConversion property that controls whether CHAR data is sent to the engine using the OEM code page or the ANSI (ACP) code page.
Migrating an Existing Database to a Different Legacy Code Page
Changing the code page of a database requires copying the data to a new database that uses the new code page. The copy must be done by your application. PSQL does not translate code pages within a transaction.
If your database metadata use characters that require translation, those changes must be made when creating the schema of the new database. Table names are a common example of this type of metadata.
Database Code Page and Client Encoding
As discussed in Concepts and Definitions, above, encoding specifies how character data is translated between the PSQL database engine and a PSQL client application. PSQL handles much of the complexity of the encoding between client and server and the various combinations of operating system, languages, and access method. The encoding enhancements are divided into database code page and client encoding. The two types of encoding are separate but interrelated.
Database code page and client encoding apply only to the Relational Engine. The MicroKernel Engine is not affected.
Database Code Page
The database code page is a database property that specifies the encoding used for character data stored in the database. The default database code page is “server default,” meaning the operating system code page on the server where the database engine is running. (The operating system code page is generally referred to as the “OS encoding,” which is the phrase used throughout this chapter.)
Database code page is particularly handy if you need to manually copy PSQL DDFs to another platform with a different OS encoding and still have the metadata correctly interpreted by the database engine.
When you create a database with PSQL, the default is to use the active code page for the machine on which the database engine is running. For example, on a machine running Windows for English, the code page is assigned to 1252. PSQL encoding translation is set to None. Your application must either use the same code page as the PSQL database, or ensure that encoding translation is set to Automatic.
Supported Code Pages
PSQL supports the following code pages. All of the pages listed use byte string storage.
ASCII
EUCJP
ISO8859_1
UTF-8
Windows Code Pages
CP437, CP737, CP775, CP850, CP852, CP855, CP857, CP858, CP862, CP866, CP932
CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258
Client Encoding
Client encoding is the data encoding used by an application on a PSQL client. An application can manage text in any encoding it chooses. A compatible encoding must be established between the database engine and the client application.
PSQL can automatically translate between different encodings used by the database engine and clients provided the characters are present in both the code page on the server machine and the code page on the client machine.
Data translation, if required, occurs at the client. Translation is not always required – for example, when client and server OS encoding match.
Encoding Support in PCC
You can use PCC to set the database code page when you create a database or to modify the code page setting for an existing database.
Note Changing the database code page property does not change any data in the database. However, changing the database code page for an existing database will affect how existing data entries are interpreted.
PSQL Control Center (PCC) is, itself, a client application to the database engine. As a client, PCC lets you specify the encoding to use for each database session when PCC reads and inserts metadata and data. The default for an existing database is to use the encoding of the machine where PCC is running. This is the legacy behavior of PCC. The default for a new database is to use automatic translation. See PCC Connection Encoding in PSQL User's Guide.
The following table explains the interaction between the settings for PCC connection encoding and database code page. PCC connection encoding applies only to PCC. It has no effect on other client applications.
PCC Connection Encoding Set to a Specific Encoding
PCC Connection Encoding Set to "Automatic Translation"
PCC ignores the database code page and uses the encoding specified to read and insert CHAR data, string literals, and metadata. NCHAR data is not affected by this setting.
(This is the legacy behavior of PCC.)
PCC and the database automatically establish the encoding for CHAR data and metadata. String literals in queries are sent to the engine as Unicode. NCHAR data is not affected by this setting.
Encoding Support in Btrieve API
When using the Btrieve API, you must provide file names and paths in the local encoding used in your application. The Btrieve API handles the differences between OS encoding on the server and client.
Encoding Support in DTI
When using the Distributed Tuning Interface (DTI), you must provide file names and paths in the local encoding used in your application. DTI handles the differences between OS encoding on the server and client.
If you use the DTI API to create a database, you may specify the database code page property at the time of creation. This property may be used by SQL access methods to configure automatic translation of character data.
Encoding Support in ADO.NET
The .NET Framework and .NET applications use UTF-16 strings. These must be translated to a code page when storing text in CHAR columns.
The connection property PVTranslate=Auto sets the connection encoding to the database code page. It is also possible to set the encoding property directly.
For more information, see Adding Connections, PsqlConnectionStringBuilder Object and Character Set Conversions in Data Provider for .NET Guide.
Encoding Support in JDBC
The Java Virtual Machine and Java applications use UTF-16 strings. These must be translated to a code page when storing text in CHAR columns.
The connection property PVTranslate=Auto will set the connection encoding to the database code page. It is also possible to set the encoding property directly.
When the PvTranslate=Auto property is set, the JDBC driver will send string literals to the engine as Unicode. Without this setting, the legacy behavior is to translate string literals to the database code page. If your application uses NCHAR string literals (e.g., “N’ABC’”), it should set the PvTranslate=Auto connection property.
See Connection String Elements in JDBC Driver Guide.
Encoding Support in ODBC
The PSQL ODBC drivers support a number of mechanisms to control client encoding.
When configuring a DSN, it is possible to select the encoding options Automatic, OEM/ANSI, and None. The Automatic setting causes the driver to translate from the client encoding to the database code page. The OEM/ANSI setting causes the driver to translate from the client encoding to the corresponding OEM code page. The None setting prevents the driver from doing any text translation. See Encoding Translation in ODBC Guide for more details.
Legacy Conversion Methods for OEM-to-ANSI Data
If a database has OEM character data in it, a legacy solution is to specify OEM/ANSI conversion in the access method. This topic discusses some legacy methods for Linux clients using OEM character data.
Note While the legacy methods are still supported, the recommendation is to specify the OEM code page for the database and have the access methods use automatic translation as discussed above.
See also OEM/ANSI Conversion in ODBC Guide.
When using ODBC, Win32 encoding is expected to be SHIFT-JIS.
Japanese versions of Linux by default have their encodings typically set to EUC-JP or UTF-8.
When using Japanese versions of Linux, a client can connect to another Linux server (for example, locally) or to a Win32 SHIFT-JIS server. It is also possible to connect to a database encoded in SHIFT-JIS but located on a Linux server.
Use the following instructions for your listed configuration. In each case, it is assumed that the application itself does not do any conversion and uses the encoding that is native for the machine.
Connecting a Linux EUC-JP Client to a Win32 SHIFT-JIS Server
Connecting a Linux UTF-8 Client to a Win32 SHIFT-JIS Server
Connecting a Linux EUC-JP Client to a Linux EUC-JP Server
Connecting a Linux UTF-8 Client to a Linux UTF-8 Server
Connecting a Linux UTF-8 Client to a Linux EUC-JP Server
Connecting a Linux EUC-JP Client to a Linux EUC-JP Server, with SHIFT-JIS Encoding Used to Store Data on the Server
Connecting a Linux EUC-JP Client to a Win32 SHIFT-JIS Server
The server requires that everything is received as SHIFT-JIS. The client requires that the server send everything as EUC-JP.
To accomplish this, the client DSN settings in ODBC.INI (located by default in /usr/local/psql/etc) used to connect to the given database should be set up as follows:
[dbclient]
Driver=/usr/local/psql/lib/libodbcci.so
Description=PSQL ODBC Client Interface: JPN-2000SERVER:1583/dbclient
ServerDSN=DEMODATA
ServerName=JPN-2000SERVER:1583
TranslationDLL=/usr/local/psql/lib/libxlate.so.10
TranslationOption=90000932
The TranslationDLL line sets the translation library for the ODBC client interface to use.
The TranslationOption line specifies that translation is needed from 9000 (EUC-JP) to 0932 (SHIFT-JIS).
Using this example, all data coming from the client will be translated to SHIFT-JIS before it is sent to the server, and to EUC-JP before the data is sent back to the client.
Connecting a Linux UTF-8 Client to a Win32 SHIFT-JIS Server
The server requires that everything is received as SHIFT-JIS. The client requires that the server send everything as UTF-8.
To accomplish this, the client DSN settings in ODBC.INI (by default in /usr/local/psql/etc) used to connect to the given database should be set up as follows:
[dbclient]
Driver=/usr/local/psql/lib/libodbcci.so
Description=PSQL ODBC Client Interface: JPN-2000SERVER:1583/dbclient
ServerDSN=DEMODATA
ServerName=JPN-2000SERVER:1583
TranslationDLL=/usr/local/psql/lib/libxlate.so.10
TranslationOption=90010932
The TranslationDLL line sets the translation library for the ODBC client interface to use.
The TranslationOption line specifies that translation is needed from 9001 (UTF-8) to 0932 (SHIFT-JIS).
Using this example, all data coming from the client will be translated to SHIFT-JIS before it is sent to the server, and to UTF-8 before the data is sent back to the client.
Connecting a Linux EUC-JP Client to a Linux EUC-JP Server
Using this configuration, no changes to the DSN description are needed. Use the DSN as it was created by the dsnadd tool.
Connecting a Linux UTF-8 Client to a Linux UTF-8 Server
Using this configuration, no changes to the DSN description are needed. Use the DSN as it was created by the dsnadd tool. See dsnadd in PSQL User's Guide.
Connecting a Linux UTF-8 Client to a Linux EUC-JP Server
The server requires that everything is received as EUC-JP. The client requires that server send everything as UTF-8.
To accomplish this, the client DSN settings in ODBC.INI (by default in /usr/local/psql/etc) used to connect to the given database should be set up as follows:
[dbclient]
Driver=/usr/local/psql/lib/libodbcci.so
Description=PSQL ODBC Client Interface: JPN-2000SERVER:1583/dbclient
ServerDSN=DEMODATA
ServerName=JPN-2000SERVER:1583
TranslationDLL=/usr/local/psql/lib/libxlate.so.10
TranslationOption=90019000
The TranslationDLL line sets the translation library for the ODBC client interface to use.
The TranslationOption line specifies that translation is needed from 9001 (EUC-JP) to 9000 (UTF-8).
Using this example, all data coming from the client will be translated to EUC-JP before it is sent to the server, and to UTF-8 before the data is sent back to the client.
Connecting a Linux EUC-JP Client to a Linux EUC-JP Server, with SHIFT-JIS Encoding Used to Store Data on the Server
This situation is possible if you have a SHIFT-JIS database on a Win32 engine, and you want to move all the files to the Linux EUC-JP server. In this case, the database resides on a EUC-JP Linux machine, but all the data inside the DDF files and data files are in SHIFT-JIS.
In this case, your DSN should be set up as follows:
[dbclient]
Driver=/usr/local/psql/lib/libodbcci.so
Description=PSQL ODBC Client Interface: JPN-2000SERVER:1583/dbclient
ServerDSN=DEMODATA
ServerName=JPN-2000SERVER:1583
TranslationDLL=/usr/local/psql/lib/libxlate.so.10
TranslationOption=90000932
CodePageConvert=932
The last line specifies that even though the server uses EUC-JP encoding, it should treat the data on the server as SHIFT-JIS.
Encoding Support for Wide ODBC Driver
PSQL supports UCS-2 with ODBC with a driver for wide character data and defaults for DSN encoding translation. See Encoding Translation and ODBC Connection Strings in ODBC Guide.
ODBC Driver for Applications with Wide Character Data
PSQL provides an ODBC driver for 32-bit and 64-bit applications that use wide character data. The driver is for Windows operating systems only and is an addition to the previous set of drivers.
Table 14 PSQL ODBC Driver for Wide Character Data
Driver Name
Discussion
PSQL ODBC Unicode Interface
Connects to a local or remote named database.
With the 32-bit ODBC Administrator, creates 32-bit DSNs for use by 32-bit applications that use wide character data. The 32-bit driver is installed with all PSQL editions.
With the 64-bit ODBC Administrator, creates 64-bit DSNs for use by 64-bit applications that use wide character data. The 64-bit driver is installed with all PSQL editions when installing on a 64-bit platform.
On Linux, the system encoding is usually UTF-8, which allows SQL text to contain any Unicode character code point. The PSQL ODBC Unicode Interface driver is not available on Linux because an application can use the PSQL ODBC Client Interface driver with UTF-8. A Linux application can handle wide character data either as UTF-16 strings (SQL_C_WCHAR) or request conversion to the system encoding (usually UTF-8) as SQL_C_CHAR. SQL text using UTF-8 is compatible with the existing Pervasive ODBC Client Interface driver so an additional ODBC driver on Linux is not required.
Default for DSN Encoding Translation
The encoding translation options for a DSN specify how character data is translated between the PSQL database engine and a PSQL client application that uses ODBC. The default for encoding translation depends on the PSQL ODBC driver that you use.
Table 15 DSN Encoding Translation Default
Driver Name
Encoding Translation Default
Remarks
PSQL ODBC Unicode Interface
Automatic
The connection string parameter Pvtranslate also defaults to “auto.”
PSQL ODBC Interface
None
Same default as the previous version of PSQL.
PSQL ODBC Client Interface
None
Same default as the previous version of PSQL.
PSQL ODBC Engine Interface
None
Same default as the previous version of PSQL.
The ODBC drivers process SQL text differently depending on the driver and the setting for the DSN encoding translation.
Table 16 PSQL ODBC Driver and DSN Encoding Translation Setting Effect on SQL Text
Setting
Processing of Incoming SQL Text
PSQL Driver
ODBC Unicode Interface
ODBC Interface and ODBC Client Interface
ODBC Engine Interface
Automatic
SQL text gets converted to UTF-8 then sent to the database engine. The code pages for Client, Server, and database are ignored.
Yes1
No
No
SQL text gets converted to the database code page then sent to the database engine.
No
Yes
Yes
None
SQL text is not translated between the Client and database engine.2
Yes
Yes
Yes
OEM/ANSI
SQL text in the Client code page is converted to the OEM/ANSI encoding and then sent to the database engine.
Yes3
Yes3
Yes3
1 With the encoding translation set to Automatic, you can use NCHAR columns and NCHAR literals with wide character data.
2 The assumption is that the Client and database engine use the same operating system encoding.
3 If the SQL text is wide character, it is first converted to the Client encoding. If the SQL text is not wide character, it is is already in the Client encoding. The SQL text is then converted to the OEM encoding and sent to the database engine.
Unicode Support in PSQL Utilities
Unicode Support in PSQL Control Center (PCC)
See also, the section Encoding Support in PCC earlier in this chapter.
Dialogs for Opening and Saving Files
The PCC dialogs for opening and saving SQL documents, saving exported schemas, and importing and exporting table data have all been enhanced to accommodate a variety of file encodings. Previously, these files were presumed to be in the default system code page. It is not possible to select a number of Unicode encodings when saving files. When opening a file, the new dialogs detect whether the file uses a byte order mark (BOM) to identify the Unicode encoding. The opening dialogs also allow you to set the expected encoding of the file. For your convenience, a new PCC setting controls the default encoding used in these dialogs.
For more information on these new features, see PSQL User's Guide under the topics Dialogs for File Open and File Save, Wide Character Data Support for Import Data, Export Data, and Export Schema, and File Encoding Preferences.
Bulk Data Utility (BDU)
The Bulk Data Utility (BDU) is a command line tool that allows you to load data from a delimited text file into a PSQL table. A command line parameter, -c encoding, is provided to specify the data encoding to use when loading the data file. The encoding options are UTF-8, UTF-16LE, and UTF-16BE. If a data file contains a byte order mark (BOM), BDU uses the encoding specified by the BOM. That is, if a data file uses a BOM to indicate an encoding of UTF-8, UTF-16LE, or UTF-16BE, BDU uses that encoding regardless of what value you specify for the encoding parameter on the command line. Without a BOM or the -c parameter, BDU defaults to using the system code page.
See bdu in PSQL User's Guide.
Support for Collation and Sorting
What Is Collation and Sorting?
Collation refers to the ordering of a set of binary values matched with characters in a code page. Collations can differ in important ways. For example, one code page may put digits before letters and another may put them after. Sorting is the rearrangement of data so that text is in collation order.
PSQL supports the specification of a named collation on byte-string text segments. Indexes then sort the record keys according to the specified collation.
Sort Order with No Collation Sequence Specified
When no collation is specified, PSQL by default sorts characters in code point order. The default is ascending order from lowest value to highest. You can change this to descending order. See Sort Order in PSQL Programmer’s Guide for more information.
Collation Support in Wide Character Columns
PSQL supports the default collation of Unicode data according to code point order. In addition to working with UTF-16, it also can sort multibyte UTF-8 text in code point order.
Collation Support Using an Alternate Collating Sequence (ACS)
You can specify an alternative to the default code page collation order. This user-defined alternate collating sequence or ACS is a mapping between the code page collation order and the desired collation order. You can define one or more alternate sequences for determining the collation of string keys of type STRING, LSTRING, and ZSTRING. For example, you can use a user-defined ACS to specify a collation that places numbers after letters or changes the ordering of upper- and lowercase letters. PSQL comes with an ACS named upper.alt that maps the lowercase letters to uppercase letters to sort as equivalent. This result could also be achieved by setting case insensitivity but the example shows what can be done with an ACS.
Essentially, the user-defined ACS is a table that associates the code page sequence position for a character with the alternate desired sequence position. Creating an ACS is described in Alternate Collating Sequences in PSQL Programmer’s Guide, including examples. You specify the ACS for key value fields in the definition of the layout of the data file. See Specifying a Alternate Collating Sequence for a Key in this guide and Data Layout in PSQL Programmer’s Guide.
For additional information about setting an ACS, see Create (14), Create Index (31) and Get Next Extended (36) in Btrieve API Guide, Alternate Collating Sequence (ACS) Files in DDF Builder User’s Guide and SET DEFAULTCOLLATE in SQL Engine Reference.
Collation Support Using an International Sort Rule (ISR)
Another type of ACS is an international sort rule or ISR. An ISR is a predefined alternate collating sequence for language-specific sort orders. You can use an ISR to correctly sort languages such as German with the letters ä, ö, ü (sorted as ae, oe, ue) and ß (sorted as ss). PSQL provides a number of ISR tables in the collate.cfg file in your PSQL installation. Examples of their use can be found in Sample Collations Using International Sorting Rules in PSQL Programmer’s Guide. See the references for alternate collating sequences, above, for more information.
Collation Support Using an ICU Unicode Collation
PSQL supports two Unicode collations for use with UTF-8 or UTF-16 data if you need sorting other than the default binary collation. These alternate Unicode collations are based on the International Components for Unicode (ICU) libraries, release version 54. The following table summarizes the collations.
ICU Collation Name
Installed File
Description
u54-msft_enus_0
u54-msft_enus_0.txt
Emulates the ISR collation MSFT_ENUS01252_0. The emulation applies only to the 1252 subset of Unicode. Characters outside this range are sorted according to the ICU root collation.
root
icudt54l.dat
Defines default ICU collation and other configuration data.
ICU collations are used like ISR table names, except that instead of having names starting with PVSW_ or MSFT_, their names must have the prefix u54- or be simply root. In addition, these collations can be applied only to the following Unicode data types:
STRING (assumed to be UTF-8)
ZSTRING (assumed to be UTF-8)
WSTRING
WZSTRING
A default PSQL installation provides the two ICU collations in the same location as the collate.cfg file. For generic sorting, the default ICU collation is referred to by the name root, and its configuration data resides in the file icudt54l.dat. For locale-specific sorting, supplemental data resides in files whose names start with u54 and end with a .txt extension. PSQL currently supports one supplemental ICU collation.
For more information about ICU collations, see the ICU Project website.
Locale Support
An important aspect of globalization is locale, which is a model and definition of a native-language environment. A locale consists of a number of categories for which country-dependent formatting or other specifications exist. For example, a locale defines date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation (sort) order. Depending on the operating system, a locale may be called a region.
More than one locale can be associated with a particular language, which allows for regional differences. For example, English can have a United States locale and a Great Britain locale.
When executing string functions, PSQL uses the locale of the operating system on which the database engine is running. PSQL uses the locale of the Client when converting data types as requested by the application through one of the PSQL access methods.
For more information, see SET DECIMALSEPARATORCOMMA, Comma as Decimal Separator and SET TIME ZONE in SQL Engine Reference.