Database Globalization
This section contains the following topics:
Overview
Globalization means, in this context, adapting computer software to different languages. It is now commonplace that data be accessed by users around the globe and that applications present the data in the user's own language. Support for globalization in Zen allows your application to store text in multiple languages in the same database. This means that your application can store, process, and retrieve data in whatever language is required.
This chapter explains the Zen features with which you can support globalization for your applications. It discusses overall approaches to globalization and the particular Zen features that support globalized applications. By default, Zen remains backward compatible with legacy text encodings and this chapter also discusses the settings that ease globalization for applications using legacy encodings.
Concepts and Definitions
This section presents important concepts and definitions of terms used in this chapter.
Character Sets
A character set defines the list of text and other symbolic characters that are recognized by a given hardware and software system. For a system that only needs to recognize the characters used in English, the set can be as small as the letters A-Z and a-z, the numerals 0-9 and a few punctuation symbols. Support for additional languages increases the size of the character set. For example, European languages add characters with accents and other diacriticals. Other languages have completely different characters.
Legacy Character Sets
A character is represented in a computer system as a numerical value (getting from a character to a number and back is called encoding and is discussed below). To represent all of the characters in a large character set requires a large range of numerical values. In order to efficiently use what was once limited and expensive storage resources, computer systems have commonly used a byte (8-bits) to store character representations. This limited the size of the character set to 256 characters. Consequently, we now have a legacy of specialized character sets for representing specific languages, such as Japanese, or limited groups of languages, such as Western European. Zen supports a number of these legacy character sets.
The Unicode Character Set
The Unicode standard defines a character set that contains every character used in spoken languages in the world (see
www.unicode.org). Unicode also expands the concept of a character set by defining additional annotation information to specify letter spacing, right-to-left behavior, word and line breaks, and so forth. This allows applications to properly display and manipulate Unicode text. Applications, and the database, also need this additional information for such actions as case conversion and sorting.
Zen recognizes the Unicode character set, providing support for character data storage and retrieval in whatever languages are required by the application.
Encoding
Encoding is the association of each character in a character set with a numerical value. Initially, computer systems and system programming languages did not distinguish between characters and bytes and encoding was simply the byte value corresponding to a particular character. In order to respond to the need to display more characters than would be possible to encode in a single byte, different encodings have been defined. Encodings for larger character sets may use multiple bytes per character.
Legacy Encodings
For legacy character sets, the encoding is defined in a code page. You can think of the code page as the lookup table for converting from a character to a value (or a value to a character). It is important that applications that use text always use the same code page. A character that is stored in the database using one code page may be displayed as a different character when read using a different encoding.
The binary unit of all legacy code pages is an 8-bit byte. Most legacy code pages currently in use are supersets of ASCII (American Standard Code for Information Interchange), a 7-bit code defining a character set consisting of 128 control codes and printable characters. By using the eighth bit an additional 128 codes are possible for a total of 256 character encodings in a single byte value.
Microsoft Windows has two groups of code pages, ANSI and OEM code pages. Although code pages in both of these groups are extended ASCII code pages, neither group distinguishes between characters and bytes.
In ANSI code pages, non-ASCII values (those greater than 127) represent international characters that are usually customized for a language or group of languages. ANSI code pages are typically used for byte string applications that use a graphical user interface on Windows systems. An ANSI code page is also sometimes referred to as the active code page (ACP, referred to in Zen as the client code page). A Windows application always has one currently active code page. For example, the default active code page for English on Windows is code page 1252.
OEM (original equipment manufacturer) code pages are, as the name implies, code pages developed by a given manufacturer for specific systems. These code pages were originally used for MS-DOS and are still used for console applications. The usual OEM code page for English is code page 437. A typical OEM code page has a character set similar to an ANSI code page but uses a different character order for values greater than 127.
Unicode Encodings
In character encoding systems, each character is assigned a unique value called a code point which can be used for encoding data. The code points are organized into planes. Each plane can contain 65536 (216) code points. Unicode has provision for up to 17 planes. The first plane, plane 0, is named the Basic Multilingual Plane (BMP) and contains the majority of the code points currently defined. At the time of this writing, only planes 0, 1, 2, 15, and 16 contain code points. The Unicode standard has several methods of encoding the code points. Two that are commonly used are UTF-8 and UCS-2. UTF-8 encodes character code point values to a byte string using 1–4 bytes per character. UCS-2 encodes character code point values using 16-bit values, often referred to as wide characters.
Zen recognizes the Basic Multilingual Plane code points and is compatible with applications that use the Unicode encodings UTF-8 for byte strings and UCS-2 for wide character strings. The binary unit for UTF-8 is 8-bit. The binary unit for UCS-2 is 16-bit, wide character.
Declaring Encodings
The database code page is a Zen database property that declares the encoding of character data stored in the database. The purpose is to help insure that character data can be interpreted correctly. However, the database code page property is simply a declaration. Zen does not validate the encoding of the data and metadata that an application inserts into a database. The application is responsible for ensuring that character data is stored and retrieved in a particular encoding. Note that the database code page applies only to text encoded with legacy code pages or UTF-8. Wide character text is encoded using UCS-2. A proper setting is required for the engine to convert between wide character text and byte-string text. The default value of the database code page is the system code page of the operating system where the engine is running.
In Zen the SQL access methods infer a client code page for byte strings exchanged between the application and the access method. On Windows, the access method assumes that the application is respecting the active code page (ACP) for byte strings. On Linux, macOS, and Raspbian, the access method assumes that the application is respecting the encoding of the locale, which is usually set in the LANG environment variable.
Zen provides methods to ensure compatible encoding between the database engine and clients. For example, an application can specify that it wants the Zen SQL client to translate data automatically between the database code page and the client application. This is referred to as automatic translation. Note, however, that automatic translation can translate characters only if they are present in the character sets of both code pages (the code page on the server machine and the code page on the client machine).
For backward compatibility, automatic translation in access methods is disabled by default. The application must configure the access method to enable automatic translation. When possible, the recommended method is to set the database code page and configure the access method to read and use that value.
Collation and Sorting
Collation is the general term for the process and function of determining the sorting order of strings of characters. Collation varies by language, so it is not possible to arrange the encoding in multilanguage code pages so that simple binary string comparison produces the desired collation order for every language. If multilevel sorting is a requirement, separate data tables are required to define the correct sorting order.
Choosing a Character Set and Encoding
To implement a globalization strategy you typically begin by identifying the character set required based on the languages or other text and character requirements you need to satisfy. The next step is to choose the encodings that support the character set. The encoding used may even be different for the database and for client applications. Let's look at some examples.
The most global character set is the Unicode character set. Even if legacy character sets are used in the clients, they can all be translated to Unicode for storage in Zen. For a new application or a new module, storing text in UCS-2 or UTF-8 encoding is a simple approach. However, not all applications are new.
Another consideration for applications is the technology of the client programs. If the application uses the .NET framework, the Java VM, or the UNICODE option with C/C++, the application is already processing text using wide character strings. In these situations the main consideration is configuring Zen to preserve that text and choosing how to store it.
If the application is using byte strings in C/C++ and the legacy Zen ODBC driver, there are two possible paths to globalization. One is to port the application to use wide character strings; the other is to let the application continue to support the legacy code page of the client where it is installed and to arrange for translation to Unicode storage.
A very conservative approach for existing applications is to continue using your current legacy code page and take advantage of the other languages that it supports. For example, an application developed for the English-speaking market on Windows using ANSI code page 1252 or OEM code page 850 can also support Western European languages without any change in application storage. The main changes would be to localize program text.
Note: User Data and Metadata. Zen has two types of text that it must handle. The first is user data, which is mostly manipulated by the application, and also by index ordering and by SQL string functions. The second type is metadata, which is the names of SQL objects, such as tables, columns, and indexes. Metadata does not handle UCS-2 encoding, and so follows the legacy code page of the database code page declaration. SQL queries can contain both user data in string literals, and metadata in object names. Thus when discussing SQL queries, we must distinguish the character sets of user data and metadata, even when we are using one of the Unicode encodings for the SQL text as a whole.
Zen is not prepared to handle mixed encodings in text storage. The application should consider such text to be BINARY storage and handle all encoding translations in the application. Zen assumes that all CHAR type data and SQL metadata respect the database code pag, and that all NCHAR data is UCS-2.
The following topics cover specific storage cases:
In addition, the following topic covers handling of legacy OEM code pages:
Multilingual Database Support with Unicode UTF-8
If you choose to store text as UTF-8 you will continue to use the CHAR, VARCHAR, and LONGVARCHAR relational types. You also need to consider such aspects as the Unicode support for the operating system on which your application runs, the string manipulation libraries available to your application, the Zen access methods your application uses, any columns that may need a different data type, and so forth.
When to Use Unicode UTF-8
Unicode UTF-8 encoding is a good choice for the following:
• You want to add new language support to an existing application but keep application changes fairly minimal. For example, you have a Zen database with ANSI-only characters (English, for instance). You want to extend your application to include data in English, German, Polish, and Czech. UTF-8 provides compact storage requirements for European scripts because it requires, at most, two bytes per character.
• Web applications, since many web platforms use UTF-8. Because Unicode UTF-8 is ASCII-compatible and compact for Latin-based language character sets, it is often used as a standard encoding for interchange of Unicode text.
• A Linux or macOS application that supports UTF-8 string handling
• A Zen server on macOS
Unicode UTF-8 Support in Zen
One of the code pages supported by Zen is UTF-8. For UTF-8 text storage, you would set the DB code page for your Zen database to UTF-8.
Note that with UTF-8, string storage is byte strings. For byte strings, Zen provides the relational data types CHAR, VARCHAR, and LONGVARCHAR, and the Btrieve data types STRING and ZSTRING. See also
Data Types in
SQL Engine Reference. Columns will likely be wider when storing UTF-8 because European languages often require two bytes per character instead of a single byte for legacy code pages.
All string data inserted by your application for existing CHAR, VARCHAR and LONGVARCHAR data types are interpreted as UTF-8 strings. You can configure the Zen SQL access methods to automatically translate to UTF-8 (see
Access Methods for Unicode UTF-8 Support).
When the database code page is UTF-8 and the client environment supports Unicode (wide character or UTF-8), SQL text supports Unicode characters in CHAR literals. With any other database code page, general Unicode characters must be in NCHAR literals.
Collation and Sorting
By default, Zen supports code point order for collation and sorting with UTF-8 storage.
Access Methods for Unicode UTF-8 Support
The Zen access methods ODBC, JDBC, and ADO.NET support translation to UTF-8 storage. These access methods exchange text values with the application as UCS-2 wide character strings or as legacy byte strings for the ANSI ODBC drivers. When properly configured, the access methods translate the application text values to UTF-8 for transmission to the storage engine.
• If your application uses the ANSI ODBC driver on Windows, all data will be converted by the Windows Driver Manager to the client legacy code page for byte strings. This results in the loss of any characters that are not in the legacy character set. You may also need to convert your application to use the Unicode ODBC driver.
• If your application uses the ANSI ODBC driver on Linux or macOS, you should set the app locale to use UTF-8 as the string encoding. For completeness, also declare pvtranslate=auto in the connection string and declare the database code page to be UTF-8.
• For JDBC, your application needs to specify pvtranslate=auto in the connection string to the JDBC driver. See
Connection String Overview in
JDBC Driver Guide.
• For ADO.NET, your application needs to specify pvtranslate=auto in the connection string to the database engine. See
Adding Connections in
Data Providers for ADO.NET Guide.
Migrating an Existing Database to Unicode UTF-8
All text data must be converted from any legacy code page to UTF-8. Columns will likely need to be widened to accommodate the longer UTF-8 byte strings. Any non-ASCII metadata, such as table names, must be converted from the legacy code page to UTF-8. Given these combined changes, it is reasonable to migrate the database by copying from the old schema, using the legacy code page, to the new schema with UTF-8 as the database code page.
Note: In the special case where all existing data and metadata is pure ASCII, it is possible to just change the database code page to UTF-8. All existing (7-bit) ASCII byte strings are also valid UTF-8 byte strings.
Multilingual Database Support with Unicode UCS-2
If you choose to store text as UCS-2 you will use the NCHAR, NVARCHAR, and NLONGVARCHAR relational types. This has no effect on your ability to also store other text in the CHAR family of relational types. You also need to consider such aspects as the Unicode support for the operating system on which your application runs, the string manipulation libraries available to your application, the Zen access methods your application uses, any columns that may need a different data type, and so forth.
When to Use Unicode UCS-2
Unicode UCS-2 is a good option for the following situations:
• Your application supports Asian character data. With UCS-2, all characters are stored as two bytes which provides more compact storage than UTF-8 for Asian character data.
• Your application will not be globalizing all of the storage and so will have wide character columns together with legacy byte string columns.
• Your application needs better compatibility for wide character data with Java, ADO.NET, or wide character clients. Such applications use UCS-2 for application strings.
Unicode UCS-2 Support in Zen
Wide character storage is a separate type from byte-string storage and may be used alongside byte-string storage.
For wide character strings, Zen provides the relational data types NCHAR, NVARCHAR, and NLONGVARCHAR, and the Btrieve data types WSTRING and WZSTRING. All data inserted by your application for NCHAR, NVARCHAR, NLONGVARCHAR, WSTRING and WZSTRING data types is interpreted as wide character strings.
Zen supports Unicode characters in NCHAR literals in SQL query text. Text data in CHAR literals is translated to the database code page. Keep in mind that database code pages other than UTF-8 cannot handle most Unicode characters.
Collation and Sorting with Unicode UCS-2
By default, Zen supports code point order for collation and sorting with UCS-2 storage.
Access Methods for Unicode UCS-2 Support
Using NCHAR storage with the SQL access methods requires that applications specify the appropriate NCHAR SQL types to the access method. Use of CHAR types can cause conversion from NCHAR to CHAR and loss of data if the database code page is not UTF-8.
• ODBC applications should use the SQL_WCHAR, SQL_WVARCHAR, and SQL_WLONGVARCHAR SQL types for parameters.
• Windows ODBC applications should be using wide character strings, identified to ODBC as SQL_C_WCHAR. Also, Windows ODBC applications should use the Zen Unicode ODBC driver. Use of the legacy ANSI ODBC driver will result in loss of data because the Windows Driver Manager will convert all wide character strings to byte string.
• Linux, macOS, and Raspbian ODBC applications should use a locale with a UTF-8 encoding and use UTF-8 in application byte strings (SQL_C_CHAR) to provide data for NCHAR values (SQL_WCHAR).
• JDBC applications should set string parameters to type NVARCHAR or use the ‘N' methods such as setNString() to send string data as an NCHAR SQL type. Also, set pvtranslate=auto in the connection string and set the database code page. Otherwise, JDBC will remain backward compatible and use the declared or default byte-string encoding for string data. See
Connection String Overview in
JDBC Driver Guide.
• ADO.NET applications need to specify pvtranslate=auto in the connection string to the database engine. See
Adding Connections in
Data Providers for ADO.NET Guide.
In all cases, use NCHAR literals in SQL text.
Migrating an Existing Database to Unicode UCS-2
The tasks for migrating an existing database to support a UCS-2 application involve converting byte strings to wide character strings. The extent of the conversion depends on your application. ALTER TABLE can be used to convert columns provided that the database code page is set to match the encoding for existing CHAR columns. It is only necessary to convert columns that will store globalized data. Columns that will never store characters outside of the character set of the database code page can remain as CHAR types.
If your application uses the ANSI ODBC driver on Windows, you need to convert your application to wide character data and use the Zen Unicode ODBC driver.
Multilingual Database Support with Legacy and OEM Encodings
If you choose to store text using a legacy or OEM code page, you will use the CHAR, VARCHAR, and LONGVARCHAR relational types. You may use wide character applications if you configure the access method to do conversion to your legacy code page.
When to Use a Legacy Code Page
Using a legacy code page is a compact and efficient way to store text data provided that the legacy code page defines a character set that meets the needs of your application.
Legacy Code Page Support in Zen
The Zen database engine assumes that CHAR data is encoded using the database code page. The access methods have configuration options that help ensure that this is the case.
Older applications that use OEM code pages for storage did not declare the database code page. You can continue to work with this situation so long as the application is careful to use the appropriate relational string functions.
Collation and Sorting with Legacy Code Pages
The default collation in Zen is binary ordering of the encoded bytes. The declared database code page does not affect this default collation.
Zen provides both alternate collating sequence (ACS) and international sorting rules (ISR) mechanisms for controlling collation. These may be declared on Btrieve indexes and on relational columns. Zen will always use a particular ACS for the ASCII character set when using the CASE declaration on a relational column or the case-insensitive bit on a Btrieve index.
Access Methods for Legacy Code Pages
All Zen access methods support byte-string text. Some assume that the database code page is the same as the application's code page, and some allow configuration.
• The ANSI ODBC drivers provide a DSN setting, OEM/ANSI translation, which declares that the application OEM code page (as distinct from ANSI code page) is the database code page. The ANSI ODBC drivers also support the encoding property in connection strings and the pvtranslate=auto option.
• The Unicode ODBC driver always operates as if the pvtranslate_true property is set and translates application CHAR data to the database code page as needed.
• The JDBC driver has an explicit encoding property in the connection string. The driver will set encoding to the database code page if the pvtranslage=auto property is given.
• The ADO.NET provider has an explicit encoding connection property. The driver will set Encoding to the database code page if the PvTranslate=Auto connection property is given.
• The PDAC access method for Delphi has the OEMConversion property that controls whether CHAR data is sent to the engine using the OEM code page or the ANSI (ACP) code page.
Migrating an Existing Database to a Different Legacy Code Page
Changing the code page of a database requires copying the data to a new database that uses the new code page. The copy must be done by your application. Zen does not translate code pages within a transaction.
If your database metadata use characters that require translation, those changes must be made when creating the schema of the new database. Table names are a common example of this type of metadata.
Database Code Page and Client Encoding
As discussed in
Concepts and Definitions, above, encoding specifies how character data is translated between the Zen database engine and a Zen client application. Zen handles much of the complexity of the encoding between client and server and the various combinations of operating system, languages, and access method. The encoding enhancements are divided into database code page and client encoding. The two types of encoding are separate but interrelated.
Database code page and client encoding apply only to the Relational Engine. The MicroKernel Engine is not affected.
Database Code Page
The database code page is a database property that specifies the encoding used for character data stored in the database. The default database code page is "server default," meaning the operating system code page on the server where the database engine is running. (The operating system code page is generally referred to as the "OS encoding," which is the phrase used throughout this chapter.)
Database code page is particularly handy if you need to manually copy Zen DDFs to another platform with a different OS encoding and still have the metadata correctly interpreted by the database engine.
When you create a database with Zen, the default is to use the active code page for the machine on which the database engine is running. For example, on a machine running Windows for English, the code page is assigned to 1252. Zen encoding translation is set to None. Your application must either use the same code page as the Zen database, or ensure that encoding translation is set to Automatic.
Supported Code Pages
Zen supports the following code pages. All of the pages listed use byte string storage.
• ASCII
• EUCJP
• ISO8859_1
• UTF-8
• Windows Code Pages
• CP437, CP737, CP775, CP850, CP852, CP855, CP857, CP858, CP862, CP866, CP932
• CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258
Client Encoding
Client encoding is the data encoding used by an application on a Zen client. An application can manage text in any encoding it chooses. A compatible encoding must be established between the database engine and the client application.
Zen can automatically translate between different encodings used by the database engine and clients provided the characters are present in both the code page on the server machine and the code page on the client machine.
Data translation, if required, occurs at the client. Translation is not always required – for example, when client and server OS encoding match.
Encoding Support in ZenCC
Note: You can use ZenCC to set the database code page when you create a database or to modify the code page setting for an existing database.
Note: Changing the database code page property does not change any data in the database. However, changing the database code page for an existing database will affect how existing data entries are interpreted.
Zen Control Center (ZenCC) is, itself, a client application to the database engine. As a client, ZenCC lets you specify the encoding to use for each database session when ZenCC reads and inserts metadata and data. The default for an existing database is to use the encoding of the machine where ZenCC is running. This is the legacy behavior of ZenCC. The default for a new database is to use automatic translation. See
ZenCC Connection Encoding in
Zen User’s Guide.
The following table explains the interaction between the settings for ZenCC connection encoding and database code page. ZenCC connection encoding applies only to ZenCC. It has no effect on other client applications.
Encoding Support in Btrieve API
When using the Btrieve API, you must provide file names and paths in the local encoding used in your application. The Btrieve API handles the differences between OS encoding on the server and client.
Encoding Support in DTI
When using the Distributed Tuning Interface (DTI), you must provide file names and paths in the local encoding used in your application. DTI handles the differences between OS encoding on the server and client.
If you use the DTI API to create a database, you may specify the database code page property at the time of creation. This property may be used by SQL access methods to configure automatic translation of character data.
Encoding Support in ADO.NET
The .NET Framework and .NET applications use UTF-16 strings. These must be translated to a code page when storing text in CHAR columns.
The connection property PVTranslate=Auto sets the connection encoding to the database code page. It is also possible to set the encoding property directly.
Encoding Support in JDBC
The Java Virtual Machine and Java applications use UTF-16 strings. These must be translated to a code page when storing text in CHAR columns.
The connection property PVTranslate=Auto will set the connection encoding to the database code page. It is also possible to set the encoding property directly.
When the PvTranslate=Auto property is set, the JDBC driver will send string literals to the engine as Unicode. Without this setting, the legacy behavior is to translate string literals to the database code page. If your application uses NCHAR string literals (e.g., "N'ABC'"), it should set the PvTranslate=Auto connection property.
See
Connection String Elements in
JDBC Driver Guide.
Encoding Support in ODBC
The Zen ODBC drivers support a number of mechanisms to control client encoding.
When configuring a DSN, it is possible to select the encoding options Automatic, OEM/ANSI, and None. The Automatic setting causes the driver to translate from the client encoding to the database code page. The OEM/ANSI setting causes the driver to translate from the client encoding to the corresponding OEM code page. The None setting prevents the driver from doing any text translation. See
Encoding Translation in
ODBC Guide for more details.
Legacy Conversion Methods for OEM-to-ANSI Data
If a database has OEM character data in it, a legacy solution is to specify OEM/ANSI conversion in the access method. This topic discusses some legacy methods for Linux clients using OEM character data.
Note: While the legacy methods are still supported, the recommendation is to specify the OEM code page for the database and have the access methods use automatic translation as discussed above.
See also
OEM/ANSI Conversion in
ODBC Guide.
When using ODBC, Win32 encoding is expected to be SHIFT-JIS.
Japanese versions of Linux by default have their encodings typically set to EUC-JP or UTF-8.
When using Japanese versions of Linux, a client can connect to another Linux server (for example, locally) or to a Win32 SHIFT-JIS server. It is also possible to connect to a database encoded in SHIFT-JIS but located on a Linux server.
Use the following instructions for your listed configuration. In each case, it is assumed that the application itself does not do any conversion and uses the encoding that is native for the machine.
Connecting a Linux EUC-JP Client to a Win32 SHIFT-JIS Server
The server requires that everything is received as SHIFT-JIS. The client requires that the server send everything as EUC-JP.
To accomplish this, the client DSN settings in ODBC.INI (located by default in /usr/local/actianzen/etc) used to connect to the given database should be set up as follows:
[dbclient]
Driver=/usr/local/actianzen/lib/libodbcci.so
Description=Zen ODBC Client Interface: JPN-2000SERVER:1583/dbclient
ServerDSN=DEMODATA
ServerName=JPN-2000SERVER:1583
TranslationDLL=/usr/local/actianzen/lib/libxlate.so.10
TranslationOption=90000932
The TranslationDLL line sets the translation library for the ODBC client interface to use.
The TranslationOption line specifies that translation is needed from 9000 (EUC-JP) to 0932 (SHIFT-JIS).
Using this example, all data coming from the client will be translated to SHIFT-JIS before it is sent to the server, and to EUC-JP before the data is sent back to the client.
Connecting a Linux UTF-8 Client to a Win32 SHIFT-JIS Server
The server requires that everything is received as SHIFT-JIS. The client requires that the server send everything as UTF-8.
To accomplish this, the client DSN settings in ODBC.INI (by default in /usr/local/actianzen/etc) used to connect to the given database should be set up as follows:
[dbclient]
Driver=/usr/local/actianzen/lib/libodbcci.so
Description=Zen ODBC Client Interface: JPN-2000SERVER:1583/dbclient
ServerDSN=DEMODATA
ServerName=JPN-2000SERVER:1583
TranslationDLL=/usr/local/actianzen/lib/libxlate.so.10
TranslationOption=90010932
The TranslationDLL line sets the translation library for the ODBC client interface to use.
The TranslationOption line specifies that translation is needed from 9001 (UTF-8) to 0932 (SHIFT-JIS).
Using this example, all data coming from the client will be translated to SHIFT-JIS before it is sent to the server, and to UTF-8 before the data is sent back to the client.
Connecting a Linux EUC-JP Client to a Linux EUC-JP Server
Using this configuration, no changes to the DSN description are needed. Use the DSN as it was created by the dsnadd tool.
Connecting a Linux UTF-8 Client to a Linux UTF-8 Server
Using this configuration, no changes to the DSN description are needed. Use the DSN as it was created by the
dsnadd tool. See
dsnadd in
Zen User’s Guide.
Connecting a Linux UTF-8 Client to a Linux EUC-JP Server
The server requires that everything is received as EUC-JP. The client requires that server send everything as UTF-8.
To accomplish this, the client DSN settings in ODBC.INI (by default in /usr/local/actianzen/etc) used to connect to the given database should be set up as follows:
[dbclient]
Driver=/usr/local/actianzen/lib/libodbcci.so
Description=Zen ODBC Client Interface: JPN-2000SERVER:1583/dbclient
ServerDSN=DEMODATA
ServerName=JPN-2000SERVER:1583
TranslationDLL=/usr/local/actianzen/lib/libxlate.so.10
TranslationOption=90019000
The TranslationDLL line sets the translation library for the ODBC client interface to use.
The TranslationOption line specifies that translation is needed from 9001 (EUC-JP) to 9000 (UTF-8).
Using this example, all data coming from the client will be translated to EUC-JP before it is sent to the server, and to UTF-8 before the data is sent back to the client.
Connecting a Linux EUC-JP Client to a Linux EUC-JP Server, with SHIFT-JIS Encoding Used to Store Data on the Server
This situation is possible if you have a SHIFT-JIS database on a Win32 engine, and you want to move all the files to the Linux EUC-JP server. In this case, the database resides on a EUC-JP Linux machine, but all the data inside the DDF files and data files are in SHIFT-JIS.
In this case, your DSN should be set up as follows:
[dbclient]
Driver=/usr/local/actianzen/lib/libodbcci.so
Description=Zen ODBC Client Interface: JPN-2000SERVER:1583/dbclient
ServerDSN=DEMODATA
ServerName=JPN-2000SERVER:1583
TranslationDLL=/usr/local/actianzen/lib/libxlate.so.10
TranslationOption=90000932
CodePageConvert=932
The last line specifies that even though the server uses EUC-JP encoding, it should treat the data on the server as SHIFT-JIS.
Encoding Support for Wide ODBC Driver
Zen supports UCS-2 with ODBC with a driver for wide character data and defaults for DSN encoding translation. See
Encoding Translation and
ODBC Connection Strings in
ODBC Guide.
ODBC Driver for Applications with Wide Character Data
Zen provides an ODBC driver for 32-bit and 64-bit applications that use wide character data. The driver is for Windows operating systems only and is an addition to the previous set of drivers.
On Linux, the system encoding is usually UTF-8, which allows SQL text to contain any Unicode character code point. The Zen ODBC Unicode Interface driver is not available on Linux because an application can use the Zen ODBC Client Interface driver with UTF-8. A Linux application can handle wide character data either as UTF-16 strings (SQL_C_WCHAR) or request conversion to the system encoding (usually UTF-8) as SQL_C_CHAR. SQL text using UTF-8 is compatible with the existing Pervasive ODBC Client Interface driver so an additional ODBC driver on Linux is not required.
Default for DSN Encoding Translation
The encoding translation options for a DSN specify how character data is translated between the Zen database engine and a Zen client application that uses ODBC. The default for encoding translation depends on the Zen ODBC driver that you use.
The ODBC drivers process SQL text differently depending on the driver and the setting for the DSN encoding translation.
Unicode Support in Zen Utilities
Unicode Support in Zen Control Center
Dialogs for Opening and Saving Files
The ZenCC dialogs for opening and saving SQL documents, saving exported schemas, and importing and exporting table data have all been enhanced to accommodate a variety of file encodings. Previously, these files were presumed to be in the default system code page. It is not possible to select a number of Unicode encodings when saving files. When opening a file, the new dialogs detect whether the file uses a byte order mark (BOM) to identify the Unicode encoding. The opening dialogs also allow you to set the expected encoding of the file. For your convenience, a new ZenCC setting controls the default encoding used in these dialogs.
For more information on these new features, see Zen User’s Guide under the topics Dialogs for File Open and File Save, Wide Character Data Support for Import Data, Export Data, and Export Schema, and File Encoding Preferences.
Bulk Data Utility (BDU)
The Bulk Data Utility (BDU) is a command line tool that allows you to load data from a delimited text file into a Zen table. A command line parameter, -c encoding, is provided to specify the data encoding to use when loading the data file. The encoding options are UTF-8, UTF-16LE, and UTF-16BE. If a data file contains a byte order mark (BOM), BDU uses the encoding specified by the BOM. That is, if a data file uses a BOM to indicate an encoding of UTF-8, UTF-16LE, or UTF-16BE, BDU uses that encoding regardless of what value you specify for the encoding parameter on the command line. Without a BOM or the -c parameter, BDU defaults to using the system code page.
See
bdu in
Zen User’s Guide.
Support for Collation and Sorting
What Is Collation and Sorting?
Collation refers to the ordering of a set of binary values matched with characters in a code page. Collations can differ in important ways. For example, one code page may put digits before letters and another may put them after. Sorting is the rearrangement of data so that text is in collation order.
Zen supports the specification of a named collation on byte-string text segments. Indexes then sort the record keys according to the specified collation.
Sort Order with No Collation Sequence Specified
When no collation is specified, Zen by default sorts characters in code point order. The default is ascending order from lowest value to highest. You can change this to descending order. See
Sort Order in
Zen Programmer’s Guide for more information.
Collation Support in Wide Character Columns
Zen supports the default collation of Unicode data according to code point order. In addition to working with UTF-16, it also can sort multibyte UTF-8 text in code point order.
Collation Support Using an Alternate Collating Sequence (ACS)
You can specify an alternative to the default code page collation order. This user-defined alternate collating sequence or ACS is a mapping between the code page collation order and the desired collation order. You can define one or more alternate sequences for determining the collation of string keys of type STRING, LSTRING, and ZSTRING. For example, you can use a user-defined ACS to specify a collation that places numbers after letters or changes the ordering of upper- and lowercase letters. Zen comes with an ACS named upper.alt that maps the lowercase letters to uppercase letters to sort as equivalent. This result could also be achieved by setting case insensitivity but the example shows what can be done with an ACS.
Essentially, the user-defined ACS is a table that associates the code page sequence position for a character with the alternate desired sequence position. Creating an ACS is described in
Alternate Collating Sequences in
Zen Programmer’s Guide, including examples. You specify the ACS for key value fields in the definition of the layout of the data file. See
Specifying a Alternate Collating Sequence for a Key in this guide and
Data Layout in
Zen Programmer’s Guide.
For additional information about setting an ACS, see
Create (14),
Create Index (31) and
Get Next Extended (36) in
Btrieve API Guide,
Alternate Collating Sequence (ACS) Files in
DDF Builder User’s Guide and
SET DEFAULTCOLLATE in
SQL Engine Reference.
Collation Support Using an International Sort Rule (ISR)
Another type of ACS is an international sort rule or ISR. An ISR is a predefined alternate collating sequence for language-specific sort orders. You can use an ISR to correctly sort languages such as German with the letters ä, ö, ü (sorted as ae, oe, ue) and ß (sorted as ss). Zen provides a number of ISR tables in the collate.cfg file in your Zen installation. Examples of their use can be found in
Sample Collations Using International Sorting Rules in
Zen Programmer’s Guide. See the references for alternate collating sequences, above, for more information.
Collation Support Using an ICU Unicode Collation
Zen supports two Unicode collations for use with UTF-8 or UTF-16 data if you need sorting other than the default binary collation. These alternate Unicode collations are based on the International Components for Unicode (ICU) libraries, release version 54. The following table summarizes the collations.
ICU collations are used like ISR table names, except that instead of having names starting with PVSW_ or MSFT_, their names must have the prefix u54- or be simply root. In addition, these collations can be applied only to the following Unicode data types:
• STRING (assumed to be UTF-8)
• ZSTRING (assumed to be UTF-8)
• WSTRING
• WZSTRING
A default Zen installation provides the two ICU collations in the same location as the collate.cfg file. For generic sorting, the default ICU collation is referred to by the name root, and its configuration data resides in the file icudt54l.dat. For locale-specific sorting, supplemental data resides in files whose names start with u54 and end with a .txt extension. Zen currently supports one supplemental ICU collation.
For more information about ICU collations, see the
ICU Project website.
Locale Support
An important aspect of globalization is locale, which is a model and definition of a native-language environment. A locale consists of a number of categories for which country-dependent formatting or other specifications exist. For example, a locale defines date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation order. Depending on the operating system, a locale may be called a region.
More than one locale can be associated with a particular language, which allows for regional differences. For example, English can have a United States locale and a Great Britain locale.
When executing string functions, Zen uses the locale of the operating system where the database engine is running. Zen uses the locale of the client when converting data types as requested by the application through one of the Zen access methods.