Supported Collation Sequences
In a computer, if no other collation sequence is enforced, the sequence derived from the machine’s native character set—either ASCII or EBCDIC—is used. The sorting order of these character sets derives from the internal numeric representation of each character.
In addition to the sequences derived from ASCII and EBCDIC, Ingres supports the following local collation sequences:
• Multi, a sequence derived from the DEC Multinational Character Set
• Spanish, a sequence derived from the Spanish language
• Unicode_french, a French Unicode collation sequence for Unicode columns
If none of these sequences adequately fills your needs, you can write your own local collation sequence.
Multi Collation Sequence
The multi collation sequence is based on the DEC Multinational Character Set. This character set adds several vowels with diacritical marks to the standard 7-bit ASCII character set.
Following are the comparison sequences for the multi sequence that differ from those of ASCII:
A < À < Á < Â < Ã < Ä < B
C < Ç < D < E < È < É < Ê < Ë
I < Ì < Í < Î < Ï < J
N < Ñ < O
O < Ò < Ó < Ô < Õ < Ö < Œ < P
U < Ù < Ú < Û < Ü < V
Y < Ÿ < Z < Æ< Ø < Å
a < à < á < â < ã < ä < b
c < ç < d < e < è < é < ê < ë
i < ì < í < î < ï < j
n < ñ < o
o < ò < ó < ô < õ < ö < œ < p
ss < ß < st
u < ù < ú < û < ü < v
y < ÿ < z < æ < ø < å
For example:
cote < côte < czar < cæsar
Pattern matching rules:
• Œ, œ, ß are special when pattern match searching is used.
• Œ matches O_ or O% as well as Œ
• œ matches o_ or o% as well as œ
• ß matches s_ or s% as well as ß
Spanish Collation Sequence
The Spanish collation sequence is based on the multi sequence but contains additional support for the Spanish letters ll and ch. Listed below are the comparison sequences for the Spanish collation sequence that differ from those of ASCII. Some pattern matching rules are also described.
A < À < Á <Â < Ã < Ä < B
CZ < CÅ < Ç < CH < Ch < D < E < È < É < Ê < Ë
I < Ì < Í < Î < Ï < J
LZ < LÅ < LL < LI < M
N < Ñ < O
O < Ò < Ó< Ô < Õ < Ö < Œ < P
U < Ù < Ú < Û < Ü <V
Y < Ÿ < Z <Æ< Ø < Å
a < à < á < â < ã < ä < b
cz < câ < ç < cH < ch < d < e < è < é < ê < ë
i < ì < í < î < ï < j
lz < lå < lL < ll < m
n < ñ < o
o < ò < ó < ô < õ < ö < œ < p
ss < ß < st
u < ù < ú < û < ü < v
y < ÿ < z < æ < ø < å
Examples:
loop < llama
cote < côte < czar < cæsar < chair
The pattern matching rules are:
• Œ, œ, ß, Ç, ç are special when pattern match searching is used.
• Œ matches O_ or O% as well as Œ
• œ matches o_ or o% as well as œ
• ß matches s_ or s% as well as ß
• Ç matches C_ or C% as well as Ç
• ç matches c_ or c% as well as ç
Unicode_french Collation Sequence
The unicode_french collation is applicable to Unicode columns. It follows French sorting order for Unicode character strings where the accented characters are sorted from left to right instead of from right to left.
For example, in the default Unicode collation, the following is the sort order for these four Unicode strings:
cote
coté
côte
côté
The string starting with co.. comes before cô...
In unicode_french collation, the Unicode strings are sorted in the following order:
cote
côte
coté
côté
The strings ending in ..e are sorted before strings ending in ..é.
You can create a Unicode enabled database with unicode_french collation as follows:
createdb ‑iunicode_french unidb
or
createdb ‑nunicode_french unidb
Custom Collation Sequence for NCS Installation
If your installation uses a National Character Set (NCS) and you have special needs that are not met by the available collation sequences, you can write your own. Ingres allows you to write a collation sequence that has any of the following characteristics:
• Character skipping—one or more specified characters are ignored for collation
• One-to-one mapping—a character can be substituted for another or weighted differently for collation
• Many-to-one mapping—groups of characters can be substituted for a single character or weight value for collation
• Many-to-many mapping—groups of characters can be substituted for a sequence of characters or weight values for collation
Guidelines for Creating a Custom Collation File
Keep the following points in mind as you design and test your custom collation file:
• Never create a production database with an untested collation sequence. Always test your collation file on a sample database. Each time that you modify the collation sequence to correct any bugs, you must unload the database, destroy the old database, install the new sequence, create a database with the new sequence, and reload the database.
• Some collation sequences allow two strings that are different to compare as equal. These sequences are called information loss sequences. An example of this type of sequence is a sequence that ignores case.
Problems that can result from such a sequence are:
• If duplicates are not allowed, the DBMS drops all but one string.
• If duplicates are not allowed, the DBMS does not allow you to add a row to a table if it appears to match an existing row.
• The hash storage structure cannot detect when two equal but different strings are placed in a hashed relation.
• In a query on a hash table, the “=” operator can only fetch one of the ‘equal’ strings that matches the expression.
Because of these problems, we suggest that you do not use information loss sequences and the hash storage structure together.
How You Write a Customized Collation Sequence
To create a customized collation sequence, follow these steps:
1. Write a description file.
2. Run the description file through the aducompile utility.
3. Test your collation sequence with a small sample database.
Description File--Describe Collation Sequence
To define a custom collation sequence you must create a description file, which consists of a list of “instructions” that, taken as a whole, describe the collation sequence. Each instruction must appear on a separate line in the file.
The format of each instruction is:
value:string
where:
value
Determines the numerical weight assigned to string. (The internal numerical weight of each character determines where a character appears in the sort order.)
The value can have any of the following formats:
char+number
Instructs sorting of the specified string after the specified character and before the next higher-weighted character in the character set. For example, in the following instruction, string1 is mapped as a single character that is ordered immediately after the letter H and before I in a sorted sequence:
H+1:string1
In the following instruction, string2 sorts after string1 and before the letter I.
H+2:string2
You can specify H+1:string or Hz+1:string and both sorts in the same manner, that is, after H and before I. However, the two examples do not behave the same when pattern matching is applied. To illustrate using an example from the Spanish language, the following instruction maps CH as a single character that exists between C and D:
C+1:CH
If you ask for a pattern match using the format C%, instances of CH are not returned. The alternative, Cz+1:CH, maps CH into two characters, C and a virtual character just after z. This causes CH to match as two characters. A pattern match using the format C% finds the instances of CH.
charstring
Sorts the specified string as the equivalent of the specified charstring. For example, in the following instruction, the word tax sorts as if it were the word revenue:
revenue:tax
+number
Gives the specified string the internal numerical weight specified by given number. The number must be between 0 and 32766. The weighting of a character in this manner is less portable than giving the character a relative weight.
+*
Causes the specified string to be ignored when collation is performed. For example, in the following instruction, the “?”is ignored whenever collation takes place.
+*:?
(empty)
When no value is specified (the instruction takes the form: string), the collation compiler ignores the instruction. Use this format to insert comments into your collation sequence. For example:
:This is a comment
string
Is any character or character string. An empty string causes a syntax error.
The aducompile Utility
The aducompile utility compiles the description file for your collation sequence into a binary file and installs that file as a collation sequence that can be used. You must be the installation owner to use this utility. Be sure to give your resulting collation file a unique name so that you do not overwrite any existing collation files.
Your new collation sequence is located at $II_SYSTEM/ingres/files/collation/collation_name.
Note: In UNIX, all system users must have rights to read the new collation file.
Custom Collation Sequence for UTF8 Installation
The format for defining a custom collation for UTF8 installations differs slightly from that for National Character Set installations.
1. Create a Unicode collation source file. This example shows a caseless Unicode custom collation file, named nocaseU.uce.
# Unicode collation element table to enforce the equivalence for # the 26 Latin uppercase and lowercase characters ONLY.
# March 27, 2017
# My Name
@version 0.0.000
0041 ; [.06C3.0020.0008.0041] # LATIN CAPITAL LETTER A
0042 ; [.06D7.0020.0008.0042] # LATIN CAPITAL LETTER B
0043 ; [.06EB.0020.0008.0043] # LATIN CAPITAL LETTER C
0044 ; [.06F7.0020.0008.0044] # LATIN CAPITAL LETTER D
0045 ; [.0713.0020.0008.0045] # LATIN CAPITAL LETTER E
0046 ; [.073F.0020.0008.0046] # LATIN CAPITAL LETTER F
0047 ; [.0747.0020.0008.0047] # LATIN CAPITAL LETTER G
0048 ; [.0767.0020.0008.0048] # LATIN CAPITAL LETTER H
0049 ; [.0781.0020.0008.0049] # LATIN CAPITAL LETTER I
004A ; [.0795.0020.0008.004A] # LATIN CAPITAL LETTER J
004B ; [.07A5.0020.0008.004B] # LATIN CAPITAL LETTER K
004C ; [.07B1.0020.0008.004C] # LATIN CAPITAL LETTER L
004D ; [.07D9.0020.0008.004D] # LATIN CAPITAL LETTER M
004E ; [.07E1.0020.0008.004E] # LATIN CAPITAL LETTER N
004F ; [.07F9.0020.0008.004F] # LATIN CAPITAL LETTER O
0050 ; [.0811.0020.0008.0050] # LATIN CAPITAL LETTER P
0051 ; [.081D.0020.0008.0051] # LATIN CAPITAL LETTER Q
0052 ; [.0829.0020.0008.0052] # LATIN CAPITAL LETTER R
0053 ; [.0851.0020.0008.0053] # LATIN CAPITAL LETTER S
0054 ; [.0869.0020.0008.0054] # LATIN CAPITAL LETTER T
0055 ; [.0881.0020.0008.0055] # LATIN CAPITAL LETTER U
0056 ; [.0899.0020.0008.0056] # LATIN CAPITAL LETTER V
0057 ; [.08A5.0020.0008.0057] # LATIN CAPITAL LETTER W
0058 ; [.08AD.0020.0008.0058] # LATIN CAPITAL LETTER X
0059 ; [.08B1.0020.0008.0059] # LATIN CAPITAL LETTER Y
005A ; [.08BD.0020.0008.005A] # LATIN CAPITAL LETTER Z
0061 ; [.06C3.0020.0008.0061] # LATIN SMALL LETTER A
0062 ; [.06D7.0020.0008.0062] # LATIN SMALL LETTER B
0063 ; [.06EB.0020.0008.0063] # LATIN SMALL LETTER C
0064 ; [.06F7.0020.0008.0064] # LATIN SMALL LETTER D
0065 ; [.0713.0020.0008.0065] # LATIN SMALL LETTER E
0066 ; [.073F.0020.0008.0066] # LATIN SMALL LETTER F
0067 ; [.0747.0020.0008.0067] # LATIN SMALL LETTER G
0068 ; [.0767.0020.0008.0068] # LATIN SMALL LETTER H
0069 ; [.0781.0020.0008.0069] # LATIN SMALL LETTER I
006A ; [.0795.0020.0008.006A] # LATIN SMALL LETTER J
006B ; [.07A5.0020.0008.006B] # LATIN SMALL LETTER K
006C ; [.07B1.0020.0008.006C] # LATIN SMALL LETTER L
006D ; [.07D9.0020.0008.006D] # LATIN SMALL LETTER M
006E ; [.07E1.0020.0008.006E] # LATIN SMALL LETTER N
006F ; [.07F9.0020.0008.006F] # LATIN SMALL LETTER O
0070 ; [.0811.0020.0008.0070] # LATIN SMALL LETTER P
0071 ; [.081D.0020.0008.0071] # LATIN SMALL LETTER Q
0072 ; [.0829.0020.0008.0072] # LATIN SMALL LETTER R
0073 ; [.0851.0020.0008.0073] # LATIN SMALL LETTER S
0074 ; [.0869.0020.0008.0074] # LATIN SMALL LETTER T
0075 ; [.0881.0020.0008.0075] # LATIN SMALL LETTER U
0076 ; [.0899.0020.0008.0076] # LATIN SMALL LETTER V
0077 ; [.08A5.0020.0008.0077] # LATIN SMALL LETTER W
0078 ; [.08AD.0020.0008.0078] # LATIN SMALL LETTER X
0079 ; [.08B1.0020.0008.0079] # LATIN SMALL LETTER Y
007A ; [.08BD.0020.0008.007A] # LATIN SMALL LETTER Z
2. Compile the nocaseU.uce file:
aducompile nocaseU.uce nocaseU -u
3. Create the database with the custom collation:
createdb -nnocaseU mydb