Thread: locale-specific sort algorithms undocumented?
In the course of creating a new PG 7.4.3 server, I chose UTF-8 as my locale. I now find that sorting is very different with that setting: It appears, through trial and error, that all non-alphanumeric characters are completely ignored by ORDER BY. Thus, traditional tricks like prefixing a name with a leading space to force it to the head of an alphabetized list don't work anymore. This is unexpected and undesired, but curiously, I can't find a word written anywhere that defines the sort process for various locale settings. Can anyone point me to any documentation on this? Second, can anyone offer any positive or negative implications of choosing a non-default locale (for English language databases)? I chose UTF because I perceive it as more of a standard encoding approach, but I'll have to reinit the server with a diffeent setting if I can't control its predilection to ignoring non-alphanumerics. Thanks for sharing any insights. John Gunther Bucks vs Bytes Inc Innovative, advanced web sites for all occasions.
On Tue, 20 Jul 2004, John Gunther wrote: > In the course of creating a new PG 7.4.3 server, I chose UTF-8 as my > locale. I now find that sorting is very different with that setting: It As a note, UTF-8 is not a locale. My guess is that you are actually using something like en_XX.UTF-8 (for some XX) given your comments below. > appears, through trial and error, that all non-alphanumeric characters > are completely ignored by ORDER BY. Thus, traditional tricks like > prefixing a name with a leading space to force it to the head of an > alphabetized list don't work anymore. This is unexpected and undesired, > but curiously, I can't find a word written anywhere that defines the > sort process for various locale settings. Can anyone point me to any > documentation on this? Pretty much we punt to your system documentation, because we rely on the facilities provided by it for handling the collation. "PostgreSQL uses the standard ISO C and POSIX locale facilities provided by the server operating system. For additional information refer to the documentation of your system." In practice, most systems seem to provide collation similar to dictionaries for English (at least) which means that symbols and spaces aren't significant for first pass comparisons. > Second, can anyone offer any positive or negative implications of > choosing a non-default locale (for English language databases)? I chose > UTF because I perceive it as more of a standard encoding approach, but > I'll have to reinit the server with a diffeent setting if I can't > control its predilection to ignoring non-alphanumerics. I think it mostly depends on what you want. en_US provides you dictionary collation, which means that "A C" > "AB" but it also means that "AC" > "Ab".
John Gunther <mail@bucksvsbytes.com> writes: > In the course of creating a new PG 7.4.3 server, I chose UTF-8 as my > locale. UTF-8 is a character set encoding, not a locale. I suspect you may have chosen en_US or something like that as your locale. > I now find that sorting is very different with that setting: It > appears, through trial and error, that all non-alphanumeric characters > are completely ignored by ORDER BY. I doubt they are ignored completely, but they probably are ignored in the first-order comparison. However, this is not PostgreSQL's bailiwick; we just use whatever strcoll() behavior the C library provides. The correct place to be looking for this documentation is in the libc or glibc docs. regards, tom lane
Tom Lane wrote: > > I now find that sorting is very different with that setting: It > > appears, through trial and error, that all non-alphanumeric > > characters are completely ignored by ORDER BY. > > I doubt they are ignored completely, but they probably are ignored in > the first-order comparison. The way this more or less works is: First pass: letters, numbers Second pass: accents Third pass: upper/lower case Fourth pass: punctuation characters This is all enshrined in various standards such as ISO/IEC 14651, national standards based on it, and independent technical standards such as the Unicode Collation Algorithm. The latter in fact allows what many people appear to be looking for: a "variable weighting" option that allows you to promote punctuation characters to the first pass. But I don't think any operating system implements that, yet. -- Peter Eisentraut http://developer.postgresql.org/~petere/