Thread: locale-specific sort algorithms undocumented?

locale-specific sort algorithms undocumented?

From
John Gunther
Date:
In the course of creating a new PG 7.4.3 server, I chose UTF-8 as my
locale. I now find that sorting is very different with that setting: It
appears, through trial and error, that all non-alphanumeric characters
are completely ignored by ORDER BY. Thus, traditional tricks like
prefixing a name with a leading space to force it to the head of an
alphabetized list don't work anymore. This is unexpected and undesired,
but curiously, I can't find a word written anywhere that defines the
sort process for various locale settings. Can anyone point me to any
documentation on this?

Second, can anyone offer any positive or negative implications of
choosing a non-default locale (for English language databases)? I chose
UTF because I perceive it as more of a standard encoding approach, but
I'll have to reinit the server with a diffeent setting if I can't
control its predilection to ignoring non-alphanumerics.

Thanks for sharing any insights.

John Gunther
Bucks vs Bytes Inc

Innovative, advanced web sites for all occasions.


Re: locale-specific sort algorithms undocumented?

From
Stephan Szabo
Date:
On Tue, 20 Jul 2004, John Gunther wrote:

> In the course of creating a new PG 7.4.3 server, I chose UTF-8 as my
> locale. I now find that sorting is very different with that setting: It

As a note, UTF-8 is not a locale.  My guess is that you are actually using
something like en_XX.UTF-8 (for some XX) given your comments below.

> appears, through trial and error, that all non-alphanumeric characters
> are completely ignored by ORDER BY. Thus, traditional tricks like
> prefixing a name with a leading space to force it to the head of an
> alphabetized list don't work anymore. This is unexpected and undesired,
> but curiously, I can't find a word written anywhere that defines the
> sort process for various locale settings. Can anyone point me to any
> documentation on this?

Pretty much we punt to your system documentation, because we rely on the
facilities provided by it for handling the collation.
 "PostgreSQL uses the standard ISO C and POSIX locale facilities provided
by the server operating system. For additional information refer to the
documentation of your system."

In practice, most systems seem to provide collation similar to
dictionaries for English (at least) which means that symbols and spaces
aren't significant for first pass comparisons.

> Second, can anyone offer any positive or negative implications of
> choosing a non-default locale (for English language databases)? I chose
> UTF because I perceive it as more of a standard encoding approach, but
> I'll have to reinit the server with a diffeent setting if I can't
> control its predilection to ignoring non-alphanumerics.

I think it mostly depends on what you want. en_US provides you dictionary
collation, which means that "A C" > "AB" but it also means that "AC" >
"Ab".

Re: locale-specific sort algorithms undocumented?

From
Tom Lane
Date:
John Gunther <mail@bucksvsbytes.com> writes:
> In the course of creating a new PG 7.4.3 server, I chose UTF-8 as my
> locale.

UTF-8 is a character set encoding, not a locale.  I suspect you may
have chosen en_US or something like that as your locale.

> I now find that sorting is very different with that setting: It
> appears, through trial and error, that all non-alphanumeric characters
> are completely ignored by ORDER BY.

I doubt they are ignored completely, but they probably are ignored in
the first-order comparison.  However, this is not PostgreSQL's
bailiwick; we just use whatever strcoll() behavior the C library
provides.  The correct place to be looking for this documentation
is in the libc or glibc docs.

            regards, tom lane

Re: locale-specific sort algorithms undocumented?

From
Peter Eisentraut
Date:
Tom Lane wrote:
> > I now find that sorting is very different with that setting: It
> > appears, through trial and error, that all non-alphanumeric
> > characters are completely ignored by ORDER BY.
>
> I doubt they are ignored completely, but they probably are ignored in
> the first-order comparison.

The way this more or less works is:

First pass: letters, numbers
Second pass: accents
Third pass: upper/lower case
Fourth pass: punctuation characters

This is all enshrined in various standards such as ISO/IEC 14651,
national standards based on it, and independent technical standards
such as the Unicode Collation Algorithm.

The latter in fact allows what many people appear to be looking for: a
"variable weighting" option that allows you to promote punctuation
characters to the first pass.  But I don't think any operating system
implements that, yet.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/