Re: OK, that's one LOCALE bug report too many... - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: OK, that's one LOCALE bug report too many...
Date
Msg-id Pine.LNX.4.21.0011242345230.791-100000@peter.localdomain
Whole thread Raw
In response to Re: OK, that's one LOCALE bug report too many...  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: OK, that's one LOCALE bug report too many...
List pgsql-hackers
Tom Lane writes:

> >> Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
> >> on recent RedHat releases, I propose that initdb change "en_US" to "C"
> >> if it finds that setting.  (Are there any platforms where there are
> >> non-bogus differences between the two?)
> 
> > There *should* be differences and it is definitely not okay to mix them
> > up.
> 
> I have now received positive proof that en_US sort order on RedHat is
> broken.  For example, it asserts
>     '/root/' < '/root0'
> but
>     '/root/t' > '/root0'
> I defy you to find anyone in the US who will say that that is a
> reasonable definition of string collation.  

That's certainly very odd, but Unixware does this too, so it's probably
some sort of standard.  And a few other European/Latin locales I tried
also do this.

But here's another example of why C and en_US are different.

peter ~$ cat foo
Delta
écrire
Beta
alpha
gamma
peter ~$ LC_COLLATE=C sort foo
Beta
Delta
alpha
gamma
écrire
peter ~$ LC_COLLATE=en_US sort foo
alpha
Beta
Delta
écrire
gamma

The C locale sorts strictly by character code.  But in the en_US locale
the accented letter is put into a "natural" position, and the upper and
lower case letters are grouped together.  Intuitively, the en_US order is
in which you'd look up things in a dictionary.

This also explains (to me at least) the example you have above:  When you
look up words in a dictionary you ignore "funny characters".  My American
Heritage Dictionary explains:

: Entries are listed in alphabetical order without taking into account
: spaces or hyphens.

So at least this concept isn't that far out.


> Do you think there are cases where setlocale(,NULL) will give back
> "POSIX" rather than "C"?  We can certainly test for either.

I know there are (old) systems that reject LANG=C as invalid locale, but I
don't know what setlocale returns there.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: OK, that's one LOCALE bug report too many...
Next
From: Tom Lane
Date:
Subject: Re: OK, that's one LOCALE bug report too many...