Tom Lane writes:
> >> Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
> >> on recent RedHat releases, I propose that initdb change "en_US" to "C"
> >> if it finds that setting. (Are there any platforms where there are
> >> non-bogus differences between the two?)
>
> > There *should* be differences and it is definitely not okay to mix them
> > up.
>
> I have now received positive proof that en_US sort order on RedHat is
> broken. For example, it asserts
> '/root/' < '/root0'
> but
> '/root/t' > '/root0'
> I defy you to find anyone in the US who will say that that is a
> reasonable definition of string collation.
That's certainly very odd, but Unixware does this too, so it's probably
some sort of standard. And a few other European/Latin locales I tried
also do this.
But here's another example of why C and en_US are different.
peter ~$ cat foo
Delta
écrire
Beta
alpha
gamma
peter ~$ LC_COLLATE=C sort foo
Beta
Delta
alpha
gamma
écrire
peter ~$ LC_COLLATE=en_US sort foo
alpha
Beta
Delta
écrire
gamma
The C locale sorts strictly by character code. But in the en_US locale
the accented letter is put into a "natural" position, and the upper and
lower case letters are grouped together. Intuitively, the en_US order is
in which you'd look up things in a dictionary.
This also explains (to me at least) the example you have above: When you
look up words in a dictionary you ignore "funny characters". My American
Heritage Dictionary explains:
: Entries are listed in alphabetical order without taking into account
: spaces or hyphens.
So at least this concept isn't that far out.
> Do you think there are cases where setlocale(,NULL) will give back
> "POSIX" rather than "C"? We can certainly test for either.
I know there are (old) systems that reject LANG=C as invalid locale, but I
don't know what setlocale returns there.
--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/