Thread: OK, that's one LOCALE bug report too many...

OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 16:20:31

... and I am not going to allow 7.1 to go out without a fix for this
class of problems.  I'm fed up ;-)

As near as I can tell from the setlocale() man page, the only locale
categories that are really hazardous for us are LC_COLLATE and LC_CTYPE;
the other categories like LC_MONETARY affect only I/O routines, not
sort ordering, and so cannot result in corrupt indices.

I propose, therefore, that in an --enable-locale installation, initdb
should save its values for LC_COLLATE and LC_CTYPE in pg_control, and
backend startup should restore these settings from pg_control.  Other
locale categories will continue to be acquired from the postmaster
environment.  This will eliminate the class of bugs associated with
index corruption from not always starting the postmaster with the same
locale settings, while not forcing people to do an initdb to change
harmless settings.

Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
on recent RedHat releases, I propose that initdb change "en_US" to "C"
if it finds that setting.  (Are there any platforms where there are
non-bogus differences between the two?)

Finally, until we have a really bulletproof solution for LIKE indexing
optimization, I will disable that optimization if --enable-locale is
compiled *and* LC_COLLATE is not C.  Better to get "LIKE is slow" bug
reports than "LIKE gives wrong answers" bug reports.

Comments?  Anyone think that initdb should lock down more categories
than just these two?
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Peter Eisentraut

Date:

24 November 2000, 17:08:26

Tom Lane writes:

> I propose, therefore, that in an --enable-locale installation, initdb
> should save its values for LC_COLLATE and LC_CTYPE in pg_control, and
> backend startup should restore these settings from pg_control.

Note that when these are unset there might still be a "catch-all" locale
value coming from the LANG env. var. (or LC_ALL on some systems).

> Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
> on recent RedHat releases, I propose that initdb change "en_US" to "C"
> if it finds that setting.  (Are there any platforms where there are
> non-bogus differences between the two?)

There *should* be differences and it is definitely not okay to mix them
up.

> Finally, until we have a really bulletproof solution for LIKE indexing
> optimization, I will disable that optimization if --enable-locale is
> compiled *and* LC_COLLATE is not C.  Better to get "LIKE is slow" bug
> reports than "LIKE gives wrong answers" bug reports.

(C or POSIX)

I have a question about that optimization:  If you have X LIKE 'foo%',
wouldn't it be enough to use X >= 'foo' (which certainly works for any
locale I've ever heard of)?  Why do you need the X <= 'foo???' at all?

> Comments?  Anyone think that initdb should lock down more categories
> than just these two?

Not sure whether LC_CTYPE is necessary.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 17:31:40

Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> I propose, therefore, that in an --enable-locale installation, initdb
>> should save its values for LC_COLLATE and LC_CTYPE in pg_control, and
>> backend startup should restore these settings from pg_control.

> Note that when these are unset there might still be a "catch-all" locale
> value coming from the LANG env. var. (or LC_ALL on some systems).

Actually, what I intend to do while writing pg_control is read the
current effective values via "setlocale(category, NULL)" --- then it
shouldn't matter where they came from, no?

This brings up a question I had just come across while doing further
research: backend/main/main.c does 

#ifdef USE_LOCALE   setlocale(LC_CTYPE, "");    /* take locale information from an                                *
environment*/   setlocale(LC_COLLATE, "");   setlocale(LC_MONETARY, "");

#endif

which seems a little odd --- why not setlocale(LC_ALL, "") ?  Karel
Zak said in a thread around 8/15/00 that this is deliberate, but
I don't quite see why.

>> Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
>> on recent RedHat releases, I propose that initdb change "en_US" to "C"
>> if it finds that setting.  (Are there any platforms where there are
>> non-bogus differences between the two?)

> There *should* be differences and it is definitely not okay to mix them
> up.

I have now received positive proof that en_US sort order on RedHat is
broken.  For example, it asserts'/root/' < '/root0'
but'/root/t' > '/root0'
I defy you to find anyone in the US who will say that that is a
reasonable definition of string collation.  

Of course, if you prefer the notion of disabling LIKE optimization
on a default RedHat installation, we can go ahead and accept en_US.
But I say it's broken and we shouldn't use it.

>> Finally, until we have a really bulletproof solution for LIKE indexing
>> optimization, I will disable that optimization if --enable-locale is
>> compiled *and* LC_COLLATE is not C.  Better to get "LIKE is slow" bug
>> reports than "LIKE gives wrong answers" bug reports.

> (C or POSIX)

Do you think there are cases where setlocale(,NULL) will give back
"POSIX" rather than "C"?  We can certainly test for either.

> I have a question about that optimization:  If you have X LIKE 'foo%',
> wouldn't it be enough to use X >= 'foo' (which certainly works for any
> locale I've ever heard of)?  Why do you need the X <= 'foo???' at all?

Because you need a two-sided index constraint, not a one-sided one.
Otherwise you're probably better off doing a sequential scan ---
scanning 50% of the table (on average) via an index will be slower
than sequential.

>> Comments?  Anyone think that initdb should lock down more categories
>> than just these two?

> Not sure whether LC_CTYPE is necessary.

I'm not either, but I'm afraid to leave it float...
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Peter Eisentraut

Date:

24 November 2000, 18:12:56

Tom Lane writes:

> >> Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
> >> on recent RedHat releases, I propose that initdb change "en_US" to "C"
> >> if it finds that setting.  (Are there any platforms where there are
> >> non-bogus differences between the two?)
> 
> > There *should* be differences and it is definitely not okay to mix them
> > up.
> 
> I have now received positive proof that en_US sort order on RedHat is
> broken.  For example, it asserts
>     '/root/' < '/root0'
> but
>     '/root/t' > '/root0'
> I defy you to find anyone in the US who will say that that is a
> reasonable definition of string collation.  

That's certainly very odd, but Unixware does this too, so it's probably
some sort of standard.  And a few other European/Latin locales I tried
also do this.

But here's another example of why C and en_US are different.

peter ~$ cat foo
Delta
écrire
Beta
alpha
gamma
peter ~$ LC_COLLATE=C sort foo
Beta
Delta
alpha
gamma
écrire
peter ~$ LC_COLLATE=en_US sort foo
alpha
Beta
Delta
écrire
gamma

The C locale sorts strictly by character code.  But in the en_US locale
the accented letter is put into a "natural" position, and the upper and
lower case letters are grouped together.  Intuitively, the en_US order is
in which you'd look up things in a dictionary.

This also explains (to me at least) the example you have above:  When you
look up words in a dictionary you ignore "funny characters".  My American
Heritage Dictionary explains:

: Entries are listed in alphabetical order without taking into account
: spaces or hyphens.

So at least this concept isn't that far out.

> Do you think there are cases where setlocale(,NULL) will give back
> "POSIX" rather than "C"?  We can certainly test for either.

I know there are (old) systems that reject LANG=C as invalid locale, but I
don't know what setlocale returns there.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 18:45:24

Peter Eisentraut <peter_e@gmx.net> writes:
>> I have now received positive proof that en_US sort order on RedHat is
>> broken.  For example, it asserts
>> '/root/' < '/root0'
>> but
>> '/root/t' > '/root0'
>> I defy you to find anyone in the US who will say that that is a
>> reasonable definition of string collation.  

> That's certainly very odd, but Unixware does this too, so it's probably
> some sort of standard.  And a few other European/Latin locales I tried
> also do this.

I don't have very many platforms to try, but HPUX does not think that
en_US sorts that way.  It may well be standard in some European locales,
but there's a reason why C locale acts the way it does: that behavior is
the accepted one on this side of the pond.  Sufficiently well accepted
that it was quite a few years before American programmers noticed there
was any reason to behave differently ;-)

> This also explains (to me at least) the example you have above:  When you
> look up words in a dictionary you ignore "funny characters".  My American
> Heritage Dictionary explains:
> : Entries are listed in alphabetical order without taking into account
> : spaces or hyphens.

That's workable for an English dictionary, where symbols other than
letters are (a) rare and (b) usually irrelevant to the meaning.  Do
you think anyone would tolerate treating "/" as a noise character in a
listing of Unix filenames, to take one counterexample?  Unfortunately,
en_US does so.

This'd be less of a problem if we had support for per-column charset
and locale specifications.  There'd be no objection to sorting a column
that contains only (or mostly) words like that.  But I've got strong
doubts that the average user of a default RedHat installation expects
*all* data to get sorted that way, or that he wants us to honor a
default that he didn't ask for to the extent of disabling LIKE
optimization to make it work.

I suppose we could do it that way and add a FAQ entry:
Q.  Why are my LIKE queries so slow?
A.  Change your locale to C, then dump, initdb, reload.

But somehow I don't think that'll go over well...
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Lamar Owen

Date:

24 November 2000, 19:11:04

Tom Lane wrote:
> that contains only (or mostly) words like that.  But I've got strong
> doubts that the average user of a default RedHat installation expects
> *all* data to get sorted that way, or that he wants us to honor a
> default that he didn't ask for to the extent of disabling LIKE
> optimization to make it work.

The change in collation for RedHat >6.0 is deliberate -- and conforms to
ISO standards.  There was noise in an unmentionable list at an
unmentionable time about why it was this way -- and the result was a
seesaw -- it was almost turned back to 'conventional' collation, but was
then put back into ISO-conforming shape.

Ask Trond (teg@redhat.com) about it.
> I suppose we could do it that way and add a FAQ entry:
> 
>         Q.  Why are my LIKE queries so slow?
> 
>         A.  Change your locale to C, then dump, initdb, reload.
> 
> But somehow I don't think that'll go over well...

Methinks you are very right.  Very right.

I am not at all happy about the 'broken' RedHat locale -- the quick and
dirty solution is to remove or rename '/etc/sysconfig/i18n' -- but that
doesn't cure the root issue.

Oh, and to make matters that much worse, on a RedHat system it doesn't
matter if you build with or without --enable-locale -- locale support is
in the libc used, and locale support gets used regardless of what you
select on the configure line :-(.   Been there; distributed that in the
6.5.x 'nl' RPM series.

But it sounds to me like you're on the right track, Tom.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 19:21:11

Lamar Owen <lamar.owen@wgcr.org> writes:
> Oh, and to make matters that much worse, on a RedHat system it doesn't
> matter if you build with or without --enable-locale -- locale support is
> in the libc used, and locale support gets used regardless of what you
> select on the configure line :-(.

I don't follow.  Of course locale support is in libc; where else would
it be?  But without --enable-locale, we will never call setlocale().
Surely even RedHat is not so broken that they default to non-C locale
in a program that has not called setlocale()?  That directly contravenes
the letter of the ISO C standard, IIRC.
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 19:22:07

Lamar Owen <lamar.owen@wgcr.org> writes:
> I am not at all happy about the 'broken' RedHat locale -- the quick and
> dirty solution is to remove or rename '/etc/sysconfig/i18n' -- but that
> doesn't cure the root issue.

Actually, that suggestion points out that just nailing down LC_COLLATE
at initdb time isn't sufficient, at least not on systems where libc's
locale behavior depends on user-alterable external files.  Even with
my proposed initdb change in place, a user could still corrupt his
indices by removing or replacing /etc/sysconfig/i18n.  Ugh.  Not sure
I see a way around this, though, short of dumping libc and bringing
along our own locale support.

Of course, we might end up doing that anyway to support column-specific
locales.  I suspect setlocale() is far too slow on many implementations
to be executed again for every string comparison :-(
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 19:33:06

Possible compromise: let initdb accept en_US, but have it spit out a
warning message:

NOTICE: initializing database with en_US collation order.
If you're not certain that's what you want, then it's probably not what
you want.  We recommend you set LC_COLLATE to "C" and re-initdb.
For more information see <appropriate place in admin guide>

Thoughts?
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Lamar Owen

Date:

24 November 2000, 19:45:34

Tom Lane wrote:
> Lamar Owen <lamar.owen@wgcr.org> writes:
> > Oh, and to make matters that much worse, on a RedHat system it doesn't
> > matter if you build with or without --enable-locale -- locale support is
> > in the libc used, and locale support gets used regardless of what you
> > select on the configure line :-(.
> But without --enable-locale, we will never call setlocale().
> Surely even RedHat is not so broken that they default to non-C locale
> in a program that has not called setlocale()?  That directly contravenes
> the letter of the ISO C standard, IIRC.

I just know this -- regression tests failed the same way with the 'nl'
non-locale RPM's as they did (and do) with the regular locale-enabled
RPM's.  Collation was the same, regardless of the --enable-locale
setting.  I got lots of 'bug' reports about the RPM's failing
regression, giving an unexpected sort order (see the archives -- the
best model thread's start post is:
http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00587.html). 
I was pretty ignorant back then of some of these issues :-).

Apparently RedHat is _that_ broken in that respect (among others). 
Thankfully some of RedHat's more egregious faults have been fixed in
7.....

But then again what Unix isn't broken in some respect :-).
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 20:27:46

Lamar Owen <lamar.owen@wgcr.org> writes:
> Collation was the same, regardless of the --enable-locale
> setting.  I got lots of 'bug' reports about the RPM's failing
> regression, giving an unexpected sort order (see the archives -- the
> best model thread's start post is:
> http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00587.html). 

Hmm.  I reviewed that thread and found this comment from you:

: > Any differences in the environment variables maybe?
: 
: In a nutshell, yes.  /etc/sysconfig/i18n on the fresh install sets LANG,
: LC_ALL, and LINGUAS all to be "en_US".  The upgraded machine at home doesn't
: have an /etc/sysconfig/i18n -- nor does the RH 6.0 box.

That makes it sounds like /etc/sysconfig/i18n is not what I'd assumed
(namely, a data file read at runtime by libc) but only a bit of shell
script that sets exported environment variables during bootup.  I don't
have that file here, so could you enlighten me as to exactly what it
is/does?

If it is just setting some default environment variables for the system,
then it isn't anything we can't deal with by forcing setlocale() at
postmaster start.  That'd make me feel a lot better ;-)
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Peter Eisentraut

Date:

24 November 2000, 20:31:24

Tom Lane writes:

> Possible compromise: let initdb accept en_US, but have it spit out a
> warning message:
> 
> NOTICE: initializing database with en_US collation order.
> If you're not certain that's what you want, then it's probably not what
> you want.  We recommend you set LC_COLLATE to "C" and re-initdb.
> For more information see <appropriate place in admin guide>

I certainly don't like treating en_US specially, when in fact all locales
are affected by this.  You could print a general notice that the database
system will be initialized with a (non-C, non-POSIX) locale and that this
may/will affect the performance in certain cases.  Maybe a
--disable-locale switch to initdb as well?

But IMHO we're not in the business of nitpicking or telling people how to
write, install, or use their operating systems when the issue is not a
show-stopper type, but really an aesthetics/convenience issue.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 20:36:46

Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> Possible compromise: let initdb accept en_US, but have it spit out a
>> warning message:

> I certainly don't like treating en_US specially, when in fact all locales
> are affected by this.

Well, my thought was that another locale, say en_FR, would be far more
likely to be something that the system's user had explicitly chosen to
use at some point, and thus there's less reason to suppose that he
doesn't know what he's getting into.  However, I have no objection to
printing such a complaint whenever the locale is one that will defeat
LIKE optimization --- how does that sound?
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Lamar Owen

Date:

24 November 2000, 21:05:50

Tom Lane wrote:
> Lamar Owen <lamar.owen@wgcr.org> writes:
> > Collation was the same, regardless of the --enable-locale
> > setting.  I got lots of 'bug' reports about the RPM's failing
> Hmm.  I reviewed that thread and found this comment from you:
> : In a nutshell, yes.  /etc/sysconfig/i18n on the fresh install sets LANG,
> : LC_ALL, and LINGUAS all to be "en_US".  The upgraded machine at home doesn't
> : have an /etc/sysconfig/i18n -- nor does the RH 6.0 box.
> That makes it sounds like /etc/sysconfig/i18n is not what I'd assumed
> (namely, a data file read at runtime by libc) but only a bit of shell
> script that sets exported environment variables during bootup.  I don't
> have that file here, so could you enlighten me as to exactly what it
> is/does?

Oh, yes, sorry -- /etc/sysconfig/i18n is read during sysinit,
immediately before starting swap (IOW, it's only read the once).  On my
RH 6.2 box, it is the following line:

----- /etc/sysconfig/i18n -------
LANG="en_US"
------------- EOF ---------------

It's the same on a fresh RedHat 7.0 install.
> If it is just setting some default environment variables for the system,
> then it isn't anything we can't deal with by forcing setlocale() at
> postmaster start.  That'd make me feel a lot better ;-)

Then you need to feel alot better :-).....
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: OK, that's one LOCALE bug report too many...

From

Don Baccus

Date:

24 November 2000, 22:04:42

At 07:32 PM 11/24/00 -0500, Tom Lane wrote:
>Possible compromise: let initdb accept en_US, but have it spit out a
>warning message:
>
>NOTICE: initializing database with en_US collation order.
>If you're not certain that's what you want, then it's probably not what
>you want.  We recommend you set LC_COLLATE to "C" and re-initdb.
>For more information see <appropriate place in admin guide>
>
>Thoughts?

Are you SURE you want to use en_US collation? [no]

(ask the question, default to no?)

Yes, a question in initdb is ugly, this whole thing is ugly.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 22:08:34

Don Baccus <dhogaza@pacifier.com> writes:
> Are you SURE you want to use en_US collation? [no]
> (ask the question, default to no?)

> Yes, a question in initdb is ugly, this whole thing is ugly.

A question in initdb won't fly for RPM installations, since the RPMs
try to do initdb themselves (or am I wrong about that?)
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Lamar Owen

Date:

24 November 2000, 22:40:33

Tom Lane wrote:
> Don Baccus <dhogaza@pacifier.com> writes:
> > Are you SURE you want to use en_US collation? [no]
> > (ask the question, default to no?)
> > Yes, a question in initdb is ugly, this whole thing is ugly.
> A question in initdb won't fly for RPM installations, since the RPMs
> try to do initdb themselves (or am I wrong about that?)

The RPMset initdb's the first time the initscript is run to start
postmaster, not at installation time.

A command-line argument to initdb would suffice to override -- maybe a
'--initlocale' parameter??  Now, what sort of default for
--initlocale.....
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

24 November 2000, 22:44:28

Lamar Owen <lamar.owen@wgcr.org> writes:
> A command-line argument to initdb would suffice to override -- maybe a
> '--initlocale' parameter??

Hardly need one, when setting LANG or LC_ALL will do just as well.

> Now, what sort of default for --initlocale.....

I think your complaints about RedHat's default are right back in your
lap ;-).  Do you want to ignore their default, or not?
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Lamar Owen

Date:

24 November 2000, 23:13:08

Tom Lane wrote:
> Lamar Owen <lamar.owen@wgcr.org> writes:
> I think your complaints about RedHat's default are right back in your
> lap ;-).  Do you want to ignore their default, or not?

Yes, I want to ignore their default.  This problem is more than just
cosmetic, thanks to the bugs that sparked this thread.

I can do things in the initscript if necessary.  That only helps the
RPM's, though, not those from-source RedHat installations.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: OK, that's one LOCALE bug report too many...

From

Peter Eisentraut

Date:

25 November 2000, 10:38:26

Lamar Owen writes:

> Yes, I want to ignore their default.

If you want to do that then the infinitely better solution is to compile
without locale support in the first place.  (Make the locale-enabled
server a separate package.)  Alternatively, the locale of the postgres
user to POSIX.

> I can do things in the initscript if necessary.  That only helps the
> RPM's, though, not those from-source RedHat installations.

The subject of this whole discussion was IIRC the "default Red Hat
installation".  Those who compile from source can always make more
informed decisions about what features to enable.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: OK, that's one LOCALE bug report too many...

From

Peter Eisentraut

Date:

25 November 2000, 10:44:31

Tom Lane writes:

> > I certainly don't like treating en_US specially, when in fact all locales
> > are affected by this.
> 
> Well, my thought was that another locale, say en_FR, would be far more
> likely to be something that the system's user had explicitly chosen to
> use at some point,

IIRC, the default locale is chosen during the installation process of Red
Hat, so any locale is explicitly chosen.  If Red Hat does not provide a
means to set the C locale as the default, that is Red Hat's fault.  But
then it should also be Red Hat's job (and Red Hat's decision) to install
PostgreSQL in a certain way or other to account for that.  Compiles from
source don't count here, those users enabled locale explicitly anyway.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: OK, that's one LOCALE bug report too many...

From

Lamar Owen

Date:

25 November 2000, 14:39:08

Peter Eisentraut wrote:
> Lamar Owen writes:
> > Yes, I want to ignore their default.
> If you want to do that then the infinitely better solution is to compile
> without locale support in the first place.  (Make the locale-enabled
> server a separate package.)  Alternatively, the locale of the postgres
> user to POSIX.

Ok, let me repeat -- the '--enable-locale' setting will not affect the
collation sequence problem on RedHat.  If you set PostgreSQL to use
locale, it uses it.  If you configure PostgreSQL to not use locale, the
collation set by LANG, LC_ALL, or LC_COLLATE is _STILL_ honored, thanks
to the libc used.

During the 6.5.x cycle, I built, for performance reasons, RPM's without
locale/multibyte support.  These were referred to as the 'nl' RPM's. 
Please see the thread I referred to to see how running with the
'non-locale' RPM's did not in the least solve the problem or change the
symptoms.

Setting the locale environment for the postmaster process is a
possibility, but I'll have to do some testing to see if there are any
interaction problems.  And this still only helps RPM users, as my
initscript is not part of the canonical tarball.
> > I can do things in the initscript if necessary.  That only helps the
> > RPM's, though, not those from-source RedHat installations.
> The subject of this whole discussion was IIRC the "default Red Hat
> installation".  Those who compile from source can always make more
> informed decisions about what features to enable.

Those who compile from source and configure for no locale support will
get a nasty surprise on RedHat 6.1 and later.

Even though a different library function is used to do the comparison
for sorts and orderings, libc (in particular, glibc 2.1) _still_ uses
the LC_ALL, LANG, or LC_COLLATE setting to determine collation. For the
--enable-locale case, the function used is  strcoll(); if not,
strncmp(). See varstr_cmp() in src/backend/utils/adt/varlena.c.

IOW, it is advisable to always enable locale on RedHat, as then you can
at least know what to expect.  And you then will still get unexpected
results unless you do some locale work -- and, unfortunately, RedHat
6.x's locale documentation was sketchy at best; nonexistent at worst.  I
haven't seen RedHat 7's printed documentation yet, so I can't comment on
it.

For reference on this issue, please see the archives, in particular the
following messages:
http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00678.html
http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00685.html
(where I got the function names above....)
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: OK, that's one LOCALE bug report too many...

From

Peter Eisentraut

Date:

25 November 2000, 16:39:12

Lamar Owen writes:

> Ok, let me repeat -- the '--enable-locale' setting will not affect the
> collation sequence problem on RedHat.  If you set PostgreSQL to use
> locale, it uses it.  If you configure PostgreSQL to not use locale, the
> collation set by LANG, LC_ALL, or LC_COLLATE is _STILL_ honored, thanks
> to the libc used.

Well, I'm looking at Red Hat 7.0 here and the locale variables are most
certainly getting ignored in the default compile.  Moreover, at no point
did strncmp() in glibc behave as you claim.  You can look at it yourself
here:

http://subversions.gnu.org/cgi-bin/cvsweb/glibc/sysdeps/generic/strncmp.c

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

25 November 2000, 17:15:30

Peter Eisentraut <peter_e@gmx.net> writes:
> Lamar Owen writes:
>> Ok, let me repeat -- the '--enable-locale' setting will not affect the
>> collation sequence problem on RedHat.  If you set PostgreSQL to use
>> locale, it uses it.  If you configure PostgreSQL to not use locale, the
>> collation set by LANG, LC_ALL, or LC_COLLATE is _STILL_ honored, thanks
>> to the libc used.

> Well, I'm looking at Red Hat 7.0 here and the locale variables are most
> certainly getting ignored in the default compile.  Moreover, at no point
> did strncmp() in glibc behave as you claim.

I'm having a hard time believing Lamar's recollection, also.  I wonder
if there could have been some other factor involved?  One possible line
of thought: a non-locale-enabled compilation, installed to replace a
locale-enabled one, would behave rather inconsistently if run on the
same database used by the locale-enabled version (since indexes will
still be in locale order).  Depending on what tests you did, you might
well think that it was still running locale-enabled.

BTW: as of my commits of an hour ago, the above failure mode is no
longer possible, since a non-locale-enabled Postgres will now refuse to
start up in a database that shows any locale other than 'C' in pg_control.
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

Bruce Momjian

Date:

25 November 2000, 18:04:21

> I'm having a hard time believing Lamar's recollection, also.  I wonder
> if there could have been some other factor involved?  One possible line
> of thought: a non-locale-enabled compilation, installed to replace a
> locale-enabled one, would behave rather inconsistently if run on the
> same database used by the locale-enabled version (since indexes will
> still be in locale order).  Depending on what tests you did, you might
> well think that it was still running locale-enabled.
> 
> BTW: as of my commits of an hour ago, the above failure mode is no
> longer possible, since a non-locale-enabled Postgres will now refuse to
> start up in a database that shows any locale other than 'C' in pg_control.

Do local-enabled compiles have the LIKE optimization disabled always?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: OK, that's one LOCALE bug report too many...

From

Tom Lane

Date:

25 November 2000, 18:13:51

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Do local-enabled compiles have the LIKE optimization disabled always?

No.  They do a run-time check to see what locale is active.
        regards, tom lane

Re: OK, that's one LOCALE bug report too many...

From

teg@redhat.com (Trond Eivind GlomsrØd)

Date:

26 November 2000, 16:12:19

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
> on recent RedHat releases, I propose that initdb change "en_US" to "C"
> if it finds that setting.  

It does not misbehave in glibc (it's not Red Hat specific).
Basically, glibc is the old

From a discussion on a semi-internal list, written by Alan Cox:

************************************************************************
I read the ISO doc (god Its boring)

Ok

Ulrich is right for the spec. Its the official correct filing order
for more
than just in computing

I think the right answer maybe this

Default to ISOblah including sort remaining sorting AbBb..
Document this and also how to switch just the collation series to Unix
style
in the README files and docs that come with the release (like we
documented
how to turn off color ls

Ultimately this comes down to:        
Unix behaviour since 197x versus librarians and others since
considerably
earlier. We are breaking Unix behaviour but I can now sort of
appreciate
the thinking behind this. 

************************************************************************

-- 
Trond Eivind Glomsrød
Red Hat, Inc.

Re: OK, that's one LOCALE bug report too many...

From

teg@redhat.com (Trond Eivind GlomsrØd)

Date:

26 November 2000, 16:21:27

teg@redhat.com (Trond Eivind GlomsrØd) writes:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly
> > on recent RedHat releases, I propose that initdb change "en_US" to "C"
> > if it finds that setting.  
> 
> It does not misbehave in glibc (it's not Red Hat specific).
> Basically, glibc is the old

Oops, here's the rest:

glibc with the C/POSIX locale will make things work the old computer
way:
AB...Zab..z

With en_US, it works the iso way:
A/a B/b ... Z/z 

-- 
Trond Eivind Glomsrød
Red Hat, Inc.

Re: OK, that's one LOCALE bug report too many...

From

Karel Zak

Date:

27 November 2000, 05:09:47

On Fri, 24 Nov 2000, Tom Lane wrote:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > Tom Lane writes:
> >> I propose, therefore, that in an --enable-locale installation, initdb
> >> should save its values for LC_COLLATE and LC_CTYPE in pg_control, and
> >> backend startup should restore these settings from pg_control.
> 
> > Note that when these are unset there might still be a "catch-all" locale
> > value coming from the LANG env. var. (or LC_ALL on some systems).
> 
> Actually, what I intend to do while writing pg_control is read the
> current effective values via "setlocale(category, NULL)" --- then it
> shouldn't matter where they came from, no?
> 
> This brings up a question I had just come across while doing further
> research: backend/main/main.c does 
> 
> #ifdef USE_LOCALE
>     setlocale(LC_CTYPE, "");    /* take locale information from an
>                                  * environment */
>     setlocale(LC_COLLATE, "");
>     setlocale(LC_MONETARY, "");
> #endif
> 
> which seems a little odd --- why not setlocale(LC_ALL, "") ?  Karel
> Zak said in a thread around 8/15/00 that this is deliberate, but
> I don't quite see why.
LC_ALL set too:
LC_NUMERIC and LC_TIME
we in backend use some locale sensitive routines like strftime() and
sprintf() (and more?).
The timeofday() make output via strftime() if you set LC_ALL, a query 
like:select timeofday()::timestamp;

will (IMHO) crashed.
With float numbers and decimal point I not sure. If *all* numbers will
like locale-setting and all routines and utils will expect correct
locale-like decimal point we probably not see some problem. But what
will happen in client program if this FE not will known anything about
current BE setting? BE send locale decimal point (czech) "123,456" and
and FE is set to "en_US" - event of client's atod() is "123.000"....
And etc...etc...
We need *robust* BE<->FE correct and comumns specific local supporte, 
without this we can use locale sensitive to_char() for numbers and pray 
and hope that everything in the PG is right :-)
we need (TODO?):
- comumns specific locale setting- FE routine for obtain column locale setting, like    PQflocale(const PGresult *res,
intfield_index);- on-the-fly numbers (and date/time?!) recoding if BE and  FE use differend locale- be-build index for
newlocale setting- fast locale information for date/time and support for  locale-sensitive date/time parsing (IMHO
almostimpossible  write this)... etc.
 
too much long way to LC_ALL.
                Karel

PS. IMHO current PG locale setting is not bad. I know biger problems   an example not-existing error codes and thread
ignorandFE lib. With   these problems is not possible write good large and robust FE.

SV: OK, that's one LOCALE bug report too many...

From

"Jarmo Paavilainen"

Date:

27 November 2000, 07:32:22

Hi,

...
> LC_NUMERIC and LC_TIME
...
>  The timeofday() make output via strftime() if you set LC_ALL, a query
> like: select timeofday()::timestamp;

Actually *I would* expect it to return a localized string. But then again I
always expect BE to use '.' as decimal point ( I must be damaged :-/ ).

...
> We need *robust* BE<->FE correct and comumns specific local supporte,

I agree :-) And the easiest (and only robust) way would be to define which
char is decimal point, how a date/time must be formated to be accepted on a
INSERT or SELECT. And leave the job of localization to the FE. (I do not
know what SQL9_ says about this, and franctly I do not care.)

And then to sorting (and compare) of strings. PostgreSQL should decide on
one charset (UTF8, UTF16) and expect that clients (FE) to enforce that. Yes
some sorting would be wrong but In most cases it would be correct.
PostgreSQL will never be able to do correct indexing in a mized locale
enviroment if it does not have one index tree (hash or whatever) per locale.
But with UTF8 it could do a good (if not perfect) jobb.

Something like this for sorting:noice-chars-in-any-order..0..1..A..a..e..é..E..È..U..Ü..u..ü..Z..z..Ö..ö
And as time/date/timestamp format:2000.11.27 12:55.01.000000
would be a good compromize.

This maybe feels like moving the trouble from BE to FE, but *I think* this
is the only solution that would always work (if not perfectly...). And this
would remove all the problems with the "--enable-locale which locale to use"
problem. Also if someone would want to connect with a new unknown locale it
would work without changes in the BE side.

And to the errorious results from "SELECT * FROM myTable where strString >
'abc'". This suggestion would not solve all of those, but it would solve
most of them. And *I think* any compare but = and != on a string is prone to
errors (even as a optimation of LIKE).

// Jarmo

Re: OK, that's one LOCALE bug report too many...

From

Lamar Owen

Date:

27 November 2000, 13:40:31

Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Lamar Owen writes:
> >> Ok, let me repeat -- the '--enable-locale' setting will not affect the
> >> collation sequence problem on RedHat.  If you set PostgreSQL to use
> >> locale, it uses it.  If you configure PostgreSQL to not use locale, the
> >> collation set by LANG, LC_ALL, or LC_COLLATE is _STILL_ honored, thanks
> >> to the libc used.
> > Well, I'm looking at Red Hat 7.0 here and the locale variables are most
> > certainly getting ignored in the default compile.  Moreover, at no point
> > did strncmp() in glibc behave as you claim.

Try on RH 6.x.  It is possible RH 7 has this behavior fixed -- I have
not built _any_ no-locale RPM's since 6.5.3 -- and the last OS I built
that on was RH 6.2.  Amend my statement above to read 'caollation
sequence problem on RedHat 6.x, where x>0.'
> I'm having a hard time believing Lamar's recollection, also.

It's in the archives.  Not just my (often bad) recollections..... :-)

Of course, RH 7.0's behavior and RH 6.1's behavior (which was the
version I reported having the problem in the archive message thread) may
not be congruent.

>  I wonder
> if there could have been some other factor involved?  One possible line
> of thought: a non-locale-enabled compilation, installed to replace a
> locale-enabled one, would behave rather inconsistently if run on the
> same database used by the locale-enabled version (since indexes will
> still be in locale order).  Depending on what tests you did, you might
> well think that it was still running locale-enabled.

No index was involved.  The simple test script referred to in that
thread was all that was used.  I even went through an initdb cycle for
it.  However, I am willing to test again with fresh built 'no-locale'
RPM's on RH 6.2 and RH7 to see, if there is need.

All I need to do now is to make sure that the initscript starts
postmaster with the 'C' locale if the locale is set to 'en_US'.  Or is
that _really_ what we want, here?
> BTW: as of my commits of an hour ago, the above failure mode is no
> longer possible, since a non-locale-enabled Postgres will now refuse to
> start up in a database that shows any locale other than 'C' in pg_control.

Good.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11