Thread: OK, that's one LOCALE bug report too many...
... and I am not going to allow 7.1 to go out without a fix for this class of problems. I'm fed up ;-) As near as I can tell from the setlocale() man page, the only locale categories that are really hazardous for us are LC_COLLATE and LC_CTYPE; the other categories like LC_MONETARY affect only I/O routines, not sort ordering, and so cannot result in corrupt indices. I propose, therefore, that in an --enable-locale installation, initdb should save its values for LC_COLLATE and LC_CTYPE in pg_control, and backend startup should restore these settings from pg_control. Other locale categories will continue to be acquired from the postmaster environment. This will eliminate the class of bugs associated with index corruption from not always starting the postmaster with the same locale settings, while not forcing people to do an initdb to change harmless settings. Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly on recent RedHat releases, I propose that initdb change "en_US" to "C" if it finds that setting. (Are there any platforms where there are non-bogus differences between the two?) Finally, until we have a really bulletproof solution for LIKE indexing optimization, I will disable that optimization if --enable-locale is compiled *and* LC_COLLATE is not C. Better to get "LIKE is slow" bug reports than "LIKE gives wrong answers" bug reports. Comments? Anyone think that initdb should lock down more categories than just these two? regards, tom lane
Tom Lane writes: > I propose, therefore, that in an --enable-locale installation, initdb > should save its values for LC_COLLATE and LC_CTYPE in pg_control, and > backend startup should restore these settings from pg_control. Note that when these are unset there might still be a "catch-all" locale value coming from the LANG env. var. (or LC_ALL on some systems). > Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly > on recent RedHat releases, I propose that initdb change "en_US" to "C" > if it finds that setting. (Are there any platforms where there are > non-bogus differences between the two?) There *should* be differences and it is definitely not okay to mix them up. > Finally, until we have a really bulletproof solution for LIKE indexing > optimization, I will disable that optimization if --enable-locale is > compiled *and* LC_COLLATE is not C. Better to get "LIKE is slow" bug > reports than "LIKE gives wrong answers" bug reports. (C or POSIX) I have a question about that optimization: If you have X LIKE 'foo%', wouldn't it be enough to use X >= 'foo' (which certainly works for any locale I've ever heard of)? Why do you need the X <= 'foo???' at all? > Comments? Anyone think that initdb should lock down more categories > than just these two? Not sure whether LC_CTYPE is necessary. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> I propose, therefore, that in an --enable-locale installation, initdb >> should save its values for LC_COLLATE and LC_CTYPE in pg_control, and >> backend startup should restore these settings from pg_control. > Note that when these are unset there might still be a "catch-all" locale > value coming from the LANG env. var. (or LC_ALL on some systems). Actually, what I intend to do while writing pg_control is read the current effective values via "setlocale(category, NULL)" --- then it shouldn't matter where they came from, no? This brings up a question I had just come across while doing further research: backend/main/main.c does #ifdef USE_LOCALE setlocale(LC_CTYPE, ""); /* take locale information from an * environment*/ setlocale(LC_COLLATE, ""); setlocale(LC_MONETARY, ""); #endif which seems a little odd --- why not setlocale(LC_ALL, "") ? Karel Zak said in a thread around 8/15/00 that this is deliberate, but I don't quite see why. >> Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly >> on recent RedHat releases, I propose that initdb change "en_US" to "C" >> if it finds that setting. (Are there any platforms where there are >> non-bogus differences between the two?) > There *should* be differences and it is definitely not okay to mix them > up. I have now received positive proof that en_US sort order on RedHat is broken. For example, it asserts'/root/' < '/root0' but'/root/t' > '/root0' I defy you to find anyone in the US who will say that that is a reasonable definition of string collation. Of course, if you prefer the notion of disabling LIKE optimization on a default RedHat installation, we can go ahead and accept en_US. But I say it's broken and we shouldn't use it. >> Finally, until we have a really bulletproof solution for LIKE indexing >> optimization, I will disable that optimization if --enable-locale is >> compiled *and* LC_COLLATE is not C. Better to get "LIKE is slow" bug >> reports than "LIKE gives wrong answers" bug reports. > (C or POSIX) Do you think there are cases where setlocale(,NULL) will give back "POSIX" rather than "C"? We can certainly test for either. > I have a question about that optimization: If you have X LIKE 'foo%', > wouldn't it be enough to use X >= 'foo' (which certainly works for any > locale I've ever heard of)? Why do you need the X <= 'foo???' at all? Because you need a two-sided index constraint, not a one-sided one. Otherwise you're probably better off doing a sequential scan --- scanning 50% of the table (on average) via an index will be slower than sequential. >> Comments? Anyone think that initdb should lock down more categories >> than just these two? > Not sure whether LC_CTYPE is necessary. I'm not either, but I'm afraid to leave it float... regards, tom lane
Tom Lane writes: > >> Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly > >> on recent RedHat releases, I propose that initdb change "en_US" to "C" > >> if it finds that setting. (Are there any platforms where there are > >> non-bogus differences between the two?) > > > There *should* be differences and it is definitely not okay to mix them > > up. > > I have now received positive proof that en_US sort order on RedHat is > broken. For example, it asserts > '/root/' < '/root0' > but > '/root/t' > '/root0' > I defy you to find anyone in the US who will say that that is a > reasonable definition of string collation. That's certainly very odd, but Unixware does this too, so it's probably some sort of standard. And a few other European/Latin locales I tried also do this. But here's another example of why C and en_US are different. peter ~$ cat foo Delta écrire Beta alpha gamma peter ~$ LC_COLLATE=C sort foo Beta Delta alpha gamma écrire peter ~$ LC_COLLATE=en_US sort foo alpha Beta Delta écrire gamma The C locale sorts strictly by character code. But in the en_US locale the accented letter is put into a "natural" position, and the upper and lower case letters are grouped together. Intuitively, the en_US order is in which you'd look up things in a dictionary. This also explains (to me at least) the example you have above: When you look up words in a dictionary you ignore "funny characters". My American Heritage Dictionary explains: : Entries are listed in alphabetical order without taking into account : spaces or hyphens. So at least this concept isn't that far out. > Do you think there are cases where setlocale(,NULL) will give back > "POSIX" rather than "C"? We can certainly test for either. I know there are (old) systems that reject LANG=C as invalid locale, but I don't know what setlocale returns there. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut <peter_e@gmx.net> writes: >> I have now received positive proof that en_US sort order on RedHat is >> broken. For example, it asserts >> '/root/' < '/root0' >> but >> '/root/t' > '/root0' >> I defy you to find anyone in the US who will say that that is a >> reasonable definition of string collation. > That's certainly very odd, but Unixware does this too, so it's probably > some sort of standard. And a few other European/Latin locales I tried > also do this. I don't have very many platforms to try, but HPUX does not think that en_US sorts that way. It may well be standard in some European locales, but there's a reason why C locale acts the way it does: that behavior is the accepted one on this side of the pond. Sufficiently well accepted that it was quite a few years before American programmers noticed there was any reason to behave differently ;-) > This also explains (to me at least) the example you have above: When you > look up words in a dictionary you ignore "funny characters". My American > Heritage Dictionary explains: > : Entries are listed in alphabetical order without taking into account > : spaces or hyphens. That's workable for an English dictionary, where symbols other than letters are (a) rare and (b) usually irrelevant to the meaning. Do you think anyone would tolerate treating "/" as a noise character in a listing of Unix filenames, to take one counterexample? Unfortunately, en_US does so. This'd be less of a problem if we had support for per-column charset and locale specifications. There'd be no objection to sorting a column that contains only (or mostly) words like that. But I've got strong doubts that the average user of a default RedHat installation expects *all* data to get sorted that way, or that he wants us to honor a default that he didn't ask for to the extent of disabling LIKE optimization to make it work. I suppose we could do it that way and add a FAQ entry: Q. Why are my LIKE queries so slow? A. Change your locale to C, then dump, initdb, reload. But somehow I don't think that'll go over well... regards, tom lane
Tom Lane wrote: > that contains only (or mostly) words like that. But I've got strong > doubts that the average user of a default RedHat installation expects > *all* data to get sorted that way, or that he wants us to honor a > default that he didn't ask for to the extent of disabling LIKE > optimization to make it work. The change in collation for RedHat >6.0 is deliberate -- and conforms to ISO standards. There was noise in an unmentionable list at an unmentionable time about why it was this way -- and the result was a seesaw -- it was almost turned back to 'conventional' collation, but was then put back into ISO-conforming shape. Ask Trond (teg@redhat.com) about it. > I suppose we could do it that way and add a FAQ entry: > > Q. Why are my LIKE queries so slow? > > A. Change your locale to C, then dump, initdb, reload. > > But somehow I don't think that'll go over well... Methinks you are very right. Very right. I am not at all happy about the 'broken' RedHat locale -- the quick and dirty solution is to remove or rename '/etc/sysconfig/i18n' -- but that doesn't cure the root issue. Oh, and to make matters that much worse, on a RedHat system it doesn't matter if you build with or without --enable-locale -- locale support is in the libc used, and locale support gets used regardless of what you select on the configure line :-(. Been there; distributed that in the 6.5.x 'nl' RPM series. But it sounds to me like you're on the right track, Tom. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > Oh, and to make matters that much worse, on a RedHat system it doesn't > matter if you build with or without --enable-locale -- locale support is > in the libc used, and locale support gets used regardless of what you > select on the configure line :-(. I don't follow. Of course locale support is in libc; where else would it be? But without --enable-locale, we will never call setlocale(). Surely even RedHat is not so broken that they default to non-C locale in a program that has not called setlocale()? That directly contravenes the letter of the ISO C standard, IIRC. regards, tom lane
Lamar Owen <lamar.owen@wgcr.org> writes: > I am not at all happy about the 'broken' RedHat locale -- the quick and > dirty solution is to remove or rename '/etc/sysconfig/i18n' -- but that > doesn't cure the root issue. Actually, that suggestion points out that just nailing down LC_COLLATE at initdb time isn't sufficient, at least not on systems where libc's locale behavior depends on user-alterable external files. Even with my proposed initdb change in place, a user could still corrupt his indices by removing or replacing /etc/sysconfig/i18n. Ugh. Not sure I see a way around this, though, short of dumping libc and bringing along our own locale support. Of course, we might end up doing that anyway to support column-specific locales. I suspect setlocale() is far too slow on many implementations to be executed again for every string comparison :-( regards, tom lane
Possible compromise: let initdb accept en_US, but have it spit out a warning message: NOTICE: initializing database with en_US collation order. If you're not certain that's what you want, then it's probably not what you want. We recommend you set LC_COLLATE to "C" and re-initdb. For more information see <appropriate place in admin guide> Thoughts? regards, tom lane
Tom Lane wrote: > Lamar Owen <lamar.owen@wgcr.org> writes: > > Oh, and to make matters that much worse, on a RedHat system it doesn't > > matter if you build with or without --enable-locale -- locale support is > > in the libc used, and locale support gets used regardless of what you > > select on the configure line :-(. > But without --enable-locale, we will never call setlocale(). > Surely even RedHat is not so broken that they default to non-C locale > in a program that has not called setlocale()? That directly contravenes > the letter of the ISO C standard, IIRC. I just know this -- regression tests failed the same way with the 'nl' non-locale RPM's as they did (and do) with the regular locale-enabled RPM's. Collation was the same, regardless of the --enable-locale setting. I got lots of 'bug' reports about the RPM's failing regression, giving an unexpected sort order (see the archives -- the best model thread's start post is: http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00587.html). I was pretty ignorant back then of some of these issues :-). Apparently RedHat is _that_ broken in that respect (among others). Thankfully some of RedHat's more egregious faults have been fixed in 7..... But then again what Unix isn't broken in some respect :-). -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > Collation was the same, regardless of the --enable-locale > setting. I got lots of 'bug' reports about the RPM's failing > regression, giving an unexpected sort order (see the archives -- the > best model thread's start post is: > http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00587.html). Hmm. I reviewed that thread and found this comment from you: : > Any differences in the environment variables maybe? : : In a nutshell, yes. /etc/sysconfig/i18n on the fresh install sets LANG, : LC_ALL, and LINGUAS all to be "en_US". The upgraded machine at home doesn't : have an /etc/sysconfig/i18n -- nor does the RH 6.0 box. That makes it sounds like /etc/sysconfig/i18n is not what I'd assumed (namely, a data file read at runtime by libc) but only a bit of shell script that sets exported environment variables during bootup. I don't have that file here, so could you enlighten me as to exactly what it is/does? If it is just setting some default environment variables for the system, then it isn't anything we can't deal with by forcing setlocale() at postmaster start. That'd make me feel a lot better ;-) regards, tom lane
Tom Lane writes: > Possible compromise: let initdb accept en_US, but have it spit out a > warning message: > > NOTICE: initializing database with en_US collation order. > If you're not certain that's what you want, then it's probably not what > you want. We recommend you set LC_COLLATE to "C" and re-initdb. > For more information see <appropriate place in admin guide> I certainly don't like treating en_US specially, when in fact all locales are affected by this. You could print a general notice that the database system will be initialized with a (non-C, non-POSIX) locale and that this may/will affect the performance in certain cases. Maybe a --disable-locale switch to initdb as well? But IMHO we're not in the business of nitpicking or telling people how to write, install, or use their operating systems when the issue is not a show-stopper type, but really an aesthetics/convenience issue. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> Possible compromise: let initdb accept en_US, but have it spit out a >> warning message: > I certainly don't like treating en_US specially, when in fact all locales > are affected by this. Well, my thought was that another locale, say en_FR, would be far more likely to be something that the system's user had explicitly chosen to use at some point, and thus there's less reason to suppose that he doesn't know what he's getting into. However, I have no objection to printing such a complaint whenever the locale is one that will defeat LIKE optimization --- how does that sound? regards, tom lane
Tom Lane wrote: > Lamar Owen <lamar.owen@wgcr.org> writes: > > Collation was the same, regardless of the --enable-locale > > setting. I got lots of 'bug' reports about the RPM's failing > Hmm. I reviewed that thread and found this comment from you: > : In a nutshell, yes. /etc/sysconfig/i18n on the fresh install sets LANG, > : LC_ALL, and LINGUAS all to be "en_US". The upgraded machine at home doesn't > : have an /etc/sysconfig/i18n -- nor does the RH 6.0 box. > That makes it sounds like /etc/sysconfig/i18n is not what I'd assumed > (namely, a data file read at runtime by libc) but only a bit of shell > script that sets exported environment variables during bootup. I don't > have that file here, so could you enlighten me as to exactly what it > is/does? Oh, yes, sorry -- /etc/sysconfig/i18n is read during sysinit, immediately before starting swap (IOW, it's only read the once). On my RH 6.2 box, it is the following line: ----- /etc/sysconfig/i18n ------- LANG="en_US" ------------- EOF --------------- It's the same on a fresh RedHat 7.0 install. > If it is just setting some default environment variables for the system, > then it isn't anything we can't deal with by forcing setlocale() at > postmaster start. That'd make me feel a lot better ;-) Then you need to feel alot better :-)..... -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
At 07:32 PM 11/24/00 -0500, Tom Lane wrote: >Possible compromise: let initdb accept en_US, but have it spit out a >warning message: > >NOTICE: initializing database with en_US collation order. >If you're not certain that's what you want, then it's probably not what >you want. We recommend you set LC_COLLATE to "C" and re-initdb. >For more information see <appropriate place in admin guide> > >Thoughts? Are you SURE you want to use en_US collation? [no] (ask the question, default to no?) Yes, a question in initdb is ugly, this whole thing is ugly. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Don Baccus <dhogaza@pacifier.com> writes: > Are you SURE you want to use en_US collation? [no] > (ask the question, default to no?) > Yes, a question in initdb is ugly, this whole thing is ugly. A question in initdb won't fly for RPM installations, since the RPMs try to do initdb themselves (or am I wrong about that?) regards, tom lane
Tom Lane wrote: > Don Baccus <dhogaza@pacifier.com> writes: > > Are you SURE you want to use en_US collation? [no] > > (ask the question, default to no?) > > Yes, a question in initdb is ugly, this whole thing is ugly. > A question in initdb won't fly for RPM installations, since the RPMs > try to do initdb themselves (or am I wrong about that?) The RPMset initdb's the first time the initscript is run to start postmaster, not at installation time. A command-line argument to initdb would suffice to override -- maybe a '--initlocale' parameter?? Now, what sort of default for --initlocale..... -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen <lamar.owen@wgcr.org> writes: > A command-line argument to initdb would suffice to override -- maybe a > '--initlocale' parameter?? Hardly need one, when setting LANG or LC_ALL will do just as well. > Now, what sort of default for --initlocale..... I think your complaints about RedHat's default are right back in your lap ;-). Do you want to ignore their default, or not? regards, tom lane
Tom Lane wrote: > Lamar Owen <lamar.owen@wgcr.org> writes: > I think your complaints about RedHat's default are right back in your > lap ;-). Do you want to ignore their default, or not? Yes, I want to ignore their default. This problem is more than just cosmetic, thanks to the bugs that sparked this thread. I can do things in the initscript if necessary. That only helps the RPM's, though, not those from-source RedHat installations. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen writes: > Yes, I want to ignore their default. If you want to do that then the infinitely better solution is to compile without locale support in the first place. (Make the locale-enabled server a separate package.) Alternatively, the locale of the postgres user to POSIX. > I can do things in the initscript if necessary. That only helps the > RPM's, though, not those from-source RedHat installations. The subject of this whole discussion was IIRC the "default Red Hat installation". Those who compile from source can always make more informed decisions about what features to enable. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Tom Lane writes: > > I certainly don't like treating en_US specially, when in fact all locales > > are affected by this. > > Well, my thought was that another locale, say en_FR, would be far more > likely to be something that the system's user had explicitly chosen to > use at some point, IIRC, the default locale is chosen during the installation process of Red Hat, so any locale is explicitly chosen. If Red Hat does not provide a means to set the C locale as the default, that is Red Hat's fault. But then it should also be Red Hat's job (and Red Hat's decision) to install PostgreSQL in a certain way or other to account for that. Compiles from source don't count here, those users enabled locale explicitly anyway. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut wrote: > Lamar Owen writes: > > Yes, I want to ignore their default. > If you want to do that then the infinitely better solution is to compile > without locale support in the first place. (Make the locale-enabled > server a separate package.) Alternatively, the locale of the postgres > user to POSIX. Ok, let me repeat -- the '--enable-locale' setting will not affect the collation sequence problem on RedHat. If you set PostgreSQL to use locale, it uses it. If you configure PostgreSQL to not use locale, the collation set by LANG, LC_ALL, or LC_COLLATE is _STILL_ honored, thanks to the libc used. During the 6.5.x cycle, I built, for performance reasons, RPM's without locale/multibyte support. These were referred to as the 'nl' RPM's. Please see the thread I referred to to see how running with the 'non-locale' RPM's did not in the least solve the problem or change the symptoms. Setting the locale environment for the postmaster process is a possibility, but I'll have to do some testing to see if there are any interaction problems. And this still only helps RPM users, as my initscript is not part of the canonical tarball. > > I can do things in the initscript if necessary. That only helps the > > RPM's, though, not those from-source RedHat installations. > The subject of this whole discussion was IIRC the "default Red Hat > installation". Those who compile from source can always make more > informed decisions about what features to enable. Those who compile from source and configure for no locale support will get a nasty surprise on RedHat 6.1 and later. Even though a different library function is used to do the comparison for sorts and orderings, libc (in particular, glibc 2.1) _still_ uses the LC_ALL, LANG, or LC_COLLATE setting to determine collation. For the --enable-locale case, the function used is strcoll(); if not, strncmp(). See varstr_cmp() in src/backend/utils/adt/varlena.c. IOW, it is advisable to always enable locale on RedHat, as then you can at least know what to expect. And you then will still get unexpected results unless you do some locale work -- and, unfortunately, RedHat 6.x's locale documentation was sketchy at best; nonexistent at worst. I haven't seen RedHat 7's printed documentation yet, so I can't comment on it. For reference on this issue, please see the archives, in particular the following messages: http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00678.html http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00685.html (where I got the function names above....) -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen writes: > Ok, let me repeat -- the '--enable-locale' setting will not affect the > collation sequence problem on RedHat. If you set PostgreSQL to use > locale, it uses it. If you configure PostgreSQL to not use locale, the > collation set by LANG, LC_ALL, or LC_COLLATE is _STILL_ honored, thanks > to the libc used. Well, I'm looking at Red Hat 7.0 here and the locale variables are most certainly getting ignored in the default compile. Moreover, at no point did strncmp() in glibc behave as you claim. You can look at it yourself here: http://subversions.gnu.org/cgi-bin/cvsweb/glibc/sysdeps/generic/strncmp.c -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut <peter_e@gmx.net> writes: > Lamar Owen writes: >> Ok, let me repeat -- the '--enable-locale' setting will not affect the >> collation sequence problem on RedHat. If you set PostgreSQL to use >> locale, it uses it. If you configure PostgreSQL to not use locale, the >> collation set by LANG, LC_ALL, or LC_COLLATE is _STILL_ honored, thanks >> to the libc used. > Well, I'm looking at Red Hat 7.0 here and the locale variables are most > certainly getting ignored in the default compile. Moreover, at no point > did strncmp() in glibc behave as you claim. I'm having a hard time believing Lamar's recollection, also. I wonder if there could have been some other factor involved? One possible line of thought: a non-locale-enabled compilation, installed to replace a locale-enabled one, would behave rather inconsistently if run on the same database used by the locale-enabled version (since indexes will still be in locale order). Depending on what tests you did, you might well think that it was still running locale-enabled. BTW: as of my commits of an hour ago, the above failure mode is no longer possible, since a non-locale-enabled Postgres will now refuse to start up in a database that shows any locale other than 'C' in pg_control. regards, tom lane
> I'm having a hard time believing Lamar's recollection, also. I wonder > if there could have been some other factor involved? One possible line > of thought: a non-locale-enabled compilation, installed to replace a > locale-enabled one, would behave rather inconsistently if run on the > same database used by the locale-enabled version (since indexes will > still be in locale order). Depending on what tests you did, you might > well think that it was still running locale-enabled. > > BTW: as of my commits of an hour ago, the above failure mode is no > longer possible, since a non-locale-enabled Postgres will now refuse to > start up in a database that shows any locale other than 'C' in pg_control. Do local-enabled compiles have the LIKE optimization disabled always? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Do local-enabled compiles have the LIKE optimization disabled always? No. They do a run-time check to see what locale is active. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly > on recent RedHat releases, I propose that initdb change "en_US" to "C" > if it finds that setting. It does not misbehave in glibc (it's not Red Hat specific). Basically, glibc is the old From a discussion on a semi-internal list, written by Alan Cox: ************************************************************************ I read the ISO doc (god Its boring) Ok Ulrich is right for the spec. Its the official correct filing order for more than just in computing I think the right answer maybe this Default to ISOblah including sort remaining sorting AbBb.. Document this and also how to switch just the collation series to Unix style in the README files and docs that come with the release (like we documented how to turn off color ls Ultimately this comes down to: Unix behaviour since 197x versus librarians and others since considerably earlier. We are breaking Unix behaviour but I can now sort of appreciate the thinking behind this. ************************************************************************ -- Trond Eivind Glomsrød Red Hat, Inc.
teg@redhat.com (Trond Eivind GlomsrØd) writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Also, since "LC_COLLATE=en_US" seems to misbehave rather spectacularly > > on recent RedHat releases, I propose that initdb change "en_US" to "C" > > if it finds that setting. > > It does not misbehave in glibc (it's not Red Hat specific). > Basically, glibc is the old Oops, here's the rest: glibc with the C/POSIX locale will make things work the old computer way: AB...Zab..z With en_US, it works the iso way: A/a B/b ... Z/z -- Trond Eivind Glomsrød Red Hat, Inc.
On Fri, 24 Nov 2000, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > Tom Lane writes: > >> I propose, therefore, that in an --enable-locale installation, initdb > >> should save its values for LC_COLLATE and LC_CTYPE in pg_control, and > >> backend startup should restore these settings from pg_control. > > > Note that when these are unset there might still be a "catch-all" locale > > value coming from the LANG env. var. (or LC_ALL on some systems). > > Actually, what I intend to do while writing pg_control is read the > current effective values via "setlocale(category, NULL)" --- then it > shouldn't matter where they came from, no? > > This brings up a question I had just come across while doing further > research: backend/main/main.c does > > #ifdef USE_LOCALE > setlocale(LC_CTYPE, ""); /* take locale information from an > * environment */ > setlocale(LC_COLLATE, ""); > setlocale(LC_MONETARY, ""); > #endif > > which seems a little odd --- why not setlocale(LC_ALL, "") ? Karel > Zak said in a thread around 8/15/00 that this is deliberate, but > I don't quite see why. LC_ALL set too: LC_NUMERIC and LC_TIME we in backend use some locale sensitive routines like strftime() and sprintf() (and more?). The timeofday() make output via strftime() if you set LC_ALL, a query like:select timeofday()::timestamp; will (IMHO) crashed. With float numbers and decimal point I not sure. If *all* numbers will like locale-setting and all routines and utils will expect correct locale-like decimal point we probably not see some problem. But what will happen in client program if this FE not will known anything about current BE setting? BE send locale decimal point (czech) "123,456" and and FE is set to "en_US" - event of client's atod() is "123.000".... And etc...etc... We need *robust* BE<->FE correct and comumns specific local supporte, without this we can use locale sensitive to_char() for numbers and pray and hope that everything in the PG is right :-) we need (TODO?): - comumns specific locale setting- FE routine for obtain column locale setting, like PQflocale(const PGresult *res, intfield_index);- on-the-fly numbers (and date/time?!) recoding if BE and FE use differend locale- be-build index for newlocale setting- fast locale information for date/time and support for locale-sensitive date/time parsing (IMHO almostimpossible write this)... etc. too much long way to LC_ALL. Karel PS. IMHO current PG locale setting is not bad. I know biger problems an example not-existing error codes and thread ignorandFE lib. With these problems is not possible write good large and robust FE.
Hi, ... > LC_NUMERIC and LC_TIME ... > The timeofday() make output via strftime() if you set LC_ALL, a query > like: select timeofday()::timestamp; Actually *I would* expect it to return a localized string. But then again I always expect BE to use '.' as decimal point ( I must be damaged :-/ ). ... > We need *robust* BE<->FE correct and comumns specific local supporte, I agree :-) And the easiest (and only robust) way would be to define which char is decimal point, how a date/time must be formated to be accepted on a INSERT or SELECT. And leave the job of localization to the FE. (I do not know what SQL9_ says about this, and franctly I do not care.) And then to sorting (and compare) of strings. PostgreSQL should decide on one charset (UTF8, UTF16) and expect that clients (FE) to enforce that. Yes some sorting would be wrong but In most cases it would be correct. PostgreSQL will never be able to do correct indexing in a mized locale enviroment if it does not have one index tree (hash or whatever) per locale. But with UTF8 it could do a good (if not perfect) jobb. Something like this for sorting:noice-chars-in-any-order..0..1..A..a..e..é..E..È..U..Ü..u..ü..Z..z..Ö..ö And as time/date/timestamp format:2000.11.27 12:55.01.000000 would be a good compromize. This maybe feels like moving the trouble from BE to FE, but *I think* this is the only solution that would always work (if not perfectly...). And this would remove all the problems with the "--enable-locale which locale to use" problem. Also if someone would want to connect with a new unknown locale it would work without changes in the BE side. And to the errorious results from "SELECT * FROM myTable where strString > 'abc'". This suggestion would not solve all of those, but it would solve most of them. And *I think* any compare but = and != on a string is prone to errors (even as a optimation of LIKE). // Jarmo
Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > Lamar Owen writes: > >> Ok, let me repeat -- the '--enable-locale' setting will not affect the > >> collation sequence problem on RedHat. If you set PostgreSQL to use > >> locale, it uses it. If you configure PostgreSQL to not use locale, the > >> collation set by LANG, LC_ALL, or LC_COLLATE is _STILL_ honored, thanks > >> to the libc used. > > Well, I'm looking at Red Hat 7.0 here and the locale variables are most > > certainly getting ignored in the default compile. Moreover, at no point > > did strncmp() in glibc behave as you claim. Try on RH 6.x. It is possible RH 7 has this behavior fixed -- I have not built _any_ no-locale RPM's since 6.5.3 -- and the last OS I built that on was RH 6.2. Amend my statement above to read 'caollation sequence problem on RedHat 6.x, where x>0.' > I'm having a hard time believing Lamar's recollection, also. It's in the archives. Not just my (often bad) recollections..... :-) Of course, RH 7.0's behavior and RH 6.1's behavior (which was the version I reported having the problem in the archive message thread) may not be congruent. > I wonder > if there could have been some other factor involved? One possible line > of thought: a non-locale-enabled compilation, installed to replace a > locale-enabled one, would behave rather inconsistently if run on the > same database used by the locale-enabled version (since indexes will > still be in locale order). Depending on what tests you did, you might > well think that it was still running locale-enabled. No index was involved. The simple test script referred to in that thread was all that was used. I even went through an initdb cycle for it. However, I am willing to test again with fresh built 'no-locale' RPM's on RH 6.2 and RH7 to see, if there is need. All I need to do now is to make sure that the initscript starts postmaster with the 'C' locale if the locale is set to 'en_US'. Or is that _really_ what we want, here? > BTW: as of my commits of an hour ago, the above failure mode is no > longer possible, since a non-locale-enabled Postgres will now refuse to > start up in a database that shows any locale other than 'C' in pg_control. Good. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11