Thread: RPM init-script: Why the locale setting?
Hello, In the init-script contained in the RPMs downloadable from the PostgreSQL site (I checked the one for Fedora), an explicit locale is set before running initdb. - And the explicit locale is not "C". This means that a PostgreSQL installation will not use indexes for LIKE queries (I just ran into this). See http://www.postgresql.org/docs/faqs/FAQ.html#4.8 I suggest that the init-script be rewritten so that LANG and LC_ALL are unset before initdb is run (which happens the first time PostgreSQL is started after the RPM-based installation). -- Greetings from Troels Arvin, Copenhagen, Denmark
On 04/04/2004 21:48 Troels Arvin wrote: > Hello, > > In the init-script contained in the RPMs downloadable from the PostgreSQL > site (I checked the one for Fedora), an explicit locale is set before > running initdb. - And the explicit locale is not "C". > > This means that a PostgreSQL installation will not use indexes for LIKE > queries (I just ran into this). See > http://www.postgresql.org/docs/faqs/FAQ.html#4.8 No. It says that [normal] indexes won't be used for: select foo from bar where col like '%abc'; or select foo from bar where col like '%abc%'; or ILIKE is used. And even then you can use a functional index of the form CREATE INDEX tabindex ON tab (lower(col)); > I suggest that the init-script be rewritten so that LANG and LC_ALL are > unset before initdb is run (which happens the first time PostgreSQL is > started after the RPM-based installation). I'll admit that I don't know what effect this would have but I'm interested to find out. regards -- Paul Thomas +------------------------------+---------------------------------------------+ | Thomas Micro Systems Limited | Software Solutions for Business | | Computer Consultants | http://www.thomas-micro-systems-ltd.co.uk | +------------------------------+---------------------------------------------+
On Sun, 4 Apr 2004, Troels Arvin wrote: > Hello, > > In the init-script contained in the RPMs downloadable from the PostgreSQL > site (I checked the one for Fedora), an explicit locale is set before > running initdb. - And the explicit locale is not "C". > > This means that a PostgreSQL installation will not use indexes for LIKE > queries (I just ran into this). See > http://www.postgresql.org/docs/faqs/FAQ.html#4.8 Technically you should be able to use an index in the appropriate *_pattern_ops opclass, but yes, normal indexes aren't used. > I suggest that the init-script be rewritten so that LANG and LC_ALL are > unset before initdb is run (which happens the first time PostgreSQL is > started after the RPM-based installation). Wouldn't this get in the way of having the server "do the right thing" when in a locale that doesn't collate by "C" rules?
Troels Arvin <troels@arvin.dk> writes: > In the init-script contained in the RPMs downloadable from the PostgreSQL > site (I checked the one for Fedora), an explicit locale is set before > running initdb. - And the explicit locale is not "C". Only if you don't have a sysconfig file: # Just in case no locale was set, use en_US [ ! -f /etc/sysconfig/i18n ] && echo "LANG=en_US" > $PGDATA/../initdb.i18n I agree though that it seems like a bad choice to default to en_US rather than C. Lamar, any reason why it's like that? regards, tom lane
On Sunday 04 April 2004 10:50 pm, Tom Lane wrote: > Troels Arvin <troels@arvin.dk> writes: > > In the init-script contained in the RPMs downloadable from the PostgreSQL > > site (I checked the one for Fedora), an explicit locale is set before > > running initdb. - And the explicit locale is not "C". > Only if you don't have a sysconfig file: > # Just in case no locale was set, use en_US > [ ! -f /etc/sysconfig/i18n ] && echo "LANG=en_US" > $PGDATA/../initdb.i18n > I agree though that it seems like a bad choice to default to en_US > rather than C. Lamar, any reason why it's like that? Yes. A bit of history before I enclose an e-mail from Trond Eivind Glomsrød (former Red Hat internal PostgreSQL RPMmaintainer) on the subject. I am only enclosing a single e-mail of an exchange that occurred over a period of a couple of weeks; I have pretty much whole exchange archived if you want to read more, although I cannot reveal the whole exchange due to some NDA stuff in it. Although it might be OK at this point, since that was, after all, 3 years ago. Back in PostgreSQL 7.1 days, locale settings and the issue of a database being initdb'ed in one locale and the postmaster starting in another locale reared up its head. I 'solved' the issue by hardcoding LC_ALL=C in the initscript. This had the side-effect of making the regression tests pass. Trond wasn't happy with my choice of C locale, and here is why: Re: Thought you might find this very interesting. From: teg@redhat.com (Trond Eivind Glomsrød) To: Lamar Owen <lamar.owen@wgcr.org> Lamar Owen <lamar.owen@wgcr.org> writes: > On Friday 25 May 2001 15:04, you wrote: > > Lamar Owen <lamar.owen@wgcr.org> writes: > > > > I also intend to kill the output from database initialization. > > > > I thought you had, at least in the RedHat 7.1 7.0.3 set. > > > Yup, but it has started showing up again in PostgreSQL 7.1.x > > I need to sync that in with this set. I've fixed a couple of issues with the inistscript, I'll send it to you when it's finished.... even after sourcing a file with locale values, the postmaster process doesn't seem to respect it. I'll need to make this work before I build (I've confirmed that the current way of handling this, using "C", is not acceptable. The locale needs to be different, and if that causes problems for pgsql, it's a bug in pgsql which needs fixing - handling other aspects, like ordering, in a bad way isn't an acceptable workaround. > > "C" equals broken for non-English locales, and isn't an acceptable choice. > > That is one argument I'll not be involved in, as I'm so used to the ASCII > sequence that it is second-nature, thus disqualifying me from commenting on > any collation issues. 1) It's not a vaslid choice for English - if you're looking in a lexicon, you'll find Aspen, bridge, Cambridge, not Aspen, Cambridge, bridge. 2) It's much worse in other locales... it gets the order of chaaracters wrong as well. Here is a test: create table bar( ord varchar(40), foo int, primary key(ord)); insert into bar values('ære',2); insert into bar values('åre',3); insert into bar values('are',4); insert into bar values('zsh',5); insert into bar values('begynne',6); insert into bar values('øve',7); select ord,foo from bar order by ord; Here is a valid result: are | 4 begynne | 6 zsh | 5 ære | 2 øve | 7 åre | 3 Here is an invalid result: are | 4 begynne | 6 zsh | 5 åre | 3 ære | 2 øve | 7 The last one is what you get with LANG=C - as you can see, the ordering of the Norwegian characters is wrong. The same would be the issue for pretty much any non-English characters - their number in the character table (as used by C) is not the same as their location in the local alphabet (as used by the local locale). -- Trond Eivind Glomsrød Red Hat, Inc. So there is a reason it is like it is. If you want to change that in the local setting, you will have to reinitdb in C locale (and edit /var/lib/pgsql/initdb.i18n accordingly, and be prepared for collation differences and problems). The initial initdb is done in the system locale, unless one does not exist, in which case en_US is used (again, so that when you do store non-English characters you get sane ordering, and so that you get the mixed-case ordering preferred by many people). The initdb locale settings are stored in initdb.i18n, and they are re-sourced everytime postgresql is started to prevent data corruption if postmaster is started with a different locale from the initdb. Tom, is the data corruption issue still an issue with 7.4.x, or is this just historical? It has been a long time since I've looked in this corner of the RPM.... :-) -- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu
Lamar Owen <lowen@pari.edu> writes: > ... The initdb locale > settings are stored in initdb.i18n, and they are re-sourced everytime > postgresql is started to prevent data corruption if postmaster is started > with a different locale from the initdb. Tom, is the data corruption issue > still an issue with 7.4.x, or is this just historical? That's historical. For several versions now, the LC_COLLATE and LC_CTYPE settings seen by initdb have been saved in pg_control and re-adopted by the postmaster at start, so that index order corruption problems are impossible. We do still adopt other settings such as LC_MESSAGES from the postmaster environment, although I believe that these will generally be read from postgresql.conf if you haven't toyed with what initdb puts into that file. In short then I doubt there's a need for initdb.i18n anymore. It would make more sense to have postgres' bash_profile source /etc/sysconfig/i18n directly. The question of what postgresql.init should do if there's no available LANG or LC_ALL setting seems orthogonal to me. I do not find Trond's arguments convincing at all: a person who feels that C locale is broken ought to set up /etc/sysconfig/i18n to specify another locale. The POSIX standards say that the default locale in the absence of any environmental variable is C, not en_US, and the fact that Trond doesn't like that default doesn't give him license to change it, nor IMHO to try to make an end run around the standard by pressuring initscript authors to override the POSIX spec. I have no objection to making en_US the default at the sysconfig level, but inserting it in lower levels of the system seems at best misguided. regards, tom lane
On Monday 05 April 2004 02:02 pm, Tom Lane wrote: > In short then I doubt there's a need for initdb.i18n anymore. It would > make more sense to have postgres' bash_profile source /etc/sysconfig/i18n > directly. Probably a good idea, then. I'll look at removing that cruft in the next release; although, you may get to another release before I do, in which case do with as you see fit (unless you just want to leave it to me...:-)) > The question of what postgresql.init should do if there's no available > LANG or LC_ALL setting seems orthogonal to me. I do not find Trond's > arguments convincing at all: a person who feels that C locale is broken > ought to set up /etc/sysconfig/i18n to specify another locale. The > POSIX standards say that the default locale in the absence of any > environmental variable is C, not en_US, and the fact that Trond doesn't > like that default doesn't give him license to change it, nor IMHO to try > to make an end run around the standard by pressuring initscript authors > to override the POSIX spec. I have no objection to making en_US the > default at the sysconfig level, but inserting it in lower levels of the > system seems at best misguided. Well, Trond no longer has the reins, no? :-) However, I would like to see a sane default that is consistent system-wide: if the whole system defaults to en_US in the presence of no environment variable, then PostgreSQL should default the same way. What does LSB say (which is where the RPMset has to live)? I personally favored a default at C locale and have no problem reinstating that if that is really a sane default. -- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu