Thread: locale support
There is a serious problem with the PostgreSQL locale support on certain platforms and certain locale combo. That is: simply ordering, indexes etc. are broken because strcoll() does not work. Example combo includes: RedHat 6.2J(Japanese localized version) + ja_JP.eucJP locale. Here is a test program that expose the problem. #include <string.h> #include <locale.h> main() { static char *s1 = "a Japanese string"; static char *s2 = "another Japanese string"; setlocale(LC_ALL,""); printf("%d\n",strcoll(s1,s2)); printf("%d\n",strcoll(s2,s1)); } This program prints 0s, that means strcoll() regards that those differnt Japanese strings are same! I know this is not PostgreSQL's fault but the broken locale data on certain platforms. The problem makes it impossible to use PostgreSQL RPMs in Japan. I'm looking for solutions/workarounds for this problem. Maybe we should disable locale support at runntime if strcoll() does not work? Comments? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > I know this is not PostgreSQL's fault but the broken locale data on > certain platforms. The problem makes it impossible to use PostgreSQL > RPMs in Japan. > I'm looking for solutions/workarounds for this problem. Build a set of RPMs without locale support? regards, tom lane
On Mon, Feb 12, 2001 at 09:59:37PM -0500, Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > I know this is not PostgreSQL's fault but the broken locale data on > > certain platforms. The problem makes it impossible to use PostgreSQL > > RPMs in Japan. > > > I'm looking for solutions/workarounds for this problem. > > Build a set of RPMs without locale support? Run it with LC_ALL="C". Nathan Myers ncm@zembu.com
> > I know this is not PostgreSQL's fault but the broken locale data on > > certain platforms. The problem makes it impossible to use PostgreSQL > > RPMs in Japan. > > > I'm looking for solutions/workarounds for this problem. > > Build a set of RPMs without locale support? >Run it with LC_ALL="C". Both of them seem not ideal solutions for RPM. It would be nice if we could distribute single binary and start up file in RPM. -- Tatsuo Ishii
Nathan Myers wrote: > > On Mon, Feb 12, 2001 at 09:59:37PM -0500, Tom Lane wrote: > > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > > I know this is not PostgreSQL's fault but the broken locale data on > > > certain platforms. The problem makes it impossible to use PostgreSQL > > > RPMs in Japan. > > > > > I'm looking for solutions/workarounds for this problem. > > > > Build a set of RPMs without locale support? > > Run it with LC_ALL="C". It would help if there was a sample working LC_ALL=xxx line /etc/rc.d/init.d/postgresql As it stands now it is a real pita to get LC_xx settings down to the real postmaster through all the layers (and quessing if it did take effect after each restart ;) --------- Hannu
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > I know this is not PostgreSQL's fault but the broken locale data on > certain platforms. The problem makes it impossible to use PostgreSQL > RPMs in Japan. >> > I'm looking for solutions/workarounds for this problem. >> >> Build a set of RPMs without locale support? >> Run it with LC_ALL="C". > Both of them seem not ideal solutions for RPM. It would be nice if we > could distribute single binary and start up file in RPM. If you can find a non-intrusive way to do that, sure ... but I don't think that we should expend any great amount of effort, nor uglify the code, in order to cater to a demonstrably broken library on one particular platform. The LC_ALL answer seems the best to me. regards, tom lane
Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > I know this is not PostgreSQL's fault but the broken locale data on > > certain platforms. The problem makes it impossible to use PostgreSQL > > RPMs in Japan. > > I'm looking for solutions/workarounds for this problem. > >> Build a set of RPMs without locale support? > >> Run it with LC_ALL="C". > > Both of them seem not ideal solutions for RPM. It would be nice if we > > could distribute single binary and start up file in RPM. > If you can find a non-intrusive way to do that, sure ... but I don't > think that we should expend any great amount of effort, nor uglify the > code, in order to cater to a demonstrably broken library on one > particular platform. Tatsuo, what is LC_ALL (or the other locale envvars) set to when you run the program? The man page for setlocale() on my machine documents that the main() starts in C or POSIX locale mode by default. The call to setlocale(LC_ALL, "") reads the envvars and sets the locale accordingly. Maybe RedHat's 6.2J isn't setting up the locale properly to begin with? See what /etc/sysconfig/i18n contains -- if it is empty or doesn't exist, then locale is simply not set up. But you specfically mention the particular locale.... Ok, what combinations _do_ work? We _know_ C or POSIX works -- but which ones don't work, on RH >6.1? While I want to make sure that a broken locale data set isn't used, I also want to make sure that a good locale set isn't thrown out, either. Forcing to LC_COLLATE=C is overkill, IMHO. And building without locale support doesn't work, either, because, at least on RH 6.1, strncmp() is buggered to use the locale's collation. The real solution is for the vendors to fix their broken locales. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen writes: > And building without locale support doesn't work, either, because, at > least on RH 6.1, strncmp() is buggered to use the locale's collation. I don't think so. On RH 6.1, strncmp() is the same it's ever been: int strncmp (s1, s2, n) const char *s1; const char *s2; size_t n; { unsigned reg_char c1 = '\0'; unsigned reg_char c2 = '\0'; if (n >= 4) { size_t n4 = n >> 2; do { c1 = (unsigned char) *s1++; c2 = (unsigned char)*s2++; if (c1 == '\0' || c1 != c2) return c1 - c2; c1 = (unsigned char) *s1++; c2= (unsigned char) *s2++; if (c1 == '\0' || c1 != c2) return c1 - c2; c1 = (unsigned char) *s1++; c2 = (unsigned char) *s2++; if (c1 == '\0' || c1 != c2) return c1 - c2; c1 = (unsignedchar) *s1++; c2 = (unsigned char) *s2++; if (c1 == '\0' || c1 != c2) return c1 - c2; } while (--n4 > 0); n &= 3; } while (n > 0) { c1 = (unsigned char) *s1++; c2 = (unsigned char) *s2++; if (c1 == '\0' || c1 != c2) return c1 - c2; n--; } return c1 - c2; } -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut wrote: > Lamar Owen writes: > > And building without locale support doesn't work, either, because, at > > least on RH 6.1, strncmp() is buggered to use the locale's collation. > I don't think so. On RH 6.1, strncmp() is the same it's ever been: [snip] Is that the code after any glibc RPM patches are applied? 'Pristine source, perhaps -- but patch like crazy!' Reference the classic 'Reflections on Trusting Trust' by Ken Thompson (which you have probably read already, but, for those on-list who may not have read this classic work on security, you can find the paper at http://www.acm.org/classics/sep95/). Although reading the glibc spec file indicates that patching isn't done in the 'conventional' manner here. (Lovely). I base my assertion on running test queries on a RedHat 6.1 box over a year ago, using the non-locale 6.5.3 RPMset I distributed at that point (I distributed non-locale RPMs because of it's speed being greater in indexing, etc). The user who was having difficulties also tried the non-locale RPMset -- and no change, until removing /etc/sysconfig/i18n. I've referenced the thread before in the archives; see the message http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00678.html for the middle of the thread. But, of course, that was 6.5.3. If 7.x behaves differently, I wouldn't know, as I've not built a 'non-locale' RPMset of 7.x. But, I can if needed. Or try the test queries on your own RH 7 box, with a non-locale build. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Lamar Owen writes: > > I don't think so. On RH 6.1, strncmp() is the same it's ever been: > [snip] > > Is that the code after any glibc RPM patches are applied? Yes. > I base my assertion on running test queries on a RedHat 6.1 box over a > year ago, using the non-locale 6.5.3 RPMset I distributed at that point > (I distributed non-locale RPMs because of it's speed being greater in > indexing, etc). The user who was having difficulties also tried the > non-locale RPMset -- and no change, until removing /etc/sysconfig/i18n. I recall that thread, but the conclusion that was reached (that strncmp() is at fault in some way) was never proved sufficiently. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
> Tatsuo, what is LC_ALL (or the other locale envvars) set to when you run > the program? The man page for setlocale() on my machine documents that > the main() starts in C or POSIX locale mode by default. The call to > setlocale(LC_ALL, "") reads the envvars and sets the locale > accordingly. Maybe RedHat's 6.2J isn't setting up the locale properly > to begin with? See what /etc/sysconfig/i18n contains -- if it is empty > or doesn't exist, then locale is simply not set up. But you specfically > mention the particular locale.... It's "ja_JP.eucJP". Definitely that locale exists, so I guess the contents is broken... > Ok, what combinations _do_ work? We _know_ C or POSIX works -- but > which ones don't work, on RH >6.1? While I want to make sure that a > broken locale data set isn't used, I also want to make sure that a good > locale set isn't thrown out, either. Forcing to LC_COLLATE=C is > overkill, IMHO. And building without locale support doesn't work, I guess most single byte locales work. However I seriously doubt that locales for multibyte language would work. > either, because, at least on RH 6.1, strncmp() is buggered to use the > locale's collation. Really? I see PostgreSQL installations without the locale support work just fine on RH 6.1J. > The real solution is for the vendors to fix their broken locales. Of course. -- Tatsuo Ishii