Thread: locale support

locale support

From
Tatsuo Ishii
Date:
There is a serious problem with the PostgreSQL locale support on
certain platforms and certain locale combo. That is: simply ordering,
indexes etc. are broken because strcoll() does not work.  Example
combo includes: RedHat 6.2J(Japanese localized version) + ja_JP.eucJP
locale. Here is a test program that expose the problem.

#include <string.h>
#include <locale.h>
main()
{ static char *s1 = "a Japanese string"; static char *s2 = "another Japanese string";
 setlocale(LC_ALL,"");
 printf("%d\n",strcoll(s1,s2)); printf("%d\n",strcoll(s2,s1));
}

This program prints 0s, that means strcoll() regards that those differnt
Japanese strings are same!

I know this is not PostgreSQL's fault but the broken locale data on
certain platforms. The problem makes it impossible to use PostgreSQL
RPMs in Japan.

I'm looking for solutions/workarounds for this problem. Maybe we
should disable locale support at runntime if strcoll() does not work?
Comments?
--
Tatsuo Ishii


Re: locale support

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> I know this is not PostgreSQL's fault but the broken locale data on
> certain platforms. The problem makes it impossible to use PostgreSQL
> RPMs in Japan.

> I'm looking for solutions/workarounds for this problem.

Build a set of RPMs without locale support?
        regards, tom lane


Re: locale support

From
ncm@zembu.com (Nathan Myers)
Date:
On Mon, Feb 12, 2001 at 09:59:37PM -0500, Tom Lane wrote:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > I know this is not PostgreSQL's fault but the broken locale data on
> > certain platforms. The problem makes it impossible to use PostgreSQL
> > RPMs in Japan.
> 
> > I'm looking for solutions/workarounds for this problem.
> 
> Build a set of RPMs without locale support?

Run it with LC_ALL="C".

Nathan Myers
ncm@zembu.com


Re: locale support

From
Tatsuo Ishii
Date:
> > I know this is not PostgreSQL's fault but the broken locale data on
> > certain platforms. The problem makes it impossible to use PostgreSQL
> > RPMs in Japan.
> 
> > I'm looking for solutions/workarounds for this problem.
> 
> Build a set of RPMs without locale support?

>Run it with LC_ALL="C".

Both of them seem not ideal solutions for RPM. It would be nice if we
could distribute single binary and start up file in RPM.
--
Tatsuo Ishii


Re: locale support

From
Hannu Krosing
Date:
Nathan Myers wrote:
> 
> On Mon, Feb 12, 2001 at 09:59:37PM -0500, Tom Lane wrote:
> > Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > > I know this is not PostgreSQL's fault but the broken locale data on
> > > certain platforms. The problem makes it impossible to use PostgreSQL
> > > RPMs in Japan.
> >
> > > I'm looking for solutions/workarounds for this problem.
> >
> > Build a set of RPMs without locale support?
> 
> Run it with LC_ALL="C".

It would help if there was a sample working LC_ALL=xxx line
/etc/rc.d/init.d/postgresql

As it stands now it is a real pita to get LC_xx settings down to the
real postmaster 
through all the layers (and quessing if it did take effect after each
restart ;)

---------
Hannu


Re: locale support

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> I know this is not PostgreSQL's fault but the broken locale data on
> certain platforms. The problem makes it impossible to use PostgreSQL
> RPMs in Japan.
>> 
> I'm looking for solutions/workarounds for this problem.
>> 
>> Build a set of RPMs without locale support?

>> Run it with LC_ALL="C".

> Both of them seem not ideal solutions for RPM. It would be nice if we
> could distribute single binary and start up file in RPM.

If you can find a non-intrusive way to do that, sure ... but I don't
think that we should expend any great amount of effort, nor uglify the
code, in order to cater to a demonstrably broken library on one
particular platform.

The LC_ALL answer seems the best to me.
        regards, tom lane


Re: locale support

From
Lamar Owen
Date:
Tom Lane wrote:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > I know this is not PostgreSQL's fault but the broken locale data on
> > certain platforms. The problem makes it impossible to use PostgreSQL
> > RPMs in Japan.
> > I'm looking for solutions/workarounds for this problem.

> >> Build a set of RPMs without locale support?
> >> Run it with LC_ALL="C".
> > Both of them seem not ideal solutions for RPM. It would be nice if we
> > could distribute single binary and start up file in RPM.
> If you can find a non-intrusive way to do that, sure ... but I don't
> think that we should expend any great amount of effort, nor uglify the
> code, in order to cater to a demonstrably broken library on one
> particular platform.

Tatsuo, what is LC_ALL (or the other locale envvars) set to when you run
the program?  The man page for setlocale() on my machine documents that
the main() starts in C or POSIX locale mode by default.  The call to
setlocale(LC_ALL, "") reads the envvars and sets the locale
accordingly.  Maybe RedHat's 6.2J isn't setting up the locale properly
to begin with?  See what /etc/sysconfig/i18n contains -- if it is empty
or doesn't exist, then locale is simply not set up. But you specfically
mention the particular locale....

Ok, what combinations _do_ work?  We _know_ C or POSIX works -- but
which ones don't work, on RH >6.1?  While I want to make sure that a
broken locale data set isn't used, I also want to make sure that a good
locale set isn't thrown out, either.  Forcing to LC_COLLATE=C is
overkill, IMHO.  And building without locale support doesn't work,
either, because, at least on RH 6.1, strncmp() is buggered to use the
locale's collation.

The real solution is for the vendors to fix their broken locales.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


Re: locale support

From
Peter Eisentraut
Date:
Lamar Owen writes:

> And building without locale support doesn't work, either, because, at
> least on RH 6.1, strncmp() is buggered to use the locale's collation.

I don't think so.  On RH 6.1, strncmp() is the same it's ever been:

int
strncmp (s1, s2, n)    const char *s1;    const char *s2;    size_t n;
{ unsigned reg_char c1 = '\0'; unsigned reg_char c2 = '\0';
 if (n >= 4)   {     size_t n4 = n >> 2;     do       {         c1 = (unsigned char) *s1++;         c2 = (unsigned
char)*s2++;         if (c1 == '\0' || c1 != c2)           return c1 - c2;         c1 = (unsigned char) *s1++;
c2= (unsigned char) *s2++;         if (c1 == '\0' || c1 != c2)           return c1 - c2;         c1 = (unsigned char)
*s1++;        c2 = (unsigned char) *s2++;         if (c1 == '\0' || c1 != c2)           return c1 - c2;         c1 =
(unsignedchar) *s1++;         c2 = (unsigned char) *s2++;         if (c1 == '\0' || c1 != c2)           return c1 - c2;
     } while (--n4 > 0);     n &= 3;   }
 
 while (n > 0)   {     c1 = (unsigned char) *s1++;     c2 = (unsigned char) *s2++;     if (c1 == '\0' || c1 != c2)
return c1 - c2;     n--;   }
 
 return c1 - c2;
}

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: locale support

From
Lamar Owen
Date:
Peter Eisentraut wrote:
> Lamar Owen writes:
> > And building without locale support doesn't work, either, because, at
> > least on RH 6.1, strncmp() is buggered to use the locale's collation.

> I don't think so.  On RH 6.1, strncmp() is the same it's ever been:
[snip]

Is that the code after any glibc RPM patches are applied?  'Pristine
source, perhaps -- but patch like crazy!'  Reference the classic
'Reflections on Trusting Trust' by Ken Thompson (which you have probably
read already, but, for those on-list who may not have read this classic
work on security, you can find the paper at
http://www.acm.org/classics/sep95/).  Although reading the glibc spec
file indicates that patching isn't done in the 'conventional' manner
here. (Lovely).

I base my assertion on running test queries on a RedHat 6.1 box over a
year ago, using the non-locale 6.5.3 RPMset I distributed at that point
(I distributed non-locale RPMs because of it's speed being greater in
indexing, etc).  The user who was having difficulties also tried the
non-locale RPMset -- and no change, until removing /etc/sysconfig/i18n. 
I've referenced the thread before in the archives; see the message
http://www.postgresql.org/mhonarc/pgsql-hackers/1999-12/msg00678.html
for the middle of the thread.

But, of course, that was 6.5.3.  If 7.x behaves differently, I wouldn't
know, as I've not built a 'non-locale' RPMset of 7.x.  But, I can if
needed.  Or try the test queries on your own RH 7 box, with a non-locale
build.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


Re: locale support

From
Peter Eisentraut
Date:
Lamar Owen writes:

> > I don't think so.  On RH 6.1, strncmp() is the same it's ever been:
> [snip]
>
> Is that the code after any glibc RPM patches are applied?

Yes.

> I base my assertion on running test queries on a RedHat 6.1 box over a
> year ago, using the non-locale 6.5.3 RPMset I distributed at that point
> (I distributed non-locale RPMs because of it's speed being greater in
> indexing, etc).  The user who was having difficulties also tried the
> non-locale RPMset -- and no change, until removing /etc/sysconfig/i18n.

I recall that thread, but the conclusion that was reached (that strncmp()
is at fault in some way) was never proved sufficiently.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: locale support

From
Tatsuo Ishii
Date:
> Tatsuo, what is LC_ALL (or the other locale envvars) set to when you run
> the program?  The man page for setlocale() on my machine documents that
> the main() starts in C or POSIX locale mode by default.  The call to
> setlocale(LC_ALL, "") reads the envvars and sets the locale
> accordingly.  Maybe RedHat's 6.2J isn't setting up the locale properly
> to begin with?  See what /etc/sysconfig/i18n contains -- if it is empty
> or doesn't exist, then locale is simply not set up. But you specfically
> mention the particular locale....

It's "ja_JP.eucJP". Definitely that locale exists, so I guess the
contents is broken...

> Ok, what combinations _do_ work?  We _know_ C or POSIX works -- but
> which ones don't work, on RH >6.1?  While I want to make sure that a
> broken locale data set isn't used, I also want to make sure that a good
> locale set isn't thrown out, either.  Forcing to LC_COLLATE=C is
> overkill, IMHO.  And building without locale support doesn't work,

I guess most single byte locales work. However I seriously doubt that
locales for multibyte language would work.

> either, because, at least on RH 6.1, strncmp() is buggered to use the
> locale's collation.

Really? I see PostgreSQL installations without the locale support work
just fine on RH 6.1J.

> The real solution is for the vendors to fix their broken locales.

Of course.
--
Tatsuo Ishii