Thread: equal operator fails on two identical strings if initdb uses the traditional chinese locale

Hi,

I'm running PostgreSQL v8 beta4 on Win2K. The default language
selected in Win2K is Big5.

I am using the Windows installer to install it. Everything is
left as default except that the locale for initdb is set to
"traditional-chinese".

Here is a test (run in pgadmin III):
1. createdb db1 -E Unicode
2. psql db1
3. create table t1 ( s varchar(20) primary key );
4. insert into t1 values('xyz'); Note that x, y and z are all
    Chinese characters.
5. select * from t1; It shows that record just fine.
6. select * from t1 where s='xyz'; It fails to find that record.
7. select * from t1 where s like 'xy%'; It finds that record.

If I reinstall pgsql but leave the locale as the default ("C"),
then the above test passes.

BTW, the locale for traditional chinese in postgresql.conf is
set to "traditional-chinese" literally. Shouldn't it be
zh_TW?

Thanks!
Kent Tong wrote:
> I'm running PostgreSQL v8 beta4 on Win2K. The default language
> selected in Win2K is Big5.

Big5 is an encoding, not a language.

> I am using the Windows installer to install it. Everything is
> left as default except that the locale for initdb is set to
> "traditional-chinese".
>
> Here is a test (run in pgadmin III):
> 1. createdb db1 -E Unicode

Probably your locale does not support Unicode.  You need to pick an
encoding that matches your locale or vice versa.

> BTW, the locale for traditional chinese in postgresql.conf is
> set to "traditional-chinese" literally. Shouldn't it be
> zh_TW?

That depends on what locale names the Windows operating system
understands.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: equal operator fails on two identical strings if initdb

From
Kent Tong
Date:
Peter Eisentraut wrote:
>>Here is a test (run in pgadmin III):
>>1. createdb db1 -E Unicode
>
>
> Probably your locale does not support Unicode.  You need to pick an
> encoding that matches your locale or vice versa.

Is there any way to check?
I have other programs reading and writing Unicode on this
computer without problems.

>>BTW, the locale for traditional chinese in postgresql.conf is
>>set to "traditional-chinese" literally. Shouldn't it be
>>zh_TW?
>
>
> That depends on what locale names the Windows operating system
> understands.

Are you using the locale routines in mingw?

Re: equal operator fails on two identical strings if initdb

From
Peter Eisentraut
Date:
Kent Tong wrote:
> Is there any way to check?

On a POSIX system, you can do

$ LC_ALL=<some_locale> locale charmap

and verify manually that the printed charmap (= character set encoding)
matches what you use in PostgreSQL.  I don't know whether an equivalent
interface exists on Windows.

> I have other programs reading and writing Unicode on this
> computer without problems.

Reading and writing Unicode is not a problem.  But if you run the string
comparison operators, PostgreSQL passes the Unicode strings from your
database to the operating system's collation routines, which will
compare them thinking they are Big5 (or whatever) strings, which will
result in the random behavior you observed.  You need to set an
appropriate locale so that the operating system also thinks they are in
Unicode.

> Are you using the locale routines in mingw?

I believe we do.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: equal operator fails on two identical strings if initdb

From
Kent Tong
Date:
Peter Eisentraut wrote:
> On a POSIX system, you can do
>
> $ LC_ALL=<some_locale> locale charmap
>
> and verify manually that the printed charmap (= character set encoding)
> matches what you use in PostgreSQL.  I don't know whether an equivalent
> interface exists on Windows.

Right, there is no such command.

> Reading and writing Unicode is not a problem.  But if you run the string
> comparison operators, PostgreSQL passes the Unicode strings from your
> database to the operating system's collation routines, which will
> compare them thinking they are Big5 (or whatever) strings, which will
> result in the random behavior you observed.  You need to set an
> appropriate locale so that the operating system also thinks they are in
> Unicode.

You mean the OS fails to convert unicode strings to Big5 or the
OS assumes the bytes are already in Big5?

It is the locale used for initdb or the default system locale
set in Windows that is used by the collation routines that you
mentioned above?

I just double checked my config and found that the default locale
is US english. The "supported languages" are:
* Traditional Chinese (default)
* Simplified Chinese
* Western Europe and United States.

Re: equal operator fails on two identical strings if initdb

From
Tom Lane
Date:
Kent Tong <kent@cpttm.org.mo> writes:
> You mean the OS fails to convert unicode strings to Big5 or the
> OS assumes the bytes are already in Big5?

The latter.

> It is the locale used for initdb or the default system locale
> set in Windows that is used by the collation routines that you
> mentioned above?

The former.

The real problem here, IMHO, is that Postgres allows you to select a
"database encoding" setting that is different from the encoding implied
by the initdb locale (ie, the LC_CTYPE setting).  If you make this
mistake, PG will carefully store data byte sequences in the specified
"database encoding" ... and then pass them to strcoll() for comparison
... and strcoll() will assume that the data is in the encoding
associated with LC_CTYPE.

This is partially bad design on our part (we should really not have
invented a per-database encoding selection when the locale setting is
not per-database) and partially bad design on the part of the C standard
(which doesn't provide any very sane way to find out what encoding is
implied by an LC_CTYPE setting).

I think the only real fix is to abandon the C library's locale routines
and find or write our own library with a better API.  This has been on
the TODO list for a long time but no one's quite wished to face up to
doing it ...

In the meantime, make sure your encoding setting agrees with the
LC_CTYPE value that initdb used.

            regards, tom lane