Thread: equal operator fails on two identical strings if initdb uses the traditional chinese locale
equal operator fails on two identical strings if initdb uses the traditional chinese locale
From
Kent Tong
Date:
Hi, I'm running PostgreSQL v8 beta4 on Win2K. The default language selected in Win2K is Big5. I am using the Windows installer to install it. Everything is left as default except that the locale for initdb is set to "traditional-chinese". Here is a test (run in pgadmin III): 1. createdb db1 -E Unicode 2. psql db1 3. create table t1 ( s varchar(20) primary key ); 4. insert into t1 values('xyz'); Note that x, y and z are all Chinese characters. 5. select * from t1; It shows that record just fine. 6. select * from t1 where s='xyz'; It fails to find that record. 7. select * from t1 where s like 'xy%'; It finds that record. If I reinstall pgsql but leave the locale as the default ("C"), then the above test passes. BTW, the locale for traditional chinese in postgresql.conf is set to "traditional-chinese" literally. Shouldn't it be zh_TW? Thanks!
Re: equal operator fails on two identical strings if initdb uses the traditional chinese locale
From
Peter Eisentraut
Date:
Kent Tong wrote: > I'm running PostgreSQL v8 beta4 on Win2K. The default language > selected in Win2K is Big5. Big5 is an encoding, not a language. > I am using the Windows installer to install it. Everything is > left as default except that the locale for initdb is set to > "traditional-chinese". > > Here is a test (run in pgadmin III): > 1. createdb db1 -E Unicode Probably your locale does not support Unicode. You need to pick an encoding that matches your locale or vice versa. > BTW, the locale for traditional chinese in postgresql.conf is > set to "traditional-chinese" literally. Shouldn't it be > zh_TW? That depends on what locale names the Windows operating system understands. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: >>Here is a test (run in pgadmin III): >>1. createdb db1 -E Unicode > > > Probably your locale does not support Unicode. You need to pick an > encoding that matches your locale or vice versa. Is there any way to check? I have other programs reading and writing Unicode on this computer without problems. >>BTW, the locale for traditional chinese in postgresql.conf is >>set to "traditional-chinese" literally. Shouldn't it be >>zh_TW? > > > That depends on what locale names the Windows operating system > understands. Are you using the locale routines in mingw?
Kent Tong wrote: > Is there any way to check? On a POSIX system, you can do $ LC_ALL=<some_locale> locale charmap and verify manually that the printed charmap (= character set encoding) matches what you use in PostgreSQL. I don't know whether an equivalent interface exists on Windows. > I have other programs reading and writing Unicode on this > computer without problems. Reading and writing Unicode is not a problem. But if you run the string comparison operators, PostgreSQL passes the Unicode strings from your database to the operating system's collation routines, which will compare them thinking they are Big5 (or whatever) strings, which will result in the random behavior you observed. You need to set an appropriate locale so that the operating system also thinks they are in Unicode. > Are you using the locale routines in mingw? I believe we do. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > On a POSIX system, you can do > > $ LC_ALL=<some_locale> locale charmap > > and verify manually that the printed charmap (= character set encoding) > matches what you use in PostgreSQL. I don't know whether an equivalent > interface exists on Windows. Right, there is no such command. > Reading and writing Unicode is not a problem. But if you run the string > comparison operators, PostgreSQL passes the Unicode strings from your > database to the operating system's collation routines, which will > compare them thinking they are Big5 (or whatever) strings, which will > result in the random behavior you observed. You need to set an > appropriate locale so that the operating system also thinks they are in > Unicode. You mean the OS fails to convert unicode strings to Big5 or the OS assumes the bytes are already in Big5? It is the locale used for initdb or the default system locale set in Windows that is used by the collation routines that you mentioned above? I just double checked my config and found that the default locale is US english. The "supported languages" are: * Traditional Chinese (default) * Simplified Chinese * Western Europe and United States.
Kent Tong <kent@cpttm.org.mo> writes: > You mean the OS fails to convert unicode strings to Big5 or the > OS assumes the bytes are already in Big5? The latter. > It is the locale used for initdb or the default system locale > set in Windows that is used by the collation routines that you > mentioned above? The former. The real problem here, IMHO, is that Postgres allows you to select a "database encoding" setting that is different from the encoding implied by the initdb locale (ie, the LC_CTYPE setting). If you make this mistake, PG will carefully store data byte sequences in the specified "database encoding" ... and then pass them to strcoll() for comparison ... and strcoll() will assume that the data is in the encoding associated with LC_CTYPE. This is partially bad design on our part (we should really not have invented a per-database encoding selection when the locale setting is not per-database) and partially bad design on the part of the C standard (which doesn't provide any very sane way to find out what encoding is implied by an LC_CTYPE setting). I think the only real fix is to abandon the C library's locale routines and find or write our own library with a better API. This has been on the TODO list for a long time but no one's quite wished to face up to doing it ... In the meantime, make sure your encoding setting agrees with the LC_CTYPE value that initdb used. regards, tom lane