Thread: Case Conversion Fix for MB Chars
Here's small patch to fix case conversion problems for ILIKE and (Oracle Compat.) lower/upper functions. (Related bug report is here: http://archives.postgresql.org/pgsql-bugs/2005-10/msg00001.php) In tests it succeeded for Turkish characters while using LATIN5 encoding. But when encoding is UNICODE it still doesn't work. (IMHO, for latin-N encodings there will be no compatibility problems; for Unicode, I've no idea.) Regards.
Attachment
Last minute edit: On 11/26/05, Volkan YAZICI <volkan.yazici@gmail.com> wrote: - In tests it succeeded for Turkish characters while using LATIN5 - encoding. But when encoding is UNICODE it still doesn't work. (IMHO, - for latin-N encodings there will be no compatibility problems; for - Unicode, I've no idea.) + In tests it succeeded for Turkish characters while using both + LATIN5 and UNICODE encodings. Furthermore, it shouldn't raise + any compatibility problems for LATIN-N encodings, but for + UNICODE, I don't have an exact idea. Regards.
Volkan YAZICI <volkan.yazici@gmail.com> writes: > Here's small patch to fix case conversion problems for ILIKE and > (Oracle Compat.) lower/upper functions. (Related bug report is here: > http://archives.postgresql.org/pgsql-bugs/2005-10/msg00001.php) Doesn't this just move the failure cases around? The really fundamental problem is that tolower/toupper don't work on wchar_t. regards, tom lane
[Sorry if the post is duplicated. But I don't receive and ACK from majordamo.] On 11/27/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Doesn't this just move the failure cases around? I don't think so. I've tried to fix every tolower/toupper related problem (depending on the MB characters) I found in the code. If there's still a problem with them, it should be related with the pg_wchar_tb concept in utils/mb/wchar.c (AFAIC, the last remaning part about this case conversion mess is the one related with PQfnumber - which needs a very different approach.) > The really fundamental problem is that tolower/toupper don't work > on wchar_t. Yes, indeed. I agree with you to find a proper solution for character set handling. But, IMHO, it's better to have a-patchy working system instead of a not working one. Put yourself in our position, we're trying to give enterprise solutions to our customers and ILIKE, lower/upper doesn't work in the product that we support. We can work with a small patch instead of waiting next releases for an exact solution to the problem.
On 11/27/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Doesn't this just move the failure cases around? I don't think so. I've tried to fix every tolower/toupper related problem (depending on the MB characters) I found in the code. If there's still a problem with them, it should be related with the pg_wchar_tb concept in utils/mb/wchar.c (AFAIC, the last remaning part about this case conversion mess is the one related with PQfnumber - which needs a very different approach.) > The really fundamental problem is that tolower/toupper don't work > on wchar_t. Yes, indeed. I agree with you to find a proper solution for character set handling. But, IMHO, it's better to have a-patchy working system instead of a not working one. Put yourself in our position, we're trying to give enterprise solutions to our customers and ILIKE, lower/upper doesn't work in the product that we support. We can work with a small patch instead of waiting next releases for an exact solution to the problem. Regards.
Volkan YAZICI <volkan.yazici@gmail.com> writes: > On 11/27/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The really fundamental problem is that tolower/toupper don't work >> on wchar_t. > Yes, indeed. I agree with you to find a proper solution for character > set handling. But, IMHO, it's better to have a-patchy working system > instead of a not working one. But you just agreed that it doesn't work. It might be that there are degrees of not-working-ness here, but before adopting a partial solution I would like to see some reasoning why it won't make things worse for other people. I think that what you are proposing could lead to arbitrarily bad behavior (up to and including server crashes) depending on what libc's toupper/tolower functions are coded to do with out-of-range inputs. Exactly what cases have you tried, and on what platforms? regards, tom lane
On 11/27/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Yes, indeed. I agree with you to find a proper solution for character > > set handling. But, IMHO, it's better to have a-patchy working system > > instead of a not working one. > > But you just agreed that it doesn't work. You get me wrong. I tried to to say it works for latin5 and unicode. When I look at the latin-n implementation of PostgreSQL, see that there doesn't exist a far difference between 'em. So there shouldn't be a problem with latin-n encodings. But, as I try to explain, I don't have so much knowledge on unicode characters and cannot claim an exact answer about it. > It might be that there are degrees of not-working-ness here, but before > adopting a partial solution I would like to see some reasoning why it > won't make things worse for other people. I think that what you are > proposing could lead to arbitrarily bad behavior (up to and including > server crashes) depending on what libc's toupper/tolower functions are > coded to do with out-of-range inputs. Furthermore, I just used the methods already in the code. If it'll cause any crashes in the server side, it won't be my fault. (Except logical errors I made.) > Exactly what cases have you tried, and on what platforms? I don't have a buildfarm at home. Tests made on an i686 with a 2.6.12.5 kernel. Here's a short list of cases I tried with both latin5 and unicode charsets: - lower/upper functions with Turkish characters. - ILIKE matches with both lower and upper case Turkish characters. (Above testes succeeded for non-Turkish characters too.) Regards.
On 11/27/05, Volkan YAZICI <volkan.yazici@gmail.com> wrote: > Tests made on an i686 with a > 2.6.12.5 kernel. Here's a short list of cases I tried with both latin5 > and unicode charsets: > - lower/upper functions with Turkish characters. > - ILIKE matches with both lower and upper case Turkish characters. > (Above testes succeeded for non-Turkish characters too.) I read the above paragraph again and realized the out of usability of it. Here's a modified one: Test's made on a Debian GNU/Linux (stable) 3.1 by patching src/backend/utils/adt/like.c (r1.62) and src/backend/utils/adt/oracle_compat.c (r1.64) files. Related software versions: - gcc-3.3 [3.3.5-13] - libc6-dev [2.3.2.ds1-22] - locales [2.3.2.ds1-22] Tried test cases using patched CVS HEAD: [For Latin5] $ usr/bin/initdb -D var/data $ LANG="tr_TR.ISO-8859-9" usr/bin/postmaster -D var/data $ usr/bin/createdb -E latin5 test_latin5 $ usr/bin/psql test_latin5 Welcome to psql 8.2devel, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit test_latin5=# SHOW client_encoding; client_encoding ----------------- LATIN5 (1 row) test_latin5=# SELECT upper('abcdefgğhıijklmnoöprsştuüvyz qwx 0123456789'); upper ------------------------------------------- ABCDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ QWX 0123456789 (1 row) test_latin5=# SELECT test_latin5-# lower('ABCDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ QWX 0123456789'); lower --------------------------------------------- abcdefgğhıijklmnoöprsştuüvyz qwx 0123456789 (1 row) test_latin5=# BEGIN; BEGIN test_latin5=# CREATE TEMP TABLE t (v varchar); CREATE TABLE test_latin5=# COPY t FROM stdin; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> ı123 >> I123 >> i123 >> İ123 >> \. test_latin5=# SELECT v FROM t; v ------ ı123 I123 i123 İ123 (4 rows) test_latin5=# SELECT v FROM t WHERE v ILIKE 'ı%'; v ------ ı123 I123 (2 rows) test_latin5=# SELECT v FROM t WHERE v ILIKE 'I%'; v ------ ı123 I123 (2 rows) test_latin5=# SELECT v FROM t WHERE v ILIKE 'i%'; v ------ i123 İ123 (2 rows) test_latin5=# SELECT v FROM t WHERE v ILIKE 'İ%'; v ------ i123 İ123 (2 rows) test_latin5=# ROLLBACK; ROLLBACK [For UNICODE] Same steps as above with LANG="tr_TR.UTF-8" and database/client encoding as UNICODE. Hope this tests help. Regards.
Where are we on this patch? Is it to be applied? --------------------------------------------------------------------------- Volkan YAZICI wrote: > On 11/27/05, Volkan YAZICI <volkan.yazici@gmail.com> wrote: > > Tests made on an i686 with a > > 2.6.12.5 kernel. Here's a short list of cases I tried with both latin5 > > and unicode charsets: > > - lower/upper functions with Turkish characters. > > - ILIKE matches with both lower and upper case Turkish characters. > > (Above testes succeeded for non-Turkish characters too.) > > I read the above paragraph again and realized the out of usability of > it. Here's a modified one: > > Test's made on a Debian GNU/Linux (stable) 3.1 by patching > src/backend/utils/adt/like.c (r1.62) and > src/backend/utils/adt/oracle_compat.c (r1.64) files. Related software > versions: > - gcc-3.3 [3.3.5-13] > - libc6-dev [2.3.2.ds1-22] > - locales [2.3.2.ds1-22] > > Tried test cases using patched CVS HEAD: > > [For Latin5] > $ usr/bin/initdb -D var/data > $ LANG="tr_TR.ISO-8859-9" usr/bin/postmaster -D var/data > $ usr/bin/createdb -E latin5 test_latin5 > $ usr/bin/psql test_latin5 > Welcome to psql 8.2devel, the PostgreSQL interactive terminal. > > Type: \copyright for distribution terms > \h for help with SQL commands > \? for help with psql commands > \g or terminate with semicolon to execute query > \q to quit > > test_latin5=# SHOW client_encoding; > client_encoding > ----------------- > LATIN5 > (1 row) > > test_latin5=# SELECT upper('abcdefg?h?ijklmno?prs?tu?vyz qwx 0123456789'); > upper > ------------------------------------------- > ABCDEFG?HI?JKLMNO?PRS?TU?VYZ QWX 0123456789 > (1 row) > > test_latin5=# SELECT > test_latin5-# lower('ABCDEFG?HI?JKLMNO?PRS?TU?VYZ QWX 0123456789'); > lower > --------------------------------------------- > abcdefg?h?ijklmno?prs?tu?vyz qwx 0123456789 > (1 row) > > test_latin5=# BEGIN; > BEGIN > test_latin5=# CREATE TEMP TABLE t (v varchar); > CREATE TABLE > test_latin5=# COPY t FROM stdin; > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself. > >> ?123 > >> I123 > >> i123 > >> ?123 > >> \. > test_latin5=# SELECT v FROM t; > v > ------ > ?123 > I123 > i123 > ?123 > (4 rows) > > test_latin5=# SELECT v FROM t WHERE v ILIKE '?%'; > v > ------ > ?123 > I123 > (2 rows) > > test_latin5=# SELECT v FROM t WHERE v ILIKE 'I%'; > v > ------ > ?123 > I123 > (2 rows) > > test_latin5=# SELECT v FROM t WHERE v ILIKE 'i%'; > v > ------ > i123 > ?123 > (2 rows) > > test_latin5=# SELECT v FROM t WHERE v ILIKE '?%'; > v > ------ > i123 > ?123 > (2 rows) > > test_latin5=# ROLLBACK; > ROLLBACK > > [For UNICODE] > Same steps as above with LANG="tr_TR.UTF-8" and database/client > encoding as UNICODE. > > Hope this tests help. > > > Regards. > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Hi, On Thu, 1 Dec 2005, Bruce Momjian wrote: > Where are we on this patch? Is it to be applied? The patch worked on FreeBSD 5.4-CURRENT. Maybe what Tom wants is more port reports, hmm? Volkan stated out the reasons that why we need this patch, I'll not repeat it again. Regards, -- Devrim GUNDUZ Kivi Bilişim Teknolojileri - http://www.kivi.com.tr devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org
On 12/1/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Where are we on this patch? Is it to be applied? After Tom's advice (he was doubtful about the patch), while I was thinking about how to improve the spectrum of tests, decided to use src/test/mb. In the tests, patch just succeded for unicode and failed on big5, euc_cn, euc_jp, euc_kr, euc_tw, mule_internal, sjis encodings. Fails' reason can be my wrong configuration too. (I've made tests on a both unicode and latin-5 encoded databases.) AFAIC, euc* encodings break latin-n and fixed latin-n encodings break euc*. Seems like a deadlock. Furthermore, I'm not aware of the status of ICU (will it be applied with 8.2?), thus we can go a way through like distributing latin-n fixed PostgreSQL sources/binaries on postgresql.org.tr. Regards.
Volkan YAZICI wrote: > On 12/1/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > > Where are we on this patch? Is it to be applied? > > After Tom's advice (he was doubtful about the patch), while I was > thinking about how to improve the spectrum of tests, decided to use > src/test/mb. In the tests, patch just succeded for unicode and failed > on big5, euc_cn, euc_jp, euc_kr, euc_tw, mule_internal, sjis > encodings. Fails' reason can be my wrong configuration too. (I've made > tests on a both unicode and latin-5 encoded databases.) Do those encodings even have uppercase letters? > AFAIC, euc* encodings break latin-n and fixed latin-n encodings break > euc*. Seems like a deadlock. Furthermore, I'm not aware of the status > of ICU (will it be applied with 8.2?), thus we can go a way through > like distributing latin-n fixed PostgreSQL sources/binaries on > postgresql.org.tr. People have talked about ICU but I don't know if anyone is working on it now. I think the big problem is that while your patch works for some cases, it fails for others, and there is no good way to know/test which will work and which will not. Is that accurate? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Last minute edit: src/test/mb seems a little bit old. I've tested SQL files in src/test/mb/sql with the expected results in src/test/mb/expected manually and it worked. (Output files need a little bit editing, like removing lines similar to "CREATE TABLE".) But it'll be better if any EUC users will try 'em manually too. On 12/2/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Volkan YAZICI wrote: > > After Tom's advice (he was doubtful about the patch), while I was > > thinking about how to improve the spectrum of tests, decided to use > > src/test/mb. In the tests, patch just succeded for unicode and failed > > on big5, euc_cn, euc_jp, euc_kr, euc_tw, mule_internal, sjis > > encodings. Fails' reason can be my wrong configuration too. (I've made > > tests on a both unicode and latin-5 encoded databases.) > > Do those encodings even have uppercase letters? According to what IRC folks, yes. > People have talked about ICU but I don't know if anyone is working on it > now. Furthermore, there're some unofficial ICU patches for PostgreSQL around. Like the one @ http://people.freebsd.org/~girgen/postgresql-icu/README.html > I think the big problem is that while your patch works for some cases, > it fails for others As I mentioned in the above, it seems like it's working for other ones too. > and there is no good way to know/test which will > work and which will not. Is that accurate? You don't want to commit this patch because it breaks[*] EUC like encodings. But OTOH, it fixes LatinN and UNICODE encodings. I'm really wondering, while we're trying to protect the EUC encodings still working, why there's not any EUC users around to take care of EUC tests? Doesn't EUC have any problems? Do ILIKE, upper/lower work for them properly? [*] If I didn't make a mistake, manual tests succeded for EUC like encodings too. You can think the reverse of the subject too. Think LatinN and UNICODE as working and somebody submitted a patch which fixes EUC encodings by breaking the previous ones. What will be the reaction of PostgreSQL team in this situation? Regards.