Thread: Case Conversion Fix for MB Chars

Case Conversion Fix for MB Chars

From
Volkan YAZICI
Date:
Here's small patch to fix case conversion problems for ILIKE and
(Oracle Compat.) lower/upper functions. (Related bug report is here:
http://archives.postgresql.org/pgsql-bugs/2005-10/msg00001.php)

In tests it succeeded for Turkish characters while using LATIN5
encoding. But when encoding is UNICODE it still doesn't work. (IMHO,
for latin-N encodings there will be no compatibility problems; for
Unicode, I've no idea.)


Regards.

Attachment

Re: Case Conversion Fix for MB Chars

From
Volkan YAZICI
Date:
Last minute edit:

On 11/26/05, Volkan YAZICI <volkan.yazici@gmail.com> wrote:
- In tests it succeeded for Turkish characters while using LATIN5
- encoding. But when encoding is UNICODE it still doesn't work. (IMHO,
- for latin-N encodings there will be no compatibility problems; for
- Unicode, I've no idea.)
+ In tests it succeeded for Turkish characters while using both
+ LATIN5 and UNICODE encodings. Furthermore, it shouldn't raise
+ any compatibility problems for LATIN-N encodings, but for
+ UNICODE, I don't have an exact idea.

Regards.

Re: Case Conversion Fix for MB Chars

From
Tom Lane
Date:
Volkan YAZICI <volkan.yazici@gmail.com> writes:
> Here's small patch to fix case conversion problems for ILIKE and
> (Oracle Compat.) lower/upper functions. (Related bug report is here:
> http://archives.postgresql.org/pgsql-bugs/2005-10/msg00001.php)

Doesn't this just move the failure cases around?

The really fundamental problem is that tolower/toupper don't work
on wchar_t.

            regards, tom lane

Re: Case Conversion Fix for MB Chars

From
Volkan YAZICI
Date:
[Sorry if the post is duplicated. But I don't receive and ACK from majordamo.]

On 11/27/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Doesn't this just move the failure cases around?

I don't think so. I've tried to fix every tolower/toupper related
problem (depending on the MB characters) I found in the code. If
there's still a problem with them, it should be related with the
pg_wchar_tb concept in utils/mb/wchar.c (AFAIC, the last remaning part
about this case conversion mess is the one related with PQfnumber -
which needs a very different approach.)

> The really fundamental problem is that tolower/toupper don't work
> on wchar_t.

Yes, indeed. I agree with you to find a proper solution for character
set handling. But, IMHO, it's better to have a-patchy working system
instead of a not working one. Put yourself in our position, we're
trying to give enterprise solutions to our customers and ILIKE,
lower/upper doesn't work in the product that we support. We can work
with a small patch instead of waiting next releases for an exact
solution to the problem.

Re: Case Conversion Fix for MB Chars

From
Volkan YAZICI
Date:
On 11/27/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Doesn't this just move the failure cases around?

I don't think so. I've tried to fix every tolower/toupper related
problem (depending on the MB characters) I found in the code. If
there's still a problem with them, it should be related with the
pg_wchar_tb concept in utils/mb/wchar.c (AFAIC, the last remaning part
about this case conversion mess is the one related with PQfnumber -
which needs a very different approach.)

> The really fundamental problem is that tolower/toupper don't work
> on wchar_t.

Yes, indeed. I agree with you to find a proper solution for character
set handling. But, IMHO, it's better to have a-patchy working system
instead of a not working one. Put yourself in our position, we're
trying to give enterprise solutions to our customers and ILIKE,
lower/upper doesn't work in the product that we support. We can work
with a small patch instead of waiting next releases for an exact
solution to the problem.


Regards.

Re: Case Conversion Fix for MB Chars

From
Tom Lane
Date:
Volkan YAZICI <volkan.yazici@gmail.com> writes:
> On 11/27/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The really fundamental problem is that tolower/toupper don't work
>> on wchar_t.

> Yes, indeed. I agree with you to find a proper solution for character
> set handling. But, IMHO, it's better to have a-patchy working system
> instead of a not working one.

But you just agreed that it doesn't work.

It might be that there are degrees of not-working-ness here, but before
adopting a partial solution I would like to see some reasoning why it
won't make things worse for other people.  I think that what you are
proposing could lead to arbitrarily bad behavior (up to and including
server crashes) depending on what libc's toupper/tolower functions are
coded to do with out-of-range inputs.  Exactly what cases have you
tried, and on what platforms?

            regards, tom lane

Re: Case Conversion Fix for MB Chars

From
Volkan YAZICI
Date:
On 11/27/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Yes, indeed. I agree with you to find a proper solution for character
> > set handling. But, IMHO, it's better to have a-patchy working system
> > instead of a not working one.
>
> But you just agreed that it doesn't work.

You get me wrong. I tried to to say it works for latin5 and unicode.
When I look at the latin-n implementation of PostgreSQL, see that
there doesn't exist a far difference between 'em. So there shouldn't
be a problem with latin-n encodings. But, as I try to explain, I don't
have so much knowledge on unicode characters and cannot claim an exact
answer about it.

> It might be that there are degrees of not-working-ness here, but before
> adopting a partial solution I would like to see some reasoning why it
> won't make things worse for other people. I think that what you are
> proposing could lead to arbitrarily bad behavior (up to and including
> server crashes) depending on what libc's toupper/tolower functions are
> coded to do with out-of-range inputs.

Furthermore, I just used the methods already in the code. If it'll
cause any crashes in the server side, it won't be my fault. (Except
logical errors I made.)

> Exactly what cases have you tried, and on what platforms?

I don't have a buildfarm at home. Tests made on an i686 with a
2.6.12.5 kernel. Here's a short list of cases I tried with both latin5
and unicode charsets:
- lower/upper functions with Turkish characters.
- ILIKE matches with both lower and upper case Turkish characters.
(Above testes succeeded for non-Turkish characters too.)


Regards.

Re: Case Conversion Fix for MB Chars

From
Volkan YAZICI
Date:
On 11/27/05, Volkan YAZICI <volkan.yazici@gmail.com> wrote:
> Tests made on an i686 with a
> 2.6.12.5 kernel. Here's a short list of cases I tried with both latin5
> and unicode charsets:
> - lower/upper functions with Turkish characters.
> - ILIKE matches with both lower and upper case Turkish characters.
> (Above testes succeeded for non-Turkish characters too.)

I read the above paragraph again and realized the out of usability of
it. Here's a modified one:

Test's made on a Debian GNU/Linux (stable) 3.1 by patching
src/backend/utils/adt/like.c (r1.62) and
src/backend/utils/adt/oracle_compat.c (r1.64) files. Related software
versions:
  - gcc-3.3 [3.3.5-13]
  - libc6-dev [2.3.2.ds1-22]
  - locales [2.3.2.ds1-22]

Tried test cases using patched CVS HEAD:

[For Latin5]
$ usr/bin/initdb -D var/data
$ LANG="tr_TR.ISO-8859-9" usr/bin/postmaster -D var/data
$ usr/bin/createdb -E latin5 test_latin5
$ usr/bin/psql test_latin5
Welcome to psql 8.2devel, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

test_latin5=# SHOW client_encoding;
 client_encoding
-----------------
 LATIN5
(1 row)

test_latin5=# SELECT upper('abcdefgğhıijklmnoöprsştuüvyz qwx 0123456789');
                   upper
-------------------------------------------
 ABCDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ QWX 0123456789
(1 row)

test_latin5=# SELECT
test_latin5-# lower('ABCDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ QWX 0123456789');
                    lower
---------------------------------------------
 abcdefgğhıijklmnoöprsştuüvyz qwx 0123456789
(1 row)

test_latin5=# BEGIN;
BEGIN
test_latin5=# CREATE TEMP TABLE t (v varchar);
CREATE TABLE
test_latin5=# COPY t FROM stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> ı123
>> I123
>> i123
>> İ123
>> \.
test_latin5=# SELECT v FROM t;
  v
------
 ı123
 I123
 i123
 İ123
(4 rows)

test_latin5=# SELECT v FROM t WHERE v ILIKE 'ı%';
  v
------
 ı123
 I123
(2 rows)

test_latin5=# SELECT v FROM t WHERE v ILIKE 'I%';
  v
------
 ı123
 I123
(2 rows)

test_latin5=# SELECT v FROM t WHERE v ILIKE 'i%';
  v
------
 i123
 İ123
(2 rows)

test_latin5=# SELECT v FROM t WHERE v ILIKE 'İ%';
  v
------
 i123
 İ123
(2 rows)

test_latin5=# ROLLBACK;
ROLLBACK

[For UNICODE]
Same steps as above with LANG="tr_TR.UTF-8" and database/client
encoding as UNICODE.

Hope this tests help.


Regards.

Re: Case Conversion Fix for MB Chars

From
Bruce Momjian
Date:
Where are we on this patch?   Is it to be applied?

---------------------------------------------------------------------------

Volkan YAZICI wrote:
> On 11/27/05, Volkan YAZICI <volkan.yazici@gmail.com> wrote:
> > Tests made on an i686 with a
> > 2.6.12.5 kernel. Here's a short list of cases I tried with both latin5
> > and unicode charsets:
> > - lower/upper functions with Turkish characters.
> > - ILIKE matches with both lower and upper case Turkish characters.
> > (Above testes succeeded for non-Turkish characters too.)
>
> I read the above paragraph again and realized the out of usability of
> it. Here's a modified one:
>
> Test's made on a Debian GNU/Linux (stable) 3.1 by patching
> src/backend/utils/adt/like.c (r1.62) and
> src/backend/utils/adt/oracle_compat.c (r1.64) files. Related software
> versions:
>   - gcc-3.3 [3.3.5-13]
>   - libc6-dev [2.3.2.ds1-22]
>   - locales [2.3.2.ds1-22]
>
> Tried test cases using patched CVS HEAD:
>
> [For Latin5]
> $ usr/bin/initdb -D var/data
> $ LANG="tr_TR.ISO-8859-9" usr/bin/postmaster -D var/data
> $ usr/bin/createdb -E latin5 test_latin5
> $ usr/bin/psql test_latin5
> Welcome to psql 8.2devel, the PostgreSQL interactive terminal.
>
> Type:  \copyright for distribution terms
>        \h for help with SQL commands
>        \? for help with psql commands
>        \g or terminate with semicolon to execute query
>        \q to quit
>
> test_latin5=# SHOW client_encoding;
>  client_encoding
> -----------------
>  LATIN5
> (1 row)
>
> test_latin5=# SELECT upper('abcdefg?h?ijklmno?prs?tu?vyz qwx 0123456789');
>                    upper
> -------------------------------------------
>  ABCDEFG?HI?JKLMNO?PRS?TU?VYZ QWX 0123456789
> (1 row)
>
> test_latin5=# SELECT
> test_latin5-# lower('ABCDEFG?HI?JKLMNO?PRS?TU?VYZ QWX 0123456789');
>                     lower
> ---------------------------------------------
>  abcdefg?h?ijklmno?prs?tu?vyz qwx 0123456789
> (1 row)
>
> test_latin5=# BEGIN;
> BEGIN
> test_latin5=# CREATE TEMP TABLE t (v varchar);
> CREATE TABLE
> test_latin5=# COPY t FROM stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself.
> >> ?123
> >> I123
> >> i123
> >> ?123
> >> \.
> test_latin5=# SELECT v FROM t;
>   v
> ------
>  ?123
>  I123
>  i123
>  ?123
> (4 rows)
>
> test_latin5=# SELECT v FROM t WHERE v ILIKE '?%';
>   v
> ------
>  ?123
>  I123
> (2 rows)
>
> test_latin5=# SELECT v FROM t WHERE v ILIKE 'I%';
>   v
> ------
>  ?123
>  I123
> (2 rows)
>
> test_latin5=# SELECT v FROM t WHERE v ILIKE 'i%';
>   v
> ------
>  i123
>  ?123
> (2 rows)
>
> test_latin5=# SELECT v FROM t WHERE v ILIKE '?%';
>   v
> ------
>  i123
>  ?123
> (2 rows)
>
> test_latin5=# ROLLBACK;
> ROLLBACK
>
> [For UNICODE]
> Same steps as above with LANG="tr_TR.UTF-8" and database/client
> encoding as UNICODE.
>
> Hope this tests help.
>
>
> Regards.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Case Conversion Fix for MB Chars

From
Devrim GUNDUZ
Date:
Hi,

On Thu, 1 Dec 2005, Bruce Momjian wrote:

> Where are we on this patch?   Is it to be applied?

The patch worked on FreeBSD 5.4-CURRENT. Maybe what Tom wants is more port
reports, hmm?

Volkan stated out the reasons that why we need this patch, I'll not repeat
it again.

Regards,
--
Devrim GUNDUZ
Kivi Bilişim Teknolojileri         -          http://www.kivi.com.tr
devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
                       http://www.gunduz.org

Re: Case Conversion Fix for MB Chars

From
Volkan YAZICI
Date:
On 12/1/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> Where are we on this patch? Is it to be applied?

After Tom's advice (he was doubtful about the patch), while I was
thinking about how to improve the spectrum of tests, decided to use
src/test/mb. In the tests, patch just succeded for unicode and failed
on big5, euc_cn, euc_jp, euc_kr, euc_tw, mule_internal, sjis
encodings. Fails' reason can be my wrong configuration too. (I've made
tests on a both unicode and latin-5 encoded databases.)

AFAIC, euc* encodings break latin-n and fixed latin-n encodings break
euc*. Seems like a deadlock. Furthermore, I'm not aware of the status
of ICU (will it be applied with 8.2?), thus we can go a way through
like distributing latin-n fixed PostgreSQL sources/binaries on
postgresql.org.tr.


Regards.

Re: Case Conversion Fix for MB Chars

From
Bruce Momjian
Date:
Volkan YAZICI wrote:
> On 12/1/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> > Where are we on this patch? Is it to be applied?
>
> After Tom's advice (he was doubtful about the patch), while I was
> thinking about how to improve the spectrum of tests, decided to use
> src/test/mb. In the tests, patch just succeded for unicode and failed
> on big5, euc_cn, euc_jp, euc_kr, euc_tw, mule_internal, sjis
> encodings. Fails' reason can be my wrong configuration too. (I've made
> tests on a both unicode and latin-5 encoded databases.)

Do those encodings even have uppercase letters?

> AFAIC, euc* encodings break latin-n and fixed latin-n encodings break
> euc*. Seems like a deadlock. Furthermore, I'm not aware of the status
> of ICU (will it be applied with 8.2?), thus we can go a way through
> like distributing latin-n fixed PostgreSQL sources/binaries on
> postgresql.org.tr.

People have talked about ICU but I don't know if anyone is working on it
now.

I think the big problem is that while your patch works for some cases,
it fails for others, and there is no good way to know/test which will
work and which will not.  Is that accurate?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Case Conversion Fix for MB Chars

From
Volkan YAZICI
Date:
Last minute edit:
src/test/mb seems a little bit old. I've tested SQL files in
src/test/mb/sql with the expected results in src/test/mb/expected
manually and it worked. (Output files need a little bit editing, like
removing lines similar to "CREATE TABLE".) But it'll be better if any
EUC users will try 'em manually too.

On 12/2/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> Volkan YAZICI wrote:
> > After Tom's advice (he was doubtful about the patch), while I was
> > thinking about how to improve the spectrum of tests, decided to use
> > src/test/mb. In the tests, patch just succeded for unicode and failed
> > on big5, euc_cn, euc_jp, euc_kr, euc_tw, mule_internal, sjis
> > encodings. Fails' reason can be my wrong configuration too. (I've made
> > tests on a both unicode and latin-5 encoded databases.)
>
> Do those encodings even have uppercase letters?

According to what IRC folks, yes.

> People have talked about ICU but I don't know if anyone is working on it
> now.

Furthermore, there're some unofficial ICU patches for PostgreSQL
around. Like the one @
http://people.freebsd.org/~girgen/postgresql-icu/README.html

> I think the big problem is that while your patch works for some cases,
> it fails for others

As I mentioned in the above, it seems like it's working for other ones too.

> and there is no good way to know/test which will
> work and which will not. Is that accurate?

You don't want to commit this patch because it breaks[*] EUC like
encodings. But OTOH, it fixes LatinN and UNICODE encodings. I'm really
wondering, while we're trying to protect the EUC encodings still
working, why there's not any EUC users around to take care of EUC
tests? Doesn't EUC have any problems? Do ILIKE, upper/lower work for
them properly?

[*] If I didn't make a mistake, manual tests succeded for EUC like
encodings too.

You can think the reverse of the subject too. Think LatinN and UNICODE
as working and somebody submitted a patch which fixes EUC encodings by
breaking the previous ones. What will be the reaction of PostgreSQL
team in this situation?


Regards.