bugs with certain Asian multibyte charsets - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject bugs with certain Asian multibyte charsets
Date
Msg-id 20051224.182533.121216450.t-ishii@sraoss.co.jp
Whole thread Raw
List pgsql-hackers
I have found long standing bug with with certain Asian multibyte
charsets handling(original report was made by Mr. Ishida).

Some text operations on certain Asian charsets such as EUCj-JP code
set 3 (JIS X 0212) make wrong results. As far as I know, these
include:

- strpos
- regular expression

It seems LIKE is not affected by this bug.

The bug has been there since 6.4. The reason we did not notice the bug
is the affected charsts are merely used. Other charsets affected by
the bug are EUC_CN code set 2, 3 (it seems they are not used at all)
and EUC_TW code set 2, 3 (it seems code set 3 is not used). As far as
I know, EUC_KR is not affected (I believe code set 2, 3 is not used in
EUC_KR).

Here are the description of the bug.

- strpos

In EUC_JP database:

SELECT strpos(hextostr('8faaa18faae1'), hextostr('8faae1'));

returns 1, instead of 2. where hextostr() is a hexadecial to string
conversion functin developed by Mr. Ishida. Those three bytes sequence
starting with 8f is a JIS X 0212 letter encoded in EUC-JP (for
example, 8faaa18faae1 consists of 2 EUC_JP letters).

- regexp

SELECT hextostr('8faaa18faaa1') ~ hextostr('8faae1');

returns false instead of true.

details of the bug:

In backend/utils/mb/wchar.c there are functions to convert multibyte
to wchar. When the conversion performed, the second or third byte was
masked by 0x3f and which makes, for example, 8faaa1 and 8faae1 look
same.

I'm going to commit fixes for 7.3-statble, 7.4-stable, 8.0-stable,
8.1-stable and current.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


pgsql-hackers by date:

Previous
From: Christopher Kings-Lynne
Date:
Subject: Re: Fixing row comparison semantics
Next
From: Martijn van Oosterhout
Date:
Subject: Re: Fixing row comparison semantics