Re: Patch for collation using ICU - Mailing list pgsql-hackers

From John Hansen
Subject Re: Patch for collation using ICU
Date
Msg-id 5066E5A966339E42AA04BA10BA706AE50A9318@rodrick.geeknet.com.au
Whole thread Raw
In response to Patch for collation using ICU  (Palle Girgensohn <girgen@pingpong.net>)
Responses Re: Patch for collation using ICU
List pgsql-hackers
Tatsuo Ishii wrote:
> Sent: Tuesday, May 10, 2005 12:32 AM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > > -----Original Message-----
> > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> > > Sent: Sunday, May 08, 2005 11:08 PM
> > > To: John Hansen
> > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> > > pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > >
> > > > > I don't buy it. If current conversion tables does the
> > > right thing,
> > > > > why we need to replace. Or if conversion tables are not
> > > correct, why
> > > > > don't you fix it? I think the rule of character
> > > conversion will not
> > > > > change frequently, especially for LATIN languages. Thus
> > > maintaining
> > > > > cost is not too high.
> > > >
> > > > I never said we need to, but if we're going to implement
> > > ICU, then we
> > > > might as well go all the way.
> > >
> > > So you admit there's no benefit using ICU for replacing existing
> > > conversions?
> > >
> > > Besides ICU does not support all existing conversions, I
> think ICU
> > > has serious flaw for using conversion. If I understand correctly,
> > > ICU uses UNICODE internally to do the conversion. For example, to
> > > implement
> > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> > > converts UNICODE to EUC_JP. Problem is these conversion
> is not roud
> > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some
> > > information). Thus SJIS->EUC_JP->SJIS conversion using
> ICU does not
> > > preserve original text.
> >
> > Just for the record, I fetched a web page encoded in sjis, and
> > converted it to euc-jp and back using uconv from ICU 3.2, and the
> > result is the original is identical to the transformed file.
> >
> >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
> uconv -f
> > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff
> index.html
> > index.html.sjis
>
> Not all SJIS/EUC_JP characters have the problem. You might want to
> try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
>
> BTW, I got this with ICU 3.2:
>
> $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt
> Conversion from Unicode to codepage failed at input byte
> position 0. Unicode: 301c Error: Invalid character found
>
> The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.

That actually makes perfect sense, since according to unicode.org's
database:
301C ~ WAVE DASH      This character was encoded to match JIS C 6226-1978 1-33 "wave
dash".      The JIS standards and some industry practise disagree in mapping. - 3030 wavy dash - FF5E full width tilde

In PG FF5E is the mapping currently used. That is obviously wrong
(according to the standards), as that is only a 'similar character'.

Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis
doesn't define "WAVE DASH".
In all, I believe this behaviour to be correct according to the
standards.

There'd be nothing to stop us from defining alternative mappings for the
cases where we deviate from the standard, but the question is, should we
be non-standard?

>
> This makes me nervous in using ICU...
> --
> Tatsuo Ishii
>
>

... John


pgsql-hackers by date:

Previous
From: "Magnus Hagander"
Date:
Subject: Case insensitive usernames
Next
From: Tom Lane
Date:
Subject: Re: Case insensitive usernames