Re: Patch for collation using ICU - Mailing list pgsql-hackers

From John Hansen
Subject Re: Patch for collation using ICU
Date
Msg-id 5066E5A966339E42AA04BA10BA706AE50A931A@rodrick.geeknet.com.au
Whole thread Raw
In response to Patch for collation using ICU  (Palle Girgensohn <girgen@pingpong.net>)
List pgsql-hackers
Tatsuo Ishii wrote:
> Sent: Tuesday, May 10, 2005 5:45 PM
> To: John Hansen
> Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> > Tatsuo Ishii wrote:
> > > Sent: Tuesday, May 10, 2005 12:32 AM
> > > To: John Hansen
> > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> > > pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > >
> > > > > -----Original Message-----
> > > > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> > > > > Sent: Sunday, May 08, 2005 11:08 PM
> > > > > To: John Hansen
> > > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net;
> > > > > pgsql-hackers@postgresql.org
> > > > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > > >
> > > > > > > I don't buy it. If current conversion tables does the
> > > > > right thing,
> > > > > > > why we need to replace. Or if conversion tables are not
> > > > > correct, why
> > > > > > > don't you fix it? I think the rule of character
> > > > > conversion will not
> > > > > > > change frequently, especially for LATIN languages. Thus
> > > > > maintaining
> > > > > > > cost is not too high.
> > > > > >
> > > > > > I never said we need to, but if we're going to implement
> > > > > ICU, then we
> > > > > > might as well go all the way.
> > > > >
> > > > > So you admit there's no benefit using ICU for
> replacing existing
> > > > > conversions?
> > > > >
> > > > > Besides ICU does not support all existing conversions, I
> > > think ICU
> > > > > has serious flaw for using conversion. If I understand
> > > > > correctly, ICU uses UNICODE internally to do the
> conversion. For
> > > > > example, to implement
> > > > > SJIS->EUC_JP conversion, ICU first converts SJIS to
> UNICODE then
> > > > > converts UNICODE to EUC_JP. Problem is these conversion
> > > is not roud
> > > > > trip(conversion between SJIS/EUC_JP and UNICODE will
> lose some
> > > > > information). Thus SJIS->EUC_JP->SJIS conversion using
> > > ICU does not
> > > > > preserve original text.
> > > >
> > > > Just for the record, I fetched a web page encoded in sjis, and
> > > > converted it to euc-jp and back using uconv from ICU
> 3.2, and the
> > > > result is the original is identical to the transformed file.
> > > >
> > > >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
> > > uconv -f
> > > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff
> > > index.html
> > > > index.html.sjis
> > >
> > > Not all SJIS/EUC_JP characters have the problem. You might want to
> > > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
> > >
> > > BTW, I got this with ICU 3.2:
> > >
> > > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt
> Conversion
> > > from Unicode to codepage failed at input byte position 0.
> Unicode:
> > > 301c Error: Invalid character found
> > >
> > > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.
> >
> > That actually makes perfect sense, since according to unicode.org's
> > database:
> > 301C ~ WAVE DASH
> >        This character was encoded to match JIS C 6226-1978
> 1-33 "wave
> > dash".
> >        The JIS standards and some industry practise
> disagree in mapping.
> >      - 3030 wavy dash
> >      - FF5E full width tilde
> >
> > In PG FF5E is the mapping currently used. That is obviously wrong
> > (according to the standards), as that is only a 'similar character'.
> >
> > Unfortunately, there is no mapping from 301C to shift_jis, as
> > shift_jis doesn't define "WAVE DASH".
> > In all, I believe this behaviour to be correct according to the
> > standards.
> >
> > There'd be nothing to stop us from defining alternative
> mappings for
> > the cases where we deviate from the standard, but the question is,
> > should we be non-standard?
>
> You missed the point. EUC_JP 0xa1c1 is a perfect valid data
> and uconv -f EUC_JP -t Shift_JIS should convert it to
> Shift_JIS 0x8160 regardless of the internal of uconv.

Studying ICU forther, I found that it works fine, provided you use the
_correct_ charset for the conversion..

a.txt contains 0x81 0x60
uconv -f ibm-943_P130-1999 -t EUC_JP a.txt -o b.txt
b.txt now contains 0xa1 0xc1
uconv -t ibm-943_P130-1999 -f EUC_JP b.txt -o a.txt
a.txt still contains 0x81 0x60

The mapping table you want is ibm-943_P130-1999
Similar, we'd need to find the right euc-jp (and plain jis) mapping,
assuming we want the one that strictly defines JIS X 0208 right?

I trust this to put your fears to rest...

> --
> Tatsuo Ishii
>
>

... John


pgsql-hackers by date:

Previous
From: Adrian Maier
Date:
Subject: Re: Oracle Style packages on postgres
Next
From: Bruce Momjian
Date:
Subject: Re: request for sql3 compliance for the update command