Re: Patch for collation using ICU - Mailing list pgsql-hackers
From | John Hansen |
---|---|
Subject | Re: Patch for collation using ICU |
Date | |
Msg-id | 5066E5A966339E42AA04BA10BA706AE50A931A@rodrick.geeknet.com.au Whole thread Raw |
In response to | Patch for collation using ICU (Palle Girgensohn <girgen@pingpong.net>) |
List | pgsql-hackers |
Tatsuo Ishii wrote: > Sent: Tuesday, May 10, 2005 5:45 PM > To: John Hansen > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Patch for collation using ICU > > > Tatsuo Ishii wrote: > > > Sent: Tuesday, May 10, 2005 12:32 AM > > > To: John Hansen > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > > pgsql-hackers@postgresql.org > > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > > > -----Original Message----- > > > > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > > > > > Sent: Sunday, May 08, 2005 11:08 PM > > > > > To: John Hansen > > > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; > > > > > pgsql-hackers@postgresql.org > > > > > Subject: Re: [HACKERS] Patch for collation using ICU > > > > > > > > > > > > I don't buy it. If current conversion tables does the > > > > > right thing, > > > > > > > why we need to replace. Or if conversion tables are not > > > > > correct, why > > > > > > > don't you fix it? I think the rule of character > > > > > conversion will not > > > > > > > change frequently, especially for LATIN languages. Thus > > > > > maintaining > > > > > > > cost is not too high. > > > > > > > > > > > > I never said we need to, but if we're going to implement > > > > > ICU, then we > > > > > > might as well go all the way. > > > > > > > > > > So you admit there's no benefit using ICU for > replacing existing > > > > > conversions? > > > > > > > > > > Besides ICU does not support all existing conversions, I > > > think ICU > > > > > has serious flaw for using conversion. If I understand > > > > > correctly, ICU uses UNICODE internally to do the > conversion. For > > > > > example, to implement > > > > > SJIS->EUC_JP conversion, ICU first converts SJIS to > UNICODE then > > > > > converts UNICODE to EUC_JP. Problem is these conversion > > > is not roud > > > > > trip(conversion between SJIS/EUC_JP and UNICODE will > lose some > > > > > information). Thus SJIS->EUC_JP->SJIS conversion using > > > ICU does not > > > > > preserve original text. > > > > > > > > Just for the record, I fetched a web page encoded in sjis, and > > > > converted it to euc-jp and back using uconv from ICU > 3.2, and the > > > > result is the original is identical to the transformed file. > > > > > > > > uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html > > > uconv -f > > > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc diff > > > index.html > > > > index.html.sjis > > > > > > Not all SJIS/EUC_JP characters have the problem. You might want to > > > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b. > > > > > > BTW, I got this with ICU 3.2: > > > > > > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt > Conversion > > > from Unicode to codepage failed at input byte position 0. > Unicode: > > > 301c Error: Invalid character found > > > > > > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character. > > > > That actually makes perfect sense, since according to unicode.org's > > database: > > 301C ~ WAVE DASH > > This character was encoded to match JIS C 6226-1978 > 1-33 "wave > > dash". > > The JIS standards and some industry practise > disagree in mapping. > > - 3030 wavy dash > > - FF5E full width tilde > > > > In PG FF5E is the mapping currently used. That is obviously wrong > > (according to the standards), as that is only a 'similar character'. > > > > Unfortunately, there is no mapping from 301C to shift_jis, as > > shift_jis doesn't define "WAVE DASH". > > In all, I believe this behaviour to be correct according to the > > standards. > > > > There'd be nothing to stop us from defining alternative > mappings for > > the cases where we deviate from the standard, but the question is, > > should we be non-standard? > > You missed the point. EUC_JP 0xa1c1 is a perfect valid data > and uconv -f EUC_JP -t Shift_JIS should convert it to > Shift_JIS 0x8160 regardless of the internal of uconv. Studying ICU forther, I found that it works fine, provided you use the _correct_ charset for the conversion.. a.txt contains 0x81 0x60 uconv -f ibm-943_P130-1999 -t EUC_JP a.txt -o b.txt b.txt now contains 0xa1 0xc1 uconv -t ibm-943_P130-1999 -f EUC_JP b.txt -o a.txt a.txt still contains 0x81 0x60 The mapping table you want is ibm-943_P130-1999 Similar, we'd need to find the right euc-jp (and plain jis) mapping, assuming we want the one that strictly defines JIS X 0208 right? I trust this to put your fears to rest... > -- > Tatsuo Ishii > > ... John
pgsql-hackers by date: