Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Patch: add conversion from pg_wchar to multibyte |
Date | |
Msg-id | CA+TgmoY-Tud3MnJTF0CFj1EhE1SYG+vQ5RNH_REX==-g=_tBRg@mail.gmail.com Whole thread Raw |
In response to | Re: Patch: add conversion from pg_wchar to multibyte (Tatsuo Ishii <ishii@postgresql.org>) |
Responses |
Re: Patch: add conversion from pg_wchar to multibyte
|
List | pgsql-hackers |
On Mon, Jul 2, 2012 at 7:33 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: >> Yeah, I did. I think I may be a bit confused here, so let me try to >> understand this a bit better. It seems like pg_mule2wchar_with_len >> uses the following algorithm: >> >> - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored >> with shifts of 16 and 0. >> - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes, >> skipping the first one and storing the remaining two with shifts of 16 >> and 0. >> - If the first character IS_LC2 (0x90-0x99), decode three bytes, >> stored with shifts of 16, 8, and 0. >> - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes, >> skipping the first one and storing the remaining three with offsets of >> 16, 8, and 0. > > Correct. > >> In the reverse transformation implemented by pg_wchar2mule_with_len, >> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or >> 3 bytes, respectively, exactly as I would expect. ASCII decoding is >> also as I would expect. The case I don't understand is what happens >> when the leading byte of the multibyte character was IS_LCPRV1 or >> IS_LCPRV2. In that case, we ought to decode three bytes if it was >> IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we >> always decode 4 bytes. That implies that the IS_LCPRV1() case in >> pg_mule2wchar_with_len is dead code, > > Yes, dead code unless we want to support following encodings in the > future(from include/mb/pg_wchar.h: > #define LC_SISHENG 0xa0/* Chinese SiSheng characters for > * PinYin/ZhuYin (not supported) */ > #define LC_IPA 0xa1/* IPA (International Phonetic Association) > * (not supported) */ > #define LC_VISCII_LOWER 0xa2/* Vietnamese VISCII1.1 lower-case (not > * supported) */ > #define LC_VISCII_UPPER 0xa3/* Vietnamese VISCII1.1 upper-case (not > * supported) */ > #define LC_ARABIC_DIGIT 0xa4 /* Arabic digit (not supported) */ > #define LC_ARABIC_1_COLUMN 0xa5 /* Arabic 1-column (not supported) */ > #define LC_ASCII_RIGHT_TO_LEFT 0xa6 /* ASCII (left half of ISO8859-1) with > * right-to-left direction (not > * supported) */ > #define LC_LAO 0xa7/* Lao characters (ISO10646 0E80..0EDF) (not > * supported) */ > #define LC_ARABIC_2_COLUMN 0xa8 /* Arabic 1-column (not supported) */ > >> and that any 4 byte characters >> are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the >> comment there is driving at, but it's not too clear to me. > > Yes, that's because we only support EUC_TW and BIG5 which are using > IS_LCPRV2 in the mule interal encoding, as stated in the comment. OK. So, in that case, I suggest that if the leading byte is non-zero, we emit 0x9d followed by the three available bytes, instead of first testing whether the first byte is >= 0xf0. That test seems to serve no purpose but to confuse the issue. I further suggest that we improve the comments on the mule functions for both wchar->mb and mb->wchar to make all this more clear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: