Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Patch: add conversion from pg_wchar to multibyte
Date
Msg-id CA+TgmoY-Tud3MnJTF0CFj1EhE1SYG+vQ5RNH_REX==-g=_tBRg@mail.gmail.com
Whole thread Raw
In response to Re: Patch: add conversion from pg_wchar to multibyte  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: Patch: add conversion from pg_wchar to multibyte
List pgsql-hackers
On Mon, Jul 2, 2012 at 7:33 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> Yeah, I did.  I think I may be a bit confused here, so let me try to
>> understand this a bit better.  It seems like pg_mule2wchar_with_len
>> uses the following algorithm:
>>
>> - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
>> with shifts of 16 and 0.
>> - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
>> skipping the first one and storing the remaining two with shifts of 16
>> and 0.
>> - If the first character IS_LC2 (0x90-0x99), decode three bytes,
>> stored with shifts of 16, 8, and 0.
>> - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
>> skipping the first one and storing the remaining three with offsets of
>> 16, 8, and 0.
>
> Correct.
>
>> In the reverse transformation implemented by pg_wchar2mule_with_len,
>> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
>> 3 bytes, respectively, exactly as I would expect.  ASCII decoding is
>> also as I would expect.  The case I don't understand is what happens
>> when the leading byte of the multibyte character was IS_LCPRV1 or
>> IS_LCPRV2.  In that case, we ought to decode three bytes if it was
>> IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
>> always decode 4 bytes.  That implies that the IS_LCPRV1() case in
>> pg_mule2wchar_with_len is dead code,
>
> Yes, dead code unless we want to support following encodings in the
> future(from include/mb/pg_wchar.h:
> #define LC_SISHENG                      0xa0/* Chinese SiSheng characters for
>                                                                  * PinYin/ZhuYin (not supported) */
> #define LC_IPA                          0xa1/* IPA (International Phonetic Association)
>                                                                  * (not supported) */
> #define LC_VISCII_LOWER         0xa2/* Vietnamese VISCII1.1 lower-case (not
>                                                                  * supported) */
> #define LC_VISCII_UPPER         0xa3/* Vietnamese VISCII1.1 upper-case (not
>                                                                  * supported) */
> #define LC_ARABIC_DIGIT         0xa4    /* Arabic digit (not supported) */
> #define LC_ARABIC_1_COLUMN      0xa5    /* Arabic 1-column (not supported) */
> #define LC_ASCII_RIGHT_TO_LEFT  0xa6    /* ASCII (left half of ISO8859-1) with
>                                                                                  * right-to-left direction (not
>                                                                                  * supported) */
> #define LC_LAO                          0xa7/* Lao characters (ISO10646 0E80..0EDF) (not
>                                                                  * supported) */
> #define LC_ARABIC_2_COLUMN      0xa8    /* Arabic 1-column (not supported) */
>
>> and that any 4 byte characters
>> are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
>> comment there is driving at, but it's not too clear to me.
>
> Yes, that's because we only support EUC_TW and BIG5 which are using
> IS_LCPRV2 in the mule interal encoding, as stated in the comment.

OK.  So, in that case, I suggest that if the leading byte is non-zero,
we emit 0x9d followed by the three available bytes, instead of first
testing whether the first byte is >= 0xf0.  That test seems to serve
no purpose but to confuse the issue.

I further suggest that we improve the comments on the mule functions
for both wchar->mb and mb->wchar to make all this more clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Event Triggers reduced, v1
Next
From: Tom Lane
Date:
Subject: Re: huge tlb support