Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: Patch: add conversion from pg_wchar to multibyte
Date
Msg-id 20120703.151747.1330940307954703732.t-ishii@sraoss.co.jp
Whole thread Raw
In response to Re: Patch: add conversion from pg_wchar to multibyte  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Patch: add conversion from pg_wchar to multibyte
Re: Patch: add conversion from pg_wchar to multibyte
List pgsql-hackers
> OK.  So, in that case, I suggest that if the leading byte is non-zero,
> we emit 0x9d followed by the three available bytes, instead of first
> testing whether the first byte is >= 0xf0.  That test seems to serve
> no purpose but to confuse the issue.

Probably the code shoud look like this(see below comment):
    else if (lb >= 0xf0 && lb <= 0xfe)        {        if (lb <= 0xf4)              *to++ = 0x9c;           else
     *to++ = 0x9d;            *to++ = lb;            *to++ = (*from >> 8) & 0xff;            *to++ = *from & 0xff;
     cnt += 4;
 

> I further suggest that we improve the comments on the mule functions
> for both wchar->mb and mb->wchar to make all this more clear.

I have added comments about mule internal encoding by refreshing my
memory and from old document found on
web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

Please take a look at.  BTW, it seems conversion between multibyte and
wchar can be roundtrip in the leading character is LCPRV2 case:

If the second byte of wchar (out of 4 bytes of wchar. The first byte
is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of
multibyte must be 0x9c.  If the second byte of wchar is in range of
0xf5 to 0xfe, then the first byte of multibyte must be 0x9d.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index d456309..1148eb5 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -37,6 +37,31 @@ typedef unsigned int pg_wchar;#define ISSJISTAIL(c) (((c) >= 0x40 && (c) <= 0x7e) || ((c) >= 0x80 &&
(c)<= 0xfc))/*
 
+ * Currently PostgreSQL supports 5 types of mule internal encodings:
+ *
+ * 1) 1-byte ASCII characters, each byte is below 0x7f.
+ *
+ * 2) "Official" single byte charsets such as ISO 8859 latin1.  Each
+ *    mule character consists of 2 bytes: LC1 + C1, where LC1 is
+ *    corresponds to each charset and in range of 0x81 to 0x8d and C1
+ *    is in rage of 0xa0 to 0xff(ISO 8859-1 for example, plus each
+ *    high bit is on).
+ *
+ * 3) "Private" single byte charsets such as SISHENG.  Each mule
+ *    character consists of 3 bytes: LCPRV1 + LC12 + C1 where LCPRV1
+ *    is either 0x9a (if LC12 is in range of 0xa0 to 0xdf) or 0x9b (if
+ *    LC12 is in range of 0xe0 to 0xef).
+ *
+ * 4) "Official" multibyte charsets such as JIS X0208.  Each mule
+ *    character consists of 3 bytes: LC2 + C1 + C2 where LC2 is
+ *    corresponds to each charset and is in rage of 0x90 to 0x99. C1
+ *    and C2 is in rage of 0xa0 to 0xff(each high bit is on).
+ *
+ * 5) "Private" multibyte charsets such as CNS 11643-1992 Plane 3.
+ *    Each mule character consists of 4 bytes: LCPRV2 + LC22 + C1 +
+ *    C2.  where LCPRV2 is either 0x9c (if LC12 is in range of 0xf0 to
+ *    0xf4) or 0x9d (if LC22 is in range of 0xf5 to 0xfe).
+ * * Leading byte types or leading prefix byte for MULE internal code. * See http://www.xemacs.org for more details.
(there is a doc titled * "XEmacs Internals Manual", "MULE Character Sets and Encodings" 

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: enhanced error fields
Next
From: Jeff Davis
Date:
Subject: Re: SP-GiST for ranges based on 2d-mapping and quad-tree