Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers
From | Tatsuo Ishii |
---|---|
Subject | Re: Patch: add conversion from pg_wchar to multibyte |
Date | |
Msg-id | 20120703.151747.1330940307954703732.t-ishii@sraoss.co.jp Whole thread Raw |
In response to | Re: Patch: add conversion from pg_wchar to multibyte (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Patch: add conversion from pg_wchar to multibyte
Re: Patch: add conversion from pg_wchar to multibyte |
List | pgsql-hackers |
> OK. So, in that case, I suggest that if the leading byte is non-zero, > we emit 0x9d followed by the three available bytes, instead of first > testing whether the first byte is >= 0xf0. That test seems to serve > no purpose but to confuse the issue. Probably the code shoud look like this(see below comment): else if (lb >= 0xf0 && lb <= 0xfe) { if (lb <= 0xf4) *to++ = 0x9c; else *to++ = 0x9d; *to++ = lb; *to++ = (*from >> 8) & 0xff; *to++ = *from & 0xff; cnt += 4; > I further suggest that we improve the comments on the mule functions > for both wchar->mb and mb->wchar to make all this more clear. I have added comments about mule internal encoding by refreshing my memory and from old document found on web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string). Please take a look at. BTW, it seems conversion between multibyte and wchar can be roundtrip in the leading character is LCPRV2 case: If the second byte of wchar (out of 4 bytes of wchar. The first byte is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of multibyte must be 0x9c. If the second byte of wchar is in range of 0xf5 to 0xfe, then the first byte of multibyte must be 0x9d. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index d456309..1148eb5 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -37,6 +37,31 @@ typedef unsigned int pg_wchar;#define ISSJISTAIL(c) (((c) >= 0x40 && (c) <= 0x7e) || ((c) >= 0x80 && (c)<= 0xfc))/* + * Currently PostgreSQL supports 5 types of mule internal encodings: + * + * 1) 1-byte ASCII characters, each byte is below 0x7f. + * + * 2) "Official" single byte charsets such as ISO 8859 latin1. Each + * mule character consists of 2 bytes: LC1 + C1, where LC1 is + * corresponds to each charset and in range of 0x81 to 0x8d and C1 + * is in rage of 0xa0 to 0xff(ISO 8859-1 for example, plus each + * high bit is on). + * + * 3) "Private" single byte charsets such as SISHENG. Each mule + * character consists of 3 bytes: LCPRV1 + LC12 + C1 where LCPRV1 + * is either 0x9a (if LC12 is in range of 0xa0 to 0xdf) or 0x9b (if + * LC12 is in range of 0xe0 to 0xef). + * + * 4) "Official" multibyte charsets such as JIS X0208. Each mule + * character consists of 3 bytes: LC2 + C1 + C2 where LC2 is + * corresponds to each charset and is in rage of 0x90 to 0x99. C1 + * and C2 is in rage of 0xa0 to 0xff(each high bit is on). + * + * 5) "Private" multibyte charsets such as CNS 11643-1992 Plane 3. + * Each mule character consists of 4 bytes: LCPRV2 + LC22 + C1 + + * C2. where LCPRV2 is either 0x9c (if LC12 is in range of 0xf0 to + * 0xf4) or 0x9d (if LC22 is in range of 0xf5 to 0xfe). + * * Leading byte types or leading prefix byte for MULE internal code. * See http://www.xemacs.org for more details. (there is a doc titled * "XEmacs Internals Manual", "MULE Character Sets and Encodings"
pgsql-hackers by date: