Home > mailing lists

Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers

From	Tatsuo Ishii
Subject	Re: Patch: add conversion from pg_wchar to multibyte
Date	July 3, 2012 03:20:20
Msg-id	20120703.151747.1330940307954703732.t-ishii@sraoss.co.jp Whole thread
In response to	Re: Patch: add conversion from pg_wchar to multibyte (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Patch: add conversion from pg_wchar to multibyte Re: Patch: add conversion from pg_wchar to multibyte
List	pgsql-hackers

Tree view

> OK.  So, in that case, I suggest that if the leading byte is non-zero,
> we emit 0x9d followed by the three available bytes, instead of first
> testing whether the first byte is >= 0xf0.  That test seems to serve
> no purpose but to confuse the issue.

Probably the code shoud look like this(see below comment):
    else if (lb >= 0xf0 && lb <= 0xfe)        {        if (lb <= 0xf4)              *to++ = 0x9c;           else
     *to++ = 0x9d;            *to++ = lb;            *to++ = (*from >> 8) & 0xff;            *to++ = *from & 0xff;
     cnt += 4;
 

> I further suggest that we improve the comments on the mule functions
> for both wchar->mb and mb->wchar to make all this more clear.

I have added comments about mule internal encoding by refreshing my
memory and from old document found on
web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

Please take a look at.  BTW, it seems conversion between multibyte and
wchar can be roundtrip in the leading character is LCPRV2 case:

If the second byte of wchar (out of 4 bytes of wchar. The first byte
is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of
multibyte must be 0x9c.  If the second byte of wchar is in range of
0xf5 to 0xfe, then the first byte of multibyte must be 0x9d.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index d456309..1148eb5 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -37,6 +37,31 @@ typedef unsigned int pg_wchar;#define ISSJISTAIL(c) (((c) >= 0x40 && (c) <= 0x7e) || ((c) >= 0x80 &&
(c)<= 0xfc))/*
 
+ * Currently PostgreSQL supports 5 types of mule internal encodings:
+ *
+ * 1) 1-byte ASCII characters, each byte is below 0x7f.
+ *
+ * 2) "Official" single byte charsets such as ISO 8859 latin1.  Each
+ *    mule character consists of 2 bytes: LC1 + C1, where LC1 is
+ *    corresponds to each charset and in range of 0x81 to 0x8d and C1
+ *    is in rage of 0xa0 to 0xff(ISO 8859-1 for example, plus each
+ *    high bit is on).
+ *
+ * 3) "Private" single byte charsets such as SISHENG.  Each mule
+ *    character consists of 3 bytes: LCPRV1 + LC12 + C1 where LCPRV1
+ *    is either 0x9a (if LC12 is in range of 0xa0 to 0xdf) or 0x9b (if
+ *    LC12 is in range of 0xe0 to 0xef).
+ *
+ * 4) "Official" multibyte charsets such as JIS X0208.  Each mule
+ *    character consists of 3 bytes: LC2 + C1 + C2 where LC2 is
+ *    corresponds to each charset and is in rage of 0x90 to 0x99. C1
+ *    and C2 is in rage of 0xa0 to 0xff(each high bit is on).
+ *
+ * 5) "Private" multibyte charsets such as CNS 11643-1992 Plane 3.
+ *    Each mule character consists of 4 bytes: LCPRV2 + LC22 + C1 +
+ *    C2.  where LCPRV2 is either 0x9c (if LC12 is in range of 0xf0 to
+ *    0xf4) or 0x9d (if LC22 is in range of 0xf5 to 0xfe).
+ * * Leading byte types or leading prefix byte for MULE internal code. * See http://www.xemacs.org for more details.
(there is a doc titled * "XEmacs Internals Manual", "MULE Character Sets and Encodings"

pgsql-hackers by date:

From: Pavel Stehule
Date: 03 July 2012, 03:14:25
Subject: Re: enhanced error fields

From: Jeff Davis
Date: 03 July 2012, 03:47:59
Subject: Re: SP-GiST for ranges based on 2d-mapping and quad-tree

Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers

Previous

Next