Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: Patch: add conversion from pg_wchar to multibyte
Date
Msg-id 20120711.082326.1199398009192084540.t-ishii@sraoss.co.jp
Whole thread Raw
In response to Re: Patch: add conversion from pg_wchar to multibyte  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: Patch: add conversion from pg_wchar to multibyte
List pgsql-hackers
>>>> Tatsuo Ishii <ishii@postgresql.org> writes:
>>>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>>>> that I can find.
>>>>>> 
>>>>>> I also read in the xemacs internals doc, at
>>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>>>> charset codes.  I don't know how we got to the current state of using
>>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>>>> with XEmacs.
>>>> 
>>>>> At the time when mule internal code was introduced to PostgreSQL,
>>>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>>>> emacs) was the only implementation allowed to use multi encoding. So I
>>>>> used the specification of mule documented in the URL I wrote.
>>>> 
>>>> I see.  Given that upstream has decided that a simpler definition is
>>>> more appropriate, is there any reason not to follow their lead, to the
>>>> extent that we can do so without breaking existing on-disk data?
>>> 
>>> Please let me spend week end to understand the their latest spec.
>> 
>> This is an intermediate report on the internal multi-byte charset
>> implementation of emacen. I have read the link Tom showed. Also I made
>> a quick scan on xemacs-21.4.0 source code, especially
>> xemacs-21.4.0/src/mule-charset.h. It seems the web document is
>> essentially a copy of the comments in the file. Also I looked into
>> other place of xemacs code and I think I can conclude that xeamcs
>> 21.4's multi-byte implementation is based on the doc on the web.
>> 
>> Next I looked into emacs 24.1 source code because I could not find any
>> doc regarding emacs's(not xemacs's) implementation of internal
>> multi-byte charset. I found followings in src/charset.h:
>> 
>> /* Leading-code followed by extended leading-code.    DIMENSION/COLUMN */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_11    0x9A /* 1/1 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_12    0x9B /* 1/2 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_21    0x9C /* 2/2 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_22    0x9D /* 2/2 */
>> 
>> And these are used like this:
>> 
>> /* Read one non-ASCII character from INSTREAM.  The character is
>>    encoded in `emacs-mule' and the first byte is already read in
>>    C.  */
>> 
>> static int
>> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
>> {
>> :
>> :
>>   else if (len == 3)
>>     {
>>       if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11
>>       || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12)
>>     {
>>       charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);
>>       code = buf[2] & 0x7F;
>>     }
>> 
>> As far as I can tell, this is exactly the same way how PostgreSQL
>> handles single private character sets: they consist of 3 bytes, and
>> leading byte is either 0x9a or 0x9b. Other examples regarding single
>> byte/multi-byte private charsets can be seen in coding.c.
>> 
>> As far as I can tell, it seems emacs and xemacs employes different
>> implementations of multi-byte charaset regarding "private"
>> charsets. Emacs's is same as PostgreSQL, while xemacs is not.  I am
>> contacting to the original Mule author if he knows anything about
>> this.
> 
> I got reply from the Mule author, Kenichi Handa (the mail is in
> Japanese. So I do not quote his mail here. If somebody wants to read
> the original mail please let me know). First of all my understanding
> with emacs's implementaion is correct according to him. He did not
> know about xemacs's implementation. Apparently the implementation of
> xemacs was not lead by the original mule author.
> 
> So which one of emacs/xemacs should we follow? My suggestion is, not
> to follow xemacs, and to leave the current treatment of private
> leading byte as it is because emacs seems to be more "right" upstream
> comparing with xemacs.
> 
>> BTW, while looking into emacs's source code, I found their charset
>> definitions are in lisp/international/mule-conf.el. According to the
>> file several new charsets has been added. Included is the patch to
>> follow their changes. This makes no changes to current behavior, since
>> the patch just changes some comments and non supported charsets.
> 
> If there's no objection, I would like to commit this. Objection?

Done along with comment that we follow emacs's implementation, not
xemacs's.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Re: Synchronous Standalone Master Redoux
Next
From: Bruce Momjian
Date:
Subject: Re: Using pg_upgrade on log-shipping standby servers