Re: Patch: add conversion from pg_wchar to multibyte - Mailing list pgsql-hackers
From | Tatsuo Ishii |
---|---|
Subject | Re: Patch: add conversion from pg_wchar to multibyte |
Date | |
Msg-id | 20120711.082326.1199398009192084540.t-ishii@sraoss.co.jp Whole thread Raw |
In response to | Re: Patch: add conversion from pg_wchar to multibyte (Tatsuo Ishii <ishii@postgresql.org>) |
Responses |
Re: Patch: add conversion from pg_wchar to multibyte
|
List | pgsql-hackers |
>>>> Tatsuo Ishii <ishii@postgresql.org> writes: >>>>>> So far as I can see, the only LCPRVn marker code that is actually in >>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c >>>>>> that I can find. >>>>>> >>>>>> I also read in the xemacs internals doc, at >>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145 >>>>>> that XEmacs thinks the marker code for private single-byte charsets >>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only); >>>>>> moreover they think 0x9a-0x9d are potential future official multibyte >>>>>> charset codes. I don't know how we got to the current state of using >>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent >>>>>> with XEmacs. >>>> >>>>> At the time when mule internal code was introduced to PostgreSQL, >>>>> xemacs did not have multi encoding capabilty and mule (a patch to >>>>> emacs) was the only implementation allowed to use multi encoding. So I >>>>> used the specification of mule documented in the URL I wrote. >>>> >>>> I see. Given that upstream has decided that a simpler definition is >>>> more appropriate, is there any reason not to follow their lead, to the >>>> extent that we can do so without breaking existing on-disk data? >>> >>> Please let me spend week end to understand the their latest spec. >> >> This is an intermediate report on the internal multi-byte charset >> implementation of emacen. I have read the link Tom showed. Also I made >> a quick scan on xemacs-21.4.0 source code, especially >> xemacs-21.4.0/src/mule-charset.h. It seems the web document is >> essentially a copy of the comments in the file. Also I looked into >> other place of xemacs code and I think I can conclude that xeamcs >> 21.4's multi-byte implementation is based on the doc on the web. >> >> Next I looked into emacs 24.1 source code because I could not find any >> doc regarding emacs's(not xemacs's) implementation of internal >> multi-byte charset. I found followings in src/charset.h: >> >> /* Leading-code followed by extended leading-code. DIMENSION/COLUMN */ >> #define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */ >> #define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */ >> #define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */ >> #define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */ >> >> And these are used like this: >> >> /* Read one non-ASCII character from INSTREAM. The character is >> encoded in `emacs-mule' and the first byte is already read in >> C. */ >> >> static int >> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun) >> { >> : >> : >> else if (len == 3) >> { >> if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11 >> || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12) >> { >> charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]); >> code = buf[2] & 0x7F; >> } >> >> As far as I can tell, this is exactly the same way how PostgreSQL >> handles single private character sets: they consist of 3 bytes, and >> leading byte is either 0x9a or 0x9b. Other examples regarding single >> byte/multi-byte private charsets can be seen in coding.c. >> >> As far as I can tell, it seems emacs and xemacs employes different >> implementations of multi-byte charaset regarding "private" >> charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am >> contacting to the original Mule author if he knows anything about >> this. > > I got reply from the Mule author, Kenichi Handa (the mail is in > Japanese. So I do not quote his mail here. If somebody wants to read > the original mail please let me know). First of all my understanding > with emacs's implementaion is correct according to him. He did not > know about xemacs's implementation. Apparently the implementation of > xemacs was not lead by the original mule author. > > So which one of emacs/xemacs should we follow? My suggestion is, not > to follow xemacs, and to leave the current treatment of private > leading byte as it is because emacs seems to be more "right" upstream > comparing with xemacs. > >> BTW, while looking into emacs's source code, I found their charset >> definitions are in lisp/international/mule-conf.el. According to the >> file several new charsets has been added. Included is the patch to >> follow their changes. This makes no changes to current behavior, since >> the patch just changes some comments and non supported charsets. > > If there's no objection, I would like to commit this. Objection? Done along with comment that we follow emacs's implementation, not xemacs's. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
pgsql-hackers by date: