Thread: Patch: add conversion from pg_wchar to multibyte

Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
Hackers,

attached patch adds conversion from pg_wchar string to multibyte string.
This functionality is needed for my patch on index support for regular expression search http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php .
Analyzing conversion from multibyte to pg_wchar I found following types of conversion:
1) Trivial conversion for single-byte encoding. It just adds leading zeros to each byte.
2) Conversion from UTF-8 to unicode.
3) Conversions from euc* encodings. They write bytes of a character to pg_wchar in inverse order starting from lower byte (this explanation assume little endian system).
4) Conversion from mule encoding. This conversion is unclear for me and also seems to be lossy.

It was easy to write inverse conversion for 1-3. I've changed 4 conversion to behave like 3. I'm not sure my change is ok, because I didn't understand original conversion.

------
With best regards,
Alexander Korotkov.
Attachment

Re: Patch: add conversion from pg_wchar to multibyte

From
"Erik Rijkers"
Date:
Hi Alexander,


Perhaps I'm too early with these tests, but FWIW I reran my earlier test program against three
instances.  (the patches compiled fine, and make check was without problem).

-- 3 instances:
HEAD                 port 6542
trgm_regex           port 6547  HEAD + trgm-regexp patch (22 Nov 2011) [1]
trgm_regex_wchar2mb  port 6549  HEAD + trgm-regexp + wchar2mb patch (23 Apr 2012) [2]

[1] http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php
[2] http://archives.postgresql.org/pgsql-hackers/2012-04/msg01095.php

-- table sizes:
  azjunk4  10^4 rows     1 MB
  azjunk5  10^5 rows    11 MB
  azjunk6  10^6 rows   112 MB
  azjunk7  10^7 rows  1116 MB

for table creation/structure, see:
[3] http://archives.postgresql.org/pgsql-hackers/2012-01/msg01094.php

Results for three instances with 4 repetitions per instance are attached.

Although the regexes I chose are somewhat arbitrary, it does show some of the good, the bad and
the ugly of the patch(es).  (Also: I've limited the tests to a range of 'workable' regexps, i.e.
avoiding unbounded regexps)

hth (and thanks, great work!),


Erik Rijkers


Attachment

Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Sun, Apr 29, 2012 at 8:12 AM, Erik Rijkers <er@xs4all.nl> wrote:
> Perhaps I'm too early with these tests, but FWIW I reran my earlier test program against three
> instances.  (the patches compiled fine, and make check was without problem).

These tests results seem to be more about the pg_trgm changes than the
patch actually on this thread, unless I'm missing something.  But the
executive summary seems to be that pg_trgm might need to be a bit
smarter about costing the trigram-based search, because when the
number of trigrams is really big, using the index is
counterproductive.  Hopefully that's not too hard to fix; the basic
approach seems quite promising.

(I haven't actually looked at the patch on this thread yet to
understand how it fits in; the above comments are about the pg_trgm
regex stuff.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> Hopefully that's not too hard to fix; the basic approach seems
> quite promising.
After playing with trigram searches for name searches against copies
of production database with appropriate indexing, our shop has
chosen it as the new way to do name searches here.  It's really
nice.
My biggest complaint is related to setting the threshold for the %
operator.  It seems to me that there should be a GUC to control the
default, and that there should be a way to set the threshold for
each % operator in a query (if there is more than one).  The
function names which must be used on the connection before running
the queries don't give any clue that they are related to trigrams:
show_limit() and set_limit() are nearly useless for conveying the
semantics of what they do.
Even with those issues, trigram similarity searching is IMO one of
the top five coolest things about PostgreSQL and should be promoted
heavily.
-Kevin


Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
Hi Erik


On Sun, Apr 29, 2012 at 4:12 PM, Erik Rijkers <er@xs4all.nl> wrote:
Perhaps I'm too early with these tests, but FWIW I reran my earlier test program against three
instances.  (the patches compiled fine, and make check was without problem).

-- 3 instances:
HEAD                 port 6542
trgm_regex           port 6547  HEAD + trgm-regexp patch (22 Nov 2011) [1]
trgm_regex_wchar2mb  port 6549  HEAD + trgm-regexp + wchar2mb patch (23 Apr 2012) [2]

Actually wchar2mb patch doesn't affect behaviour of trgm-regexp. It provide correct way to do some work of encoding conversion which last published version of trgm-regexp does internally. So "HEAD + trgm-regexp patch" and "HEAD + trgm-regexp + wchar2mb patch" should behave similarly.
 
[1] http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php
[2] http://archives.postgresql.org/pgsql-hackers/2012-04/msg01095.php

-- table sizes:
 azjunk4  10^4 rows     1 MB
 azjunk5  10^5 rows    11 MB
 azjunk6  10^6 rows   112 MB
 azjunk7  10^7 rows  1116 MB

for table creation/structure, see:
[3] http://archives.postgresql.org/pgsql-hackers/2012-01/msg01094.php

Results for three instances with 4 repetitions per instance are attached.

Although the regexes I chose are somewhat arbitrary, it does show some of the good, the bad and
the ugly of the patch(es).  (Also: I've limited the tests to a range of 'workable' regexps, i.e.
avoiding unbounded regexps)

Thank you for testing!
Such synthetical tests are very valuable for finding corner cases of the patch, bugs etc.
But also, it would be nice to do some tests on reallife datasets with reallife regexps in order to see real benefit of this approach of indexing and do some comparison with other approaches. May be you or somebody else could obtain such datasets?

Also, I did some optimizations in algorithm. Automaton analysis stage should become less CPU and memory consuming. I'll publish new version soon.

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
<div class="gmail_quote">On Mon, Apr 30, 2012 at 10:07 PM, Robert Haas <span dir="ltr"><<a
href="mailto:robertmhaas@gmail.com"target="_blank">robertmhaas@gmail.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Sun, Apr
29,2012 at 8:12 AM, Erik Rijkers <<a href="mailto:er@xs4all.nl">er@xs4all.nl</a>> wrote:<br /> > Perhaps I'm
tooearly with these tests, but FWIW I reran my earlier test program against three<br /> > instances.  (the patches
compiledfine, and make check was without problem).<br /><br /></div>These tests results seem to be more about the
pg_trgmchanges than the<br /> patch actually on this thread, unless I'm missing something.  But the<br /> executive
summaryseems to be that pg_trgm might need to be a bit<br /> smarter about costing the trigram-based search, because
whenthe<br /> number of trigrams is really big, using the index is<br /> counterproductive.  Hopefully that's not too
hardto fix; the basic<br /> approach seems quite promising.</blockquote><div class="gmail_quote"><br /></div><div
class="gmail_quote">Right.When number of trigrams is big, it is slow to scan posting list of all of them. The solution
isthis case is to exclude most frequent trigrams from index scan. But, it require some kind of statistics of trigrams
frequencieswhich we don't have. We could estimate frequencies using some hard-coded assumptions about natural
languages.Or we could exclude arbitrary trigrams. But I don't like both these ideas. This problem is also relevant for
LIKE/ILIKEsearch using trigram indexes.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">Something
similarcould occur in tsearch when we search for "frequent_term & rare_term". In some situations (depending on
termsfrequencies) it's better to exclude  frequent_term from index scan and do recheck. We have relevant statistics to
dosuch decision, but it doesn't seem to be feasible to get it using current GIN interface.</div><div
class="gmail_quote"><br/></div><div class="gmail_quote">Probably you have some comments on idea of conversion from
pg_wcharto multibyte? Is it acceptable at all?</div><br />------<br />With best regards,<br />Alexander Korotkov.</div> 

Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Tue, May 1, 2012 at 1:48 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
My biggest complaint is related to setting the threshold for the %
operator.  It seems to me that there should be a GUC to control the
default, and that there should be a way to set the threshold for
each % operator in a query (if there is more than one).  The
function names which must be used on the connection before running
the queries don't give any clue that they are related to trigrams:
show_limit() and set_limit() are nearly useless for conveying the
semantics of what they do.

I think this problem can be avoided by introducing composite type representing trigram similarity query. It could consists of a query text and similarity threshold. This type would have similar purpose as tsquery or query_int. It would make queries more heavy, but would allow to use distinct similarity threshold in the same query.

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Right. When number of trigrams is big, it is slow to scan posting list of
> all of them. The solution is this case is to exclude most frequent trigrams
> from index scan.  But, it require some kind of statistics of trigrams
> frequencies which we don't have. We could estimate frequencies using some
> hard-coded assumptions about natural languages. Or we could exclude
> arbitrary trigrams. But I don't like both these ideas. This problem is also
> relevant for LIKE/ILIKE search using trigram indexes.

I was thinking you could perhaps do it just based on the *number* of
trigrams, not necessarily their frequency.

> Probably you have some comments on idea of conversion from pg_wchar to
> multibyte? Is it acceptable at all?

Well, I'm not an expert on encodings, but it seems like a logical
extension of what we're doing right now, so I don't really see why
not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
though.  Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch).  Maybe I am missing something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Wed, May 2, 2012 at 4:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Right. When number of trigrams is big, it is slow to scan posting list of
> all of them. The solution is this case is to exclude most frequent trigrams
> from index scan.  But, it require some kind of statistics of trigrams
> frequencies which we don't have. We could estimate frequencies using some
> hard-coded assumptions about natural languages. Or we could exclude
> arbitrary trigrams. But I don't like both these ideas. This problem is also
> relevant for LIKE/ILIKE search using trigram indexes.

I was thinking you could perhaps do it just based on the *number* of
trigrams, not necessarily their frequency.

Imagine we've two queries:
1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';

The first query require reading posting lists of trigrams "abc" and "bcd".
The second query require reading posting lists of trigrams "abc", "bcd", "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for second query because number of posting list to read is high. But it is unreasonable because actually second query is narrower than the first one. We can use same index scan for it, recheck will remove all false positives. When number of trigrams is high we can just exclude some of them from index scan. It would be better than just decide to do sequential scan. But the question is what trigrams to exclude? Ideally we would leave most rare trigrams to make index scan cheaper.
 
> Probably you have some comments on idea of conversion from pg_wchar to
> multibyte? Is it acceptable at all?

Well, I'm not an expert on encodings, but it seems like a logical
extension of what we're doing right now, so I don't really see why
not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
though.  Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch).  Maybe I am missing something.

Unfortunately I didn't understand original logic of pg_mule2wchar_with_len. I just did proposal about how it could be. I hope somebody more familiar with this code would clarify this situation.

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> I was thinking you could perhaps do it just based on the *number* of
>> trigrams, not necessarily their frequency.
>
> Imagine we've two queries:
> 1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
> 2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';
>
> The first query require reading posting lists of trigrams "abc" and "bcd".
> The second query require reading posting lists of trigrams "abc", "bcd",
> "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
> We could decide to use index scan for first query and sequential scan for
> second query because number of posting list to read is high. But it is
> unreasonable because actually second query is narrower than the first one.
> We can use same index scan for it, recheck will remove all false positives.
> When number of trigrams is high we can just exclude some of them from index
> scan. It would be better than just decide to do sequential scan. But the
> question is what trigrams to exclude? Ideally we would leave most rare
> trigrams to make index scan cheaper.

True.  I guess I was thinking more of the case where you've got
abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|....  There's probably some point
at which it becomes silly to think about using the index.

>> > Probably you have some comments on idea of conversion from pg_wchar to
>> > multibyte? Is it acceptable at all?
>>
>> Well, I'm not an expert on encodings, but it seems like a logical
>> extension of what we're doing right now, so I don't really see why
>> not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
>> though.  Presumably either the old code is right (in which case, don't
>> change it) or the new code is right (in which case, there's a bug fix
>> needed here that ought to be discussed and committed separately from
>> the rest of the patch).  Maybe I am missing something.
>
> Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
> I just did proposal about how it could be. I hope somebody more familiar
> with this code would clarify this situation.

Well, do you think the current code is buggy, or not?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Wed, May 2, 2012 at 5:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
 > Imagine we've two queries:
> 1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
> 2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';
>
> The first query require reading posting lists of trigrams "abc" and "bcd".
> The second query require reading posting lists of trigrams "abc", "bcd",
> "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
> We could decide to use index scan for first query and sequential scan for
> second query because number of posting list to read is high. But it is
> unreasonable because actually second query is narrower than the first one.
> We can use same index scan for it, recheck will remove all false positives.
> When number of trigrams is high we can just exclude some of them from index
> scan. It would be better than just decide to do sequential scan. But the
> question is what trigrams to exclude? Ideally we would leave most rare
> trigrams to make index scan cheaper.

True.  I guess I was thinking more of the case where you've got
abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|....  There's probably some point
at which it becomes silly to think about using the index.

Yes, such situations are also possible.

>> Well, I'm not an expert on encodings, but it seems like a logical
>> extension of what we're doing right now, so I don't really see why
>> not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
>> though.  Presumably either the old code is right (in which case, don't
>> change it) or the new code is right (in which case, there's a bug fix
>> needed here that ought to be discussed and committed separately from
>> the rest of the patch).  Maybe I am missing something.
>
> Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
> I just did proposal about how it could be. I hope somebody more familiar
> with this code would clarify this situation.

Well, do you think the current code is buggy, or not?

Probably, but I'm not sure. The conversion seems lossy to me unless I'm missing something about mule encoding.

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
Hello, Ishii-san!

We've talked on PGCon that I've questions about mule to wchar conversion. My questions about pg_mule2wchar_with_len function are following. In these parts of code:

else if (IS_LCPRV1(*from) && len >= 3)
{
    from++;
    *to = *from++ << 16;
    *to |= *from++;
    len -= 3;
}

and

else if (IS_LCPRV2(*from) && len >= 4)
{
    from++;
    *to = *from++ << 16;
    *to |= *from++ << 8;
    *to |= *from++;
    len -= 4;
}

we skip first character of original string. Are we able to restore it back from pg_wchar?
Also in this part of code we're shifting first byte by 16 bits:

if (IS_LC1(*from) && len >= 2)
{
    *to = *from++ << 16;
    *to |= *from++;
    len -= 2;
}
else if (IS_LCPRV1(*from) && len >= 3)
{
    from++;
    *to = *from++ << 16;
    *to |= *from++;
    len -= 3;
}

Why don't we shift it by 8 bits?
You can see my patch in this thread where I propose purely mechanical changes in this function which make inverse conversion possible.

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
Hi Alexander,

It was good seeing you in Ottawa!

> Hello, Ishii-san!
> 
> We've talked on PGCon that I've questions about mule to wchar
> conversion. My questions about pg_mule2wchar_with_len function are
> following. In these parts of code:
> *
> *
> else if (IS_LCPRV1(*from) && len >= 3)
> {
>     from++;
>     *to = *from++ << 16;
>     *to |= *from++;
>     len -= 3;
> }
> 
> and
> 
> else if (IS_LCPRV2(*from) && len >= 4)
> {
>     from++;
>     *to = *from++ << 16;
>     *to |= *from++ << 8;
>     *to |= *from++;
>     len -= 4;
> }
> 
> we skip first character of original string. Are we able to restore it back
> from pg_wchar?

I think it's possible. The first characters are defined like this:

#define IS_LCPRV1(c)    ((unsigned char)(c) == 0x9a || (unsigned char)(c) == 0x9b)
#define IS_LCPRV2(c)    ((unsigned char)(c) == 0x9c || (unsigned char)(c) == 0x9d)

It seems IS_LCPRV1 is not used in any of PostgreSQL supported
encodings at this point, that means there's 0 chance which existing
databases include LCPRV1. So you could safely ignore it.

For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5)
in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c
and it is fixed to 0x9d.  So you can always restore the value to 0x9d.

> Also in this part of code we're shifting first byte by 16 bits:
> 
> if (IS_LC1(*from) && len >= 2)
> {
>     *to = *from++ << 16;
>     *to |= *from++;
>     len -= 2;
> }
> else if (IS_LCPRV1(*from) && len >= 3)
> {
>     from++;
>     *to = *from++ << 16;
>     *to |= *from++;
>     len -= 3;
> }
> 
> Why don't we shift it by 8 bits?

Because we want the first byte of LC1 case to be placed in the second
byte of wchar. i.e.

0th byte: always 0
1th byte: leading byte (the first byte of the multibyte)
2th byte: always 0
3th byte: the second byte of the multibyte

Note that we always assume that the 1th byte (called "leading byte":
LB in short) represents the id of the character set (from 0x81 to
0xff) in MULE INTERNAL encoding. For the mapping between LB and
charsets, see pg_wchar.h.

> You can see my patch in this thread where I propose purely mechanical
> changes in this function which make inverse conversion possible.
> 
> ------
> With best regards,
> Alexander Korotkov.


Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Tue, May 22, 2012 at 11:50 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
I think it's possible. The first characters are defined like this:

#define IS_LCPRV1(c)    ((unsigned char)(c) == 0x9a || (unsigned char)(c) == 0x9b)
#define IS_LCPRV2(c)    ((unsigned char)(c) == 0x9c || (unsigned char)(c) == 0x9d)

It seems IS_LCPRV1 is not used in any of PostgreSQL supported
encodings at this point, that means there's 0 chance which existing
databases include LCPRV1. So you could safely ignore it.

For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5)
in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c
and it is fixed to 0x9d.  So you can always restore the value to 0x9d.

> Also in this part of code we're shifting first byte by 16 bits:
>
> if (IS_LC1(*from) && len >= 2)
> {
>     *to = *from++ << 16;
>     *to |= *from++;
>     len -= 2;
> }
> else if (IS_LCPRV1(*from) && len >= 3)
> {
>     from++;
>     *to = *from++ << 16;
>     *to |= *from++;
>     len -= 3;
> }
>
> Why don't we shift it by 8 bits?

Because we want the first byte of LC1 case to be placed in the second
byte of wchar. i.e.

0th byte: always 0
1th byte: leading byte (the first byte of the multibyte)
2th byte: always 0
3th byte: the second byte of the multibyte

Note that we always assume that the 1th byte (called "leading byte":
LB in short) represents the id of the character set (from 0x81 to
0xff) in MULE INTERNAL encoding. For the mapping between LB and
charsets, see pg_wchar.h.

Thanks for your comments. They clarify a lot.
But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2? Isn't it possible for them to produce same pg_wchar?

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> Thanks for your comments. They clarify a lot.
> But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
> Isn't it possible for them to produce same pg_wchar?

If LB is in 0x90 - 0x99 range, then they are LC2.
If LB is in 0xf0 - 0xff range, then they are LCPRV2.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Tue, May 22, 2012 at 3:27 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> Thanks for your comments. They clarify a lot.
> But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
> Isn't it possible for them to produce same pg_wchar?

If LB is in 0x90 - 0x99 range, then they are LC2.
If LB is in 0xf0 - 0xff range, then they are LCPRV2.

Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of patch is attached.

------
With best regards,
Alexander Korotkov.
Attachment

Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> On Tue, May 22, 2012 at 3:27 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> 
>> > Thanks for your comments. They clarify a lot.
>> > But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
>> > Isn't it possible for them to produce same pg_wchar?
>>
>> If LB is in 0x90 - 0x99 range, then they are LC2.
>> If LB is in 0xf0 - 0xff range, then they are LCPRV2.
>>
> 
> Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of
> patch is attached.

[forgot to cc: to the list]

I looked into your patch, especially: pg_wchar2euc_with_len(const
pg_wchar *from, unsigned char *to, int len)

I think there's a small room to enhance the function.
    if (*from >> 24)    {        *to++ = *from >> 24;        *to++ = (*from >> 16) & 0xFF;        *to++ = (*from >> 8)
&0xFF;        *to++ = *from & 0xFF;        cnt += 4;    }
 

Since the function walk through this every single wchar, something like:
    if ((c = *from >> 24))    {        *to++ = c;        *to++ = (*from >> 16) & 0xFF;        *to++ = (*from >> 8) &
0xFF;       *to++ = *from & 0xFF;        cnt += 4;    }
 

will save few cycles(I'm not sure the optimizer produces similar code
above anyway though).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Thu, May 24, 2012 at 12:04 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of
> patch is attached.

Review:

It looks to me like pg_wchar2utf_with_len will not work, because
unicode_to_utf8 returns its second argument unmodified - not, as your
code seems to assume, the byte following what was already written.

MULE also looks problematic.  The code that you've written isn't
symmetric with the opposite conversion, unlike what you did in all
other cases, and I don't understand why.  I'm also somewhat baffled by
the reverse conversion: it treats a multi-byte sequence beginning with
a byte for which IS_LCPRV1(x) returns true as invalid if there are
less than 3 bytes available, but it only reads two; similarly, for
IS_LCPRV2(x), it demands 4 bytes but converts only 3.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Wed, Jun 27, 2012 at 11:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It looks to me like pg_wchar2utf_with_len will not work, because
unicode_to_utf8 returns its second argument unmodified - not, as your
code seems to assume, the byte following what was already written.

Fixed.
 
MULE also looks problematic.  The code that you've written isn't
symmetric with the opposite conversion, unlike what you did in all
other cases, and I don't understand why.  I'm also somewhat baffled by
the reverse conversion: it treats a multi-byte sequence beginning with
a byte for which IS_LCPRV1(x) returns true as invalid if there are
less than 3 bytes available, but it only reads two; similarly, for
IS_LCPRV2(x), it demands 4 bytes but converts only 3.
 
Should we save existing pg_wchar representation for MULE encoding? Probably, we can modify it like in 0.1 version of patch in order to make it more transparent.

------
With best regards,
Alexander Korotkov.
Attachment

Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> MULE also looks problematic.  The code that you've written isn't
>> symmetric with the opposite conversion, unlike what you did in all
>> other cases, and I don't understand why.  I'm also somewhat baffled by
>> the reverse conversion: it treats a multi-byte sequence beginning with
>> a byte for which IS_LCPRV1(x) returns true as invalid if there are
>> less than 3 bytes available, but it only reads two; similarly, for
>> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
>
> Should we save existing pg_wchar representation for MULE encoding? Probably,
> we can modify it like in 0.1 version of patch in order to make it more
> transparent.

Changing the encoding would break pg_upgrade, so -1 from me on that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> MULE also looks problematic.  The code that you've written isn't
>> symmetric with the opposite conversion, unlike what you did in all
>> other cases, and I don't understand why.  I'm also somewhat baffled by
>> the reverse conversion: it treats a multi-byte sequence beginning with
>> a byte for which IS_LCPRV1(x) returns true as invalid if there are
>> less than 3 bytes available, but it only reads two; similarly, for
>> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
>
> Should we save existing pg_wchar representation for MULE encoding? Probably,
> we can modify it like in 0.1 version of patch in order to make it more
> transparent.

Changing the encoding would break pg_upgrade, so -1 from me on that.

I didn't realize that we store pg_wchar on disk somewhere. I thought it is only in-memory representation. Where do we store pg_wchar on disk?

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Mon, Jul 2, 2012 at 4:33 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com>
>> wrote:
>> >> MULE also looks problematic.  The code that you've written isn't
>> >> symmetric with the opposite conversion, unlike what you did in all
>> >> other cases, and I don't understand why.  I'm also somewhat baffled by
>> >> the reverse conversion: it treats a multi-byte sequence beginning with
>> >> a byte for which IS_LCPRV1(x) returns true as invalid if there are
>> >> less than 3 bytes available, but it only reads two; similarly, for
>> >> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
>> >
>> > Should we save existing pg_wchar representation for MULE encoding?
>> > Probably,
>> > we can modify it like in 0.1 version of patch in order to make it more
>> > transparent.
>>
>> Changing the encoding would break pg_upgrade, so -1 from me on that.
>
>
> I didn't realize that we store pg_wchar on disk somewhere. I thought it is
> only in-memory representation. Where do we store pg_wchar on disk?

OK, now I'm confused.  I was thinking (incorrectly) that you were
talking about changing the multibyte encoding, which of course is
saved on disk all over the place.  Changing the wchar encoding is a
different kettle of fish, and I have no idea what that would or would
not break.  But I don't see why we'd want to do such a thing.  We just
need to make the MB->WCHAR and WCHAR->MB transformations mirror images
of each other; why is that hard?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Tue, Jul 3, 2012 at 12:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jul 2, 2012 at 4:33 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com>
>> wrote:
>> >> MULE also looks problematic.  The code that you've written isn't
>> >> symmetric with the opposite conversion, unlike what you did in all
>> >> other cases, and I don't understand why.  I'm also somewhat baffled by
>> >> the reverse conversion: it treats a multi-byte sequence beginning with
>> >> a byte for which IS_LCPRV1(x) returns true as invalid if there are
>> >> less than 3 bytes available, but it only reads two; similarly, for
>> >> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
>> >
>> > Should we save existing pg_wchar representation for MULE encoding?
>> > Probably,
>> > we can modify it like in 0.1 version of patch in order to make it more
>> > transparent.
>>
>> Changing the encoding would break pg_upgrade, so -1 from me on that.
>
>
> I didn't realize that we store pg_wchar on disk somewhere. I thought it is
> only in-memory representation. Where do we store pg_wchar on disk?

OK, now I'm confused.  I was thinking (incorrectly) that you were
talking about changing the multibyte encoding, which of course is
saved on disk all over the place.  Changing the wchar encoding is a
different kettle of fish, and I have no idea what that would or would
not break.  But I don't see why we'd want to do such a thing.  We just
need to make the MB->WCHAR and WCHAR->MB transformations mirror images
of each other; why is that hard?

So, I provided such transformation in versions 0.3 and 0.4 based on explanation from Tatsuo Ishii. The problem is that both conversions are nontrivial and it's not evident that they are mirror (understanding that they are mirror require some additional assumptions about encodings, not evident just by transformation itself). I though you mention that problem two message back. 

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Mon, Jul 2, 2012 at 4:46 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> So, I provided such transformation in versions 0.3 and 0.4 based on
> explanation from Tatsuo Ishii. The problem is that both conversions are
> nontrivial and it's not evident that they are mirror (understanding that
> they are mirror require some additional assumptions about encodings, not
> evident just by transformation itself). I though you mention that problem
> two message back.

Yeah, I did.  I think I may be a bit confused here, so let me try to
understand this a bit better.  It seems like pg_mule2wchar_with_len
uses the following algorithm:

- If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
with shifts of 16 and 0.
- If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
skipping the first one and storing the remaining two with shifts of 16
and 0.
- If the first character IS_LC2 (0x90-0x99), decode three bytes,
stored with shifts of 16, 8, and 0.
- If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
skipping the first one and storing the remaining three with offsets of
16, 8, and 0.

In the reverse transformation implemented by pg_wchar2mule_with_len,
if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
3 bytes, respectively, exactly as I would expect.  ASCII decoding is
also as I would expect.  The case I don't understand is what happens
when the leading byte of the multibyte character was IS_LCPRV1 or
IS_LCPRV2.  In that case, we ought to decode three bytes if it was
IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
always decode 4 bytes.  That implies that the IS_LCPRV1() case in
pg_mule2wchar_with_len is dead code, and that any 4 byte characters
are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
comment there is driving at, but it's not too clear to me.

Am I close?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> Yeah, I did.  I think I may be a bit confused here, so let me try to
> understand this a bit better.  It seems like pg_mule2wchar_with_len
> uses the following algorithm:
> 
> - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
> with shifts of 16 and 0.
> - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
> skipping the first one and storing the remaining two with shifts of 16
> and 0.
> - If the first character IS_LC2 (0x90-0x99), decode three bytes,
> stored with shifts of 16, 8, and 0.
> - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
> skipping the first one and storing the remaining three with offsets of
> 16, 8, and 0.

Correct.

> In the reverse transformation implemented by pg_wchar2mule_with_len,
> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
> 3 bytes, respectively, exactly as I would expect.  ASCII decoding is
> also as I would expect.  The case I don't understand is what happens
> when the leading byte of the multibyte character was IS_LCPRV1 or
> IS_LCPRV2.  In that case, we ought to decode three bytes if it was
> IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
> always decode 4 bytes.  That implies that the IS_LCPRV1() case in
> pg_mule2wchar_with_len is dead code,

Yes, dead code unless we want to support following encodings in the
future(from include/mb/pg_wchar.h:
#define LC_SISHENG            0xa0/* Chinese SiSheng characters for                             * PinYin/ZhuYin (not
supported)*/
 
#define LC_IPA                0xa1/* IPA (International Phonetic Association)                             * (not
supported)*/
 
#define LC_VISCII_LOWER        0xa2/* Vietnamese VISCII1.1 lower-case (not                             * supported) */
#define LC_VISCII_UPPER        0xa3/* Vietnamese VISCII1.1 upper-case (not                             * supported) */
#define LC_ARABIC_DIGIT        0xa4    /* Arabic digit (not supported) */
#define LC_ARABIC_1_COLUMN    0xa5    /* Arabic 1-column (not supported) */
#define LC_ASCII_RIGHT_TO_LEFT    0xa6    /* ASCII (left half of ISO8859-1) with                                     *
right-to-leftdirection (not                                     * supported) */
 
#define LC_LAO                0xa7/* Lao characters (ISO10646 0E80..0EDF) (not                             * supported)
*/
#define LC_ARABIC_2_COLUMN    0xa8    /* Arabic 1-column (not supported) */

> and that any 4 byte characters
> are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
> comment there is driving at, but it's not too clear to me.

Yes, that's because we only support EUC_TW and BIG5 which are using
IS_LCPRV2 in the mule interal encoding, as stated in the comment.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> In the reverse transformation implemented by pg_wchar2mule_with_len,
> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
> 3 bytes, respectively, exactly as I would expect.  ASCII decoding is
> also as I would expect.  The case I don't understand is what happens
> when the leading byte of the multibyte character was IS_LCPRV1 or
> IS_LCPRV2.

Some inspection of pg_wchar.h suggests that the IS_LCPRV1 and IS_LCPRV2
cases are unused: the file doesn't define any encoding labels that match
the byte values they accept, nor do the comments suggest that Emacs has
any such labels either.  If true, it would not be much of a stretch to
believe that any code claiming to support these cases could be broken.
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
I wrote:
> Some inspection of pg_wchar.h suggests that the IS_LCPRV1 and IS_LCPRV2
> cases are unused: the file doesn't define any encoding labels that match
> the byte values they accept, nor do the comments suggest that Emacs has
> any such labels either.

Scratch that --- I was misled by the fond illusion that our code
wouldn't use magic hex literals for encoding labels.  Stuff like this:
       /* 0x9d means LCPRV2 */       if (c1 == LC_CNS11643_1 || c1 == LC_CNS11643_2 || c1 == 0x9d)

seems to me to be well below the minimum acceptable quality standards
for Postgres code.

Having said that, grepping the src/backend/utils/mb/conversion_procs/
reveals no sign that 0x9a, 0x9b, or 0x9c are used anywhere with the
meanings that the IS_LCPRV1 and IS_LCPRV2 macros assign to them.
Furthermore, AFAICS the 0x9d case is only used in euc_tw_and_big5/,
with the following byte being one of the LC_CNS11643_[3-7] constants.

Given that these constants are treading on encoding ID namespace that
Emacs upstream might someday decide to assign, I think we'd be well
advised to *not* start installing any code that thinks that 9a-9c
mean something.
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Mon, Jul 2, 2012 at 7:33 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> Yeah, I did.  I think I may be a bit confused here, so let me try to
>> understand this a bit better.  It seems like pg_mule2wchar_with_len
>> uses the following algorithm:
>>
>> - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
>> with shifts of 16 and 0.
>> - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
>> skipping the first one and storing the remaining two with shifts of 16
>> and 0.
>> - If the first character IS_LC2 (0x90-0x99), decode three bytes,
>> stored with shifts of 16, 8, and 0.
>> - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
>> skipping the first one and storing the remaining three with offsets of
>> 16, 8, and 0.
>
> Correct.
>
>> In the reverse transformation implemented by pg_wchar2mule_with_len,
>> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
>> 3 bytes, respectively, exactly as I would expect.  ASCII decoding is
>> also as I would expect.  The case I don't understand is what happens
>> when the leading byte of the multibyte character was IS_LCPRV1 or
>> IS_LCPRV2.  In that case, we ought to decode three bytes if it was
>> IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
>> always decode 4 bytes.  That implies that the IS_LCPRV1() case in
>> pg_mule2wchar_with_len is dead code,
>
> Yes, dead code unless we want to support following encodings in the
> future(from include/mb/pg_wchar.h:
> #define LC_SISHENG                      0xa0/* Chinese SiSheng characters for
>                                                                  * PinYin/ZhuYin (not supported) */
> #define LC_IPA                          0xa1/* IPA (International Phonetic Association)
>                                                                  * (not supported) */
> #define LC_VISCII_LOWER         0xa2/* Vietnamese VISCII1.1 lower-case (not
>                                                                  * supported) */
> #define LC_VISCII_UPPER         0xa3/* Vietnamese VISCII1.1 upper-case (not
>                                                                  * supported) */
> #define LC_ARABIC_DIGIT         0xa4    /* Arabic digit (not supported) */
> #define LC_ARABIC_1_COLUMN      0xa5    /* Arabic 1-column (not supported) */
> #define LC_ASCII_RIGHT_TO_LEFT  0xa6    /* ASCII (left half of ISO8859-1) with
>                                                                                  * right-to-left direction (not
>                                                                                  * supported) */
> #define LC_LAO                          0xa7/* Lao characters (ISO10646 0E80..0EDF) (not
>                                                                  * supported) */
> #define LC_ARABIC_2_COLUMN      0xa8    /* Arabic 1-column (not supported) */
>
>> and that any 4 byte characters
>> are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
>> comment there is driving at, but it's not too clear to me.
>
> Yes, that's because we only support EUC_TW and BIG5 which are using
> IS_LCPRV2 in the mule interal encoding, as stated in the comment.

OK.  So, in that case, I suggest that if the leading byte is non-zero,
we emit 0x9d followed by the three available bytes, instead of first
testing whether the first byte is >= 0xf0.  That test seems to serve
no purpose but to confuse the issue.

I further suggest that we improve the comments on the mule functions
for both wchar->mb and mb->wchar to make all this more clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> OK.  So, in that case, I suggest that if the leading byte is non-zero,
> we emit 0x9d followed by the three available bytes, instead of first
> testing whether the first byte is >= 0xf0.  That test seems to serve
> no purpose but to confuse the issue.

Probably the code shoud look like this(see below comment):
    else if (lb >= 0xf0 && lb <= 0xfe)        {        if (lb <= 0xf4)              *to++ = 0x9c;           else
     *to++ = 0x9d;            *to++ = lb;            *to++ = (*from >> 8) & 0xff;            *to++ = *from & 0xff;
     cnt += 4;
 

> I further suggest that we improve the comments on the mule functions
> for both wchar->mb and mb->wchar to make all this more clear.

I have added comments about mule internal encoding by refreshing my
memory and from old document found on
web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

Please take a look at.  BTW, it seems conversion between multibyte and
wchar can be roundtrip in the leading character is LCPRV2 case:

If the second byte of wchar (out of 4 bytes of wchar. The first byte
is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of
multibyte must be 0x9c.  If the second byte of wchar is in range of
0xf5 to 0xfe, then the first byte of multibyte must be 0x9d.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index d456309..1148eb5 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -37,6 +37,31 @@ typedef unsigned int pg_wchar;#define ISSJISTAIL(c) (((c) >= 0x40 && (c) <= 0x7e) || ((c) >= 0x80 &&
(c)<= 0xfc))/*
 
+ * Currently PostgreSQL supports 5 types of mule internal encodings:
+ *
+ * 1) 1-byte ASCII characters, each byte is below 0x7f.
+ *
+ * 2) "Official" single byte charsets such as ISO 8859 latin1.  Each
+ *    mule character consists of 2 bytes: LC1 + C1, where LC1 is
+ *    corresponds to each charset and in range of 0x81 to 0x8d and C1
+ *    is in rage of 0xa0 to 0xff(ISO 8859-1 for example, plus each
+ *    high bit is on).
+ *
+ * 3) "Private" single byte charsets such as SISHENG.  Each mule
+ *    character consists of 3 bytes: LCPRV1 + LC12 + C1 where LCPRV1
+ *    is either 0x9a (if LC12 is in range of 0xa0 to 0xdf) or 0x9b (if
+ *    LC12 is in range of 0xe0 to 0xef).
+ *
+ * 4) "Official" multibyte charsets such as JIS X0208.  Each mule
+ *    character consists of 3 bytes: LC2 + C1 + C2 where LC2 is
+ *    corresponds to each charset and is in rage of 0x90 to 0x99. C1
+ *    and C2 is in rage of 0xa0 to 0xff(each high bit is on).
+ *
+ * 5) "Private" multibyte charsets such as CNS 11643-1992 Plane 3.
+ *    Each mule character consists of 4 bytes: LCPRV2 + LC22 + C1 +
+ *    C2.  where LCPRV2 is either 0x9c (if LC12 is in range of 0xf0 to
+ *    0xf4) or 0x9d (if LC22 is in range of 0xf5 to 0xfe).
+ * * Leading byte types or leading prefix byte for MULE internal code. * See http://www.xemacs.org for more details.
(there is a doc titled * "XEmacs Internals Manual", "MULE Character Sets and Encodings" 

Re: Patch: add conversion from pg_wchar to multibyte

From
Alexander Korotkov
Date:
On Tue, Jul 3, 2012 at 10:17 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> OK.  So, in that case, I suggest that if the leading byte is non-zero,
> we emit 0x9d followed by the three available bytes, instead of first
> testing whether the first byte is >= 0xf0.  That test seems to serve
> no purpose but to confuse the issue.

Probably the code shoud look like this(see below comment):

                else if (lb >= 0xf0 && lb <= 0xfe)
                {
                    if (lb <= 0xf4)
                          *to++ = 0x9c;
            else
                          *to++ = 0x9d;
                        *to++ = lb;
                        *to++ = (*from >> 8) & 0xff;
                        *to++ = *from & 0xff;
                        cnt += 4;

It's likely we also need to assign some names to all these numbers (0xf0, 0xf4, 0xfe, 0x9c, 0x9d). But it's hard for me to invent such names.
 
> I further suggest that we improve the comments on the mule functions
> for both wchar->mb and mb->wchar to make all this more clear.

I have added comments about mule internal encoding by refreshing my
memory and from old document found on
web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

Please take a look at.  BTW, it seems conversion between multibyte and
wchar can be roundtrip in the leading character is LCPRV2 case:

If the second byte of wchar (out of 4 bytes of wchar. The first byte
is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of
multibyte must be 0x9c.  If the second byte of wchar is in range of
0xf5 to 0xfe, then the first byte of multibyte must be 0x9d.
 
Should I intergrate these code changes into my patch? Or we would like to commit them first?

------
With best regards,
Alexander Korotkov.

Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> It's likely we also need to assign some names to all these numbers
> (0xf0, 0xf4, 0xfe, 0x9c, 0x9d). But it's hard for me to invent such names.

The encoding ID byte values already have names (see pg_wchar.h), but the
private prefix bytes don't.  I griped about that upthread.  I agree this
code needs some basic readability cleanup that's independent of your
feature addition.  It'd likely be reasonable to do that as a separate
patch.
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> I have added comments about mule internal encoding by refreshing my
> memory and from old document found on
> web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

Any objection to apply my patch?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
>> I have added comments about mule internal encoding by refreshing my
>> memory and from old document found on
>> web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

> Any objection to apply my patch?

It needs a bit of copy-editing, and I think we need to do more than just
add comments: the various byte values should have #defines so that you
can grep for code usages.  I'll see what I can do with it.
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
I wrote:
> Tatsuo Ishii <ishii@postgresql.org> writes:
>>> I have added comments about mule internal encoding by refreshing my
>>> memory and from old document found on
>>> web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

>> Any objection to apply my patch?

> It needs a bit of copy-editing, and I think we need to do more than just
> add comments: the various byte values should have #defines so that you
> can grep for code usages.  I'll see what I can do with it.

I cleaned up the comments in pg_wchar.h some more, added #define
symbols for the LCPRVn marker codes, and committed it.

So far as I can see, the only LCPRVn marker code that is actually in
use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
that I can find.

I also read in the xemacs internals doc, at
http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
that XEmacs thinks the marker code for private single-byte charsets
is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
moreover they think 0x9a-0x9d are potential future official multibyte
charset codes.  I don't know how we got to the current state of using
0x9a-0x9d as private charset markers, but it seems pretty inconsistent
with XEmacs.

Since only 0x9d could possibly be on-disk anywhere at the moment (unless
I'm missing something), I think we would be well advised to redefine our
marker codes thus:
LCPRV1    0x9e (only) (matches XEmacs spec)LCPRV2    0x9d (only) (doesn't match XEmacs, but too late to change)

This would simplify and speed up the IS_LCPRVn macros, simplify the
conversions that Alexander is worried about, and get us out from under
the risk that XEmacs will assign their next three official multibyte
encoding IDs.  We're still in trouble if they ever get to 0x9d, but
since that's the last code they have, I bet they will be in no hurry
to use it up.
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> So far as I can see, the only LCPRVn marker code that is actually in
> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
> that I can find.
> 
> I also read in the xemacs internals doc, at
> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
> that XEmacs thinks the marker code for private single-byte charsets
> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
> moreover they think 0x9a-0x9d are potential future official multibyte
> charset codes.  I don't know how we got to the current state of using
> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
> with XEmacs.

At the time when mule internal code was introduced to PostgreSQL,
xemacs did not have multi encoding capabilty and mule (a patch to
emacs) was the only implementation allowed to use multi encoding. So I
used the specification of mule documented in the URL I wrote.

> Since only 0x9d could possibly be on-disk anywhere at the moment (unless
> I'm missing something), I think we would be well advised to redefine our
> marker codes thus:
> 
>     LCPRV1    0x9e (only) (matches XEmacs spec)
>     LCPRV2    0x9d (only) (doesn't match XEmacs, but too late to change)
> 
> This would simplify and speed up the IS_LCPRVn macros, simplify the
> conversions that Alexander is worried about, and get us out from under
> the risk that XEmacs will assign their next three official multibyte
> encoding IDs.  We're still in trouble if they ever get to 0x9d, but
> since that's the last code they have, I bet they will be in no hurry
> to use it up.
> 
>             regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> [ new patch ]

With the improved comments in pg_wchar.h, it seemed clear what needed
to be done here, so I fixed up the MULE conversion and committed this.I'd appreciate it if someone would check my work,
butI think it's
 
right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> [ new patch ]
> 
> With the improved comments in pg_wchar.h, it seemed clear what needed
> to be done here, so I fixed up the MULE conversion and committed this.
>  I'd appreciate it if someone would check my work, but I think it's
> right.

For me your commit looks good.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> [ new patch ]

> With the improved comments in pg_wchar.h, it seemed clear what needed
> to be done here, so I fixed up the MULE conversion and committed this.
>  I'd appreciate it if someone would check my work, but I think it's
> right.

Hm, several of these routines seem to neglect to advance the "from"
pointer?
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
>> So far as I can see, the only LCPRVn marker code that is actually in
>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>> that I can find.
>> 
>> I also read in the xemacs internals doc, at
>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>> that XEmacs thinks the marker code for private single-byte charsets
>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>> moreover they think 0x9a-0x9d are potential future official multibyte
>> charset codes.  I don't know how we got to the current state of using
>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>> with XEmacs.

> At the time when mule internal code was introduced to PostgreSQL,
> xemacs did not have multi encoding capabilty and mule (a patch to
> emacs) was the only implementation allowed to use multi encoding. So I
> used the specification of mule documented in the URL I wrote.

I see.  Given that upstream has decided that a simpler definition is
more appropriate, is there any reason not to follow their lead, to the
extent that we can do so without breaking existing on-disk data?
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>> [ new patch ]
>
>> With the improved comments in pg_wchar.h, it seemed clear what needed
>> to be done here, so I fixed up the MULE conversion and committed this.
>>  I'd appreciate it if someone would check my work, but I think it's
>> right.
>
> Hm, several of these routines seem to neglect to advance the "from"
> pointer?

Err... yeah.  That's not a bug I introduced, but I should have caught
it... and it does make me wonder how well this code was tested.

Does the attached look like an appropriate fix?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm, several of these routines seem to neglect to advance the "from"
>> pointer?

> Err... yeah.  That's not a bug I introduced, but I should have caught
> it... and it does make me wonder how well this code was tested.

> Does the attached look like an appropriate fix?

I'd be inclined to put the from++ and len-- at the bottom of each loop,
and in that order every time, just for consistency and obviousness.
But yeah, that's basically what's needed.
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> writes:
>>> So far as I can see, the only LCPRVn marker code that is actually in
>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>> that I can find.
>>> 
>>> I also read in the xemacs internals doc, at
>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>> that XEmacs thinks the marker code for private single-byte charsets
>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>> charset codes.  I don't know how we got to the current state of using
>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>> with XEmacs.
> 
>> At the time when mule internal code was introduced to PostgreSQL,
>> xemacs did not have multi encoding capabilty and mule (a patch to
>> emacs) was the only implementation allowed to use multi encoding. So I
>> used the specification of mule documented in the URL I wrote.
> 
> I see.  Given that upstream has decided that a simpler definition is
> more appropriate, is there any reason not to follow their lead, to the
> extent that we can do so without breaking existing on-disk data?

Please let me spend week end to understand the their latest spec.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Patch: add conversion from pg_wchar to multibyte

From
Robert Haas
Date:
On Thu, Jul 5, 2012 at 8:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Hm, several of these routines seem to neglect to advance the "from"
>>> pointer?
>
>> Err... yeah.  That's not a bug I introduced, but I should have caught
>> it... and it does make me wonder how well this code was tested.
>
>> Does the attached look like an appropriate fix?
>
> I'd be inclined to put the from++ and len-- at the bottom of each loop,
> and in that order every time, just for consistency and obviousness.
> But yeah, that's basically what's needed.

OK, I've committed a slightly tweaked version of that patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
>> Tatsuo Ishii <ishii@postgresql.org> writes:
>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>> that I can find.
>>>> 
>>>> I also read in the xemacs internals doc, at
>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>> charset codes.  I don't know how we got to the current state of using
>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>> with XEmacs.
>> 
>>> At the time when mule internal code was introduced to PostgreSQL,
>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>> emacs) was the only implementation allowed to use multi encoding. So I
>>> used the specification of mule documented in the URL I wrote.
>> 
>> I see.  Given that upstream has decided that a simpler definition is
>> more appropriate, is there any reason not to follow their lead, to the
>> extent that we can do so without breaking existing on-disk data?
> 
> Please let me spend week end to understand the their latest spec.

This is an intermediate report on the internal multi-byte charset
implementation of emacen. I have read the link Tom showed. Also I made
a quick scan on xemacs-21.4.0 source code, especially
xemacs-21.4.0/src/mule-charset.h. It seems the web document is
essentially a copy of the comments in the file. Also I looked into
other place of xemacs code and I think I can conclude that xeamcs
21.4's multi-byte implementation is based on the doc on the web.

Next I looked into emacs 24.1 source code because I could not find any
doc regarding emacs's(not xemacs's) implementation of internal
multi-byte charset. I found followings in src/charset.h:

/* Leading-code followed by extended leading-code.    DIMENSION/COLUMN */
#define EMACS_MULE_LEADING_CODE_PRIVATE_11    0x9A /* 1/1 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_12    0x9B /* 1/2 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_21    0x9C /* 2/2 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_22    0x9D /* 2/2 */

And these are used like this:

/* Read one non-ASCII character from INSTREAM.  The character is  encoded in `emacs-mule' and the first byte is already
readin  C.  */
 

static int
read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
{
:
: else if (len == 3)   {     if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11  || buf[0] ==
EMACS_MULE_LEADING_CODE_PRIVATE_12){ charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);  code = buf[2] & 0x7F;}
 

As far as I can tell, this is exactly the same way how PostgreSQL
handles single private character sets: they consist of 3 bytes, and
leading byte is either 0x9a or 0x9b. Other examples regarding single
byte/multi-byte private charsets can be seen in coding.c.

As far as I can tell, it seems emacs and xemacs employes different
implementations of multi-byte charaset regarding "private"
charsets. Emacs's is same as PostgreSQL, while xemacs is not.  I am
contacting to the original Mule author if he knows anything about
this.

BTW, while looking into emacs's source code, I found their charset
definitions are in lisp/international/mule-conf.el. According to the
file several new charsets has been added. Included is the patch to
follow their changes. This makes no changes to current behavior, since
the patch just changes some comments and non supported charsets.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 1bcdfbc..e44749e 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -108,7 +108,7 @@ typedef unsigned int pg_wchar;#define LC_KOI8_R            0x8b    /* Cyrillic KOI8-R */#define
LC_ISO8859_5       0x8c    /* ISO8859 Cyrillic */#define LC_ISO8859_9        0x8d    /* ISO8859 Latin 5 (not supported
yet)*/
 
-/* #define FREE                0x8e    free (unused) */
+#define LC_ISO8859_15        0x8e    /* ISO8859 Latin 15 (not supported yet) *//* #define CONTROL_1        0x8f
controlcharacters (unused) *//* Is a leading byte for "official" single byte encodings? */
 
@@ -119,14 +119,13 @@ typedef unsigned int pg_wchar; * 0x9a-0x9d are free. 0x9e and 0x9f are reserved. */#define
LC_JISX0208_1978   0x90    /* Japanese Kanji, old JIS (not supported) */
 
-/* #define FREE                0x90    free (unused) */#define LC_GB2312_80        0x91    /* Chinese */#define
LC_JISX0208           0x92    /* Japanese Kanji (JIS X 0208) */#define LC_KS5601            0x93    /* Korean */#define
LC_JISX0212           0x94    /* Japanese Kanji (JIS X 0212) */#define LC_CNS11643_1        0x95    /* CNS 11643-1992
Plane1 */#define LC_CNS11643_2        0x96    /* CNS 11643-1992 Plane 2 */
 
-/* #define FREE                0x97    free (unused) */
+#define LC_JISX0213-1        0x97    /* Japanese Kanji (JIS X 0213 Plane 1) (not supported) */#define LC_BIG5_1
   0x98    /* Plane 1 Chinese traditional (not supported) */#define LC_BIG5_2            0x99    /* Plane 1 Chinese
traditional(not supported) */
 
@@ -184,6 +183,12 @@ typedef unsigned int pg_wchar;                                     * (not supported) */#define
LC_TIBETAN_1_COLUMN0xf1    /* Tibetan 1-column width glyphs                                     * (not supported) */
 
+#define LC_UNICODE_SUBSET_2    0xf2    /* Unicode characters of the range U+2500..U+33FF.
+                                     * (not supported) */  
+#define LC_UNICODE_SUBSET_3    0xf3    /* Unicode characters of the range U+E000..U+FFFF.
+                                     * (not supported) */  
+#define LC_UNICODE_SUBSET    0xf4    /* Unicode characters of the range U+0100..U+24FF.
+                                     * (not supported) */  #define LC_ETHIOPIC            0xf5    /* Ethiopic
characters(not supported) */#define LC_CNS11643_3        0xf6    /* CNS 11643-1992 Plane 3 */#define LC_CNS11643_4
 0xf7    /* CNS 11643-1992 Plane 4 */ 

Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
>>> Tatsuo Ishii <ishii@postgresql.org> writes:
>>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>>> that I can find.
>>>>> 
>>>>> I also read in the xemacs internals doc, at
>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>>> charset codes.  I don't know how we got to the current state of using
>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>>> with XEmacs.
>>> 
>>>> At the time when mule internal code was introduced to PostgreSQL,
>>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>>> emacs) was the only implementation allowed to use multi encoding. So I
>>>> used the specification of mule documented in the URL I wrote.
>>> 
>>> I see.  Given that upstream has decided that a simpler definition is
>>> more appropriate, is there any reason not to follow their lead, to the
>>> extent that we can do so without breaking existing on-disk data?
>> 
>> Please let me spend week end to understand the their latest spec.
> 
> This is an intermediate report on the internal multi-byte charset
> implementation of emacen. I have read the link Tom showed. Also I made
> a quick scan on xemacs-21.4.0 source code, especially
> xemacs-21.4.0/src/mule-charset.h. It seems the web document is
> essentially a copy of the comments in the file. Also I looked into
> other place of xemacs code and I think I can conclude that xeamcs
> 21.4's multi-byte implementation is based on the doc on the web.
> 
> Next I looked into emacs 24.1 source code because I could not find any
> doc regarding emacs's(not xemacs's) implementation of internal
> multi-byte charset. I found followings in src/charset.h:
> 
> /* Leading-code followed by extended leading-code.    DIMENSION/COLUMN */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_11    0x9A /* 1/1 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_12    0x9B /* 1/2 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_21    0x9C /* 2/2 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_22    0x9D /* 2/2 */
> 
> And these are used like this:
> 
> /* Read one non-ASCII character from INSTREAM.  The character is
>    encoded in `emacs-mule' and the first byte is already read in
>    C.  */
> 
> static int
> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
> {
> :
> :
>   else if (len == 3)
>     {
>       if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11
>       || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12)
>     {
>       charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);
>       code = buf[2] & 0x7F;
>     }
> 
> As far as I can tell, this is exactly the same way how PostgreSQL
> handles single private character sets: they consist of 3 bytes, and
> leading byte is either 0x9a or 0x9b. Other examples regarding single
> byte/multi-byte private charsets can be seen in coding.c.
> 
> As far as I can tell, it seems emacs and xemacs employes different
> implementations of multi-byte charaset regarding "private"
> charsets. Emacs's is same as PostgreSQL, while xemacs is not.  I am
> contacting to the original Mule author if he knows anything about
> this.

I got reply from the Mule author, Kenichi Handa (the mail is in
Japanese. So I do not quote his mail here. If somebody wants to read
the original mail please let me know). First of all my understanding
with emacs's implementaion is correct according to him. He did not
know about xemacs's implementation. Apparently the implementation of
xemacs was not lead by the original mule author.

So which one of emacs/xemacs should we follow? My suggestion is, not
to follow xemacs, and to leave the current treatment of private
leading byte as it is because emacs seems to be more "right" upstream
comparing with xemacs.

> BTW, while looking into emacs's source code, I found their charset
> definitions are in lisp/international/mule-conf.el. According to the
> file several new charsets has been added. Included is the patch to
> follow their changes. This makes no changes to current behavior, since
> the patch just changes some comments and non supported charsets.

If there's no objection, I would like to commit this. Objection?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
>>>> Tatsuo Ishii <ishii@postgresql.org> writes:
>>>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>>>> that I can find.
>>>>>> 
>>>>>> I also read in the xemacs internals doc, at
>>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>>>> charset codes.  I don't know how we got to the current state of using
>>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>>>> with XEmacs.
>>>> 
>>>>> At the time when mule internal code was introduced to PostgreSQL,
>>>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>>>> emacs) was the only implementation allowed to use multi encoding. So I
>>>>> used the specification of mule documented in the URL I wrote.
>>>> 
>>>> I see.  Given that upstream has decided that a simpler definition is
>>>> more appropriate, is there any reason not to follow their lead, to the
>>>> extent that we can do so without breaking existing on-disk data?
>>> 
>>> Please let me spend week end to understand the their latest spec.
>> 
>> This is an intermediate report on the internal multi-byte charset
>> implementation of emacen. I have read the link Tom showed. Also I made
>> a quick scan on xemacs-21.4.0 source code, especially
>> xemacs-21.4.0/src/mule-charset.h. It seems the web document is
>> essentially a copy of the comments in the file. Also I looked into
>> other place of xemacs code and I think I can conclude that xeamcs
>> 21.4's multi-byte implementation is based on the doc on the web.
>> 
>> Next I looked into emacs 24.1 source code because I could not find any
>> doc regarding emacs's(not xemacs's) implementation of internal
>> multi-byte charset. I found followings in src/charset.h:
>> 
>> /* Leading-code followed by extended leading-code.    DIMENSION/COLUMN */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_11    0x9A /* 1/1 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_12    0x9B /* 1/2 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_21    0x9C /* 2/2 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_22    0x9D /* 2/2 */
>> 
>> And these are used like this:
>> 
>> /* Read one non-ASCII character from INSTREAM.  The character is
>>    encoded in `emacs-mule' and the first byte is already read in
>>    C.  */
>> 
>> static int
>> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
>> {
>> :
>> :
>>   else if (len == 3)
>>     {
>>       if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11
>>       || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12)
>>     {
>>       charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);
>>       code = buf[2] & 0x7F;
>>     }
>> 
>> As far as I can tell, this is exactly the same way how PostgreSQL
>> handles single private character sets: they consist of 3 bytes, and
>> leading byte is either 0x9a or 0x9b. Other examples regarding single
>> byte/multi-byte private charsets can be seen in coding.c.
>> 
>> As far as I can tell, it seems emacs and xemacs employes different
>> implementations of multi-byte charaset regarding "private"
>> charsets. Emacs's is same as PostgreSQL, while xemacs is not.  I am
>> contacting to the original Mule author if he knows anything about
>> this.
> 
> I got reply from the Mule author, Kenichi Handa (the mail is in
> Japanese. So I do not quote his mail here. If somebody wants to read
> the original mail please let me know). First of all my understanding
> with emacs's implementaion is correct according to him. He did not
> know about xemacs's implementation. Apparently the implementation of
> xemacs was not lead by the original mule author.
> 
> So which one of emacs/xemacs should we follow? My suggestion is, not
> to follow xemacs, and to leave the current treatment of private
> leading byte as it is because emacs seems to be more "right" upstream
> comparing with xemacs.
> 
>> BTW, while looking into emacs's source code, I found their charset
>> definitions are in lisp/international/mule-conf.el. According to the
>> file several new charsets has been added. Included is the patch to
>> follow their changes. This makes no changes to current behavior, since
>> the patch just changes some comments and non supported charsets.
> 
> If there's no objection, I would like to commit this. Objection?

Done along with comment that we follow emacs's implementation, not
xemacs's.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Patch: add conversion from pg_wchar to multibyte

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
> Done along with comment that we follow emacs's implementation, not
> xemacs's.

Well, when the preceding comment block contains five references to
xemacs and the link for more information leads to www.xemacs.org,
I don't think it's real helpful to add one sentence saying "oh
by the way we're not actually following xemacs".

I continue to think that we'd be better off to follow the xemacs
spec, as the subdivisions the emacs spec is insisting on seem like
entirely useless complication.  The only possible reason for doing
it the emacs way is that it would provide room for twice as many
charset IDs ... but the present design for wchar conversion destroys
that advantage, because it requires the charset ID spaces to be
nonoverlapping anyhow.  Moreover, it's not apparent to me that
charset standards are still proliferating, so I doubt that we need
any more ID space.
        regards, tom lane


Re: Patch: add conversion from pg_wchar to multibyte

From
Tatsuo Ishii
Date:
> Well, when the preceding comment block contains five references to
> xemacs and the link for more information leads to www.xemacs.org,
> I don't think it's real helpful to add one sentence saying "oh
> by the way we're not actually following xemacs".
> 
> I continue to think that we'd be better off to follow the xemacs
> spec, as the subdivisions the emacs spec is insisting on seem like
> entirely useless complication.  The only possible reason for doing
> it the emacs way is that it would provide room for twice as many
> charset IDs ... but the present design for wchar conversion destroys
> that advantage, because it requires the charset ID spaces to be
> nonoverlapping anyhow.  Moreover, it's not apparent to me that
> charset standards are still proliferating, so I doubt that we need
> any more ID space.

Well, we have been following emacs spec, not xemacs spec from the day
0. I don't see any value to switch to xemacs way at this moment,
because I think the reason why we support particular encoding is, to
keep on supporting existing user data, not "enhance" our internal
architecture.

If you like xeamcs's spec, I think you'd better add new encoding,
rather than break data compatibility.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp