Thread: Supporting SJIS as a database encoding

Supporting SJIS as a database encoding

From
"Tsunakawa, Takayuki"
Date:
Hello,

I'd like to propose adding SJIS as a database encoding.  You may wonder why SJIS is still necessary in the world of
Unicode. The purpose is to achieve comparable performance when migrating legacy database systems from other DBMSs
withoutlittle modification of applications.
 

Recently, we failed to migrate some customer's legacy database from DBMS-X to PostgreSQL.  That customer wished for
PostgreSQL,but PostgreSQL couldn't meet the performance requirement.
 

The system uses DBMS-X with the database character set being SJIS.  The main applications are written in embedded SQL,
whichrequire SJIS in their host variables.  They insisted they cannot use UTF8 for the host variables because that
wouldrequire large modification of applications due to character handling.  So no character set conversion is necessary
betweenthe clients and the server.
 

On the other hand, PostgreSQL doesn't support SJIS as a database encoding.  Therefore, character set conversion from
UTF-8to SJIS has to be performed.  The batch application runs millions of SELECTS each of which retrieves more than 100
columns. And many of those columns are of character type.
 

If PostgreSQL supports SJIS, PostgreSQL will match or outperform the performance of DBMS-X with regard to the
applications. We confirmed it by using psql to run a subset of the batch processing.  When the client encoding is SJIS,
oneFETCH of 10,000 rows took about 500ms.  When the client encoding is UTF8 (the same as the database encoding), the
sameFETCH took 270ms.
 

Supporting SJIS may somewhat regain attention to PostgreSQL here in Japan, in the context of database migration.  BTW,
MySQLsupports SJIS as a database encoding.  PostgreSQL used to be the most popular open source database in Japan, but
MySQLis now more popular.
 


But what I'm wondering is why PostgreSQL doesn't support SJIS.  Was there any technical difficulty?  Is there anything
youare worried about if adding SJIS?
 

I'd like to write a patch for adding SJIS if there's no strong objection.  I'd appreciate it if you could let me know
gooddesign information to add a server encoding (e.g. the URL of the most recent patch to add a new server encoding)
 

Regards
Takayuki Tsunakawa





Re: Supporting SJIS as a database encoding

From
Tatsuo Ishii
Date:
> But what I'm wondering is why PostgreSQL doesn't support SJIS.  Was there any technical difficulty?  Is there
anythingyou are worried about if adding SJIS?
 

Yes, there's a technical difficulty with backend code. In many places
it is assumed that any string is "ASCII compatible", which means no
ASCII character is used as a part of multi byte string. Here is such a
random example from src/backend/util/adt/varlena.c:
/* Else, it's the traditional escaped style */for (bc = 0, tp = inputText; *tp != '\0'; bc++){    if (tp[0] != '\\')
   tp++;
 

Sometimes SJIS uses '\' as the second byte of it.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Supporting SJIS as a database encoding

From
"Tsunakawa, Takayuki"
Date:
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tatsuo Ishii
> > But what I'm wondering is why PostgreSQL doesn't support SJIS.  Was there
> any technical difficulty?  Is there anything you are worried about if adding
> SJIS?
> 
> Yes, there's a technical difficulty with backend code. In many places it
> is assumed that any string is "ASCII compatible", which means no ASCII
> character is used as a part of multi byte string. Here is such a random
> example from src/backend/util/adt/varlena.c:
> 
>     /* Else, it's the traditional escaped style */
>     for (bc = 0, tp = inputText; *tp != '\0'; bc++)
>     {
>         if (tp[0] != '\\')
>             tp++;
> 
> Sometimes SJIS uses '\' as the second byte of it.

Thanks, I'll try to understand the seriousness of the problem as I don't have good knowledge of character sets.  But
yourexample seems to be telling everything about the difficulty...
 

Before digging into the problem, could you share your impression on whether PostgreSQL can support SJIS?  Would it be
hopeless? Can't we find any direction to go?  Can I find relevant source code by searching specific words like "ASCII",
"HIGH_BIT","\\" etc?
 

Regards
Takayuki Tsunakawa





Re: Supporting SJIS as a database encoding

From
Tatsuo Ishii
Date:
> Before digging into the problem, could you share your impression on whether PostgreSQL can support SJIS?  Would it be
hopeless? Can't we find any direction to go?  Can I find relevant source code by searching specific words like "ASCII",
"HIGH_BIT","\\" etc?
 

For starters, you could grep "multibyte".

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Supporting SJIS as a database encoding

From
Tom Lane
Date:
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> Before digging into the problem, could you share your impression on
> whether PostgreSQL can support SJIS?  Would it be hopeless?

I think it's pretty much hopeless.  Even if we were willing to make every
bit of code that looks for '\' and other specific at-risk characters
multi-byte aware (with attendant speed penalties), we could expect that
third-party extensions would still contain vulnerable code.  More, we
could expect that new bugs of the same ilk would get introduced all the
time.  Many such bugs would amount to security problems.  So the amount of
effort and vigilance required seems out of proportion to the benefits.

Most of the recent discussion about allowed backend encodings has run
more in the other direction, ie, "why don't we disallow everything but
UTF8 and get rid of all the infrastructure for multiple backend
encodings?".  I'm not personally in favor of that, but there are very
few hackers who want to add any more overhead in this area.
        regards, tom lane



Re: Supporting SJIS as a database encoding

From
Heikki Linnakangas
Date:
On 09/05/2016 05:47 PM, Tom Lane wrote:
> "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
>> Before digging into the problem, could you share your impression on
>> whether PostgreSQL can support SJIS?  Would it be hopeless?
>
> I think it's pretty much hopeless.

Agreed.

But one thing that would help a little, would be to optimize the UTF-8 
-> SJIS conversion. It uses a very generic routine, with a binary search 
over a large array of mappings. I bet you could do better than that, 
maybe using a hash table or a radix tree instead of the large 
binary-searched array.

- Heikki




Re: Supporting SJIS as a database encoding

From
"Tsunakawa, Takayuki"
Date:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> > Before digging into the problem, could you share your impression on
> > whether PostgreSQL can support SJIS?  Would it be hopeless?
> 
> I think it's pretty much hopeless.  Even if we were willing to make every
> bit of code that looks for '\' and other specific at-risk characters
> multi-byte aware (with attendant speed penalties), we could expect that
> third-party extensions would still contain vulnerable code.  More, we could
> expect that new bugs of the same ilk would get introduced all the time.
> Many such bugs would amount to security problems.  So the amount of effort
> and vigilance required seems out of proportion to the benefits.

Hmm, this sounds like a death sentence.  But as I don't have good knowledge of character set handling yet, I'm not
completelyconvinced about why PostgreSQL cannot support SJIS.  I wonder why and how other DBMSs support SJIS and what's
thedifference of the implementation.  Using multibyte-functions like mb... to process characters would solve the
problem? Isn't the current implementation blocking the support of other character sets that have similar
characteristics? I'll learn the character set handling...
 

> Most of the recent discussion about allowed backend encodings has run more
> in the other direction, ie, "why don't we disallow everything but
> UTF8 and get rid of all the infrastructure for multiple backend encodings?".
> I'm not personally in favor of that, but there are very few hackers who
> want to add any more overhead in this area.

Personally, I totally agree.  I want non-Unicode character sets to disappear from the world.  But the real business
doesn'tseem to forgive the lack of SJIS...
 

Regards
Takayuki Tsunakawa





Re: Supporting SJIS as a database encoding

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Heikki
> But one thing that would help a little, would be to optimize the UTF-8
> -> SJIS conversion. It uses a very generic routine, with a binary search
> over a large array of mappings. I bet you could do better than that, maybe
> using a hash table or a radix tree instead of the large binary-searched
> array.

That sounds worth pursuing.  Thanks!


Regards
Takayuki Tsunakawa




Re: Supporting SJIS as a database encoding

From
Tom Lane
Date:
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> Using multibyte-functions like mb... to process characters would solve
> the problem?

Well, sure.  The problem is (1) finding all the places that need that
(I'd estimate dozens to hundreds of places in the core code, and then
there's the question of extensions); (2) preventing new
non-multibyte-aware code from being introduced after you've fixed those
places; and (3) the performance penalties you'd take, because a lot of
those places are bottlenecks and it's much cheaper to not worry about
character lengths in an inner loop.

> Isn't the current implementation blocking the support of
> other character sets that have similar characteristics?

Sure, SJIS is not the only encoding that we consider frontend-only.
See

https://www.postgresql.org/docs/devel/static/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
        regards, tom lane



Re: Supporting SJIS as a database encoding

From
Kyotaro HORIGUCHI
Date:
Hello,

At Mon, 5 Sep 2016 19:38:33 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<529db688-72fc-1ca2-f898-b0b99e30076f@iki.fi>
> On 09/05/2016 05:47 PM, Tom Lane wrote:
> > "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> >> Before digging into the problem, could you share your impression on
> >> whether PostgreSQL can support SJIS?  Would it be hopeless?
> >
> > I think it's pretty much hopeless.
> 
> Agreed.

+1, even as a user of SJIS:)

> But one thing that would help a little, would be to optimize the UTF-8
> -> SJIS conversion. It uses a very generic routine, with a binary
> search over a large array of mappings. I bet you could do better than
> that, maybe using a hash table or a radix tree instead of the large
> binary-searched array.

I'm very impressed by the idea. Mean number of iterations for
binsearch on current conversion table with 8000 characters is
about 13 and the table size is under 100kBytes (maybe).

A three-level array with 2 byte values will take about 1.6~2MB of memory.

A radix tree for UTF-8->some-encoding conversion requires about,
or up to.. (using 1 byte index to point the next level)

(1 *  ((7f + 1) +     (df - c2 + 1) * (bf - 80 + 1) +     (ef - e0 + 1) * (bf - 80 + 1)^2)) = 67 kbytes.

SJIS characters are 2byte length at longest so about 8000
characters takes extra 16 k Bytes. And some padding space will be
added on them.

As the result, radix tree seems to be promising because of small
requirement of additional memory and far less comparisons.  Also
Big5 and other encodings including EUC-* will get benefit from
it.

Implementing radix tree code, then redefining the format of
mapping table to suppot radix tree, then modifying mapping
generator script are needed.

If no one oppse to this, I'll do that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Supporting SJIS as a database encoding

From
"Tsunakawa, Takayuki"
Date:
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
> HORIGUCHI
Implementing radix tree code, then redefining the format of mapping table
> to suppot radix tree, then modifying mapping generator script are needed.
> 
> If no one oppse to this, I'll do that.

+100
Great analysis and your guts.  I very much appreciate your trial!

Regards
Takayuki Tsunakawa




Re: Supporting SJIS as a database encoding

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 6 Sep 2016 03:43:46 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1F5E66CE@G01JPEXMBYT05>
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
> > HORIGUCHI
> Implementing radix tree code, then redefining the format of mapping table
> > to suppot radix tree, then modifying mapping generator script are needed.
> > 
> > If no one oppse to this, I'll do that.
> 
> +100
> Great analysis and your guts.  I very much appreciate your trial!

Thanks, by the way, there's another issue related to SJIS
conversion.  MS932 has several characters that have multiple code
points. By converting texts in this encoding to and from Unicode
causes a round-trop problem. For example,

8754(ROMAN NUMERICAL I in NEC specials) => U+2160(ROMAN NUMERICAL I)   => FA4A (ROMAN NUMERICA I in IBM extension)

My counting said that 398 characters are affected by this kind of
replacement. Addition to that, "GAIJI" (Private usage area) is
not allowed. Is this meet your purpose?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Supporting SJIS as a database encoding

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
> Thanks, by the way, there's another issue related to SJIS conversion.  MS932
> has several characters that have multiple code points. By converting texts
> in this encoding to and from Unicode causes a round-trop problem. For
> example,
> 
> 8754(ROMAN NUMERICAL I in NEC specials)
>   => U+2160(ROMAN NUMERICAL I)
>     => FA4A (ROMAN NUMERICA I in IBM extension)
> 
> My counting said that 398 characters are affected by this kind of replacement.
> Addition to that, "GAIJI" (Private usage area) is not allowed. Is this meet
> your purpose?

Supporting GAIJI is not a requirement as far as I know.  Thank you for sharing information.

# I realize my lack of knowledge about character sets...

Regards
Takayuki Tsunakawa




Re: Supporting SJIS as a database encoding

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 07 Sep 2016 16:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160907.161304.112519789.horiguchi.kyotaro@lab.ntt.co.jp>
> > Implementing radix tree code, then redefining the format of mapping table
> > > to suppot radix tree, then modifying mapping generator script are needed.
> > > 
> > > If no one oppse to this, I'll do that.

So, I did that as a PoC. The radix tree takes a little less than
100k bytes (far smaller than expected:) and it is defnitely
faster than binsearch.


The attached patch does the following things.

- Defines a struct for static radix tree (utf_radix_tree). Currently it supports up to 3-byte encodings.

- Adds a map generator script UCS_to_SJIS_radix.pl, which generates utf8_to_sjis_radix.map from utf8_to_sjis.map.

- Adds a new conversion function utf8_to_sjis_radix.

- Modifies UtfToLocal so as to allow map to be NULL.

- Modifies utf8_to_sjis to use the new conversion function instead of ULmapSJIS.


The followings are to be done.

- utf8_to_sjis_radix could be more generic.

- SJIS->UTF8 is not implemented but it would be easily done since there's no difference in using the radix tree
mechanism.(but the output character is currently assumed to be 2-byte long)
 

- It doesn't support 4-byte codes so this is not applicable to sjis_2004. Extending the radix tree to support 4-byte
wouldn'tbe hard.
 


The following is the result of a simple test.

=# create table t (a text); alter table t alter column a storage plain;
=# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding');
=# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding');

# Doing that twice is just my mistake.

$ export PGCLIENTENCODING=SJIS

$ time psql postgres -c '
$ psql -c '\encoding' postgres
SJIS

<Using radix tree>
$ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' > /dev/null

real    0m22.696s
user    0m16.991s
sys    0m0.182s>

Using binsearch the result for the same operation was 
real    0m35.296s
user    0m17.166s
sys    0m0.216s

Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Supporting SJIS as a database encoding

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
> HORIGUCHI
> <Using radix tree>
> $ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' >
> /dev/null
> 
> real    0m22.696s
> user    0m16.991s
> sys    0m0.182s>
> 
> Using binsearch the result for the same operation was
> real    0m35.296s
> user    0m17.166s
> sys    0m0.216s
> 
> Returning in UTF-8 bloats the result string by about 1.5 times so it doesn't
> seem to make sense comparing with it. But it takes real = 47.35s.

Cool, 36% speedup!  Does this difference vary depending on the actual characters used, e.g. the speedup would be
greaterif most of the characters are ASCII?
 

Regards
Takayuki Tsunakawa





Re: Supporting SJIS as a database encoding

From
Kyotaro HORIGUCHI
Date:
At Thu, 8 Sep 2016 07:09:51 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1F5E7D4A@G01JPEXMBYT05>
> From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
> > HORIGUCHI
> > <Using radix tree>
> > $ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' >
> > /dev/null
> > 
> > real    0m22.696s
> > user    0m16.991s
> > sys    0m0.182s>
> > 
> > Using binsearch the result for the same operation was
> > real    0m35.296s
> > user    0m17.166s
> > sys    0m0.216s
> > 
> > Returning in UTF-8 bloats the result string by about 1.5 times so it doesn't
> > seem to make sense comparing with it. But it takes real = 47.35s.
> 
> Cool, 36% speedup!  Does this difference vary depending on the actual characters used, e.g. the speedup would be
greaterif most of the characters are ASCII?
 

Binsearch on JIS X 0208 always needs about 10 times of comparison
and bisecting and the radix tree requires three hops on arrays
for most of the characters and two hops for some. In sort, this
effect won't be differ among 2 and 3 byte characters in UTF-8.

The translation speed of ASCII cahracters (U+20 - U+7f) is not
affected by the character conversion mechanism. They are just
copied without conversion.

As the result, there's no speedup if the output consists only of
ASCII characters and maximum speedup when the output consists
only of 2 byte UTF-8 characters.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Supporting SJIS as a database encoding

From
Heikki Linnakangas
Date:
On 09/08/2016 09:35 AM, Kyotaro HORIGUCHI wrote:
> Returning in UTF-8 bloats the result string by about 1.5 times so
> it doesn't seem to make sense comparing with it. But it takes
> real = 47.35s.

Nice!

I was hoping that this would also make the binaries smaller. A few dozen 
kB of storage is perhaps not a big deal these days, but still. And 
smaller tables would also consume less memory and CPU cache.

I removed the #include "../../Unicode/utf8_to_sjis.map" line, so that 
the old table isn't included anymore, compiled, and ran "strip 
utf8_and_sjis.so". Without this patch, it's 126 kB, and with it, it's 
160 kB. So the radix tree takes a little bit more space.

That's not too bad, and I'm sure we could live with that, but with a few 
simple tricks, we could do better. First, since all the values we store 
in the tree are < 0xffff, we could store them in int16 instead of int32, 
and halve the size of the table right off the bat. That won't work for 
all encodings, of course, but it might be worth it to have two versions 
of the code, one for int16 and another for int32.

Another trick is to eliminate redundancies in the tables. Many of the 
tables contain lots of zeros, as in:

>   /*   c3xx */{
>     /*   c380 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /*   c388 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /*   c390 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x817e,
>     /*   c398 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /*   c3a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /*   c3a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /*   c3b0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x8180,
>     /*   c3b8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000
>   },

and

>   /* e388xx */{
>     /* e38880 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /* e38888 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /* e38890 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /* e38898 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /* e388a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /* e388a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /* e388b0 */ 0x0000, 0xfa58, 0x878b, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
>     /* e388b8 */ 0x0000, 0x878c, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000
>   },

You could overlay the last row of the first table, which is all zeros, 
with the first row of the second table, which is also all zeros. (Many 
of the tables have a lot more zero-rows than this example.)

But yes, this patch looks very promising in general. I think we should 
switch over to radix trees for the all the encodings.

- Heikki




Re: Supporting SJIS as a database encoding

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 13 Sep 2016 11:44:01 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<7ff67a45-a53e-4d38-e25d-3a121afea47c@iki.fi>
> On 09/08/2016 09:35 AM, Kyotaro HORIGUCHI wrote:
> > Returning in UTF-8 bloats the result string by about 1.5 times so
> > it doesn't seem to make sense comparing with it. But it takes
> > real = 47.35s.
> 
> Nice!

Thanks!

> I was hoping that this would also make the binaries smaller. A few
> dozen kB of storage is perhaps not a big deal these days, but
> still. And smaller tables would also consume less memory and CPU
> cache.

Agreed.

> I removed the #include "../../Unicode/utf8_to_sjis.map" line, so that
> the old table isn't included anymore, compiled, and ran "strip
> utf8_and_sjis.so". Without this patch, it's 126 kB, and with it, it's
> 160 kB. So the radix tree takes a little bit more space.
> 
> That's not too bad, and I'm sure we could live with that, but with a
> few simple tricks, we could do better. First, since all the values we
> store in the tree are < 0xffff, we could store them in int16 instead
> of int32, and halve the size of the table right off the bat. won't work
> for all encodings, of course, but it might be worth it to
> have two versions of the code, one for int16 and another for int32.

That's right. I used int imprudently. All of the character in the
patch, and most of characters in other than Unicode-related
encodings are in 2 bytes. 3 bytes characters can be in separate
table in the struct for the case. Othersise two or more versions
of the structs is possible since currently the radix struct is
utf8_and_sjis's own in spite of the fact that it is in pg_wchar.h
in the patch.

> Another trick is to eliminate redundancies in the tables. Many of the
> tables contain lots of zeros, as in:
> 
> >   /*   c3xx */{
...
> >     0x817e,
> >     /* c398 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000,
> >     /* c3a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000,
> >     /* c3a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000,
> >     /* c3b0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x8180,
> >     /* c3b8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000
> >   },
> 
> and
> 
> >   /* e388xx */{
> >     /* e38880 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000,
> >     /* e38888 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000,
> >     /* e38890 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000,
> >     /* e38898 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000,
> >     /* e388a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
> >     0x0000,
...
> >   },
> 
> You could overlay the last row of the first table, which is all zeros,
> with the first row of the second table, which is also all zeros. (Many
> of the tables have a lot more zero-rows than this example.)

Yes, the bunch of zeros was annoyance. Several or many
compression techniques are available in exchange for some
additional CPU time. But the technique you suggested doesn't
need such sacrifice, sounds nice.

> But yes, this patch looks very promising in general. I think we should
> switch over to radix trees for the all the encodings.

The result was more than I expected for a character set with
about 7000 characters. We can expect certain amount of advangate
even for character sets that have less than a hundred of
characters.


I'll work on this for the next CF.

Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Supporting SJIS as a database encoding

From
Kyotaro HORIGUCHI
Date:
Hello, did this.

As a result, radix tree is about 1.5 times faster and needs a
half memory.

At Wed, 21 Sep 2016 15:14:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160921.151427.265121484.horiguchi.kyotaro@lab.ntt.co.jp>
> I'll work on this for the next CF.

The radix conversion function and map conversion script became
more generic than the previous state. So I could easily added
radix conversion of EUC_JP in addition to SjiftJIS.

nm -S said that the size of radix tree data for sjis->utf8
conversion is 34kB and that for utf8->sjis is 46kB.  (eucjp->utf8
57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and
59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm
not missing something, radix tree is faster and require less
memory.

A simple test where 'select '7070 sjis chars' x 100' (I'm not
sure, but the size is 1404kB) on local connection shows that this
is fast enough.

radix:  real 0m0.285s / user 0m0.199s / sys 0m0.006s
master: real 0m0.418s / user 0m0.180s / sys 0m0.004s

To make sure, the result of a test of sending the same amount of
ASCII string (1404kB) on SJIS and UTF8(no-conversion) encoding is
as follows.

ascii/utf8-sjis: real 0m0.220s / user 0m0.176s / sys 0m0.011s
ascii/utf8-utf8: real 0m0.137s / user 0m0.111s / sys 0m0.008s

======
Random discussions -

Currently the tree structure is devided into several elements,
One for 2-byte, other ones for 3-byte and 4-byte codes and output
table. The other than the last one is logically and technically
merged into single table but it makes the generator script far
complex than the current complexity. I no longer want to play
hide'n seek with complex perl object..

It might be better that combining this as a native feature of the
core. Currently the helper function is in core but that function
is given as conv_func on calling LocalToUtf.

Current implement uses *.map files of pg_utf_to_local as
input. It seems not good but the radix tree files is completely
uneditable. Provide custom made loading functions for every
source instead of load_chartable() would be the way to go.

# However, for example utf8_to_sjis.map, it doesn't seem to have
# generated from the source mentioned in UCS_to_SJIS.pl

I'm not sure that compilers other than gcc accepts generated map
file content.

The RADIXTREE.pm is in rather older style but seem no problem.

I haven't tried this for charsets that contains 4-byte code.

I haven't consider charset with conbined characters. I don't
think it is needed so immediately.

Though I believe that this is easily applied to other
conversions, I tried this only with character sets that I know
about it.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
This is a differnet topic from the original thread so I renamed
the subject and repost. Sorry for duplicate posting.

======================
Hello, I did this.

As a result, radix tree is about 1.5 times faster and needs a
half memory.

At Wed, 21 Sep 2016 15:14:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160921.151427.265121484.horiguchi.kyotaro@lab.ntt.co.jp>
> I'll work on this for the next CF.

The radix conversion function and map conversion script became
more generic than the previous state. So I could easily added
radix conversion of EUC_JP in addition to SjiftJIS.

nm -S said that the size of radix tree data for sjis->utf8
conversion is 34kB and that for utf8->sjis is 46kB.  (eucjp->utf8
57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and
59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm
not missing something, radix tree is faster and require less
memory.

A simple test where 'select '7070 sjis chars' x 100' (I'm not
sure, but the size is 1404kB) on local connection shows that this
is fast enough.

radix:  real 0m0.285s / user 0m0.199s / sys 0m0.006s
master: real 0m0.418s / user 0m0.180s / sys 0m0.004s

To make sure, the result of a test of sending the same amount of
ASCII string (1404kB) on SJIS and UTF8(no-conversion) encoding is
as follows.

ascii/utf8-sjis: real 0m0.220s / user 0m0.176s / sys 0m0.011s
ascii/utf8-utf8: real 0m0.137s / user 0m0.111s / sys 0m0.008s

======
Random discussions -

Currently the tree structure is devided into several elements,
One for 2-byte, other ones for 3-byte and 4-byte codes and output
table. The other than the last one is logically and technically
merged into single table but it makes the generator script far
complex than the current complexity. I no longer want to play
hide'n seek with complex perl object..

It might be better that combining this as a native feature of the
core. Currently the helper function is in core but that function
is given as conv_func on calling LocalToUtf.

Current implement uses *.map files of pg_utf_to_local as
input. It seems not good but the radix tree files is completely
uneditable. Provide custom made loading functions for every
source instead of load_chartable() would be the way to go.

# However, for example utf8_to_sjis.map, it doesn't seem to have
# generated from the source mentioned in UCS_to_SJIS.pl

I'm not sure that compilers other than gcc accepts generated map
file content.

The RADIXTREE.pm is in rather older style but seem no problem.

I haven't tried this for charsets that contains 4-byte code.

I haven't consider charset with conbined characters. I don't
think it is needed so immediately.

Though I believe that this is easily applied to other
conversions, I tried this only with character sets that I know
about it.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 10/07/2016 11:36 AM, Kyotaro HORIGUCHI wrote:
> The radix conversion function and map conversion script became
> more generic than the previous state. So I could easily added
> radix conversion of EUC_JP in addition to SjiftJIS.
>
> nm -S said that the size of radix tree data for sjis->utf8
> conversion is 34kB and that for utf8->sjis is 46kB.  (eucjp->utf8
> 57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and
> 59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm
> not missing something, radix tree is faster and require less
> memory.

Cool!

> Currently the tree structure is devided into several elements,
> One for 2-byte, other ones for 3-byte and 4-byte codes and output
> table. The other than the last one is logically and technically
> merged into single table but it makes the generator script far
> complex than the current complexity. I no longer want to play
> hide'n seek with complex perl object..

I think that's OK. There isn't really anything to gain by merging them.

> It might be better that combining this as a native feature of the
> core. Currently the helper function is in core but that function
> is given as conv_func on calling LocalToUtf.

Yeah, I think we want to completely replace the current binary-search 
based code with this. I would rather maintain just one mechanism.

> Current implement uses *.map files of pg_utf_to_local as
> input. It seems not good but the radix tree files is completely
> uneditable. Provide custom made loading functions for every
> source instead of load_chartable() would be the way to go.
>
> # However, for example utf8_to_sjis.map, it doesn't seem to have
> # generated from the source mentioned in UCS_to_SJIS.pl

Ouch. We should find and document an authoritative source for all the 
mappings we have...

I think the next steps here are:

1. Find an authoritative source for all the existing mappings.
2. Generate the radix tree files directly from the authoritative 
sources, instead of the existing *.map files.
3. Completely replace the existing binary-search code with this.

- Heikki




Re: Radix tree for character conversion

From
Robert Haas
Date:
On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Ouch. We should find and document an authoritative source for all the
> mappings we have...
>
> I think the next steps here are:
>
> 1. Find an authoritative source for all the existing mappings.
> 2. Generate the radix tree files directly from the authoritative sources,
> instead of the existing *.map files.
> 3. Completely replace the existing binary-search code with this.

It might be best to convert using the existing map files, and then
update the mappings later.  Otherwise, when things break, you won't
know what to blame.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Radix tree for character conversion

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Ouch. We should find and document an authoritative source for all the
>> mappings we have...
>> 
>> I think the next steps here are:
>> 
>> 1. Find an authoritative source for all the existing mappings.
>> 2. Generate the radix tree files directly from the authoritative sources,
>> instead of the existing *.map files.
>> 3. Completely replace the existing binary-search code with this.

> It might be best to convert using the existing map files, and then
> update the mappings later.  Otherwise, when things break, you won't
> know what to blame.

I think I went through this exercise last year or so, and updated the
notes about the authoritative sources where I was able to find one.
In the remaining cases, I believe that the maps have been intentionally
tweaked and we should be cautious about undoing that.  Tatsuo-san might
remember more about why they are the way they are.
        regards, tom lane



Re: Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 10/07/2016 06:55 PM, Robert Haas wrote:
> On Fri, Oct 7, 2016 at 6:46 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Ouch. We should find and document an authoritative source for all the
>> mappings we have...
>>
>> I think the next steps here are:
>>
>> 1. Find an authoritative source for all the existing mappings.
>> 2. Generate the radix tree files directly from the authoritative sources,
>> instead of the existing *.map files.
>> 3. Completely replace the existing binary-search code with this.
>
> It might be best to convert using the existing map files, and then
> update the mappings later.  Otherwise, when things break, you won't
> know what to blame.

I was thinking that we keep the mappings unchanged, but figure out where 
we got the mappings we have. An authoritative source may well be "file X 
from unicode, with the following tweaks: ...". As long as we have some 
way of representing that, in text files, or in perl code, that's OK.

What I don't want is that the current *.map files are turned into the 
authoritative source files, that we modify by hand. There are no 
comments in them, for starters, which makes hand-editing cumbersome. It 
seems that we have edited some of them by hand already, but we should 
rectify that.

- Heikki




Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, this is new version of radix charconv.

At Sat, 8 Oct 2016 00:37:28 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<6d85d710-9554-a928-29ff-b2d3b80b01c9@iki.fi>
> What I don't want is that the current *.map files are turned into the
> authoritative source files, that we modify by hand. There are no
> comments in them, for starters, which makes hand-editing
> cumbersome. It seems that we have edited some of them by hand already,
> but we should rectify that.

Agreed. So, I identifed source files of each character for EUC_JP
and SJIS conversions to clarify what has been done on them.

SJIS conversion is made from CP932.TXT and 8 additional
conversions for UTF8->SJIS and none for SJIS->UTF8.

EUC_JP is made from CP932.TXT and JIS0212.TXT. JIS0201.TXT and
JIS0208.TXT are useless. It adds 83 or 86 (different by
direction) conversion entries.

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT

Now the generator scripts don't use *.map as source and in turn
generates old-style map files as well as radix tree files.

For convenience, UCS_to_(SJIS|EUC_JP).pl takes parater --flat and
-v. The format generates the old-style flat map as well as radix
map file and additional -v adds source description for each line
in the flat map file.

During working on this, EUC_JP map lacks some conversions but it
is another issue.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 10/21/2016 11:33 AM, Kyotaro HORIGUCHI wrote:
> Hello, this is new version of radix charconv.
>
> At Sat, 8 Oct 2016 00:37:28 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<6d85d710-9554-a928-29ff-b2d3b80b01c9@iki.fi>
>> What I don't want is that the current *.map files are turned into the
>> authoritative source files, that we modify by hand. There are no
>> comments in them, for starters, which makes hand-editing
>> cumbersome. It seems that we have edited some of them by hand already,
>> but we should rectify that.
>
> Agreed. So, I identifed source files of each character for EUC_JP
> and SJIS conversions to clarify what has been done on them.
>
> SJIS conversion is made from CP932.TXT and 8 additional
> conversions for UTF8->SJIS and none for SJIS->UTF8.
>
> EUC_JP is made from CP932.TXT and JIS0212.TXT. JIS0201.TXT and
> JIS0208.TXT are useless. It adds 83 or 86 (different by
> direction) conversion entries.
>
> http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
> http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT
>
> Now the generator scripts don't use *.map as source and in turn
> generates old-style map files as well as radix tree files.
>
> For convenience, UCS_to_(SJIS|EUC_JP).pl takes parater --flat and
> -v. The format generates the old-style flat map as well as radix
> map file and additional -v adds source description for each line
> in the flat map file.
>
> During working on this, EUC_JP map lacks some conversions but it
> is another issue.

Thanks!

I'd reallly like to clean up all the current perl scripts, before we
start to do the radix tree stuff. I worked through the rest of the
conversions, and fixed/hacked the perl scripts so that they faithfully
re-produce the mapping tables that we have in the repository currently.
Whether those are the best mappings or not, or whether we should update
them based on some authoritative source is another question, but let's
try to nail down the process of creating the mapping tables.

Tom Lane looked into this in Nov 2015
(https://www.postgresql.org/message-id/28825.1449076551%40sss.pgh.pa.us).
This is a continuation of that, to actually fix the scripts. This patch
series doesn't change any of the mappings, only the way we produce the
mapping tables.

Our UHC conversion tables contained a lot more characters than the
CP949.TXT file it's supposedly based on. I rewrote the script to use
"windows-949-2000.xml" file, from the ICU project, as the source
instead. It's a much closer match to our mapping tables, containing all
but one of the additional characters. We were already using
gb-18030-2000.xml as the source in UCS_GB18030.pl, so parsing ICU's XML
files isn't a new thing.

The GB2312.TXT source file seems to have disappeared from the Unicode
consortium's FTP site. I changed the UCS_to_EUC_CN.pl script to use
gb-18030-2000.xml as the source instead. GB-18030 is an extension of
GB-2312, UCS_to_EUC_CN.pl filters out the additional characters that are
not in GB-2312.

This now forms a reasonable basis for switching to radix tree. Every
mapping table is now generated by the print_tables() perl function in
convutils.pm. To switch to a radix tree, you just need to swap that
function with one that produces a radix tree instead of the
current-format mapping tables.

The perl scripts are still quite messy. For example, I lost the checks
for duplicate mappings somewhere along the way - that ought to be put
back. My Perl skills are limited.


This is now an orthogonal discussion, and doesn't need to block the
radix tree work, but we should consider what we want to base our mapping
tables on. Perhaps we could use the XML files from ICU as the source for
all of the mappings?

ICU seems to use a BSD-like license, so we could even include the XML
files in our repository. Actually, looking at
http://www.unicode.org/copyright.html#License, I think we could include
the *.TXT files in our repository, too, if we wanted to. The *.TXT files
are found under www.unicode.org/Public/, so that license applies. I
think that has changed somewhat recently, because the comments in our
perl scripts claim that the license didn't allow that.

- Heikki


Attachment

Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, thank you very much for the work. My work became quite
easier with it.

At Tue, 25 Oct 2016 12:23:48 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi>
> I'd reallly like to clean up all the current perl scripts, before we
> start to do the radix tree stuff. I worked through the rest of the
> conversions, and fixed/hacked the perl scripts so that they faithfully
> re-produce the mapping tables that we have in the repository
> currently. Whether those are the best mappings or not, or whether we
> should update them based on some authoritative source is another
> question, but let's try to nail down the process of creating the
> mapping tables.
> 
> Tom Lane looked into this in Nov 2015
> (https://www.postgresql.org/message-id/28825.1449076551%40sss.pgh.pa.us). This
> is a continuation of that, to actually fix the scripts. This patch
> series doesn't change any of the mappings, only the way we produce the
> mapping tables.
> 
> > Our UHC conversion tables contained a lot more characters than the
> CP949.TXT file it's supposedly based on. I rewrote the script to use
> "windows-949-2000.xml" file, from the ICU project, as the source
> instead. It's a much closer match to our mapping tables, containing
> all but one of the additional characters. We were already using
> gb-18030-2000.xml as the source in UCS_GB18030.pl, so parsing ICU's
> XML files isn't a new thing.
> 
> The GB2312.TXT source file seems to have disappeared from the Unicode
> consortium's FTP site. I changed the UCS_to_EUC_CN.pl script to use
> gb-18030-2000.xml as the source instead. GB-18030 is an extension of
> GB-2312, UCS_to_EUC_CN.pl filters out the additional characters that
> are not in GB-2312.
> 
> This now forms a reasonable basis for switching to radix tree. Every
> mapping table is now generated by the print_tables() perl function in
> convutils.pm. To switch to a radix tree, you just need to swap that
> function with one that produces a radix tree instead of the
> current-format mapping tables.

RADIXCONV.pm is merged into convutils.pm and the manner to
resolve reference is unified from $$x{} to $x->{}. (subroutine
call by '&' is not unified..) Now radix trees files are written
by the function with similar interface.

print_radix_trees($script_name, $encoding, \@mapping);

> The perl scripts are still quite messy. For example, I lost the checks
> for duplicate mappings somewhere along the way - that ought to be put
> back. My Perl skills are limited.

Perl scripts are to be messy, I believe. Anyway the duplicate
check as been built into the sub print_radix_trees. Maybe the
same check is needed by some plain map files but it would be just
duplication for the maps having radix tree.

The attached patches apply on top your patches and changes all
possible conversions to use radix tree (combined characters are
still using old-method). Addition to that, because of the
difficult-to-verify nature of the radix-tree data, I added
map_chekcer (make mapcheck) to check them agaist plain maps.

I have briefly checked with real characters for
SJIS/EUC-JP/BIG5/ISO8859-13 and radix conversion seems to work
correctly for them.

> This is now an orthogonal discussion, and doesn't need to block the
> radix tree work, but we should consider what we want to base our
> mapping tables on. Perhaps we could use the XML files from ICU as the
> source for all of the mappings?
> 
> ICU seems to use a BSD-like license, so we could even include the XML
> files in our repository. Actually, looking at
> http://www.unicode.org/copyright.html#License, I think we could
> include the *.TXT files in our repository, too, if we wanted to. The
> *.TXT files are found under www.unicode.org/Public/, so that license
> applies. I think that has changed somewhat recently, because the
> comments in our perl scripts claim that the license didn't allow that.

For the convenience, all the required files are downloaded by
typing 'make download-texts'.


In the following document,

http://unicode.org/Public/ReadMe.txt

| Terms of Use
|     http://www.unicode.org/copyright.html

http://www.unicode.org/copyright.html

| EXHIBIT 1
| UNICODE, INC. LICENSE AGREEMENT - DATA FILES AND SOFTWARE
| Unicode Data Files include all data files under the directories
| http://www.unicode.org/Public/, http://www.unicode.org/reports/,
...

| COPYRIGHT AND PERMISSION NOTICE
| 
| Copyright (c) 1991-2016 Unicode, Inc. All rights reserved.
| Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
| 
| Permission is hereby granted, free of charge, to any person obtaining
| a copy of the Unicode data files and any associated documentation
| (the "Data Files") or Unicode software and any associated documentation
| (the "Software") to deal in the Data Files or Software
| without restriction, including without limitation the rights to use,
| copy, modify, merge, publish, distribute, and/or sell copies of
| the Data Files or Software, and to permit persons to whom the Data Files
| or Software are furnished to do so, provided that either
| (a) this copyright and permission notice appear with all copies
| of the Data Files or Software, or
| (b) this copyright and permission notice appear in associated
| Documentation.

Perhaps we can put the files into our repositoy by providing some
notifications.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Radix tree for character conversion

From
Robert Haas
Date:
On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> | COPYRIGHT AND PERMISSION NOTICE
> |
> | Copyright (c) 1991-2016 Unicode, Inc. All rights reserved.
> | Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
> |
> | Permission is hereby granted, free of charge, to any person obtaining
> | a copy of the Unicode data files and any associated documentation
> | (the "Data Files") or Unicode software and any associated documentation
> | (the "Software") to deal in the Data Files or Software
> | without restriction, including without limitation the rights to use,
> | copy, modify, merge, publish, distribute, and/or sell copies of
> | the Data Files or Software, and to permit persons to whom the Data Files
> | or Software are furnished to do so, provided that either
> | (a) this copyright and permission notice appear with all copies
> | of the Data Files or Software, or
> | (b) this copyright and permission notice appear in associated
> | Documentation.
>
> Perhaps we can put the files into our repositoy by providing some
> notifications.

Uggh, I don't much like advertising clauses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Radix tree for character conversion

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Perhaps we can put the files into our repositoy by providing some
>> notifications.

> Uggh, I don't much like advertising clauses.

Even if the license were exactly compatible with ours, I'd be -1 on
bloating our tarballs with these files.  They're large and only a
tiny fraction of developers, let alone end users, will ever care
to look at them.

I think it's fine as long as we have a README file that explains
where to get them.  (I'm not even very thrilled with the proposed
auto-download script, as it makes undesirable assumptions about
which internet tools you use, not to mention that it won't work
at all on Windows.)

I'd actually vote for getting rid of the reference files we
have in the tree now (src/backend/utils/mb/Unicode/*txt), on
the same grounds.  That's 600K of stuff that does not need to
be in our tarballs.
        regards, tom lane



Re: Radix tree for character conversion

From
David Fetter
Date:
On Fri, Oct 28, 2016 at 09:18:08AM -0400, Robert Haas wrote:
> On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > | COPYRIGHT AND PERMISSION NOTICE
> > |
> > | Copyright (c) 1991-2016 Unicode, Inc. All rights reserved.
> > | Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
> > |
> > | Permission is hereby granted, free of charge, to any person obtaining
> > | a copy of the Unicode data files and any associated documentation
> > | (the "Data Files") or Unicode software and any associated documentation
> > | (the "Software") to deal in the Data Files or Software
> > | without restriction, including without limitation the rights to use,
> > | copy, modify, merge, publish, distribute, and/or sell copies of
> > | the Data Files or Software, and to permit persons to whom the Data Files
> > | or Software are furnished to do so, provided that either
> > | (a) this copyright and permission notice appear with all copies
> > | of the Data Files or Software, or
> > | (b) this copyright and permission notice appear in associated
> > | Documentation.
> >
> > Perhaps we can put the files into our repositoy by providing some
> > notifications.
> 
> Uggh, I don't much like advertising clauses.

Your dislike is pretty common.

Might it be worth reaching out to the Unicode consortium about this?
They may well have added that as boilerplate without really
considering the effects, and they even have a popup that specifically
addresses licensing.

http://www.unicode.org/reporting.html

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 28 Oct 2016 09:42:25 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <13049.1477662145@sss.pgh.pa.us>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Oct 27, 2016 at 3:23 AM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> Perhaps we can put the files into our repositoy by providing some
> >> notifications.
> 
> > Uggh, I don't much like advertising clauses.
> 
> Even if the license were exactly compatible with ours, I'd be -1 on
> bloating our tarballs with these files.  They're large and only a
> tiny fraction of developers, let alone end users, will ever care
> to look at them.

I understood that the intention of Heikki's suggestion, that is,
these might be included in PostgreSQL's repository, is looking
for a kind of stability, or consistency. The source files are not
revision-mangaged. In case where the authorities get unwanted
changes or no longer avaiable, .map files have to be edited
irelevantly from the authority files maybe from the reason that
. Actually some map files have lost their authority file or other
map files have got several direct modifications. We will be free
from such disturbance by containing "frozen" authority files.

On the other hand, I also agree that the advertising or
additional bloats of source repositiry are a nuisance.

> I think it's fine as long as we have a README file that explains
> where to get them.  (I'm not even very thrilled with the proposed
> auto-download script, as it makes undesirable assumptions about
> which internet tools you use, not to mention that it won't work
> at all on Windows.)

Mmm. It would be a pain in the neck. Some of the files are
already stored in "OBSOLETE" directory in the Unicode consortium
ftp site, and one of them has been vanished and available from
another place, a part of ICU source tree. On the other hand map
files are assumed to be generated from the scripts and are to
discuraged to be directly edited. Radix map files are uneditable
and currently made from the authority files. If some authority
files are gone, the additional edit have to be done directly onto
map files, and they are in turn to be the authority for radix
files.  (it's quite easy to chage the authority to current map
files, though).

By the way, the following phrase of the terms of license.

http://www.unicode.org/copyright.html

| COPYRIGHT AND PERMISSION NOTICE
| 
| Copyright (c) 1991-2016 Unicode, Inc. All rights reserved.
| Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
| 
| Permission is hereby granted, free of charge, to any person obtaining
| a copy of the Unicode data files and any associated documentation
| (the "Data Files") or Unicode software and any associated documentation
| (the "Software") to deal in the Data Files or Software
| without restriction, including without limitation the rights to use,
| copy, modify, merge, publish, distribute, and/or sell copies of
| the Data Files or Software, and to permit persons to whom the Data Files
| or Software are furnished to do so, provided that either
| (a) this copyright and permission notice appear with all copies
| of the Data Files or Software, or
| (b) this copyright and permission notice appear in associated
| Documentation.

I'm afraid that the map (and _radix.map files) are the translates
of the "Data Files", and 'translate' is a part of 'modify'.

Either the notice is necessary or not, if we decide to wipe the
'true' authority out from our source files, I'd like to make the
map files (preferably with comments) as the second authority,
_radix.map files are to be getenerated from them, since they're
not editable.

> I'd actually vote for getting rid of the reference files we
> have in the tree now (src/backend/utils/mb/Unicode/*txt), on
> the same grounds.  That's 600K of stuff that does not need to
> be in our tarballs.

Anyway, I'd like to register this as an item of this CF.

regares,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Radix tree for character conversion

From
Daniel Gustafsson
Date:
> On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Hello, thank you very much for the work. My work became quite
> easier with it.
>
> At Tue, 25 Oct 2016 12:23:48 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi>
>>
>> [..]
>> The perl scripts are still quite messy. For example, I lost the checks
>> for duplicate mappings somewhere along the way - that ought to be put
>> back. My Perl skills are limited.
>
> Perl scripts are to be messy, I believe. Anyway the duplicate
> check as been built into the sub print_radix_trees. Maybe the
> same check is needed by some plain map files but it would be just
> duplication for the maps having radix tree.

I took a small stab at doing some cleaning of the Perl scripts, mainly around
using the more modern (well, modern as in +15 years old) form for open(..),
avoiding global filehandles for passing scalar references and enforcing use
strict.  Some smaller typos and fixes were also included.  It seems my Perl has
become a bit rusty so I hope the changes make sense.  The produced files are
identical with these patches applied, they are merely doing cleaning as opposed
to bugfixing.

The attached patches are against the 0001-0006 patches from Heikki and you in
this series of emails, the separation is intended to make them easier to read.

cheers ./daniel


Attachment

Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Thank you for looling this.

At Mon, 31 Oct 2016 17:11:17 +0100, Daniel Gustafsson <daniel@yesql.se> wrote in
<3FC648B5-2B7F-4585-9615-207A44B730A9@yesql.se>
> > On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Perl scripts are to be messy, I believe. Anyway the duplicate
> > check as been built into the sub print_radix_trees. Maybe the
> > same check is needed by some plain map files but it would be just
> > duplication for the maps having radix tree.
> 
> I took a small stab at doing some cleaning of the Perl scripts, mainly around
> using the more modern (well, modern as in +15 years old) form for open(..),
> avoiding global filehandles for passing scalar references and enforcing use
> strict.  Some smaller typos and fixes were also included.  It seems my Perl has
> become a bit rusty so I hope the changes make sense.  The produced files are
> identical with these patches applied, they are merely doing cleaning as opposed
> to bugfixing.
> 
> The attached patches are against the 0001-0006 patches from Heikki and you in
> this series of emails, the separation is intended to make them easier to read.

I'm not sure how the discussion about this goes, these patches
makes me think about coding style of Perl.

The distinction between executable script and library is by
intention with an obscure basis. Existing scripts don't get less
modification, but library uses more restricted scopes to get rid
of the troubles caused by using global scopes. But I don't have a
clear preference on that. The TAP test scripts takes OO notations
but I'm not sure convutils.pl also be better to take the same
notation. It would be rarely edited hereafter and won't gets
grown any more.

As far as I see the obvious bug fixes in the patchset are the
following,

- 0007: load_maptable fogets to close input file.
- 0010: commment for load_maptables is wrong.
- 0011: hash reference is incorrectly dereferenced

All other fixes other than the above three seem to be styling or
syntax-generation issues and I don't know whether any
recommendation exists...


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Radix tree for character conversion

From
Daniel Gustafsson
Date:
> On 04 Nov 2016, at 08:34, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Thank you for looling this.

And thank you for taking the time to read my patches!

> At Mon, 31 Oct 2016 17:11:17 +0100, Daniel Gustafsson <daniel@yesql.se> wrote in
<3FC648B5-2B7F-4585-9615-207A44B730A9@yesql.se>
>>> On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> Perl scripts are to be messy, I believe. Anyway the duplicate
>>> check as been built into the sub print_radix_trees. Maybe the
>>> same check is needed by some plain map files but it would be just
>>> duplication for the maps having radix tree.
>>
>> I took a small stab at doing some cleaning of the Perl scripts, mainly around
>> using the more modern (well, modern as in +15 years old) form for open(..),
>> avoiding global filehandles for passing scalar references and enforcing use
>> strict.  Some smaller typos and fixes were also included.  It seems my Perl has
>> become a bit rusty so I hope the changes make sense.  The produced files are
>> identical with these patches applied, they are merely doing cleaning as opposed
>> to bugfixing.
>>
>> The attached patches are against the 0001-0006 patches from Heikki and you in
>> this series of emails, the separation is intended to make them easier to read.
>
> I'm not sure how the discussion about this goes, these patches
> makes me think about coding style of Perl.

Some of this can absolutely be considered style and more or less down to
personal preference.  I haven’t seen any coding conventions for Perl so I
assume it’s down to consensus among the committers.  My rationale for these
patches in the first place was that I perceived this thread to partly want to
clean up the code and make it more modern Perl.

> The distinction between executable script and library is by
> intention with an obscure basis. Existing scripts don't get less
> modification, but library uses more restricted scopes to get rid
> of the troubles caused by using global scopes. But I don't have a
> clear preference on that. The TAP test scripts takes OO notations
> but I'm not sure convutils.pl also be better to take the same
> notation. It would be rarely edited hereafter and won't gets
> grown any more.

I think the current convutils module is fine and converting it to OO would be
overkill.

> As far as I see the obvious bug fixes in the patchset are the
> following,

Agreed, with some comments:

> - 0007: load_maptable fogets to close input file.

An interesting note on this is that it’s not even a bug =) Since $in is a
scalar reference, there is no need to explicitly close() the filehandle since
the reference counter will close it on leaving scope, but there’s no harm in
doing it ourselves and it also makes for less confusion for anyone not familiar
with Perl internals.

> - 0010: commment for load_maptables is wrong.

There is also a fix for a typo in make_mapchecker.pl

> - 0011: hash reference is incorrectly dereferenced
>
> All other fixes other than the above three seem to be styling or
> syntax-generation issues and I don't know whether any
> recommendation exists…

I think there are some more fixes that arent styling/syntax remaining.  I’ll go
through the patches one by one:

0007 - While this might be considered styling/syntax, my $0.02 is that it’s not
and instead a worthwhile change.  I’ll illustrate with an example from the
patch in question:

Using a bareword global variable in open() for the filehandle was replaced with
the three-part form in 5.6 and is now even actively discouraged from in the
Perl documentation (and has been so since the 5.20 docs).  The problem is that
they are global and can thus easily clash, so easily that the 0007 patch
actually fixes one such occurrence:

print_radix_map() opens the file in the global filehandle OUT and passes it to
print_radix_table() with the typeglob *OUT; print_radit_table() in turn passes
the filehandle to print_segmented_table() which writes to the file using the
parameter $hd, except in one case where it uses the global OUT variable without
knowing it will be the right file.  This is where the hunk below in 0007 comes
in:

-               print OUT "$line\n";
+               print { $$hd } "$line\n";

In this case OUT references the right file and it produces the right result,
but it illustrates how easy it is to get wrong (which can cause very subtle
bugs).  So, when poking at this code we might as well, IMHO, use what is today
in Perl considered the right way to deal with filehandle references.

Using implicit filemodes can also introduce bugs when opening filenames passed
in from the outside as we do in UCS_to_most.pl.  Considering the use case of
these scripts it’s obviously quite low on the list of risks but still.

0008 - I don’t think there are any recommendations whether or not to use use
strict; in the codebase, there certainly are lots of scripts not doing it.
Personally I think it’s good hygiene to always use strict but here it might
just be janitorial nitpicking (which I too am guilty of liking..  =)).

0009 - local $var; is to provide a temporary value of $var, where $var exists,
for the current scope (and was mostly used back in Perl 4).  Since we are
passing by value to ucs2utf(), and creating $utf inside it, using my to create
the variable is the right thing even though the end result is the same.

0010 and 0011 are already dealt with above.

So to summarize, I think there are a few more (while not all) hunks that are of
interest which aren’t just syntax/style which can serve to make the code easer
to read/work with down line should we need to.

cheers ./daniel


Re: Radix tree for character conversion

From
Daniel Gustafsson
Date:
> On 07 Nov 2016, at 12:32, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> On 04 Nov 2016, at 08:34, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>
>> I'm not sure how the discussion about this goes, these patches
>> makes me think about coding style of Perl.
>
> Some of this can absolutely be considered style and more or less down to
> personal preference.  I haven’t seen any coding conventions for Perl so I
> assume it’s down to consensus among the committers.

Actually, scratch that; there is of course a perltidy profile in the pgindent
directory.  I should avoid sending email before coffee..

cheers ./daniel


Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello,

At Mon, 7 Nov 2016 12:32:55 +0100, Daniel Gustafsson <daniel@yesql.se> wrote in
<EE8775B6-BE30-459D-9DDB-F3D0B3FF573D@yesql.se>
> > On 04 Nov 2016, at 08:34, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I'm not sure how the discussion about this goes, these patches
> > makes me think about coding style of Perl.
> 
> Some of this can absolutely be considered style and more or less down to
> personal preference.  I haven’t seen any coding conventions for Perl so I
> assume it’s down to consensus among the committers.  My rationale for these
> patches in the first place was that I perceived this thread to partly want to
> clean up the code and make it more modern Perl.
> 
> > The distinction between executable script and library is by
> > intention with an obscure basis. Existing scripts don't get less
> > modification, but library uses more restricted scopes to get rid
> > of the troubles caused by using global scopes. But I don't have a
> > clear preference on that. The TAP test scripts takes OO notations
> > but I'm not sure convutils.pl also be better to take the same
> > notation. It would be rarely edited hereafter and won't gets
> > grown any more.
> 
> I think the current convutils module is fine and converting it to OO would be
> overkill.

Agreed.

> > As far as I see the obvious bug fixes in the patchset are the
> > following,
> 
> Agreed, with some comments:
> 
> > - 0007: load_maptable fogets to close input file.
> 
> An interesting note on this is that it’s not even a bug =) Since $in is a
> scalar reference, there is no need to explicitly close() the filehandle since
> the reference counter will close it on leaving scope, but there’s no harm in
> doing it ourselves and it also makes for less confusion for anyone not familiar
> with Perl internals.

Wow. I didn't know that perl has such a hidden-OO
feature. Nevertheless, implicit close is not friendly to who are
not familiar with newer perl.

Your comment led me to confirm the requirement to build PostgreSQL.

https://www.postgresql.org/docs/devel/static/install-requirements.html

| Perl 5.8 or later is needed to build from a Git checkout, or if
| you changed the input files for any of the build steps that use
| Perl scripts. If building on Windows you will need Perl in any
| case. Perl is also required to run some test suites.

So, we should assume Perl 5.8 (released in 2002!) on build
time. And actually 5.10 on RedHat 6.4, 5.16 on my
environment(ContOS 7.2), and the official doc is at 5.24. Active
perl is 5.24. According to this, we should use syntax supported
as of 5.8 and/but not obsolete until 5.24, then to follow the
latest convention. But not OO. (But I can't squeeze out a
concrete syntax set out of this requirements :( )


> > - 0010: commment for load_maptables is wrong.
> 
> There is also a fix for a typo in make_mapchecker.pl
> 
> > - 0011: hash reference is incorrectly dereferenced
> > 
> > All other fixes other than the above three seem to be styling or
> > syntax-generation issues and I don't know whether any
> > recommendation exists…
> 
> I think there are some more fixes that arent styling/syntax remaining.  I’ll go
> through the patches one by one:
> 
> 0007 - While this might be considered styling/syntax, my $0.02 is that it’s not
> and instead a worthwhile change.  I’ll illustrate with an example from the
> patch in question:
> 
> Using a bareword global variable in open() for the filehandle was replaced with
> the three-part form in 5.6 and is now even actively discouraged from in the
> Perl documentation (and has been so since the 5.20 docs).  The problem is that
> they are global and can thus easily clash, so easily that the 0007 patch
> actually fixes one such occurrence:

That's what should be adopted in the criteria above.
 - Don't use bareword globals. - Use open() with separate MODE argument.

> print_radix_map() opens the file in the global filehandle OUT and passes it to
> print_radix_table() with the typeglob *OUT; print_radit_table() in turn passes
> the filehandle to print_segmented_table() which writes to the file using the
> parameter $hd, except in one case where it uses the global OUT variable without
> knowing it will be the right file.  This is where the hunk below in 0007 comes
> in:
> 
> -               print OUT "$line\n";
> +               print { $$hd } "$line\n";
> 
> In this case OUT references the right file and it produces the right result,
> but it illustrates how easy it is to get wrong (which can cause very subtle
> bugs).  So, when poking at this code we might as well, IMHO, use what is today
> in Perl considered the right way to deal with filehandle references.

Thanks for the detail. Ok, I'll change the style so in the next
patch.

> Using implicit filemodes can also introduce bugs when opening filenames passed
> in from the outside as we do in UCS_to_most.pl.  Considering the use case of
> these scripts it’s obviously quite low on the list of risks but still.

Ok, I'll do that.

> 0008 - I don’t think there are any recommendations whether or not to use use
> strict; in the codebase, there certainly are lots of scripts not doing it.
> Personally I think it’s good hygiene to always use strict but here it might
> just be janitorial nitpicking (which I too am guilty of liking..  =)).

I used strict as an amulet (or armor) not to type incorrect
symbols. Breaking well-working scripts by adding strict is not
reasonable but using it in new or heavily rewritten scripts are
reasonable. I changed my mind to use newer style in rewriting
existing scripts.

> 0009 - local $var; is to provide a temporary value of $var, where $var exists,
> for the current scope (and was mostly used back in Perl 4).  Since we are
> passing by value to ucs2utf(), and creating $utf inside it, using my to create
> the variable is the right thing even though the end result is the same.

Yes, you're right. The point was that the deferrence between
lexical and dynamic scopes doesn't make any difference in the
context. But I'll rewrite it according to the new policy.

> 0010 and 0011 are already dealt with above.
> 
> So to summarize, I think there are a few more (while not all) hunks that are of
> interest which aren’t just syntax/style which can serve to make the code easer
> to read/work with down line should we need to.

Addition to this, I'll remove existing authority files and modify
radix generator so that it can read plain map files in the next
patch.

Thank you for the significant comments.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello,

At Mon, 7 Nov 2016 17:19:29 +0100, Daniel Gustafsson <daniel@yesql.se> wrote in
<39E295B9-7391-40B6-911D-FE852E4604BD@yesql.se>
> > On 07 Nov 2016, at 12:32, Daniel Gustafsson <daniel@yesql.se> wrote:
> > 
> >> On 04 Nov 2016, at 08:34, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> 
> >> I'm not sure how the discussion about this goes, these patches
> >> makes me think about coding style of Perl.
> > 
> > Some of this can absolutely be considered style and more or less down to
> > personal preference.  I haven’t seen any coding conventions for Perl so I
> > assume it’s down to consensus among the committers.
> 
> Actually, scratch that; there is of course a perltidy profile in the pgindent
> directory.  I should avoid sending email before coffee..

Hmm.  Somehow perl-mode on my Emacs is stirring with
ununderstandable indentation and I manually correct them so it is
highly probable that the style of this patch is not compatible
with the defined style. Anyway it is better that pgindent
generates smaller patch so I'll try it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, this is the revising patch applies on top of the previous
patch.

Differences on map files are enormous but useless for discussion
so they aren't included in this. (but can be generated)

This still doesn't remove three .txt/.xml files since it heavily
bloats the patch. I'm planning that they are removed in the final
shape. All authority files including the removed files are
automatically downloaded by the Makefile in this patch.

At Tue, 08 Nov 2016 10:43:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161108.104356.265607041.horiguchi.kyotaro@lab.ntt.co.jp>
> https://www.postgresql.org/docs/devel/static/install-requirements.html
> 
> | Perl 5.8 or later is needed to build from a Git checkout, or if
> | you changed the input files for any of the build steps that use
> | Perl scripts. If building on Windows you will need Perl in any
> | case. Perl is also required to run some test suites.
> 
> So, we should assume Perl 5.8 (released in 2002!) on build
> time. And actually 5.10 on RedHat 6.4, 5.16 on my
> environment(ContOS 7.2), and the official doc is at 5.24. Active
> perl is 5.24. According to this, we should use syntax supported
> as of 5.8 and/but not obsolete until 5.24, then to follow the
> latest convention. But not OO. (But I can't squeeze out a
> concrete syntax set out of this requirements :( )
...(forget this for a while..)

Finally the attached patch contains most of (virtually all of)
Daniel's suggestion and some modification by pgperltidy.

> Addition to this, I'll remove existing authority files and modify
> radix generator so that it can read plain map files in the next
> patch.

So, I think the attached are in rather modern shape.

At Tue, 08 Nov 2016 11:02:58 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161108.110258.59832499.horiguchi.kyotaro@lab.ntt.co.jp>
> Hmm.  Somehow perl-mode on my Emacs is stirring with
> ununderstandable indentation and I manually correct them so it is
> highly probable that the style of this patch is not compatible
> with the defined style. Anyway it is better that pgindent
> generates smaller patch so I'll try it.

The attached are applied pgperltidy. Several regions such like
additional character list are marked not to be edited.

One concern is what to leave by 'make distclen' and 'make
maintainer-clean'. The former should remove authority *.TXT files
since it shouldn't be in source archive. On the other hand it is
more convenient that the latter leaves them. This seems somewhat
strange but I can't come up with better behavior for now.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Radix tree for character conversion

From
Daniel Gustafsson
Date:
> On 08 Nov 2016, at 12:21, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Hello, this is the revising patch applies on top of the previous
> patch.
>
> ...
>
> Finally the attached patch contains most of (virtually all of)
> Daniel's suggestion and some modification by pgperltidy.

Reading over this it looks good to me.  I did spot one thing I had missed
before though, the error message below should be referencing the scalar
variable ‘direction' unless I’m missing something:

-    die "unacceptable direction : %direction"
+    die "unacceptable direction : $direction"      if ($direction ne "to_unicode" && $direction ne "from_unicode");

With this, I would consider this ready for committer.

>> Addition to this, I'll remove existing authority files and modify
>> radix generator so that it can read plain map files in the next
>> patch.
>
> So, I think the attached are in rather modern shape.

+1, nice work!

cheers ./daniel


Re: Radix tree for character conversion

From
Peter Eisentraut
Date:
On 10/31/16 12:11 PM, Daniel Gustafsson wrote:
> I took a small stab at doing some cleaning of the Perl scripts, mainly around
> using the more modern (well, modern as in +15 years old) form for open(..),
> avoiding global filehandles for passing scalar references and enforcing use
> strict.  Some smaller typos and fixes were also included.  It seems my Perl has
> become a bit rusty so I hope the changes make sense.  The produced files are
> identical with these patches applied, they are merely doing cleaning as opposed
> to bugfixing.
> 
> The attached patches are against the 0001-0006 patches from Heikki and you in
> this series of emails, the separation is intended to make them easier to read.

Cool.  See also here:
https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Radix tree for character conversion

From
Daniel Gustafsson
Date:
> On 08 Nov 2016, at 17:37, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>
> On 10/31/16 12:11 PM, Daniel Gustafsson wrote:
>> I took a small stab at doing some cleaning of the Perl scripts, mainly around
>> using the more modern (well, modern as in +15 years old) form for open(..),
>> avoiding global filehandles for passing scalar references and enforcing use
>> strict.  Some smaller typos and fixes were also included.  It seems my Perl has
>> become a bit rusty so I hope the changes make sense.  The produced files are
>> identical with these patches applied, they are merely doing cleaning as opposed
>> to bugfixing.
>>
>> The attached patches are against the 0001-0006 patches from Heikki and you in
>> this series of emails, the separation is intended to make them easier to read.
>
> Cool.  See also here:
> https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net

Nice, not having hacked much Perl in quite a while I had all but forgotten
about perlcritic.

Running it on the current version of the patchset yields mostly warnings on
string values used in the require “convutils.pm” statement.  There were however
two more interesting reports: one more open() call not using the three
parameter form and an instance of map which alters the input value.  The latter
is not causing an issue since we don’t use the input list past the map but
fixing it seems like good form.

Attached is a patch that addresses the perlcritic reports (running without any
special options).

cheers ./daniel


Attachment

Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, thank you for polishing this.

At Wed, 9 Nov 2016 02:19:01 +0100, Daniel Gustafsson <daniel@yesql.se> wrote in
<80F34F25-BF6D-4BCD-9C38-42ED10D3F453@yesql.se>
> > On 08 Nov 2016, at 17:37, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
> > 
> > On 10/31/16 12:11 PM, Daniel Gustafsson wrote:
> >> I took a small stab at doing some cleaning of the Perl scripts, mainly around
> >> using the more modern (well, modern as in +15 years old) form for open(..),
> >> avoiding global filehandles for passing scalar references and enforcing use
> >> strict.  Some smaller typos and fixes were also included.  It seems my Perl has
> >> become a bit rusty so I hope the changes make sense.  The produced files are
> >> identical with these patches applied, they are merely doing cleaning as opposed
> >> to bugfixing.
> >> 
> >> The attached patches are against the 0001-0006 patches from Heikki and you in
> >> this series of emails, the separation is intended to make them easier to read.
> > 
> > Cool.  See also here:
> > https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net

> Nice, not having hacked much Perl in quite a while I had all but forgotten
> about perlcritic.

I tried it on CentOS7. Installation failed saying that
Module::Build is too old. It is yum-inatlled so removed it and
installed it with CPAN. Again failed with many 'Could not create
MYMETA files'. Then tried to install CPAN::Meta and it failed
saying that CPAN::Meta::YAML is too *new*. That sucks.

So your patch is greately helpfull. Thank you.

| -my @mapnames = map { s/\.map//; $_ } values %plainmaps;
| +my @mapnames = map { my $m = $_; $m =~ s/\.map//; $m } values %plainmaps;

It surprised me to know that perlcritic does such things.

> Running it on the current version of the patchset yields mostly warnings on
> string values used in the require “convutils.pm” statement.  There were however
> two more interesting reports: one more open() call not using the three
> parameter form and an instance of map which alters the input value. 

Sorry for overlooking it.

> The latter
> is not causing an issue since we don’t use the input list past the map but
> fixing it seems like good form.

Agreed.

> Attached is a patch that addresses the perlcritic reports (running without any
> special options).

Thanks. The attached patch contains the patch by perlcritic.

0001,2,3 are Heikki's patch that are not modified since it is
first proposed. It's a bit too big so I don't attach them to this
mail (again).

https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi

0004 is radix-tree stuff, applies on top of the three patches
above.

There's a hidden fifth patch which of 20MB in size. But it is
generated by running make in the Unicode directory.

[$(TOP)]$ ./configure ...
[$(TOP)]$ make
[Unicode]$ make
[Unicode]$ make distclean
[Unicode]$ git add .
[Unicode]$ commit 
=== COMMITE MESSSAGE
Replace map files with radix tree files.

These encodings no longer uses the former map files and uses new radix
tree files. All existing authority files in this directory are removed.
===

regards,

Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello.

I'll be off line until at least next Monday. So I move this to
the next CF by myself.

At Wed, 09 Nov 2016 17:38:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161109.173853.77274443.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello, thank you for polishing this.
> 
> At Wed, 9 Nov 2016 02:19:01 +0100, Daniel Gustafsson <daniel@yesql.se> wrote in
<80F34F25-BF6D-4BCD-9C38-42ED10D3F453@yesql.se>
> > > On 08 Nov 2016, at 17:37, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
> > > 
> > > On 10/31/16 12:11 PM, Daniel Gustafsson wrote:
> > >> I took a small stab at doing some cleaning of the Perl scripts, mainly around
> > >> using the more modern (well, modern as in +15 years old) form for open(..),
> > >> avoiding global filehandles for passing scalar references and enforcing use
> > >> strict.  Some smaller typos and fixes were also included.  It seems my Perl has
> > >> become a bit rusty so I hope the changes make sense.  The produced files are
> > >> identical with these patches applied, they are merely doing cleaning as opposed
> > >> to bugfixing.
> > >> 
> > >> The attached patches are against the 0001-0006 patches from Heikki and you in
> > >> this series of emails, the separation is intended to make them easier to read.
> > > 
> > > Cool.  See also here:
> > > https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net
> 
> > Nice, not having hacked much Perl in quite a while I had all but forgotten
> > about perlcritic.
> 
> I tried it on CentOS7. Installation failed saying that
> Module::Build is too old. It is yum-inatlled so removed it and
> installed it with CPAN. Again failed with many 'Could not create
> MYMETA files'. Then tried to install CPAN::Meta and it failed
> saying that CPAN::Meta::YAML is too *new*. That sucks.
> 
> So your patch is greately helpfull. Thank you.
> 
> | -my @mapnames = map { s/\.map//; $_ } values %plainmaps;
> | +my @mapnames = map { my $m = $_; $m =~ s/\.map//; $m } values %plainmaps;
> 
> It surprised me to know that perlcritic does such things.
> 
> > Running it on the current version of the patchset yields mostly warnings on
> > string values used in the require “convutils.pm” statement.  There were however
> > two more interesting reports: one more open() call not using the three
> > parameter form and an instance of map which alters the input value. 
> 
> Sorry for overlooking it.
> 
> > The latter
> > is not causing an issue since we don’t use the input list past the map but
> > fixing it seems like good form.
> 
> Agreed.
> 
> > Attached is a patch that addresses the perlcritic reports (running without any
> > special options).
> 
> Thanks. The attached patch contains the patch by perlcritic.
> 
> 0001,2,3 are Heikki's patch that are not modified since it is
> first proposed. It's a bit too big so I don't attach them to this
> mail (again).
> 
> https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
> 
> 0004 is radix-tree stuff, applies on top of the three patches
> above.
> 
> There's a hidden fifth patch which of 20MB in size. But it is
> generated by running make in the Unicode directory.
> 
> [$(TOP)]$ ./configure ...
> [$(TOP)]$ make
> [Unicode]$ make
> [Unicode]$ make distclean
> [Unicode]$ git add .
> [Unicode]$ commit 
> === COMMITE MESSSAGE
> Replace map files with radix tree files.
> 
> These encodings no longer uses the former map files and uses new radix
> tree files. All existing authority files in this directory are removed.
> ===

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 10/31/2016 06:11 PM, Daniel Gustafsson wrote:
>> On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>
>> At Tue, 25 Oct 2016 12:23:48 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi>
>>>
>>> [..]
>>> The perl scripts are still quite messy. For example, I lost the checks
>>> for duplicate mappings somewhere along the way - that ought to be put
>>> back. My Perl skills are limited.
>>
>> Perl scripts are to be messy, I believe. Anyway the duplicate
>> check as been built into the sub print_radix_trees. Maybe the
>> same check is needed by some plain map files but it would be just
>> duplication for the maps having radix tree.
>
> I took a small stab at doing some cleaning of the Perl scripts, mainly around
> using the more modern (well, modern as in +15 years old) form for open(..),
> avoiding global filehandles for passing scalar references and enforcing use
> strict.  Some smaller typos and fixes were also included.  It seems my Perl has
> become a bit rusty so I hope the changes make sense.  The produced files are
> identical with these patches applied, they are merely doing cleaning as opposed
> to bugfixing.
>
> The attached patches are against the 0001-0006 patches from Heikki and you in
> this series of emails, the separation is intended to make them easier to read.

Thanks! Patches 0001-0003 seem to have been mostly unchanged for the 
later discussion and everyone seems to be happy with those patches, so I 
picked the parts of these cleanups of yours that applied to my patches 
0001-0003, and pushed those. I'll continue reviewing the rest..

- Heikki




Re: Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote:
> Thanks. The attached patch contains the patch by perlcritic.
>
> 0001,2,3 are Heikki's patch that are not modified since it is
> first proposed. It's a bit too big so I don't attach them to this
> mail (again).
>
> https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi

I've now pushed these preliminary patches, with the applicable fixes
from you and Daniel. The attached patch is now against git master.

> 0004 is radix-tree stuff, applies on top of the three patches
> above.

I've spent the last couple of days reviewing this. While trying to
understand how it works, I ended up dismantling, rewriting, and putting
back together most of the added perl code. Attached is a new version,
with more straightforward logic, making it more understandable. I find
it more understandable, anyway, I hope it's not only because I wrote it
myself :-). Let me know what you think.

In particular, I found the explanations of flat and segmented tables
really hard to understand. So in this version, the radix trees for a
conversion are stored completely in one large array. Leaf and
intermediate levels are all in the same array. When reading this
version, please note that I'm not sure if I mean the same thing with
"segment" that you did in your version.

I moved the "lower" and "upper" values in the structs. Also, there are
now also separate "lower" and "upper" values for the leaf levels of the
trees, for 1- 2-, 3- and 4-byte inputs. This made a huge difference to
the size of gb18030_to_utf8_radix.map, in particular: the source file
shrank from about 2 MB to 1/2 MB. In that conversion, the valid range
for the last byte of 2-byte inputs is 0x40-0xfe, and the valid range for
the last byte of 4-byte inputs is 0x30-0x39. With the old patch version,
the "chars" range was therefore 0x30-0xfe, to cover both of those, and
most of the array was filled with zeros. With this new patch version, we
store separate ranges for those, and can leave out most of the zeros.

There's a segment full of zeros at the beginning of each conversion
array now. The purpose of that is that when traversing the radix tree,
you don't need to check each intermediate value for 0. If you follow a 0
offset, it simply points to the dummy all-zeros segments in the
beginning. Seemed like a good idea to shave some cycles, although I'm
not sure if it made much difference in reality.

I optimized pg_mb_radix_conv() a bit, too. We could do more. For
example, I think it would save some cycles to have specialized versions
of UtfToLocal and LocalToUtf, moving the tests for whether a combined
character map and/or conversion callback is used, out of the loop. They
feel a bit ugly too, in their current form...

I need a break now, but I'll try to pick this up again some time next
week. Meanwhile, please have a look and tell me what you think.

- Heikki


Attachment

Re: Radix tree for character conversion

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote:
> > Thanks. The attached patch contains the patch by perlcritic.
> > 
> > 0001,2,3 are Heikki's patch that are not modified since it is
> > first proposed. It's a bit too big so I don't attach them to this
> > mail (again).
> > 
> > https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
> 
> I've now pushed these preliminary patches, with the applicable fixes from
> you and Daniel. The attached patch is now against git master.

Is this the Nov. 30th commit?  Because I don't see any other commits
from you.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 12/02/2016 10:18 PM, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote:
>>> Thanks. The attached patch contains the patch by perlcritic.
>>>
>>> 0001,2,3 are Heikki's patch that are not modified since it is
>>> first proposed. It's a bit too big so I don't attach them to this
>>> mail (again).
>>>
>>> https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
>>
>> I've now pushed these preliminary patches, with the applicable fixes from
>> you and Daniel. The attached patch is now against git master.
>
> Is this the Nov. 30th commit?  Because I don't see any other commits
> from you.

Yes. Sorry for the confusion.

- Heikki




Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, thank you for reviewing this.

I compared mine and yours. The new patch works fine and gives
smaller radix map files. It seems also to me more readable.

At Fri, 2 Dec 2016 22:07:07 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<da58a154-0b28-802e-5e82-5a205f53926e@iki.fi>
> On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote:
> > Thanks. The attached patch contains the patch by perlcritic.
> >
> > 0001,2,3 are Heikki's patch that are not modified since it is
> > first proposed. It's a bit too big so I don't attach them to this
> > mail (again).
> >
> > https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
> 
> I've now pushed these preliminary patches, with the applicable fixes
> from you and Daniel. The attached patch is now against git master.

Thanks for committing them.

> > 0004 is radix-tree stuff, applies on top of the three patches
> > above.
> 
> I've spent the last couple of days reviewing this. While trying to
> understand how it works, I ended up dismantling, rewriting, and
> putting back together most of the added perl code. 

I might have been putting too much in one structure and a bit too
eager to conceal lower level from the upper level.

> Attached is a new
> version, with more straightforward logic, making it more
> understandable. I find it more understandable, anyway, I hope it's not
> only because I wrote it myself :-). Let me know what you think.

First, thank you for the refactoring(?).

I didn't intend to replace all of .map files with radix files at
first. Finally my patch removes all old-style map file but I
haven't noticed that. So removing the old bsearch code seems
reasonable. Avoiding redundant decomposition of multibyte
characters into bytes seems reasonable from the view of
efficiency.

The new patch decomposes the structured pg_mb_radix_tree into a
series of (basically) plain member variables in a struct. I'm not
so in favor of the style but a radix tree of at least 4-levels is
easily read in the style and maybe the code to handle it is
rather easily readable. (So +1 for it)

> In particular, I found the explanations of flat and segmented tables
> really hard to understand. So in this version, the radix trees for a
> conversion are stored completely in one large array. Leaf and
> intermediate levels are all in the same array. When reading this
> version, please note that I'm not sure if I mean the same thing with
> "segment" that you did in your version.
> I moved the "lower" and "upper" values in the structs. Also, there are
> now also separate "lower" and "upper" values for the leaf levels of
> the trees, for 1- 2-, 3- and 4-byte inputs. This made a huge

The "segment" there seems to mean definitely the same to
mine. Flattening the on-memory structure is fine from the same
reason to the above.

> difference to the size of gb18030_to_utf8_radix.map, in particular:
> the source file shrank from about 2 MB to 1/2 MB. In that conversion,
> the valid range for the last byte of 2-byte inputs is 0x40-0xfe, and
> the valid range for the last byte of 4-byte inputs is 0x30-0x39. With
> the old patch version, the "chars" range was therefore 0x30-0xfe, to
> cover both of those, and most of the array was filled with zeros. With
> this new patch version, we store separate ranges for those, and can
> leave out most of the zeros.

Great. I agree that the (logically) devided chartable is
significantly space-efficient.

> There's a segment full of zeros at the beginning of each conversion
> array now. The purpose of that is that when traversing the radix tree,
> you don't need to check each intermediate value for 0. If you follow a
> 0 offset, it simply points to the dummy all-zeros segments in the
> beginning. Seemed like a good idea to shave some cycles, although I'm
> not sure if it made much difference in reality.

And I like the zero page.

> I optimized pg_mb_radix_conv() a bit, too. We could do more. For
> example, I think it would save some cycles to have specialized
> versions of UtfToLocal and LocalToUtf, moving the tests for whether a
> combined character map and/or conversion callback is used, out of the
> loop. They feel a bit ugly too, in their current form...

Hmm. Maybe decomposing iiso in pg_mb_radix_conv is faster than
pushing extra 3 (or 4) parameters into the stack (or it's wrong
if they are passed using registers?) but I'm not sure.

> I need a break now, but I'll try to pick this up again some time next
> week. Meanwhile, please have a look and tell me what you think.

Thank you very much for the big effort on this patch.

Apart from the aboves, I have some trivial comments on the new
version.


1. If we decide not to use old-style maps, UtfToLocal no longer need to take void * as map data. (Patch 0001)

2. "use Data::Dumper" doesn't seem necessary. (Patch 0002)

3. A comment contains a superfluous comma. (Patch 0002) (The last  byte of the first line below)> ### The segments are
writtenout physically to one big array in the final,> ### step, but logically, they form a radix tree. Or rather, four
radix

4. The following code doesn't seem so perl'ish.
  >  for (my $i=0; $i <= 0xff; $i++)  >  {  >    my $val = $seg->{values}->{$i};  >    if ($val)  >    {  >
$this_min= $i if (!defined $this_min || $i < $this_min);
 
  Refraining from proposing extreme perl, the following would be  reasonable as an equivalent. (Patch 0002)
  foreach $i (keys $seg->{values})  {       The 0002 patch contains the following change but this might be  kinda
extreme..
  -    for (my $i=0; $i <= 0xff; $i++)  +    while ((my $i, my $val) = each $map)

4. download_srctxts.sh is no longer needed. (No patch)


I'll put more consideration on the new version and put another
version later.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, I looked on this closer.

The attached is the revised version of this patch.

At Mon, 05 Dec 2016 19:29:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161205.192954.121855559.horiguchi.kyotaro@lab.ntt.co.jp>
> Apart from the aboves, I have some trivial comments on the new
> version.
> 
> 
> 1. If we decide not to use old-style maps, UtfToLocal no longer
>   need to take void * as map data. (Patch 0001)
> 2. "use Data::Dumper" doesn't seem necessary. (Patch 0002)
> 3. A comment contains a superfluous comma. (Patch 0002) (The last
>    byte of the first line below)
> 4. The following code doesn't seem so perl'ish.
> 4. download_srctxts.sh is no longer needed. (No patch)

6. Fixed some inconsistent indentation/folding.
7. Fix handling of $verbose.
8. Sort segments using leading bytes.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile
index 0345a36..f184f65 100644
--- a/src/backend/utils/mb/Unicode/Makefile
+++ b/src/backend/utils/mb/Unicode/Makefile
@@ -158,9 +158,6 @@ gb-18030-2000.xml windows-949-2000.xml:euc-jis-2004-std.txt sjis-0213-2004-std.txt:    $(DOWNLOAD)
http://x0213.org/codetable/$(@F)
-gb-18030-2000.xml windows-949-2000.xml:
-    $(DOWNLOAD) https://ssl.icu-project.org/repos/icu/data/trunk/charset/data/xml/$(@F)
-GB2312.TXT:    $(DOWNLOAD)
'http://trac.greenstone.org/browser/trunk/gsdl/unicode/MAPPINGS/EASTASIA/GB/GB2312.TXT?rev=1842&format=txt'
@@ -176,7 +173,7 @@ KOI8-R.TXT KOI8-U.TXT:$(ISO8859TEXTS):    $(DOWNLOAD)
http://ftp.unicode.org/Public/MAPPINGS/ISO8859/$(@F)
-$(filter-out CP8%,$(WINTEXTS)) CP932.TXT CP950.TXT:
+$(filter-out CP8%,$(WINTEXTS)) $(filter CP9%, $(SPECIALTEXTS)):    $(DOWNLOAD)
http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/$(@F)$(filterCP8%,$(WINTEXTS)):
 
diff --git a/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl b/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl
index 822ab28..62e5145 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_BIG5.pl
@@ -51,7 +51,9 @@ foreach my $i (@$cp950txt)          { code      => $code,            ucs       => $ucs,
comment  => $i->{comment},
 
-            direction => "both" };
+            direction => "both",
+            f          => $i->{f},
+            l          => $i->{l} };    }}
@@ -70,6 +72,6 @@ foreach my $i (@$all)}# Output
-print_tables($this_script, "BIG5", $all);
+print_tables($this_script, "BIG5", $all, 1);print_radix_trees($this_script, "BIG5", $all);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl
index a933c12..299beec 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl
@@ -72,9 +72,11 @@ while (<$in>)    push @mapping,      { ucs       => $ucs,        code      => $code,
-        direction => 'both' };
+        direction => 'both',
+        f          => $in_file,
+        l          => $. };}close($in);
-print_tables($this_script, "EUC_CN", \@mapping);
+print_tables($this_script, "EUC_CN", \@mapping, 1);print_radix_trees($this_script, "EUC_CN", \@mapping);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl
index 1bf7f2e..fea03df 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl
@@ -31,12 +31,14 @@ while (my $line = <$in>)        my $ucs1 = hex($u1);        my $ucs2 = hex($u2);
-        push @all, { direction => 'both',
-                     ucs => $ucs1,
-                     ucs_second => $ucs2,
-                     code => $code,
-                     comment => $rest };
-        next;
+        push @all,
+          { direction  => 'both',
+            ucs        => $ucs1,
+            ucs_second => $ucs2,
+            code       => $code,
+            comment    => $rest,
+            f           => $in_file,
+            l           => $. };    }    elsif ($line =~ /^0x(.*)[ \t]*U\+(.*)[ \t]*#(.*)$/)    {
@@ -47,7 +49,13 @@ while (my $line = <$in>)        next if ($code < 0x80 && $ucs < 0x80);
-        push @all, { direction => 'both', ucs => $ucs, code => $code, comment => $rest };
+        push @all,
+          { direction => 'both',
+            ucs       => $ucs,
+            code      => $code,
+            comment   => $rest,
+            f           => $in_file,
+            l           => $. };    }}close($in);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl
index 5ac3542..9dcb9e2 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl
@@ -108,98 +108,98 @@ foreach my $i (@mapping)}push @mapping, (
-     {direction => 'both', ucs => 0x4efc, code => 0x8ff4af, comment => '# CJK(4EFC)'},
-     {direction => 'both', ucs => 0x50f4, code => 0x8ff4b0, comment => '# CJK(50F4)'},
-     {direction => 'both', ucs => 0x51EC, code => 0x8ff4b1, comment => '# CJK(51EC)'},
-     {direction => 'both', ucs => 0x5307, code => 0x8ff4b2, comment => '# CJK(5307)'},
-     {direction => 'both', ucs => 0x5324, code => 0x8ff4b3, comment => '# CJK(5324)'},
-     {direction => 'both', ucs => 0x548A, code => 0x8ff4b5, comment => '# CJK(548A)'},
-     {direction => 'both', ucs => 0x5759, code => 0x8ff4b6, comment => '# CJK(5759)'},
-     {direction => 'both', ucs => 0x589E, code => 0x8ff4b9, comment => '# CJK(589E)'},
-     {direction => 'both', ucs => 0x5BEC, code => 0x8ff4ba, comment => '# CJK(5BEC)'},
-     {direction => 'both', ucs => 0x5CF5, code => 0x8ff4bb, comment => '# CJK(5CF5)'},
-     {direction => 'both', ucs => 0x5D53, code => 0x8ff4bc, comment => '# CJK(5D53)'},
-     {direction => 'both', ucs => 0x5FB7, code => 0x8ff4be, comment => '# CJK(5FB7)'},
-     {direction => 'both', ucs => 0x6085, code => 0x8ff4bf, comment => '# CJK(6085)'},
-     {direction => 'both', ucs => 0x6120, code => 0x8ff4c0, comment => '# CJK(6120)'},
-     {direction => 'both', ucs => 0x654E, code => 0x8ff4c1, comment => '# CJK(654E)'},
-     {direction => 'both', ucs => 0x663B, code => 0x8ff4c2, comment => '# CJK(663B)'},
-     {direction => 'both', ucs => 0x6665, code => 0x8ff4c3, comment => '# CJK(6665)'},
-     {direction => 'both', ucs => 0x6801, code => 0x8ff4c6, comment => '# CJK(6801)'},
-     {direction => 'both', ucs => 0x6A6B, code => 0x8ff4c9, comment => '# CJK(6A6B)'},
-     {direction => 'both', ucs => 0x6AE2, code => 0x8ff4ca, comment => '# CJK(6AE2)'},
-     {direction => 'both', ucs => 0x6DF2, code => 0x8ff4cc, comment => '# CJK(6DF2)'},
-     {direction => 'both', ucs => 0x6DF8, code => 0x8ff4cb, comment => '# CJK(6DF8)'},
-     {direction => 'both', ucs => 0x7028, code => 0x8ff4cd, comment => '# CJK(7028)'},
-     {direction => 'both', ucs => 0x70BB, code => 0x8ff4ae, comment => '# CJK(70BB)'},
-     {direction => 'both', ucs => 0x7501, code => 0x8ff4d0, comment => '# CJK(7501)'},
-     {direction => 'both', ucs => 0x7682, code => 0x8ff4d1, comment => '# CJK(7682)'},
-     {direction => 'both', ucs => 0x769E, code => 0x8ff4d2, comment => '# CJK(769E)'},
-     {direction => 'both', ucs => 0x7930, code => 0x8ff4d4, comment => '# CJK(7930)'},
-     {direction => 'both', ucs => 0x7AE7, code => 0x8ff4d9, comment => '# CJK(7AE7)'},
-     {direction => 'both', ucs => 0x7DA0, code => 0x8ff4dc, comment => '# CJK(7DA0)'},
-     {direction => 'both', ucs => 0x7DD6, code => 0x8ff4dd, comment => '# CJK(7DD6)'},
-     {direction => 'both', ucs => 0x8362, code => 0x8ff4df, comment => '# CJK(8362)'},
-     {direction => 'both', ucs => 0x85B0, code => 0x8ff4e1, comment => '# CJK(85B0)'},
-     {direction => 'both', ucs => 0x8807, code => 0x8ff4e4, comment => '# CJK(8807)'},
-     {direction => 'both', ucs => 0x8B7F, code => 0x8ff4e6, comment => '# CJK(8B7F)'},
-     {direction => 'both', ucs => 0x8CF4, code => 0x8ff4e7, comment => '# CJK(8CF4)'},
-     {direction => 'both', ucs => 0x8D76, code => 0x8ff4e8, comment => '# CJK(8D76)'},
-     {direction => 'both', ucs => 0x90DE, code => 0x8ff4ec, comment => '# CJK(90DE)'},
-     {direction => 'both', ucs => 0x9115, code => 0x8ff4ee, comment => '# CJK(9115)'},
-     {direction => 'both', ucs => 0x9592, code => 0x8ff4f1, comment => '# CJK(9592)'},
-     {direction => 'both', ucs => 0x973B, code => 0x8ff4f4, comment => '# CJK(973B)'},
-     {direction => 'both', ucs => 0x974D, code => 0x8ff4f5, comment => '# CJK(974D)'},
-     {direction => 'both', ucs => 0x9751, code => 0x8ff4f6, comment => '# CJK(9751)'},
-     {direction => 'both', ucs => 0x999E, code => 0x8ff4fa, comment => '# CJK(999E)'},
-     {direction => 'both', ucs => 0x9AD9, code => 0x8ff4fb, comment => '# CJK(9AD9)'},
-     {direction => 'both', ucs => 0x9B72, code => 0x8ff4fc, comment => '# CJK(9B72)'},
-     {direction => 'both', ucs => 0x9ED1, code => 0x8ff4fe, comment => '# CJK(9ED1)'},
-     {direction => 'both', ucs => 0xF929, code => 0x8ff4c5, comment => '# CJK COMPATIBILITY IDEOGRAPH-F929'},
-     {direction => 'both', ucs => 0xF9DC, code => 0x8ff4f2, comment => '# CJK COMPATIBILITY IDEOGRAPH-F9DC'},
-     {direction => 'both', ucs => 0xFA0E, code => 0x8ff4b4, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA0E'},
-     {direction => 'both', ucs => 0xFA0F, code => 0x8ff4b7, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA0F'},
-     {direction => 'both', ucs => 0xFA10, code => 0x8ff4b8, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA10'},
-     {direction => 'both', ucs => 0xFA11, code => 0x8ff4bd, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA11'},
-     {direction => 'both', ucs => 0xFA12, code => 0x8ff4c4, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA12'},
-     {direction => 'both', ucs => 0xFA13, code => 0x8ff4c7, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA13'},
-     {direction => 'both', ucs => 0xFA14, code => 0x8ff4c8, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA14'},
-     {direction => 'both', ucs => 0xFA15, code => 0x8ff4ce, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA15'},
-     {direction => 'both', ucs => 0xFA16, code => 0x8ff4cf, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA16'},
-     {direction => 'both', ucs => 0xFA17, code => 0x8ff4d3, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA17'},
-     {direction => 'both', ucs => 0xFA18, code => 0x8ff4d5, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA18'},
-     {direction => 'both', ucs => 0xFA19, code => 0x8ff4d6, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA19'},
-     {direction => 'both', ucs => 0xFA1A, code => 0x8ff4d7, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1A'},
-     {direction => 'both', ucs => 0xFA1B, code => 0x8ff4d8, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1B'},
-     {direction => 'both', ucs => 0xFA1C, code => 0x8ff4da, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1C'},
-     {direction => 'both', ucs => 0xFA1D, code => 0x8ff4db, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1D'},
-     {direction => 'both', ucs => 0xFA1E, code => 0x8ff4de, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1E'},
-     {direction => 'both', ucs => 0xFA1F, code => 0x8ff4e0, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1F'},
-     {direction => 'both', ucs => 0xFA20, code => 0x8ff4e2, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA20'},
-     {direction => 'both', ucs => 0xFA21, code => 0x8ff4e3, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA21'},
-     {direction => 'both', ucs => 0xFA22, code => 0x8ff4e5, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA22'},
-     {direction => 'both', ucs => 0xFA23, code => 0x8ff4e9, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA23'},
-     {direction => 'both', ucs => 0xFA24, code => 0x8ff4ea, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA24'},
-     {direction => 'both', ucs => 0xFA25, code => 0x8ff4eb, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA25'},
-     {direction => 'both', ucs => 0xFA26, code => 0x8ff4ed, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA26'},
-     {direction => 'both', ucs => 0xFA27, code => 0x8ff4ef, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA27'},
-     {direction => 'both', ucs => 0xFA28, code => 0x8ff4f0, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA28'},
-     {direction => 'both', ucs => 0xFA29, code => 0x8ff4f3, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA29'},
-     {direction => 'both', ucs => 0xFA2A, code => 0x8ff4f7, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA2A'},
-     {direction => 'both', ucs => 0xFA2B, code => 0x8ff4f8, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA2B'},
-     {direction => 'both', ucs => 0xFA2C, code => 0x8ff4f9, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA2C'},
-     {direction => 'both', ucs => 0xFA2D, code => 0x8ff4fd, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA2D'},
-     {direction => 'both', ucs => 0xFF07, code => 0x8ff4a9, comment => '# FULLWIDTH APOSTROPHE'},
-     {direction => 'both', ucs => 0xFFE4, code => 0x8fa2c3, comment => '# FULLWIDTH BROKEN BAR'},
+     {direction => 'both', ucs => 0x4efc, code => 0x8ff4af, comment => '# CJK(4EFC)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x50f4, code => 0x8ff4b0, comment => '# CJK(50F4)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x51EC, code => 0x8ff4b1, comment => '# CJK(51EC)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x5307, code => 0x8ff4b2, comment => '# CJK(5307)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x5324, code => 0x8ff4b3, comment => '# CJK(5324)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x548A, code => 0x8ff4b5, comment => '# CJK(548A)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x5759, code => 0x8ff4b6, comment => '# CJK(5759)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x589E, code => 0x8ff4b9, comment => '# CJK(589E)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x5BEC, code => 0x8ff4ba, comment => '# CJK(5BEC)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x5CF5, code => 0x8ff4bb, comment => '# CJK(5CF5)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x5D53, code => 0x8ff4bc, comment => '# CJK(5D53)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x5FB7, code => 0x8ff4be, comment => '# CJK(5FB7)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x6085, code => 0x8ff4bf, comment => '# CJK(6085)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x6120, code => 0x8ff4c0, comment => '# CJK(6120)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x654E, code => 0x8ff4c1, comment => '# CJK(654E)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x663B, code => 0x8ff4c2, comment => '# CJK(663B)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x6665, code => 0x8ff4c3, comment => '# CJK(6665)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x6801, code => 0x8ff4c6, comment => '# CJK(6801)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x6A6B, code => 0x8ff4c9, comment => '# CJK(6A6B)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x6AE2, code => 0x8ff4ca, comment => '# CJK(6AE2)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x6DF2, code => 0x8ff4cc, comment => '# CJK(6DF2)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x6DF8, code => 0x8ff4cb, comment => '# CJK(6DF8)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x7028, code => 0x8ff4cd, comment => '# CJK(7028)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x70BB, code => 0x8ff4ae, comment => '# CJK(70BB)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x7501, code => 0x8ff4d0, comment => '# CJK(7501)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x7682, code => 0x8ff4d1, comment => '# CJK(7682)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x769E, code => 0x8ff4d2, comment => '# CJK(769E)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x7930, code => 0x8ff4d4, comment => '# CJK(7930)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x7AE7, code => 0x8ff4d9, comment => '# CJK(7AE7)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x7DA0, code => 0x8ff4dc, comment => '# CJK(7DA0)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x7DD6, code => 0x8ff4dd, comment => '# CJK(7DD6)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x8362, code => 0x8ff4df, comment => '# CJK(8362)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x85B0, code => 0x8ff4e1, comment => '# CJK(85B0)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x8807, code => 0x8ff4e4, comment => '# CJK(8807)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x8B7F, code => 0x8ff4e6, comment => '# CJK(8B7F)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x8CF4, code => 0x8ff4e7, comment => '# CJK(8CF4)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x8D76, code => 0x8ff4e8, comment => '# CJK(8D76)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x90DE, code => 0x8ff4ec, comment => '# CJK(90DE)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x9115, code => 0x8ff4ee, comment => '# CJK(9115)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x9592, code => 0x8ff4f1, comment => '# CJK(9592)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x973B, code => 0x8ff4f4, comment => '# CJK(973B)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x974D, code => 0x8ff4f5, comment => '# CJK(974D)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x9751, code => 0x8ff4f6, comment => '# CJK(9751)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x999E, code => 0x8ff4fa, comment => '# CJK(999E)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x9AD9, code => 0x8ff4fb, comment => '# CJK(9AD9)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x9B72, code => 0x8ff4fc, comment => '# CJK(9B72)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0x9ED1, code => 0x8ff4fe, comment => '# CJK(9ED1)', f => $this_script, l=>
__LINE__},
+     {direction => 'both', ucs => 0xF929, code => 0x8ff4c5, comment => '# CJK COMPATIBILITY IDEOGRAPH-F929', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xF9DC, code => 0x8ff4f2, comment => '# CJK COMPATIBILITY IDEOGRAPH-F9DC', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA0E, code => 0x8ff4b4, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA0E', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA0F, code => 0x8ff4b7, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA0F', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA10, code => 0x8ff4b8, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA10', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA11, code => 0x8ff4bd, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA11', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA12, code => 0x8ff4c4, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA12', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA13, code => 0x8ff4c7, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA13', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA14, code => 0x8ff4c8, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA14', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA15, code => 0x8ff4ce, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA15', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA16, code => 0x8ff4cf, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA16', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA17, code => 0x8ff4d3, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA17', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA18, code => 0x8ff4d5, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA18', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA19, code => 0x8ff4d6, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA19', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA1A, code => 0x8ff4d7, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1A', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA1B, code => 0x8ff4d8, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1B', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA1C, code => 0x8ff4da, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1C', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA1D, code => 0x8ff4db, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1D', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA1E, code => 0x8ff4de, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1E', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA1F, code => 0x8ff4e0, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA1F', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA20, code => 0x8ff4e2, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA20', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA21, code => 0x8ff4e3, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA21', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA22, code => 0x8ff4e5, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA22', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA23, code => 0x8ff4e9, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA23', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA24, code => 0x8ff4ea, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA24', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA25, code => 0x8ff4eb, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA25', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA26, code => 0x8ff4ed, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA26', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA27, code => 0x8ff4ef, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA27', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA28, code => 0x8ff4f0, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA28', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA29, code => 0x8ff4f3, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA29', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA2A, code => 0x8ff4f7, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA2A', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA2B, code => 0x8ff4f8, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA2B', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA2C, code => 0x8ff4f9, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA2C', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFA2D, code => 0x8ff4fd, comment => '# CJK COMPATIBILITY IDEOGRAPH-FA2D', f =>
$this_script,l=> __LINE__},
 
+     {direction => 'both', ucs => 0xFF07, code => 0x8ff4a9, comment => '# FULLWIDTH APOSTROPHE', f => $this_script,
l=>__LINE__},
 
+     {direction => 'both', ucs => 0xFFE4, code => 0x8fa2c3, comment => '# FULLWIDTH BROKEN BAR', f => $this_script,
l=>__LINE__},     # additional conversions for EUC_JP -> UTF-8 conversion
 
-     {direction => 'to_unicode', ucs => 0x2116, code => 0x8ff4ac, comment => '# NUMERO SIGN'},
-     {direction => 'to_unicode', ucs => 0x2121, code => 0x8ff4ad, comment => '# TELEPHONE SIGN'},
-     {direction => 'to_unicode', ucs => 0x3231, code => 0x8ff4ab, comment => '# PARENTHESIZED IDEOGRAPH STOCK'}
+     {direction => 'to_unicode', ucs => 0x2116, code => 0x8ff4ac, comment => '# NUMERO SIGN', f => $this_script, l=>
__LINE__},
+     {direction => 'to_unicode', ucs => 0x2121, code => 0x8ff4ad, comment => '# TELEPHONE SIGN', f => $this_script,
l=>__LINE__},
 
+     {direction => 'to_unicode', ucs => 0x3231, code => 0x8ff4ab, comment => '# PARENTHESIZED IDEOGRAPH STOCK', f =>
$this_script,l=> __LINE__}    );#>>>
 
-print_tables($this_script, "EUC_JP", \@mapping);
+print_tables($this_script, "EUC_JP", \@mapping, 1);print_radix_trees($this_script, "EUC_JP", \@mapping);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl
index d17d777..baa3f9c 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl
@@ -33,11 +33,11 @@ foreach my $i (@$mapping)# Some extra characters that are not in KSX1001.TXT#<<< do not let
perltidytouch thispush @$mapping,(
 
-    {direction => 'both', ucs => 0x20AC, code => 0xa2e6, comment => '# EURO SIGN'},
-    {direction => 'both', ucs => 0x00AE, code => 0xa2e7, comment => '# REGISTERED SIGN'},
-    {direction => 'both', ucs => 0x327E, code => 0xa2e8, comment => '# CIRCLED HANGUL IEUNG U'}
+    {direction => 'both', ucs => 0x20AC, code => 0xa2e6, comment => '# EURO SIGN', f => $this_script, l => __LINE__},
+    {direction => 'both', ucs => 0x00AE, code => 0xa2e7, comment => '# REGISTERED SIGN', f => $this_script, l =>
__LINE__},
 
+    {direction => 'both', ucs => 0x327E, code => 0xa2e8, comment => '# CIRCLED HANGUL IEUNG U', f => $this_script, l
=>__LINE__ }    );#>>>
 
-print_tables($this_script, "EUC_KR", $mapping);
+print_tables($this_script, "EUC_KR", $mapping, 1);print_radix_trees($this_script, "EUC_KR", $mapping);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl
index 603edc4..0407575 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl
@@ -56,11 +56,13 @@ foreach my $i (@$mapping)          { ucs       => $i->{ucs},            code      => ($i->{code} +
0x8ea10000),           rest      => $i->{rest},
 
-            direction => 'to_unicode' };
+            direction => 'to_unicode',
+            f          => $i->{f},
+            l          => $i->{l} };    }}push @$mapping, @extras;
-print_tables($this_script, "EUC_TW", $mapping);
+print_tables($this_script, "EUC_TW", $mapping, 1);print_radix_trees($this_script, "EUC_TW", $mapping);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl b/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
index e20b4a8..922f206 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
@@ -38,10 +38,12 @@ while (<$in>)        push @mapping,          { ucs       => $ucs,            code      => $code,
-            direction => 'both' };
+            direction => 'both',
+            f          => $in_file,
+            l          => $. };    }}close($in);
-print_tables($this_script, "GB18030", \@mapping);
+print_tables($this_script, "GB18030", \@mapping, 1);print_radix_trees($this_script, "GB18030", \@mapping);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl b/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl
old mode 100755
new mode 100644
index 2dc9fb3..ab6bebf
--- a/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl
@@ -27,11 +27,11 @@ my $mapping = &read_source("JOHAB.TXT");# Some extra characters that are not in JOHAB.TXT#<<< do
notlet perltidy touch thispush @$mapping, (
 
-    {direction => 'both', ucs => 0x20AC, code => 0xd9e6, comment => '# EURO SIGN'},
-    {direction => 'both', ucs => 0x00AE, code => 0xd9e7, comment => '# REGISTERED SIGN'},
-    {direction => 'both', ucs => 0x327E, code => 0xd9e8, comment => '# CIRCLED HANGUL IEUNG U'}
+    {direction => 'both', ucs => 0x20AC, code => 0xd9e6, comment => '# EURO SIGN', f => $this_script, l =>  __LINE__
},
+    {direction => 'both', ucs => 0x00AE, code => 0xd9e7, comment => '# REGISTERED SIGN', f => $this_script, l =>
__LINE__},
 
+    {direction => 'both', ucs => 0x327E, code => 0xd9e8, comment => '# CIRCLED HANGUL IEUNG U', f => $this_script, l
=> __LINE__ }    );#>>>
 
-print_tables($this_script, "JOHAB", $mapping);
+print_tables($this_script, "JOHAB", $mapping, 1);print_radix_trees($this_script, "JOHAB", $mapping);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl
b/src/backend/utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl
index b1ab307..557fc62 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_SHIFT_JIS_2004.pl
@@ -24,7 +24,6 @@ while (my $line = <$in>){    if ($line =~ /^0x(.*)[ \t]*U\+(.*)\+(.*)[ \t]*#(.*)$/)    {
-        # combined characters        my ($c, $u1, $u2) = ($1, $2, $3);        my $rest = "U+" . $u1 . "+" . $u2 . $4;
     my $code = hex($c);
 
@@ -36,15 +35,15 @@ while (my $line = <$in>)            ucs        => $ucs1,            ucs_second => $ucs2,
comment   => $rest,
 
-            direction  => 'both' };
+            direction  => 'both',
+            f           => $in_file,
+            l           => $. };        next;    }    elsif ($line =~ /^0x(.*)[ \t]*U\+(.*)[ \t]*#(.*)$/)    {
-        # non-combined characters        my ($c, $u, $rest) = ($1, $2, "U+" . $2 . $3);
-        my $ucs  = hex($u);
-        my $code = hex($c);
+        my ($ucs, $code) = (hex($u), hex($c));        my $direction;        if ($code < 0x80 && $ucs < 0x80)
@@ -64,12 +63,13 @@ while (my $line = <$in>)            $direction = 'both';        }
-        push @mapping, {
-            code => $code,
-            ucs => $ucs,
-            comment => $rest,
-            direction => $direction
-        };
+        push @mapping,
+          { code      => $code,
+            ucs       => $ucs,
+            comment   => $rest,
+            direction => $direction,
+            f           => $in_file,
+            l           => $. };    }}close($in);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
index ffeb65f..e1978f7 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
@@ -40,16 +40,16 @@ foreach my $i (@$mapping)# Add these UTF8->SJIS pairs to the table.#<<< do not let perltidy touch
thispush@$mapping, (
 
-    {direction => "from_unicode", ucs => 0x00a2, code => 0x8191, comment => '# CENT SIGN'},
-    {direction => "from_unicode", ucs => 0x00a3, code => 0x8192, comment => '# POUND SIGN'},
-    {direction => "from_unicode", ucs => 0x00a5, code => 0x5c,   comment => '# YEN SIGN'},
-    {direction => "from_unicode", ucs => 0x00ac, code => 0x81ca, comment => '# NOT SIGN'},
-    {direction => "from_unicode", ucs => 0x2016, code => 0x8161, comment => '# DOUBLE VERTICAL LINE'},
-    {direction => "from_unicode", ucs => 0x203e, code => 0x7e,   comment => '# OVERLINE'},
-    {direction => "from_unicode", ucs => 0x2212, code => 0x817c, comment => '# MINUS SIGN'},
-    {direction => "from_unicode", ucs => 0x301c, code => 0x8160, comment => '# WAVE DASH'}
+    {direction => "from_unicode", ucs => 0x00a2, code => 0x8191, comment => '# CENT SIGN', f => $this_script, l =>
__LINE__},
 
+    {direction => "from_unicode", ucs => 0x00a3, code => 0x8192, comment => '# POUND SIGN', f => $this_script, l =>
__LINE__},
 
+    {direction => "from_unicode", ucs => 0x00a5, code => 0x5c,   comment => '# YEN SIGN', f => $this_script, l =>
__LINE__},
 
+    {direction => "from_unicode", ucs => 0x00ac, code => 0x81ca, comment => '# NOT SIGN', f => $this_script, l =>
__LINE__},
 
+    {direction => "from_unicode", ucs => 0x2016, code => 0x8161, comment => '# DOUBLE VERTICAL LINE', f =>
$this_script,l => __LINE__ },
 
+    {direction => "from_unicode", ucs => 0x203e, code => 0x7e,   comment => '# OVERLINE', f => $this_script, l =>
__LINE__},
 
+    {direction => "from_unicode", ucs => 0x2212, code => 0x817c, comment => '# MINUS SIGN', f => $this_script, l =>
__LINE__},
 
+    {direction => "from_unicode", ucs => 0x301c, code => 0x8160, comment => '# WAVE DASH', f => $this_script, l =>
__LINE__}    );#>>>
 
-print_tables($this_script, "SJIS", $mapping);
+print_tables($this_script, "SJIS", $mapping, 1);print_radix_trees($this_script, "SJIS", $mapping);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_UHC.pl b/src/backend/utils/mb/Unicode/UCS_to_UHC.pl
old mode 100755
new mode 100644
index 2905b95..26cf5a2
--- a/src/backend/utils/mb/Unicode/UCS_to_UHC.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_UHC.pl
@@ -41,7 +41,9 @@ while (<$in>)        push @mapping,          { ucs       => $ucs,            code      => $code,
-            direction => 'both' };
+            direction => 'both',
+            f          => $in_file,
+            l          => $. };    }}close($in);
@@ -51,7 +53,9 @@ push @mapping,  { direction => 'both',    code      => 0xa2e8,    ucs       => 0x327e,
-    comment   => 'CIRCLED HANGUL IEUNG U' };
+    comment   => 'CIRCLED HANGUL IEUNG U',
+    f          => $this_script,
+    l          => __LINE__ };
-print_tables($this_script, "UHC", \@mapping);
+print_tables($this_script, "UHC", \@mapping, 1);print_radix_trees($this_script, "UHC", \@mapping);
diff --git a/src/backend/utils/mb/Unicode/UCS_to_most.pl b/src/backend/utils/mb/Unicode/UCS_to_most.pl
index 55ef873..8cc3eb7 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_most.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_most.pl
@@ -56,6 +56,6 @@ foreach my $charset (@charsets){    my $mapping = &read_source($filename{$charset});
-    print_tables($this_script, $charset, $mapping);
+    print_tables($this_script, $charset, $mapping, 1);    print_radix_trees($this_script, $charset, $mapping);}
diff --git a/src/backend/utils/mb/Unicode/convutils.pm b/src/backend/utils/mb/Unicode/convutils.pm
index 3ab461b..f4e6917 100644
--- a/src/backend/utils/mb/Unicode/convutils.pm
+++ b/src/backend/utils/mb/Unicode/convutils.pm
@@ -19,20 +19,24 @@ sub ucs2utf    }    elsif ($ucs > 0x007f && $ucs <= 0x07ff)    {
-        $utf = (($ucs & 0x003f) | 0x80) | ((($ucs >> 6) | 0xc0) << 8);
+        $utf =
+            (($ucs & 0x003f) | 0x80) |
+            ((($ucs >> 6) | 0xc0) << 8);    }    elsif ($ucs > 0x07ff && $ucs <= 0xffff)    {        $utf =
((($ucs>> 12) | 0xe0) << 16) | 
-          (((($ucs & 0x0fc0) >> 6) | 0x80) << 8) | (($ucs & 0x003f) | 0x80);
+          (((($ucs & 0x0fc0) >> 6) | 0x80) << 8) |
+          (($ucs & 0x003f) | 0x80);    }    else    {        $utf =          ((($ucs >> 18) | 0xf0) << 24) |
(((($ucs& 0x3ffff) >> 12) | 0x80) << 16) |
 
-          (((($ucs & 0x0fc0) >> 6) | 0x80) << 8) | (($ucs & 0x003f) | 0x80);
+          (((($ucs & 0x0fc0) >> 6) | 0x80) << 8) |
+          (($ucs & 0x003f) | 0x80);    }    return ($utf);}
@@ -72,7 +76,9 @@ sub read_source            code      => hex($1),            ucs       => hex($2),            comment
=> $4,
 
-            direction => "both" };
+            direction => "both",
+            f          => $fname,
+            l          => $. };        # Ignore pure ASCII mappings. PostgreSQL character conversion code        #
nevereven passes these to the conversion code.
 
@@ -134,8 +140,7 @@ sub print_tables            {                push @to_unicode_combined, $entry;            }
-            if (   $i->{direction} eq "both"
-                || $i->{direction} eq "from_unicode")
+            if ($i->{direction} eq "both" || $i->{direction} eq "from_unicode")            {                push
@from_unicode_combined,$entry;            }
 
@@ -152,8 +157,7 @@ sub print_tables            {                push @to_unicode, $entry;            }
-            if (   $i->{direction} eq "both"
-                || $i->{direction} eq "from_unicode")
+            if ($i->{direction} eq "both" || $i->{direction} eq "from_unicode")            {                push
@from_unicode,$entry;            }
 
@@ -183,15 +187,16 @@ sub print_from_utf8_map    my $fname = lc("utf8_to_${charset}.map");    print "- Writing
UTF8=>${charset}conversion table: $fname\n";    open(my $out, '>', $fname) || die "cannot open output file :
$fname\n";
-    printf($out "/* src/backend/utils/mb/Unicode/$fname */\n\n".
-           "static const pg_utf_to_local ULmap${charset}[ %d ] = {",
-           scalar(@$table));
+    printf $out "/* src/backend/utils/mb/Unicode/$fname */\n"
+      . "/* This file is generated by $this_script */\n\n"
+      . "static const pg_utf_to_local ULmap${charset}[ %d ] = {",
+      scalar(@$table);    my $first = 1;    foreach my $i (sort { $a->{utf8} <=> $b->{utf8} } @$table)    {
print$out "," if (!$first);        $first = 0;
 
-        print $out "\t/* $last_comment */" if ($verbose);
+        print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");        printf $out "\n  {0x%04x,
0x%04x}",$i->{utf8}, $i->{code};        if ($verbose >= 2)
 
@@ -199,12 +204,12 @@ sub print_from_utf8_map            $last_comment =              sprintf("%s:%d %s", $i->{f},
$i->{l},$i->{comment});        }
 
-        else
+        elsif ($verbose >= 1)        {            $last_comment = $i->{comment};        }    }
-    print $out "\t/* $last_comment */" if ($verbose);
+    print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");    print $out "\n};\n";    close($out);}
@@ -218,21 +223,30 @@ sub print_from_utf8_combined_map    my $fname = lc("utf8_to_${charset}_combined.map");    print
"-Writing UTF8=>${charset} conversion table: $fname\n";    open(my $out, '>', $fname) || die "cannot open output file :
$fname\n";
-    printf($out "/* src/backend/utils/mb/Unicode/$fname */\n\n".
-           "static const pg_utf_to_local_combined ULmap${charset}_combined[ %d ] = {",
-           scalar(@$table));
+    printf $out "/* src/backend/utils/mb/Unicode/$fname */\n"
+      . "/* This file is generated by $this_script */\n\n"
+      . "static const pg_utf_to_local_combined ULmap${charset}_combined[ %d ] = {",
+      scalar(@$table);    my $first = 1;    foreach my $i (sort { $a->{utf8} <=> $b->{utf8} } @$table)    {
print$out "," if (!$first);        $first = 0;
 
-        print $out "\t/* $last_comment */" if ($verbose);
+        print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");        printf $out "\n  {0x%08x,
0x%08x,0x%04x}",          $i->{utf8}, $i->{utf8_second}, $i->{code};
 
-        $last_comment = $i->{comment};
+        if ($verbose >= 2)
+        {
+            $last_comment =
+              sprintf("%s:%d %s", $i->{f}, $i->{l}, $i->{comment});
+        }
+        elsif ($verbose >= 1)
+        {
+            $last_comment = $i->{comment};
+        }    }
-    print $out "\t/* $last_comment */" if ($verbose);
+    print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");    print $out "\n};\n";    close($out);}
@@ -247,15 +261,16 @@ sub print_to_utf8_map    print "- Writing ${charset}=>UTF8 conversion table: $fname\n";
open(my$out, '>', $fname) || die "cannot open output file : $fname\n";
 
-    printf($out "/* src/backend/utils/mb/Unicode/${fname} */\n\n".
-           "static const pg_local_to_utf LUmap${charset}[ %d ] = {",
-           scalar(@$table));
+    printf $out "/* src/backend/utils/mb/Unicode/$fname */\n"
+      . "/* This file is generated by $this_script */\n\n"
+      . "static const pg_local_to_utf LUmap${charset}[ %d ] = {",
+      scalar(@$table);    my $first = 1;    foreach my $i (sort { $a->{code} <=> $b->{code} } @$table)    {
print$out "," if (!$first);        $first = 0;
 
-        print $out "\t/* $last_comment */" if ($verbose);
+        print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");        printf $out "\n  {0x%04x,
0x%x}",$i->{code}, $i->{utf8};        if ($verbose >= 2)
 
@@ -263,12 +278,12 @@ sub print_to_utf8_map            $last_comment =              sprintf("%s:%d %s", $i->{f},
$i->{l},$i->{comment});        }
 
-        else
+        elsif ($verbose >= 1)        {            $last_comment = $i->{comment};        }    }
-    print $out "\t/* $last_comment */" if ($verbose);
+    print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");    print $out "\n};\n";    close($out);}
@@ -283,21 +298,31 @@ sub print_to_utf8_combined_map    print "- Writing ${charset}=>UTF8 conversion table: $fname\n";
 open(my $out, '>', $fname) || die "cannot open output file : $fname\n";
 
-    printf($out "/* src/backend/utils/mb/Unicode/${fname} */\n\n".
-           "static const pg_local_to_utf_combined LUmap${charset}_combined[ %d ] = {",
-           scalar(@$table));
+    printf $out "/* src/backend/utils/mb/Unicode/$fname */\n"
+      . "/* This file is generated by $this_script */\n\n"
+      . "static const pg_local_to_utf_combined LUmap${charset}_combined[ %d ] = {",
+      scalar(@$table);    my $first = 1;    foreach my $i (sort { $a->{code} <=> $b->{code} } @$table)    {
print$out "," if (!$first);        $first = 0;
 
-        print $out "\t/* $last_comment */" if ($verbose);
+        print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");        printf $out "\n  {0x%04x,
0x%08x,0x%08x}",          $i->{code}, $i->{utf8}, $i->{utf8_second};
 
-        $last_comment = $i->{comment};
+
+        if ($verbose >= 2)
+        {
+            $last_comment =
+                sprintf("%s:%d %s", $i->{f}, $i->{l}, $i->{comment});
+        }
+        elsif ($verbose >= 1)
+        {
+            $last_comment = $i->{comment};
+        }    }
-    print $out "\t/* $last_comment */" if ($verbose);
+    print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");    print $out "\n};\n";    close($out);}
@@ -305,711 +330,442 @@ sub
print_to_utf8_combined_map##############################################################################RADIX TREE
STUFF
-# C struct type names : see wchar.h
-my $radix_type      = "pg_mb_radix_tree";
-my $radix_node_type = "pg_mb_radix_index";
-
-#########################################
-# read_maptable(<map file name>)
-#
-# extract data from map files and returns a character map table.
-# returns a reference to a hash <in code> => <out code>
-sub read_maptable
-{
-    my ($fname) = @_;
-    my %c;
-
-    open(my $in, '<', $fname) || die("cannot open $fname");
-
-    while (<$in>)
-    {
-        if (/^[ \t]*{0x([0-9a-f]+), *0x([0-9a-f]+)},?/)
-        {
-            $c{ hex($1) } = hex($2);
-        }
-    }
-
-    close($in);
-    return \%c;
-}
-#########################################
-# generate_index(<charmap hash ref>)
+# print_radix_table(<charmap hash ref>)#
-# generate a radix tree data from a character table
-# returns a hashref to an index data.
-# {
-#   csegs => <character segment index>
-#   b2idx => [<tree index of 1st byte of 2-byte code>]
-#   b3idx => [<idx for 1st byte for 3-byte code>, <2nd>]
-#   b4idx => [<idx for 1st byte for 4-byte code>, <2nd>, <3rd>]
-# }
+# Input: A hash, mapping an input character to an output character.#
-# Tables are in two forms, flat and segmented. a segmented table is
-# logically a two-dimentional table but physically a sequence of
-# segments, fixed length block of items. This structure allows us to
-# shrink table size by overlapping a shared sequence of zeros between
-# successive two segments. overlap_segments does that step.
+# Constructs a radix tree from the hash, and prints it out as a C-struct.#
-# A flat table is simple set of key and value pairs. The value is a
-# segment id of the next segmented table. The next table is referenced
-# using the segment id and the next byte of a code.
-#
-# flat table (b2idx, b3idx1, b4idx1)
-# {
-#    attr => {
-#       segmented => true(1) if this index is segmented>
-#       min  => <minimum value of index key>
-#       max  => <maximum value of index key>
-#        nextidx => <hash reference to the next level table>
-#    }
-#    i    => { # index data
-#       <byte> => <pointer value> # pointer to the next index
-#       ...
-#    }
-#
-# Each segments in segmented table is equivalent to a flat table
-# above.
-#
-# segmented table (csegs, b3idx2, b4idx2, b4idx3)
-# {
-#    attr => {
-#       segmented => true(1) if this index is segmented>
-#       min => <minimum value of index key>
-#       max => <maximum value of index key>
-#        width => <required hex width only for cseg table>
-#        is32bit => true if values are 32bit width, false means 16bit.
-#       has0page => only for cseg. true if 0 page is for single byte chars
-#        next => <hash reference to the next level table, if any>
-#    }
-#    i    => { # segment data
-#       <segid> => { # key for this segment
-#          lower   => <minimum value>
-#          upper   => <maximum value>
-#          offset  => <position of this segment in the whole table>
-#          label => <label string of this segment>
-#          d => { # segment data
-#            <byte> => { # pointer to the next index
-#               label     => <label string for this item>
-#               segid     => <target segid of next level>
-#               segoffset => <offset of the target segid>
-#             }
-#            ...
-#          }
-#        }
-#    }
-# }
-
-sub generate_index
-{
-    my ($c) = @_;
-    my (%csegs, %b2idx, %b3idx1, %b3idx2, %b4idx1, %b4idx2, %b4idx3);
-    my @all_tables =
-      (\%csegs, \%b2idx, \%b3idx1, \%b3idx2, \%b4idx1, \%b4idx2, \%b4idx3);
-    my $si;
-
-    # initialize attributes of index tables
-    #<<< do not let perltidy touch this
-    $csegs{attr} = {name => "csegs", chartbl => 1, segmented => 1,
-                    is32bit   => 0,    has0page  => 0};
-    #>>>
-    $csegs{attr} = {
-        name      => "csegs",
-        chartbl   => 1,
-        segmented => 1,
-        is32bit   => 0,
-        has0page  => 0 };
-    $b2idx{attr}  = { name => "b2idx",  segmented => 0, nextidx => \%csegs };
-    $b3idx1{attr} = { name => "b3idx1", segmented => 0, nextidx => \%b3idx2 };
-    $b3idx2{attr} = { name => "b3idx2", segmented => 1, nextidx => \%csegs };
-    $b4idx1{attr} = { name => "b4idx1", segmented => 0, nextidx => \%b4idx2 };
-    $b4idx2{attr} = { name => "b4idx2", segmented => 1, nextidx => \%b4idx3 };
-    $b4idx3{attr} = { name => "b4idx3", segmented => 1, nextidx => \%csegs };
+sub print_radix_table
+{
+    my ($out, $tblname, $c) = @_;
+
+    ###
+    ### Build radix trees in memory, for 1-, 2-, 3- and 4-byte inputs. Each
+    ### radix tree is represented as a nested hash, each hash indexed by
+    ### input byte
+    ###
+    my %b1map;
+    my %b2map;
+    my %b3map;
+    my %b4map;    foreach my $in (keys %$c)    {
+        my $out = $c->{$in};
+        if ($in < 0x100)        {
-            my $b1 = $in;
-
-            # 1 byte code doesn't have index. the first segment #0 of
-            # character table stores them
-            $csegs{attr}{has0page} = 1;
-            $si = {
-                segid => 0,
-                off   => $in,
-                label => "1byte-",
-                char  => $c->{$in} };
+            $b1map{$in} = $out;        }        elsif ($in < 0x10000)        {
-            # 2-byte code index consists of just one flat table            my $b1     = $in >> 8;            my $b2
= $in & 0xff;
 
-            my $csegid = $in >> 8;
-            if (!defined $b2idx{i}{$b1})
-            {
-                &set_min_max($b2idx{attr}, $b1);
-                $b2idx{i}{$b1}{segid} = $csegid;
-            }
-            $si = {
-                segid => $csegid,
-                off   => $b2,
-                label => sprintf("%02x", $b1),
-                char  => $c->{$in} };
+            $b2map{$b1}{$b2} = $out;        }        elsif ($in < 0x1000000)        {
-            # 3-byte code index consists of one flat table and one
-            # segmented table            my $b1     = $in >> 16;            my $b2     = ($in >> 8) & 0xff;
my$b3     = $in & 0xff;
 
-            my $l1id   = $in >> 16;
-            my $csegid = $in >> 8;
-
-            if (!defined $b3idx1{i}{$b1})
-            {
-                &set_min_max($b3idx1{attr}, $b1);
-                $b3idx1{i}{$b1}{segid} = $l1id;
-            }
-            if (!defined $b3idx2{i}{$l1id}{d}{$b2})
-            {
-                &set_min_max($b3idx2{attr}, $b2);
-                $b3idx2{i}{$l1id}{label} = sprintf("%02x", $b1);
-                $b3idx2{i}{$l1id}{d}{$b2} = {
-                    segid => $csegid,
-                    label => sprintf("%02x%02x", $b1, $b2) };
-            }
-            $si = {
-                segid => $csegid,
-                off   => $b3,
-                label => sprintf("%02x%02x", $b1, $b2),
-                char  => $c->{$in} };
+            $b3map{$b1}{$b2}{$b3} = $out;        }        elsif ($in < 0x100000000)        {
-            # 4-byte code index consists of one flat table, and two
-            # segmented tables            my $b1     = $in >> 24;            my $b2     = ($in >> 16) & 0xff;
 my $b3     = ($in >> 8) & 0xff;            my $b4     = $in & 0xff;
 
-            my $l1id   = $in >> 24;
-            my $l2id   = $in >> 16;
-            my $csegid = $in >> 8;
-
-            if (!defined $b4idx1{i}{$b1})
-            {
-                &set_min_max($b4idx1{attr}, $b1);
-                $b4idx1{i}{$b1}{segid} = $l1id;
-            }
-
-            if (!defined $b4idx2{i}{$l1id}{d}{$b2})
-            {
-                &set_min_max($b4idx2{attr}, $b2);
-                $b4idx2{i}{$l1id}{d}{$b2} = {
-                    segid => $l2id,
-                    label => sprintf("%02x", $b1) };
-            }
-            if (!defined $b4idx3{i}{$l2id}{d}{$b3})
-            {
-                &set_min_max($b4idx3{attr}, $b3);
-                $b4idx3{i}{$l2id}{d}{$b3} = {
-                    segid => $csegid,
-                    label => sprintf("%02x%02x", $b1, $b2) };
-            }
-            $si = {
-                segid => $csegid,
-                off   => $b4,
-                label => sprintf("%02x%02x%02x", $b1, $b2, $b3),
-                char  => $c->{$in} };
+            $b4map{$b1}{$b2}{$b3}{$b4} = $out;        }        else        {            die sprintf("up to 4 byte code
issupported: %x", $in);        }
 
-
-        &set_min_max($csegs{attr}, $si->{off});
-        $csegs{i}{ $si->{segid} }{d}{ $si->{off} } = $si->{char};
-        $csegs{i}{ $si->{segid} }{label} = $si->{label};
-        $csegs{attr}{is32bit} = 1 if ($si->{char} >= 0x10000);
-        &update_width($csegs{attr}, $si->{char});
-        if ($si->{char} >= 0x100000000)
-        {
-            die "character width is over 32bit. abort.";
-        }    }
-    # calcualte segment attributes
-    foreach my $t (@all_tables)
-    {
-        next if (!defined $t->{i} || !$t->{attr}{segmented});
-
-        # segments are to be aligned in the numerical order of segment id
-        my @keylist = sort { $a <=> $b } keys $t->{i};
-        next if ($#keylist < 0);
-        my $offset  = 1;
-        my $segsize = $t->{attr}{max} - $t->{attr}{min} + 1;
-
-        for my $k (@keylist)
+    my @segments;
+
+    ###
+    ### Build a linear list of "segments", from the nested hashes.
+    ###
+    ### Each segment is a lookup table, keyed by the next byte in the input.
+    ### The segments are written out physically to one big array in the final
+    ### step, but logically, they form a radix tree. Or rather, four radix
+    ### trees: one for 1-byte inputs, another for 2-byte inputs, 3-byte
+    ### inputs, and 4-byte inputs.
+    ###
+    ### Each segment is represented by a hash with following fields:
+    ###
+    ### comment => <string to output as a comment>
+    ### label => <label that can be used to refer to this segment from elsewhere>
+    ### values => <a hash, keyed by byte, 0-0xff>
+    ###
+    ### Entries in 'values' can be integers (for leaf-level segments), or
+    ### string labels, pointing to a segment with that label. Any missing
+    ### values are treated as zeros. If 'values' hash is missing altogether,
+    ###  it's treated as all-zeros.
+    ###
+    ### Subsequent steps will enrich the segments with more fields.
+    ###
+
+    # Add the segments for the radix trees themselves.
+    push @segments, build_segments_from_tree("Single byte table", "1-byte", 1, \%b1map);
+    push @segments, build_segments_from_tree("Two byte table", "2-byte", 2, \%b2map);
+    push @segments, build_segments_from_tree("Three byte table", "3-byte", 3, \%b3map);
+    push @segments, build_segments_from_tree("Four byte table", "4-byte", 4, \%b4map);
+
+    ###
+    ### Find min and max index used in each level of each tree.
+    ###
+    ### These are stored separately, and we can then leave out the unused
+    ### parts of every segment. (When using the resulting tree, you must
+    ### check each input byte against the min and max.)
+    ###
+    my %min_idx;
+    my %max_idx;
+    foreach my $seg (@segments)
+    {
+        my $this_min = $min_idx{$seg->{depth}}->{$seg->{level}};
+        my $this_max = $max_idx{$seg->{depth}}->{$seg->{level}};
+
+        foreach my $i (keys $seg->{values})        {
-            my $seg = $t->{i}{$k};
-            $seg->{lower}  = $t->{attr}{min};
-            $seg->{upper}  = $t->{attr}{max};
-            $seg->{offset} = $offset;
-            $offset += $segsize;
+            $this_min = $i if (!defined $this_min || $i < $this_min);
+            $this_max = $i if (!defined $this_max || $i > $this_max);        }
-        # overlapping successive zeros between segments
-        &overlap_segments($t);
+        $min_idx{$seg->{depth}}{$seg->{level}} = $this_min;
+        $max_idx{$seg->{depth}}{$seg->{level}} = $this_max;    }
-
-    # make link among tables
-    foreach my $t (@all_tables)
+    # Copy the mins and max's back to every segment, for convenience
+    foreach my $seg (@segments)    {
-        &make_index_link($t, $t->{attr}{nextidx});
+        $seg->{min_idx} = $min_idx{$seg->{depth}}{$seg->{level}};
+        $seg->{max_idx} = $max_idx{$seg->{depth}}{$seg->{level}};    }
-    return {
-        name_prefix => "",
-        csegs       => \%csegs,
-        b2idx       => [ \%b2idx ],
-        b3idx       => [ \%b3idx1, \%b3idx2 ],
-        b4idx       => [ \%b4idx1, \%b4idx2, \%b4idx3 ],
-        all         => \@all_tables };
-}
-
-
-#########################################
-# set_min_max - internal routine to maintain min and max value of a table
-sub set_min_max
-{
-    my ($a, $v) = @_;
-
-    $a->{min} = $v if (!defined $a->{min} || $v < $a->{min});
-    $a->{max} = $v if (!defined $a->{max} || $v > $a->{max});
-}
-
-#########################################
-# set_maxval - internal routine to maintain mixval
-sub update_width
-{
-    my ($a, $v) = @_;
+    ###
+    ### Prepend a dummy all-zeros map to the beginning.
+    ###
+    ### A 0 is an invalid value anywhere in the table, and this allows us to
+    ### point to 0 offset anywhere else in the tables, to get a 0 result.
-    my $nnibbles = int((int(log($v) / log(16)) + 1) / 2) * 2;
-    if (!defined $a->{width} || $nnibbles > $a->{width})
+    # Find the max range between min and max indexes in any of the segments.
+    my $widest_range = 0;
+    foreach my $seg (@segments)    {
-        $a->{width} = $nnibbles;
+        my $this_range = $seg->{max_idx} - $seg->{min_idx};
+        $widest_range = $this_range if ($this_range > $widest_range);    }
-}
-
-#########################################
-# overlap_segments
-#
-# removes duplicate regeion between two successive segments.
-
-sub overlap_segments
-{
-    my ($h) = @_;
-    # don't touch if undefined
-    return if (!defined $h->{i} || !$h->{attr}{segmented});
-    my $index = $h->{i};
-    my ($min, $max) = ($h->{attr}{min}, $h->{attr}{max});
-    my ($prev, $first);
-    my @segids = sort { $a <=> $b } keys $index;
-    return if ($#segids < 1);
+    unshift @segments, {
+        header => "Dummy map, for invalid values",
+        min_idx => 0,
+        max_idx => $widest_range
+    };
-    $first = 1;
-    undef $prev;
-
-    for my $segid (@segids)
+    ###
+    ### Eliminate overlapping zeros
+    ###
+    ### For each segment, if there are zero values at the end of, and there
+    ### are also zero values at the beginning of the next segment, we can
+    ### overlay the tail of this segment with the head of next segment, to
+    ### save space.
+    ###
+    ### To achieve that, we subtract the 'max_idx' of each segment with the
+    ### amount of zeros that can be ovarlaid.
+    ###
+    for (my $j = 0; $j < $#segments - 1; $j++)    {
-        my $seg = $index->{$segid};
-
-        # smin and smax is range excluded preceeding and trailing zeros
-        my @keys = sort { $a <=> $b } keys $seg->{d};
-        my $smin = $keys[0];
-        my $smax = $keys[-1];
+        my $seg = $segments[$j];
+        my $nextseg = $segments[$j + 1];
-        if ($first)
+        # Count the number of zero values at the end of this segment.
+        my $this_trail_zeros = 0;
+        for (my $i = $seg->{max_idx}; $i >= $seg->{min_idx} && !$seg->{values}->{$i}; $i--)        {
-            # first segment doesn't have a preceding segment
-            $seg->{offset} = 1;
-            $seg->{lower}  = $min;
-            $seg->{upper}  = $smax;
+            $this_trail_zeros++;        }
-        else
+
+        # Count the number of zeros at the beginning of next segment.
+        my $next_lead_zeros = 0;
+        for (my $i = $nextseg->{min_idx}; $i <= $nextseg->{max_idx} && !$nextseg->{values}->{$i}; $i++)        {
-            # calculate overlap and shift segment location
-            my $prefix      = $smin - $min;
-            my $postfix     = $max - $smax;
-            my $prevpostfix = $max - $prev->{upper};
-            my $overlap     = $prevpostfix < $prefix ? $prevpostfix : $prefix;
-
-            $seg->{lower}  = $min + $overlap;
-            $seg->{upper}  = $smax;
-            $seg->{offset} = $prev->{offset} + ($max - $min + 1) - $overlap;
-            $prev->{upper} = $max;
+            $next_lead_zeros++;        }
-        $prev  = $seg;
-        $first = 0;
-    }
-    return $h;
-}
+        # How many zeros in common?
+        my $overlaid_trail_zeros =
+            ($this_trail_zeros > $next_lead_zeros) ? $next_lead_zeros : $this_trail_zeros;
-######################################################
-# make_index_link(from_table, to_table)
-#
-# Fills out target pointers in non-leaf index tables.
-#
-# from_table - table to set links
-# to_table   - target table of from_table
+        $seg->{overlaid_trail_zeros} = $overlaid_trail_zeros;
+        $seg->{max_idx} = $seg->{max_idx} - $overlaid_trail_zeros;
+    }
-sub make_index_link
-{
-    my ($s, $t) = @_;
-    return if (!defined $s->{i} || !defined $t->{i});
+    ###
+    ### Replace label references with real offsets.
+    ###
+    ### So far, the non-leaf segments have referred to other segments by
+    ### their labels. Replace them with numerical offsets from the beginning
+    ### of the final array. You cannot move, add, or remove segments after
+    ### this step, as that would invalidate the offsets calculated here!
+    ###
+    my $flatoff = 0;
+    my %segmap;
-    my @tkeys = sort { $a <=> $b } keys $t->{i};
+    # First pass: assign offsets to each segment, and build hash
+    # of label => offset.
+    foreach my $seg (@segments)
+    {
+        $seg->{offset} = $flatoff;
+        $segmap{$seg->{label}} = $flatoff;
+        $flatoff += $seg->{max_idx} - $seg->{min_idx} + 1;
+    }
+    my $tblsize = $flatoff;
-    if ($s->{attr}{segmented})
+    # Second pass: look up the offset of each label reference in the hash.
+    foreach my $seg (@segments)    {
-        foreach my $k1 (keys $s->{i})
+        while (my ($i, $val) = each %{$seg->{values}})        {
-            foreach my $k2 (keys $s->{i}{$k1}{d})
+            if (!($val =~ /^[0-9,.E]+$/ ))            {
-                my $tsegid = $s->{i}{$k1}{d}{$k2}{segid};
-                if (!defined $tsegid)
+                my $segoff = $segmap{$val};
+                if ($segoff)
+                {
+                    $seg->{values}->{$i} = $segoff;
+                }
+                else                {
-                    die sprintf(
-                        "segid is not set in %s{i}{%x}{d}{%x}{segid}",
-                        $s->{attr}{name},
-                        $k1, $k2);
+                    die "no segment with label $val";                }
-                $s->{i}{$k1}{d}{$k2}{segoffset} = $t->{i}{$tsegid}{offset};            }        }    }
-    else
-    {
-        foreach my $k (keys $s->{i})
+
+    # Also look up the positions of the roots in the table.
+    my $b1root = $segmap{"1-byte"};
+    my $b2root = $segmap{"2-byte"};
+    my $b3root = $segmap{"3-byte"};
+    my $b4root = $segmap{"4-byte"};
+
+    # And the lower-upper values of each level in each radix tree.
+    my $b1_lower = $min_idx{1}{1};
+    my $b1_upper = $max_idx{1}{1};
+
+    my $b2_1_lower = $min_idx{2}{1};
+    my $b2_1_upper = $max_idx{2}{1};
+    my $b2_2_lower = $min_idx{2}{2};
+    my $b2_2_upper = $max_idx{2}{2};
+
+    my $b3_1_lower = $min_idx{3}{1};
+    my $b3_1_upper = $max_idx{3}{1};
+    my $b3_2_lower = $min_idx{3}{2};
+    my $b3_2_upper = $max_idx{3}{2};
+    my $b3_3_lower = $min_idx{3}{3};
+    my $b3_3_upper = $max_idx{3}{3};
+
+    my $b4_1_lower = $min_idx{4}{1};
+    my $b4_1_upper = $max_idx{4}{1};
+    my $b4_2_lower = $min_idx{4}{2};
+    my $b4_2_upper = $max_idx{4}{2};
+    my $b4_3_lower = $min_idx{4}{3};
+    my $b4_3_upper = $max_idx{4}{3};
+    my $b4_4_lower = $min_idx{4}{4};
+    my $b4_4_upper = $max_idx{4}{4};
+
+    ###
+    ### Find the maximum value in the whole table, to determine if we can
+    ### use uint16 or if we need to use uint32.
+    ###
+    my $max_val = 0;
+    foreach my $seg (@segments)
+    {
+        foreach my $val (values $seg->{values})        {
-            my $tsegid = $s->{i}{$k}{segid};
-            if (!defined $tsegid)
-            {
-                die sprintf("segid is not set in %s{i}{%x}{segid}",
-                    $s->{attr}{name}, $k);
-            }
-            $s->{i}{$k}{segoffset} = $t->{i}{$tsegid}{offset};
+            $max_val = $val if ($val > $max_val);        }    }
-}
-
-###############################################
-# print_radix_table - output index table as C-struct
-#
-# print_radix_table(hd, table, tblname, width)
-# returns 1 if the table is written
-#
-# hd      - file handle to write
-# table   - ref to an index table
-# tblname - C symbol name for the table
-# width   - width in characters of this table
-sub print_radix_table
-{
-    my ($hd, $table, $tblname, $width) = @_;
+    my $datatype = ($max_val <= 0xffff) ? "uint16" : "uint32";
-    return 0 if (!defined $table->{i});
+    # For formatting, determine how many values we can fit on a single
+    # line, and how wide each value needs to be to align nicely.
+    my $vals_per_line;
+    my $colwidth;
-    if ($table->{attr}{chartbl})
+    if ($max_val <= 0xffff)    {
-        &print_chars_table($hd, $table, $tblname, $width);
+        $vals_per_line = 8;
+        $colwidth = 4;    }
-    elsif ($table->{attr}{segmented})
+    elsif ($max_val <= 0xffffff)    {
-        &print_segmented_table($hd, $table, $tblname, $width);
+        $vals_per_line = 4;
+        $colwidth = 6;    }    else    {
-        &print_flat_table($hd, $table, $tblname, $width);
-    }
-    return 1;
-}
-
-#########################################
-# print_chars_table
-#
-# print_chars_table(hd, table, tblname, width)
-# this is usually called via writ_table
-#
-# hd      - file handle to write
-# table   - ref to an index table
-# tblname - C symbol name for the table
-# tblwidth- width in characters of this table
-
-sub print_chars_table
-{
-    my ($hd, $table, $tblname, $width) = @_;
-    my ($st, $ed) = ($table->{attr}{min}, $table->{attr}{max});
-    my ($type) = $table->{attr}{is32bit} ? "uint32" : "uint16";
-
-    printf $hd "static const %s %s[] =\n{", $type, $tblname;
-    printf $hd " /* chars content - index range = [%02x, %02x] */", $st, $ed;
-
-    # values in character table are written in fixedwidth
-    # hexadecimals.  calculate the number of columns in a line. 13 is
-    # the length of line header.
-
-    my $colwidth     = $table->{attr}{width};
-    my $colseplen    = 4;                       # the length of  ", 0x"
-    my $headerlength = 13;
-    my $colnum = int(($width - $headerlength) / ($colwidth + $colseplen));
-
-    # round down to multiples of 4. don't bother by too small table width
-    my $colnum = int($colnum / 4) * 4;
-    my $line   = "";
-    my $first0 = 1;
-
-    # output all segments in segment id order
-    foreach my $k (sort { $a <=> $b } keys $table->{i})
-    {
-        my $s = $table->{i}{$k};
-        if (!$first0)
-        {
-            $line =~ s/\s+$//;    # remove trailing space
-            print $hd $line, ",\n";
-            $line = "";
-        }
-        $first0 = 0;
-
-        # write segment header
-        printf $hd "\n  /*** %4sxx - offset 0x%05x ***/",
-          $s->{label}, $s->{offset};
-
-        # write segment content
-        my $first1 = 1;
-        my ($segstart, $segend) = ($s->{lower}, $s->{upper});
-        my ($xpos, $nocomma) = (0, 0);
-
-        foreach my $j (($segstart - ($segstart % $colnum)) .. $segend)
+        $vals_per_line = 4;
+        $colwidth = 8;
+    }
+
+    ###
+    ### Print the struct and array.
+    ###
+    printf $out "static const $datatype ${tblname}_table[];\n";
+    printf $out "\n";
+    printf $out "static const pg_mb_radix_tree $tblname =\n";
+    printf $out "{\n";
+    if ($datatype eq "uint16")
+    {
+        print $out "  ${tblname}_table,\n";
+        print $out "  NULL, /* 32-bit table not used */\n";
+    }
+    if ($datatype eq "uint32")
+    {
+        print $out "  NULL, /* 16-bit table not used */\n";
+        print $out "  ${tblname}_table,\n";
+    }
+    printf $out "\n";
+    printf $out "  0x%04x, /* offset of table for 1-byte inputs */\n", $b1root;
+    printf $out "  0x%02x, /* b1_lower */\n", $b1_lower;
+    printf $out "  0x%02x, /* b1_upper */\n", $b1_upper;
+    printf $out "\n";
+    printf $out "  0x%04x, /* offset of table for 2-byte inputs */\n", $b2root;
+    printf $out "  0x%02x, /* b2_1_lower */\n", $b2_1_lower;
+    printf $out "  0x%02x, /* b2_1_upper */\n", $b2_1_upper;
+    printf $out "  0x%02x, /* b2_2_lower */\n", $b2_2_lower;
+    printf $out "  0x%02x, /* b2_2_upper */\n", $b2_2_upper;
+    printf $out "\n";
+    printf $out "  0x%04x, /* offset of table for 3-byte inputs */\n", $b3root;
+    printf $out "  0x%02x, /* b3_1_lower */\n", $b3_1_lower;
+    printf $out "  0x%02x, /* b3_1_upper */\n", $b3_1_upper;
+    printf $out "  0x%02x, /* b3_2_lower */\n", $b3_2_lower;
+    printf $out "  0x%02x, /* b3_2_upper */\n", $b3_2_upper;
+    printf $out "  0x%02x, /* b3_3_lower */\n", $b3_3_lower;
+    printf $out "  0x%02x, /* b3_3_upper */\n", $b3_3_upper;
+    printf $out "\n";
+    printf $out "  0x%04x, /* offset of table for 3-byte inputs */\n", $b4root;
+    printf $out "  0x%02x, /* b4_1_lower */\n", $b4_1_lower;
+    printf $out "  0x%02x, /* b4_1_upper */\n", $b4_1_upper;
+    printf $out "  0x%02x, /* b4_2_lower */\n", $b4_2_lower;
+    printf $out "  0x%02x, /* b4_2_upper */\n", $b4_2_upper;
+    printf $out "  0x%02x, /* b4_3_lower */\n", $b4_3_lower;
+    printf $out "  0x%02x, /* b4_3_upper */\n", $b4_3_upper;
+    printf $out "  0x%02x, /* b4_4_lower */\n", $b4_4_lower;
+    printf $out "  0x%02x  /* b4_4_upper */\n", $b4_4_upper;
+    print $out "};\n";
+    print $out "\n";
+    print $out "static const $datatype ${tblname}_table[$tblsize] =\n";
+    print $out "{";
+    my $off = 0;
+    foreach my $seg (@segments)
+    {
+        printf $out "\n";
+        printf $out "  /*** %s - offset 0x%05x ***/\n", $seg->{header}, $off;
+        printf $out "\n";
+
+        for (my $i=$seg->{min_idx}; $i <= $seg->{max_idx};)        {
-            $line .= "," if (!$first1 && !$nocomma);
-
-            # write the previous line and put a line header for the
-            # new line if this is the first time or this line is full
-            if ($xpos >= $colnum || $first1)
-            {
-                $line =~ s/\s+$//;    # remove trailing space
-                print $hd $line, "\n";
-                $line = sprintf("  /* %02x */ ", $j);
-                $xpos = 0;
-            }
-            else
+            # Print the next line's worth of values.
+            # XXX pad to begin at a nice boundary
+            printf $out "  /* %02x */ ", $i;
+            for (my $j = 0; $j < $vals_per_line && $i <= $seg->{max_idx}; $j++)            {
-                $line .= " ";
-            }
-            $first1 = 0;
+                my $val = $seg->{values}->{$i};
-            # write each column
-            if ($j >= $segstart)
-            {
-                $line .= sprintf("0x%0*x", $colwidth, $s->{d}{$j});
-                $nocomma = 0;
-            }
-            else
-            {
-                # adjust column position
-                $line .= " " x ($colwidth + 3);
-                $nocomma = 1;
+                printf $out " 0x%0*x", $colwidth, $val;
+                $off++;
+                if ($off != $tblsize)
+                {
+                    print $out ",";
+                }
+                $i++;            }
-            $xpos++;
+            print $out "\n";
+        }
+        if ($seg->{overlaid_trail_zeros})
+        {
+            printf $out "    /* $seg->{overlaid_trail_zeros} trailing zero values shared with next segment */\n";
 }
 
-    }
-    $line =~ s/\s+$//;
-    print $hd $line, "\n};\n";
-}
+    # Sanity check.
+    if ($off != $tblsize) { die "table size didn't match!"; }
-######################################################
-# print_flat_table - output nonsegmented index table
-#
-# print_flat_table(hd, table, tblname, width)
-# this is usually called via writ_table
-#
-# hd      - file handle to write
-# table   - ref to an index table
-# tblname - C symbol name for the table
-# width   - width in characters of this table
+    print $out "};\n";
+}
-sub print_flat_table
+###
+sub build_segments_from_tree{
-    my ($hd, $table, $tblname, $width) = @_;
-    my ($st, $ed) = ($table->{attr}{min}, $table->{attr}{max});
-
-    print  $hd "static const $radix_node_type $tblname =\n{";
-    printf $hd "\n  0x%x, 0x%x, /* table range */\n", $st, $ed;
-    print  $hd "  {";
+    my ($header, $rootlabel, $depth, $map) = @_;
-    my $first = 1;
-    my $line  = "";
+    my @segments;
-    foreach my $i ($st .. $ed)
+    if (%{$map})    {
-        $line .= "," if (!$first);
-        my $newitem = sprintf("%d",
-            defined $table->{i}{$i} ? $table->{i}{$i}{segoffset} : 0);
+        @segments = build_segments_recurse($header, $rootlabel, "", 1, $depth, $map);
-        # flush current line and feed a line if the current line
-        # exceeds a limit
-        if ($first || length($line . $newitem) > $width)
-        {
-            $line =~ s/\s+$//;    # remove trailing space
-            print $hd "$line\n";
-            $line = "    ";
-        }
-        else
-        {
-            $line .= " ";
-        }
-        $line .= $newitem;
-        $first = 0;
+        # Sort the segments into "breadth-first" order. Not strictly required,
+        # but makes the maps nicer to read.
+        @segments = sort { $a->{level} cmp $b->{level} or
+                           $a->{path}  cmp $b->{path}}
+                         @segments;    }
-    print $hd $line;
-    print $hd "\n  }\n};\n";
-}
-######################################################
-# print_segmented_table - output segmented index table
-#
-# print_segmented_table(hd, table, tblname, width)
-# this is usually called via writ_table
-#
-# hd      - file handle to write
-# table   - ref to an index table
-# tblname - C symbol name for the table
-# width   - width in characters of this table
+    return @segments;
+}
-sub print_segmented_table
+###
+sub build_segments_recurse{
-    my ($hd, $table, $tblname, $width) = @_;
-    my ($st, $ed) = ($table->{attr}{min}, $table->{attr}{max});
+    my ($header, $label, $path, $level, $depth, $map) = @_;
-    # write the variable definition
-    print $hd "static const $radix_node_type $tblname =\n{";
-    printf $hd "\n  0x%02x, 0x%02x,        /*index range */\n  {", $st, $ed;
+    my @segments;
-    my $first0 = 1;
-    foreach my $k (sort { $a <=> $b } keys $table->{i})
+    if ($level == $depth)    {
-        print $hd ",\n" if (!$first0);
-        $first0 = 0;
-        printf $hd "\n  /*** %sxxxx - offset 0x%05x ****/",
-          $table->{i}{$k}{label}, $table->{i}{$k}{offset};
-
-        my $segstart = $table->{i}{$k}{lower};
-        my $segend   = $table->{i}{$k}{upper};
-
-        my $line    = "";
-        my $first1  = 1;
-        my $newitem = "";
+        push @segments, {
+            header => $header . ", leaf: ${path}xx",
+            label => $label,
+            level => $level,
+            depth => $depth,
+            path => $path,
+            values => $map
+        };
+    }
+    else
+    {
+        my %children;
-        foreach my $j ($segstart .. $segend)
+        while (my ($i, $val) = each $map)        {
-            $line .= "," if (!$first1);
-            $newitem = sprintf("%d",
-                  $table->{i}{$k}{d}{$j}
-                ? $table->{i}{$k}{d}{$j}{segoffset}
-                : 0);
+            my $childpath = $path . sprintf("%02x", $i);
+            my $childlabel = "$depth-level-$level-$childpath";
-            if ($first1 || length($line . $newitem) > $width)
-            {
-                $line =~ s/\s+$//;
-                print $hd "$line\n";
-                $line =
-                  sprintf("  /* %2s%02x */ ", $table->{i}{$k}{label}, $j);
-            }
-            else
-            {
-                $line .= " ";
-            }
-            $line .= $newitem;
-            $first1 = 0;
+            push @segments, build_segments_recurse($header, $childlabel, $childpath,
+                                                   $level + 1, $depth, $val);
+            $children{$i} = $childlabel;        }
-        print $hd $line;
-    }
-    print $hd "\n  }\n};\n";
-}
-
-#########################################
-# make_table_refname(table, prefix)
-#
-# internal routine to make C reference notation for tables
-
-sub make_table_refname
-{
-    my ($table, $prefix) = @_;
-
-    return "NULL" if (!defined $table->{i});
-    return "&" . $prefix . $table->{attr}{name};
-}
-
-#########################################
-# print_radix_main(hd, tblname, trie, name_prefix)
-#
-# write main radix tree table
-#
-# hd         - file handle to write this table
-# tblname    - variable name of this struct
-# trie       - ref to a radix tree
-# name_prefix- prefix for subtables.
-sub print_radix_main
-{
-    my ($hd, $tblname, $trie, $name_prefix) = @_;
-    my $ctblname = $name_prefix . $trie->{csegs}{attr}{name};
-    my ($ctbl16name, $ctbl32name);
-    if ($trie->{csegs}{attr}{is32bit})
-    {
-        $ctbl16name = "NULL";
-        $ctbl32name = $ctblname;
-    }
-    else
-    {
-        $ctbl16name = $ctblname;
-        $ctbl32name = "NULL";
+        push @segments, {
+            header => $header . ", byte #$level: ${path}xx",
+            label => $label,
+            level => $level,
+            depth => $depth,
+            path => $path,
+            values => \%children
+        };    }
-
-    my $b2iname  = make_table_refname($trie->{b2idx}[0], $name_prefix);
-    my $b3i1name = make_table_refname($trie->{b3idx}[0], $name_prefix);
-    my $b3i2name = make_table_refname($trie->{b3idx}[1], $name_prefix);
-    my $b4i1name = make_table_refname($trie->{b4idx}[0], $name_prefix);
-    my $b4i2name = make_table_refname($trie->{b4idx}[1], $name_prefix);
-    my $b4i3name = make_table_refname($trie->{b4idx}[2], $name_prefix);
-
-    #<<< do not let perltidy touch this
-    print  $hd "static const $radix_type $tblname =\n{\n";
-    print  $hd "    /* final character table offset and body */\n";
-    printf $hd "    0x%x, 0x%x, %s, %s, %s,\n",
-      $trie->{csegs}{attr}{min}, $trie->{csegs}{attr}{max},
-      $trie->{csegs}{attr}{has0page} ? 'true' : 'false',
-      $ctbl16name, $ctbl32name;
-
-    print  $hd "    /* 2-byte code table */\n";
-    print  $hd "    $b2iname,\n";
-    print  $hd "    /* 3-byte code tables */\n";
-    print  $hd "    {$b3i1name, $b3i2name},\n";
-    print  $hd "    /* 4-byte code table */\n";
-    print  $hd "    {$b4i1name, $b4i2name, $b4i3name},\n";
-    print  $hd "};\n";
-    #>>>
+    return @segments;}######################################################
@@ -1078,7 +834,6 @@ sub print_radix_map    my ($this_script, $csname, $direction, $charset, $tblwidth) = @_;    my
$charmap= &make_charmap($charset, $direction);
 
-    my $trie = &generate_index($charmap);    my $fname =      $direction eq "to_unicode"      ?
lc("${csname}_to_utf8_radix.map")
@@ -1101,17 +856,8 @@ sub print_radix_map    print $out "/* src/backend/utils/mb/Unicode/$fname */\n"      . "/* This
fileis generated by $this_script */\n\n";
 
-    foreach my $t (@{ $trie->{all} })
-    {
-        my $table_name = $name_prefix . $t->{attr}{name};
-
-        if (&print_radix_table($out, $t, $table_name, $tblwidth))
-        {
-            print $out "\n";
-        }
-    }
+    print_radix_table($out, $tblname, $charmap);
-    &print_radix_main($out, $tblname, $trie, $name_prefix);    close($out);}
diff --git a/src/backend/utils/mb/Unicode/download_srctxts.sh b/src/backend/utils/mb/Unicode/download_srctxts.sh
deleted file mode 100755
index 572d57e..0000000
--- a/src/backend/utils/mb/Unicode/download_srctxts.sh
+++ /dev/null
@@ -1,127 +0,0 @@
-#! /bin/bash
-
-# This script downloads conversion source files from URLs as of 2016/10/27
-# These source files may removed or changed without notice
-if [ ! -e CP932.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
-fi
-if [ ! -e JIS0201.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
-fi
-if [ ! -e JIS0208.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
-fi
-if [ ! -e JIS0212.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT
-fi
-if [ ! -e SHIFTJIS.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT
-fi
-if [ ! -e CP866.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT
-fi
-if [ ! -e CP874.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP874.TXT
-fi
-if [ ! -e CP936.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
-fi
-if [ ! -e CP950.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
-fi
-if [ ! -e CP1250.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT
-fi
-if [ ! -e CP1251.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT
-fi
-if [ ! -e CP1252.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
-fi
-if [ ! -e CP1253.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1253.TXT
-fi
-if [ ! -e CP1254.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1254.TXT
-fi
-if [ ! -e CP1255.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1255.TXT
-fi
-if [ ! -e CP1256.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT
-fi
-if [ ! -e CP1257.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1257.TXT
-fi
-if [ ! -e CP1258.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1258.TXT
-fi
-if [ ! -e 8859-2.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT
-fi
-if [ ! -e 8859-3.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-3.TXT
-fi
-if [ ! -e 8859-4.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-4.TXT
-fi
-if [ ! -e 8859-5.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-5.TXT
-fi
-if [ ! -e 8859-6.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-6.TXT
-fi
-if [ ! -e 8859-7.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-7.TXT
-fi
-if [ ! -e 8859-8.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-8.TXT
-fi
-if [ ! -e 8859-9.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-9.TXT
-fi
-if [ ! -e 8859-10.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-10.TXT
-fi
-if [ ! -e 8859-13.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-13.TXT
-fi
-if [ ! -e 8859-14.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-14.TXT
-fi
-if [ ! -e 8859-15.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-15.TXT
-fi
-if [ ! -e 8859-16.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-16.TXT
-fi
-if [ ! -e KOI8-R.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
-fi
-if [ ! -e KOI8-U.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-U.TXT
-fi
-if [ ! -e CNS11643.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
-fi
-if [ ! -e KSX1001.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
-fi
-if [ ! -e JOHAB.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT
-fi
-if [ ! -e BIG5.TXT ]; then
-    wget ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
-fi
-if [ ! -e windows-949-2000.xml ]; then
-    wget http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/windows-949-2000.xml
-fi
-if [ ! -e gb-18030-2000.xml ]; then
-    wget http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
-fi
-if [ ! -e sjis-0213-2004-std.txt ]; then
-    wget http://x0213.org/codetable/sjis-0213-2004-std.txt
-fi
-if [ ! -e euc-jis-2004-std.txt ]; then
-    wget http://x0213.org/codetable/euc-jis-2004-std.txt
-fi
diff --git a/src/backend/utils/mb/Unicode/make_mapchecker.pl b/src/backend/utils/mb/Unicode/make_mapchecker.pl
old mode 100755
new mode 100644
diff --git a/src/backend/utils/mb/Unicode/map_checker.c b/src/backend/utils/mb/Unicode/map_checker.c
index 643ac10..65e33ea 100644
--- a/src/backend/utils/mb/Unicode/map_checker.c
+++ b/src/backend/utils/mb/Unicode/map_checker.c
@@ -9,98 +9,109 @@ * radix tree conversion function - this should be identical to the function in * ../conv.c with the
samename */
 
-const uint32 pg_mb_radix_conv(const pg_mb_radix_tree *rt, const uint32 c)
+static inline uint32
+pg_mb_radix_conv(const pg_mb_radix_tree *rt,
+                 int l,
+                 unsigned char b1,
+                 unsigned char b2,
+                 unsigned char b3,
+                 unsigned char b4){
-    uint32 off = 0;
-    uint32 b1 = c >> 24;
-    uint32 b2 = (c >> 16) & 0xff;
-    uint32 b3 = (c >> 8) & 0xff;
-    uint32 b4 = c & 0xff;
-
-    if (b1 > 0)
+    if (l == 4)    {        /* 4-byte code */
-        uint32 idx;
-
-        /* check code validity - fist byte */
-        if (rt->b4idx[0] == NULL ||
-            b1 < rt->b4idx[0]->lower || b1 > rt->b4idx[0]->upper)
-            return 0;
-
-        idx = b1 - rt->b4idx[0]->lower;
-        off = rt->b4idx[0]->idx[idx];
-        if (off == 0)
-            return 0;
-        /* check code validity - second byte */
-        if (b2 < rt->b4idx[1]->lower || b2 > rt->b4idx[1]->upper)
+        /* check code validity */
+        if (b1 < rt->b4_1_lower || b1 > rt->b4_1_upper ||
+            b2 < rt->b4_2_lower || b2 > rt->b4_2_upper ||
+            b3 < rt->b4_3_lower || b3 > rt->b4_3_upper ||
+            b4 < rt->b4_4_lower || b4 > rt->b4_4_upper)            return 0;
-        idx = b2 - rt->b4idx[1]->lower;
-        off = (rt->b4idx[1]->idx + off - 1)[idx];
-        if (off == 0)
-            return 0;
+        if (rt->chars32)
+        {
+            uint32        idx = rt->b4root;
-        /* check code validity - third byte */
-        if (b3 < rt->b4idx[2]->lower || b3 > rt->b4idx[2]->upper)
-            return 0;
+            idx = rt->chars32[b1 + idx - rt->b4_1_lower];
+            idx = rt->chars32[b2 + idx - rt->b4_2_lower];
+            idx = rt->chars32[b3 + idx - rt->b4_3_lower];
+            return rt->chars32[b4 + idx - rt->b4_4_lower];
+        }
+        else
+        {
+            uint16        idx = rt->b4root;
-        idx = b3 - rt->b4idx[2]->lower;
-        off = (rt->b4idx[2]->idx + off - 1)[idx];
+            idx = rt->chars16[b1 + idx - rt->b4_1_lower];
+            idx = rt->chars16[b2 + idx - rt->b4_2_lower];
+            idx = rt->chars16[b3 + idx - rt->b4_3_lower];
+            return rt->chars16[b4 + idx - rt->b4_4_lower];
+        }    }
-    else if (b2 > 0)
+    else if (l == 3)    {        /* 3-byte code */
-        uint32 idx;
-
-        /* check code validity - first byte */
-        if (rt->b3idx[0] == NULL ||
-            b2 < rt->b3idx[0]->lower || b2 > rt->b3idx[0]->upper)
+        /* check code validity */
+        if (b2 < rt->b3_1_lower || b2 > rt->b3_1_upper ||
+            b3 < rt->b3_2_lower || b3 > rt->b3_2_upper ||
+            b4 < rt->b3_3_lower || b4 > rt->b3_3_upper)            return 0;
-        idx = b2 - rt->b3idx[0]->lower;
-        off = rt->b3idx[0]->idx[idx];
-        if (off == 0)
-            return 0;
+        if (rt->chars32)
+        {
+            uint32        idx = rt->b3root;
-        /* check code validity - second byte */
-        if (b3 < rt->b3idx[1]->lower || b3 > rt->b3idx[1]->upper)
-            return 0;
+            idx = rt->chars32[b2 + idx - rt->b3_1_lower];
+            idx = rt->chars32[b3 + idx - rt->b3_2_lower];
+            return rt->chars32[b4 + idx - rt->b3_3_lower];
+        }
+        else
+        {
+            uint16        idx = rt->b3root;
-        idx = b3 - rt->b3idx[1]->lower;
-        off = (rt->b3idx[1]->idx + off - 1)[idx];
+            idx = rt->chars16[b2 + idx - rt->b3_1_lower];
+            idx = rt->chars16[b3 + idx - rt->b3_2_lower];
+            return rt->chars16[b4 + idx - rt->b3_3_lower];
+        }    }
-    else if (b3 > 0)
+    else if (l == 2)    {        /* 2-byte code */
-        uint32 idx;        /* check code validity - first byte */
-        if (rt->b2idx == NULL ||
-            b3 < rt->b2idx->lower || b3 > rt->b2idx->upper)
+        if (b3 < rt->b2_1_lower || b3 > rt->b2_1_upper ||
+            b4 < rt->b2_2_lower || b4 > rt->b2_2_upper)            return 0;
-        idx = b3 - rt->b2idx->lower;
-        off = rt->b2idx->idx[idx];
+        if (rt->chars32)
+        {
+            uint32        idx = rt->b2root;
+
+            idx = rt->chars32[b3 + idx - rt->b2_1_lower];
+            return rt->chars32[b4 + idx - rt->b2_2_lower];
+        }
+        else
+        {
+            uint16        idx = rt->b2root;
+
+            idx = rt->chars16[b3 + idx - rt->b2_1_lower];
+            return rt->chars16[b4 + idx - rt->b2_2_lower];
+        }    }
-    else
+    else if (l == 1)    {
-        if (rt->single_byte)
-            off = 1;
-    }
-
-    if (off == 0)
-        return 0;
+        /* 1-byte code */
-    /* check code validity - last byte */
-    if (b4 < rt->chars_lower || b4 > rt->chars_upper)
-        return 0;
+        /* check code validity - first byte */
+        if (b4 < rt->b1_lower || b4 > rt->b1_upper)
+            return 0;
-    if (rt->chars32)
-        return (rt->chars32 + off - 1)[b4 - rt->chars_lower];
-    else
-        return (rt->chars16 + off - 1)[b4 - rt->chars_lower];
+        if (rt->chars32)
+            return rt->chars32[b4 + rt->b1root - rt->b1_lower];
+        else
+            return rt->chars16[b4 + rt->b1root - rt->b1_lower];
+    }
+    return 0; /* shouldn't happen */}int main(void)
@@ -116,6 +127,12 @@ int main(void)        {            uint32 s, c, d;
+            unsigned char b1;
+            unsigned char b2;
+            unsigned char b3;
+            unsigned char b4;
+            int l;
+            if (mp->ul)            {                s = mp->ul[i].utf;
@@ -132,7 +149,20 @@ int main(void)                exit(1);            }
-            c = pg_mb_radix_conv(mp->rt, s);
+            b1 = s >> 24;
+            b2 = s >> 16;
+            b3 = s >> 8;
+            b4 = s;
+            if (b1 != 0)
+                l = 4;
+            else if (b2 != 0)
+                l = 3;
+            else if (b3 != 0)
+                l = 2;
+            else
+                l = 1;
+
+            c = pg_mb_radix_conv(mp->rt, l, b1, b2, b3, b4);            if (c != d)            {
diff --git a/src/backend/utils/mb/conv.c b/src/backend/utils/mb/conv.c
index d4fab1f..f850487 100644
--- a/src/backend/utils/mb/conv.c
+++ b/src/backend/utils/mb/conv.c
@@ -284,36 +284,6 @@ mic2latin_with_table(const unsigned char *mic,/* * comparison routine for bsearch()
- * this routine is intended for UTF8 -> local code
- */
-static int
-compare1(const void *p1, const void *p2)
-{
-    uint32        v1,
-                v2;
-
-    v1 = *(const uint32 *) p1;
-    v2 = ((const pg_utf_to_local *) p2)->utf;
-    return (v1 > v2) ? 1 : ((v1 == v2) ? 0 : -1);
-}
-
-/*
- * comparison routine for bsearch()
- * this routine is intended for local code -> UTF8
- */
-static int
-compare2(const void *p1, const void *p2)
-{
-    uint32        v1,
-                v2;
-
-    v1 = *(const uint32 *) p1;
-    v2 = ((const pg_local_to_utf *) p2)->code;
-    return (v1 > v2) ? 1 : ((v1 == v2) ? 0 : -1);
-}
-
-/*
- * comparison routine for bsearch() * this routine is intended for combined UTF8 -> local code */static int
@@ -366,98 +336,109 @@ store_coded_char(unsigned char *dest, uint32 code)/* * radix tree conversion function */
-const uint32 pg_mb_radix_conv(const pg_mb_radix_tree *rt, const uint32 c)
+static inline uint32
+pg_mb_radix_conv(const pg_mb_radix_tree *rt,
+                 int l,
+                 unsigned char b1,
+                 unsigned char b2,
+                 unsigned char b3,
+                 unsigned char b4){
-    uint32 off = 0;
-    uint32 b1 = c >> 24;
-    uint32 b2 = (c >> 16) & 0xff;
-    uint32 b3 = (c >> 8) & 0xff;
-    uint32 b4 = c & 0xff;
-
-    if (b1 > 0)
+    if (l == 4)    {        /* 4-byte code */
-        uint32 idx;
-
-        /* check code validity - fist byte */
-        if (rt->b4idx[0] == NULL ||
-            b1 < rt->b4idx[0]->lower || b1 > rt->b4idx[0]->upper)
-            return 0;
-
-        idx = b1 - rt->b4idx[0]->lower;
-        off = rt->b4idx[0]->idx[idx];
-        if (off == 0)
-            return 0;
-        /* check code validity - second byte */
-        if (b2 < rt->b4idx[1]->lower || b2 > rt->b4idx[1]->upper)
+        /* check code validity */
+        if (b1 < rt->b4_1_lower || b1 > rt->b4_1_upper ||
+            b2 < rt->b4_2_lower || b2 > rt->b4_2_upper ||
+            b3 < rt->b4_3_lower || b3 > rt->b4_3_upper ||
+            b4 < rt->b4_4_lower || b4 > rt->b4_4_upper)            return 0;
-        idx = b2 - rt->b4idx[1]->lower;
-        off = (rt->b4idx[1]->idx + off - 1)[idx];
-        if (off == 0)
-            return 0;
+        if (rt->chars32)
+        {
+            uint32        idx = rt->b4root;
-        /* check code validity - third byte */
-        if (b3 < rt->b4idx[2]->lower || b3 > rt->b4idx[2]->upper)
-            return 0;
+            idx = rt->chars32[b1 + idx - rt->b4_1_lower];
+            idx = rt->chars32[b2 + idx - rt->b4_2_lower];
+            idx = rt->chars32[b3 + idx - rt->b4_3_lower];
+            return rt->chars32[b4 + idx - rt->b4_4_lower];
+        }
+        else
+        {
+            uint16        idx = rt->b4root;
-        idx = b3 - rt->b4idx[2]->lower;
-        off = (rt->b4idx[2]->idx + off - 1)[idx];
+            idx = rt->chars16[b1 + idx - rt->b4_1_lower];
+            idx = rt->chars16[b2 + idx - rt->b4_2_lower];
+            idx = rt->chars16[b3 + idx - rt->b4_3_lower];
+            return rt->chars16[b4 + idx - rt->b4_4_lower];
+        }    }
-    else if (b2 > 0)
+    else if (l == 3)    {        /* 3-byte code */
-        uint32 idx;
-
-        /* check code validity - first byte */
-        if (rt->b3idx[0] == NULL ||
-            b2 < rt->b3idx[0]->lower || b2 > rt->b3idx[0]->upper)
+        /* check code validity */
+        if (b2 < rt->b3_1_lower || b2 > rt->b3_1_upper ||
+            b3 < rt->b3_2_lower || b3 > rt->b3_2_upper ||
+            b4 < rt->b3_3_lower || b4 > rt->b3_3_upper)            return 0;
-        idx = b2 - rt->b3idx[0]->lower;
-        off = rt->b3idx[0]->idx[idx];
-        if (off == 0)
-            return 0;
+        if (rt->chars32)
+        {
+            uint32        idx = rt->b3root;
-        /* check code validity - second byte */
-        if (b3 < rt->b3idx[1]->lower || b3 > rt->b3idx[1]->upper)
-            return 0;
+            idx = rt->chars32[b2 + idx - rt->b3_1_lower];
+            idx = rt->chars32[b3 + idx - rt->b3_2_lower];
+            return rt->chars32[b4 + idx - rt->b3_3_lower];
+        }
+        else
+        {
+            uint16        idx = rt->b3root;
-        idx = b3 - rt->b3idx[1]->lower;
-        off = (rt->b3idx[1]->idx + off - 1)[idx];
+            idx = rt->chars16[b2 + idx - rt->b3_1_lower];
+            idx = rt->chars16[b3 + idx - rt->b3_2_lower];
+            return rt->chars16[b4 + idx - rt->b3_3_lower];
+        }    }
-    else if (b3 > 0)
+    else if (l == 2)    {        /* 2-byte code */
-        uint32 idx;        /* check code validity - first byte */
-        if (rt->b2idx == NULL ||
-            b3 < rt->b2idx->lower || b3 > rt->b2idx->upper)
+        if (b3 < rt->b2_1_lower || b3 > rt->b2_1_upper ||
+            b4 < rt->b2_2_lower || b4 > rt->b2_2_upper)            return 0;
-        idx = b3 - rt->b2idx->lower;
-        off = rt->b2idx->idx[idx];
+        if (rt->chars32)
+        {
+            uint32        idx = rt->b2root;
+
+            idx = rt->chars32[b3 + idx - rt->b2_1_lower];
+            return rt->chars32[b4 + idx - rt->b2_2_lower];
+        }
+        else
+        {
+            uint16        idx = rt->b2root;
+
+            idx = rt->chars16[b3 + idx - rt->b2_1_lower];
+            return rt->chars16[b4 + idx - rt->b2_2_lower];
+        }    }
-    else
+    else if (l == 1)    {
-        if (rt->single_byte)
-            off = 1;
-    }
+        /* 1-byte code */
-    if (off == 0)
-        return 0;
-
-    /* check code validity - last byte */
-    if (b4 < rt->chars_lower || b4 > rt->chars_upper)
-        return 0;
+        /* check code validity - first byte */
+        if (b4 < rt->b1_lower || b4 > rt->b1_upper)
+            return 0;
-    if (rt->chars32)
-        return (rt->chars32 + off - 1)[b4 - rt->chars_lower];
-    else
-        return (rt->chars16 + off - 1)[b4 - rt->chars_lower];
+        if (rt->chars32)
+            return rt->chars32[b4 + rt->b1root - rt->b1_lower];
+        else
+            return rt->chars16[b4 + rt->b1root - rt->b1_lower];
+    }
+    return 0; /* shouldn't happen */}/*
@@ -468,7 +449,6 @@ const uint32 pg_mb_radix_conv(const pg_mb_radix_tree *rt, const uint32 c) * iso: pointer to the
outputarea (must be large enough!)          (output string will be null-terminated) * map: conversion map for single
characters
- * mapsize: number of entries in the conversion map * cmap: conversion map for combined characters *
(optional,pass NULL if none) * cmapsize: number of entries in the conversion map for combined characters
 
@@ -486,14 +466,13 @@ const uint32 pg_mb_radix_conv(const pg_mb_radix_tree *rt, const uint32 c)voidUtfToLocal(const
unsignedchar *utf, int len,           unsigned char *iso,
 
-           const void *map, int mapsize,
-           const void *cmap, int cmapsize,
+           const pg_mb_radix_tree *map,
+           const pg_utf_to_local *cmap, int cmapsize,           utf_local_conversion_func conv_func,           int
encoding){   uint32        iutf;    int            l;
 
-    const pg_utf_to_local *p;    const pg_utf_to_local_combined *cp;    if (!PG_VALID_ENCODING(encoding))
@@ -503,6 +482,11 @@ UtfToLocal(const unsigned char *utf, int len,    for (; len > 0; len -= l)    {
+        unsigned char b1 = 0;
+        unsigned char b2 = 0;
+        unsigned char b3 = 0;
+        unsigned char b4 = 0;
+        /* "break" cases all represent errors */        if (*utf == '\0')            break;
@@ -524,27 +508,28 @@ UtfToLocal(const unsigned char *utf, int len,        /* collect coded char of length l */
if(l == 2)        {
 
-            iutf = *utf++ << 8;
-            iutf |= *utf++;
+            b3 = *utf++;
+            b4 = *utf++;        }        else if (l == 3)        {
-            iutf = *utf++ << 16;
-            iutf |= *utf++ << 8;
-            iutf |= *utf++;
+            b2 = *utf++;
+            b3 = *utf++;
+            b4 = *utf++;        }        else if (l == 4)        {
-            iutf = *utf++ << 24;
-            iutf |= *utf++ << 16;
-            iutf |= *utf++ << 8;
-            iutf |= *utf++;
+            b1 = *utf++;
+            b2 = *utf++;
+            b3 = *utf++;
+            b4 = *utf++;        }        else        {            elog(ERROR, "unsupported character length %d", l);
        iutf = 0;            /* keep compiler quiet */        }
 
+        iutf = (b1 << 24 | b2 << 16 | b3 << 8 | b4);        /* First, try with combined map if possible */        if
(cmap&& len > l)
 
@@ -613,21 +598,9 @@ UtfToLocal(const unsigned char *utf, int len,        }        /* Now check ordinary map */
-        if (mapsize > 0)
-        {
-            p = bsearch(&iutf, map, mapsize,
-                        sizeof(pg_utf_to_local), compare1);
-
-            if (p)
-            {
-                iso = store_coded_char(iso, p->code);
-                continue;
-            }
-        }
-        else if (map)
+        if (map)        {
-            uint32 converted = pg_mb_radix_conv((pg_mb_radix_tree *)map,
-                                                iutf);
+            uint32 converted = pg_mb_radix_conv(map, l, b1, b2, b3, b4);            if (converted)            {
       iso = store_coded_char(iso, converted);
 
@@ -667,7 +640,6 @@ UtfToLocal(const unsigned char *utf, int len, * utf: pointer to the output area (must be large
enough!)         (output string will be null-terminated) * map: conversion map for single characters
 
- * mapsize: number of entries in the conversion map * cmap: conversion map for combined characters *
(optional,pass NULL if none) * cmapsize: number of entries in the conversion map for combined characters
 
@@ -685,14 +657,13 @@ UtfToLocal(const unsigned char *utf, int len,voidLocalToUtf(const unsigned char *iso, int len,
      unsigned char *utf,
 
-           const void *map, int mapsize,
-           const void *cmap, int cmapsize,
+           const pg_mb_radix_tree *map,
+           const pg_local_to_utf *cmap, int cmapsize,           utf_local_conversion_func conv_func,           int
encoding){   uint32        iiso;    int            l;
 
-    const pg_local_to_utf *p;    const pg_local_to_utf_combined *cp;    if (!PG_VALID_ENCODING(encoding))
@@ -702,6 +673,11 @@ LocalToUtf(const unsigned char *iso, int len,    for (; len > 0; len -= l)    {
+        unsigned char b1 = 0;
+        unsigned char b2 = 0;
+        unsigned char b3 = 0;
+        unsigned char b4 = 0;
+        /* "break" cases all represent errors */        if (*iso == '\0')            break;
@@ -720,40 +696,39 @@ LocalToUtf(const unsigned char *iso, int len,        /* collect coded char of length l */
if(l == 1)
 
-            iiso = *iso++;
+            b4 = *iso++;        else if (l == 2)        {
-            iiso = *iso++ << 8;
-            iiso |= *iso++;
+            b3 = *iso++;
+            b4 = *iso++;        }        else if (l == 3)        {
-            iiso = *iso++ << 16;
-            iiso |= *iso++ << 8;
-            iiso |= *iso++;
+            b2 = *iso++;
+            b3 = *iso++;
+            b4 = *iso++;        }        else if (l == 4)        {
-            iiso = *iso++ << 24;
-            iiso |= *iso++ << 16;
-            iiso |= *iso++ << 8;
-            iiso |= *iso++;
+            b1 = *iso++;
+            b2 = *iso++;
+            b3 = *iso++;
+            b4 = *iso++;        }        else        {            elog(ERROR, "unsupported character length %d", l);
        iiso = 0;            /* keep compiler quiet */        }
 
+        iiso = (b1 << 24 | b2 << 16 | b3 << 8 | b4);
-        if (mapsize > 0)
+        if (map)        {
-            /* First check ordinary map */
-            p = bsearch(&iiso, map, mapsize,
-                        sizeof(pg_local_to_utf), compare2);
+            uint32 converted = pg_mb_radix_conv(map, l, b1, b2, b3, b4);
-            if (p)
+            if (converted)            {
-                utf = store_coded_char(utf, p->utf);
+                utf = store_coded_char(utf, converted);                continue;            }
@@ -771,16 +746,6 @@ LocalToUtf(const unsigned char *iso, int len,                }            }        }
-        else if (map)
-        {
-            uint32 converted = pg_mb_radix_conv((pg_mb_radix_tree*)map, iiso);
-
-            if (converted)
-            {
-                utf = store_coded_char(utf, converted);
-                continue;
-            }
-        }        /* if there's a conversion function, try that */        if (conv_func)
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c
b/src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c
index 2857228..6ca7191 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c
@@ -42,7 +42,7 @@ big5_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_BIG5, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &big5_to_unicode_tree, 0,
+               &big5_to_unicode_tree,               NULL, 0,               NULL,               PG_BIG5);
@@ -60,7 +60,7 @@ utf8_to_big5(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_BIG5);
UtfToLocal(src,len, dest,
 
-               &big5_from_unicode_tree, 0,
+               &big5_from_unicode_tree,               NULL, 0,               NULL,               PG_BIG5);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/utf8_and_cyrillic.c
b/src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/utf8_and_cyrillic.c
index f61f86b..6580243 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/utf8_and_cyrillic.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/utf8_and_cyrillic.c
@@ -48,7 +48,7 @@ utf8_to_koi8r(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_KOI8R);
UtfToLocal(src,len, dest,
 
-               &koi8r_from_unicode_tree, 0,
+               &koi8r_from_unicode_tree,               NULL, 0,               NULL,               PG_KOI8R);
@@ -66,7 +66,7 @@ koi8r_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_KOI8R, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &koi8r_to_unicode_tree, 0,
+               &koi8r_to_unicode_tree,               NULL, 0,               NULL,               PG_KOI8R);
@@ -84,7 +84,7 @@ utf8_to_koi8u(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_KOI8U);
UtfToLocal(src,len, dest,
 
-               &koi8u_from_unicode_tree, 0,
+               &koi8u_from_unicode_tree,               NULL, 0,               NULL,               PG_KOI8U);
@@ -102,7 +102,7 @@ koi8u_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_KOI8U, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &koi8u_to_unicode_tree, 0,
+               &koi8u_to_unicode_tree,               NULL, 0,               NULL,               PG_KOI8U);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc2004/utf8_and_euc2004.c
b/src/backend/utils/mb/conversion_procs/utf8_and_euc2004/utf8_and_euc2004.c
index 1ad3d03..8676618 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_euc2004/utf8_and_euc2004.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc2004/utf8_and_euc2004.c
@@ -44,7 +44,7 @@ euc_jis_2004_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_JIS_2004, PG_UTF8);
LocalToUtf(src, len, dest,
 
-               &euc_jis_2004_to_unicode_tree, 0,
+               &euc_jis_2004_to_unicode_tree,            LUmapEUC_JIS_2004_combined,
lengthof(LUmapEUC_JIS_2004_combined),              NULL,               PG_EUC_JIS_2004);
 
@@ -62,7 +62,7 @@ utf8_to_euc_jis_2004(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_JIS_2004);
UtfToLocal(src, len, dest,
 
-               &euc_jis_2004_from_unicode_tree, 0,
+               &euc_jis_2004_from_unicode_tree,            ULmapEUC_JIS_2004_combined,
lengthof(ULmapEUC_JIS_2004_combined),              NULL,               PG_EUC_JIS_2004);
 
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/utf8_and_euc_cn.c
b/src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/utf8_and_euc_cn.c
index be1a036..1dea26e 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/utf8_and_euc_cn.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/utf8_and_euc_cn.c
@@ -42,7 +42,7 @@ euc_cn_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_CN, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &euc_cn_to_unicode_tree, 0,
+               &euc_cn_to_unicode_tree,               NULL, 0,               NULL,               PG_EUC_CN);
@@ -60,7 +60,7 @@ utf8_to_euc_cn(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_CN);
UtfToLocal(src,len, dest,
 
-               &euc_cn_from_unicode_tree, 0,
+               &euc_cn_from_unicode_tree,               NULL, 0,               NULL,               PG_EUC_CN);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/utf8_and_euc_jp.c
b/src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/utf8_and_euc_jp.c
index cc46003..0f65f44 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/utf8_and_euc_jp.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/utf8_and_euc_jp.c
@@ -42,7 +42,7 @@ euc_jp_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_JP, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &euc_jp_to_unicode_tree, 0,
+               &euc_jp_to_unicode_tree,               NULL, 0,               NULL,               PG_EUC_JP);
@@ -60,7 +60,7 @@ utf8_to_euc_jp(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_JP);
UtfToLocal(src,len, dest,
 
-               &euc_jp_from_unicode_tree, 0,
+               &euc_jp_from_unicode_tree,               NULL, 0,               NULL,               PG_EUC_JP);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/utf8_and_euc_kr.c
b/src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/utf8_and_euc_kr.c
index 5e83522..d7d2d78 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/utf8_and_euc_kr.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/utf8_and_euc_kr.c
@@ -42,7 +42,7 @@ euc_kr_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_KR, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &euc_kr_to_unicode_tree, 0,
+               &euc_kr_to_unicode_tree,               NULL, 0,               NULL,               PG_EUC_KR);
@@ -60,7 +60,7 @@ utf8_to_euc_kr(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_KR);
UtfToLocal(src,len, dest,
 
-               &euc_kr_from_unicode_tree, 0,
+               &euc_kr_from_unicode_tree,               NULL, 0,               NULL,               PG_EUC_KR);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/utf8_and_euc_tw.c
b/src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/utf8_and_euc_tw.c
index dd3d791..94d9bee 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/utf8_and_euc_tw.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/utf8_and_euc_tw.c
@@ -42,7 +42,7 @@ euc_tw_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_EUC_TW, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &euc_tw_to_unicode_tree, 0,
+               &euc_tw_to_unicode_tree,               NULL, 0,               NULL,               PG_EUC_TW);
@@ -60,7 +60,7 @@ utf8_to_euc_tw(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_EUC_TW);
UtfToLocal(src,len, dest,
 
-               &euc_tw_from_unicode_tree, 0,
+               &euc_tw_from_unicode_tree,               NULL, 0,               NULL,               PG_EUC_TW);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c
b/src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c
index 3e3c74d..0dca5eb 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c
@@ -197,7 +197,7 @@ gb18030_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_GB18030, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &gb18030_to_unicode_tree, 0,
+               &gb18030_to_unicode_tree,               NULL, 0,               conv_18030_to_utf8,
PG_GB18030);
@@ -215,7 +215,7 @@ utf8_to_gb18030(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_GB18030);
UtfToLocal(src,len, dest,
 
-               &gb18030_from_unicode_tree, 0,
+               &gb18030_from_unicode_tree,               NULL, 0,               conv_utf8_to_18030,
PG_GB18030);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_gbk/utf8_and_gbk.c
b/src/backend/utils/mb/conversion_procs/utf8_and_gbk/utf8_and_gbk.c
index 872f353..06234de 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_gbk/utf8_and_gbk.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_gbk/utf8_and_gbk.c
@@ -42,7 +42,7 @@ gbk_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_GBK, PG_UTF8);    LocalToUtf(src,
len,dest,
 
-               &gbk_to_unicode_tree, 0,
+               &gbk_to_unicode_tree,               NULL, 0,               NULL,               PG_GBK);
@@ -60,7 +60,7 @@ utf8_to_gbk(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_GBK);    UtfToLocal(src,
len,dest,
 
-               &gbk_from_unicode_tree, 0,
+               &gbk_from_unicode_tree,               NULL, 0,               NULL,               PG_GBK);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_iso8859/utf8_and_iso8859.c
b/src/backend/utils/mb/conversion_procs/utf8_and_iso8859/utf8_and_iso8859.c
index 2361528..98cd3c7 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_iso8859/utf8_and_iso8859.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_iso8859/utf8_and_iso8859.c
@@ -109,7 +109,7 @@ iso8859_to_utf8(PG_FUNCTION_ARGS)        if (encoding == maps[i].encoding)        {
LocalToUtf(src,len, dest,
 
-                       maps[i].map1, 0,
+                       maps[i].map1,                       NULL, 0,                       NULL,
encoding);
@@ -141,7 +141,7 @@ utf8_to_iso8859(PG_FUNCTION_ARGS)        if (encoding == maps[i].encoding)        {
UtfToLocal(src,len, dest,
 
-                       maps[i].map2, 0,
+                       maps[i].map2,                       NULL, 0,                       NULL,
encoding);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_johab/utf8_and_johab.c
b/src/backend/utils/mb/conversion_procs/utf8_and_johab/utf8_and_johab.c
index 2d8ca18..4036fd1 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_johab/utf8_and_johab.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_johab/utf8_and_johab.c
@@ -42,7 +42,7 @@ johab_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_JOHAB, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &johab_to_unicode_tree, 0,
+               &johab_to_unicode_tree,               NULL, 0,               NULL,               PG_JOHAB);
@@ -60,7 +60,7 @@ utf8_to_johab(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_JOHAB);
UtfToLocal(src,len, dest,
 
-               &johab_from_unicode_tree, 0,
+               &johab_from_unicode_tree,               NULL, 0,               NULL,               PG_JOHAB);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_sjis/utf8_and_sjis.c
b/src/backend/utils/mb/conversion_procs/utf8_and_sjis/utf8_and_sjis.c
index 0a4802d..2a4245a 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_sjis/utf8_and_sjis.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_sjis/utf8_and_sjis.c
@@ -42,7 +42,7 @@ sjis_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_SJIS, PG_UTF8);
LocalToUtf(src,len, dest,
 
-               &sjis_to_unicode_tree, 0,
+               &sjis_to_unicode_tree,               NULL, 0,               NULL,               PG_SJIS);
@@ -60,7 +60,7 @@ utf8_to_sjis(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_SJIS);
UtfToLocal(src,len, dest,
 
-               &sjis_from_unicode_tree, 0,
+               &sjis_from_unicode_tree,               NULL, 0,               NULL,               PG_SJIS);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/utf8_and_sjis2004.c
b/src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/utf8_and_sjis2004.c
index 7160741..c83c5da 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/utf8_and_sjis2004.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/utf8_and_sjis2004.c
@@ -44,7 +44,7 @@ shift_jis_2004_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_SHIFT_JIS_2004,
PG_UTF8);   LocalToUtf(src, len, dest,
 
-               &shift_jis_2004_to_unicode_tree, 0,
+               &shift_jis_2004_to_unicode_tree,        LUmapSHIFT_JIS_2004_combined,
lengthof(LUmapSHIFT_JIS_2004_combined),              NULL,               PG_SHIFT_JIS_2004);
 
@@ -62,7 +62,7 @@ utf8_to_shift_jis_2004(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8,
PG_SHIFT_JIS_2004);   UtfToLocal(src, len, dest,
 
-               &shift_jis_2004_from_unicode_tree, 0,
+               &shift_jis_2004_from_unicode_tree,        ULmapSHIFT_JIS_2004_combined,
lengthof(ULmapSHIFT_JIS_2004_combined),              NULL,               PG_SHIFT_JIS_2004);
 
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_uhc/utf8_and_uhc.c
b/src/backend/utils/mb/conversion_procs/utf8_and_uhc/utf8_and_uhc.c
index fb66a8a..d06a19b 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_uhc/utf8_and_uhc.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_uhc/utf8_and_uhc.c
@@ -42,7 +42,7 @@ uhc_to_utf8(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UHC, PG_UTF8);    LocalToUtf(src,
len,dest,
 
-               &uhc_to_unicode_tree, 0,
+               &uhc_to_unicode_tree,               NULL, 0,               NULL,               PG_UHC);
@@ -60,7 +60,7 @@ utf8_to_uhc(PG_FUNCTION_ARGS)    CHECK_ENCODING_CONVERSION_ARGS(PG_UTF8, PG_UHC);    UtfToLocal(src,
len,dest,
 
-               &uhc_from_unicode_tree, 0,
+               &uhc_from_unicode_tree,               NULL, 0,               NULL,               PG_UHC);
diff --git a/src/backend/utils/mb/conversion_procs/utf8_and_win/utf8_and_win.c
b/src/backend/utils/mb/conversion_procs/utf8_and_win/utf8_and_win.c
index d213927..9f55307 100644
--- a/src/backend/utils/mb/conversion_procs/utf8_and_win/utf8_and_win.c
+++ b/src/backend/utils/mb/conversion_procs/utf8_and_win/utf8_and_win.c
@@ -90,7 +90,7 @@ win_to_utf8(PG_FUNCTION_ARGS)        if (encoding == maps[i].encoding)        {
LocalToUtf(src,len, dest,
 
-                       maps[i].map1, 0,
+                       maps[i].map1,                       NULL, 0,                       NULL,
encoding);
@@ -122,7 +122,7 @@ utf8_to_win(PG_FUNCTION_ARGS)        if (encoding == maps[i].encoding)        {
UtfToLocal(src,len, dest,
 
-                       maps[i].map2, 0,
+                       maps[i].map2,                       NULL, 0,                       NULL,
encoding);
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 38edbff..7efa600 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -383,26 +383,56 @@ typedef struct    uint32        code;            /* local code */} pg_utf_to_local;
-/*
- * radix tree structer for faster conversion
- */typedef struct pg_mb_radix_index{
-    uint8    lower, upper;                /* index range of b2idx */
-    uint32    idx[FLEXIBLE_ARRAY_MEMBER];    /* index body */
+    uint8        lower;
+    uint8        upper;                           /* index range of b2idx */} pg_mb_radix_index;
+/*
+ * Radix tree structs for faster conversion
+ */typedef struct{
-    const uint8    chars_lower, chars_upper;    /* index range of chars* */
-    const bool single_byte;                    /* true if the first segment is
-                                             * for single byte characters*/
-    const uint16 *chars16;                    /* 16 bit character table */
+    /*
+     * Array containing all the values. Only one of chars16 or chars32 is
+     * used, depending on how wide the values we need to represent are.
+     */
+    const uint16 *chars16;                    /* 16 bit  */    const uint32 *chars32;                    /* 32 bit
charactertable */
 
-    const pg_mb_radix_index *b2idx;
-    const pg_mb_radix_index *b3idx[2];
-    const pg_mb_radix_index *b4idx[3];
+    /* Radix tree for 1-byte inputs */
+    uint32        b1root;        /* offset of table in the chars[16|32] array */
+    uint8        b1_lower;    /* min allowed value for a single byte input */
+    uint8        b1_upper;    /* max allowed value for a single byte input */
+
+    /* Radix tree for 2-byte inputs */
+    uint32        b2root;        /* offset of 1st byte's table */
+    uint8        b2_1_lower; /* min/max allowed value for 1st input byte */
+    uint8        b2_1_upper;
+    uint8        b2_2_lower; /* min/max allowed value for 2nd input byte */
+    uint8        b2_2_upper;
+
+    /* Radix tree for 3-byte inputs */
+    uint32        b3root;        /* offset of 1st byte's table */
+    uint8        b3_1_lower; /* min/max allowed value for 1st input byte */
+    uint8        b3_1_upper;
+    uint8        b3_2_lower; /* min/max allowed value for 2nd input byte */
+    uint8        b3_2_upper;
+    uint8        b3_3_lower; /* min/max allowed value for 3rd input byte */
+    uint8        b3_3_upper;
+
+    /* Radix tree for 4-byte inputs */
+    uint32        b4root;        /* offset of 1st byte's table */
+    uint8        b4_1_lower; /* min/max allowed value for 1st input byte */
+    uint8        b4_1_upper;
+    uint8        b4_2_lower; /* min/max allowed value for 2nd input byte */
+    uint8        b4_2_upper;
+    uint8        b4_3_lower; /* min/max allowed value for 3rd input byte */
+    uint8        b4_3_upper;
+    uint8        b4_4_lower; /* min/max allowed value for 4th input byte */
+    uint8        b4_4_upper;
+} pg_mb_radix_tree;/*
@@ -532,14 +562,14 @@ extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc);extern void
UtfToLocal(constunsigned char *utf, int len,           unsigned char *iso,
 
-           const void *map, int mapsize,
-           const void *combined_map, int cmapsize,
+           const pg_mb_radix_tree *map,
+           const pg_utf_to_local *cmap, int cmapsize,           utf_local_conversion_func conv_func,           int
encoding);externvoid LocalToUtf(const unsigned char *iso, int len,           unsigned char *utf,
 
-           const void *map, int mapsize,
-           const void *combined_cmap, int cmapsize,
+           const pg_mb_radix_tree *map,
+           const pg_local_to_utf *cmap, int cmapsize,           utf_local_conversion_func conv_func,           int
encoding);
@@ -573,7 +603,6 @@ extern void latin2mic_with_table(const unsigned char *l, unsigned char *p,extern void
mic2latin_with_table(constunsigned char *mic, unsigned char *p,                     int len, int lc, int encoding,
              const unsigned char *tab);
 
-extern const uint32 pg_mb_radix_conv(const pg_mb_radix_tree *rt, const uint32 c);extern bool pg_utf8_islegal(const
unsignedchar *source, int length); 

Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, I found a bug in my portion while rebasing.

The attached patches apply on top of the current master HEAD, not
on Heikki's previous one. And separated into 4 parts.

At Tue, 13 Dec 2016 15:11:03 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161213.151103.157484378.horiguchi.kyotaro@lab.ntt.co.jp>
> > Apart from the aboves, I have some trivial comments on the new
> > version.
> > 
> > 
> > 1. If we decide not to use old-style maps, UtfToLocal no longer
> >   need to take void * as map data. (Patch 0001)

I changed the pointer type wrongly. Combined maps are of the type
*_combined.

> > 2. "use Data::Dumper" doesn't seem necessary. (Patch 0002)
> > 3. A comment contains a superfluous comma. (Patch 0002) (The last
> >    byte of the first line below)
> > 4. The following code doesn't seem so perl'ish.
> > 5. download_srctxts.sh is no longer needed. (No patch)
> 
> 6. Fixed some inconsistent indentation/folding.
> 7. Fix handling of $verbose.
> 8. Sort segments using leading bytes.

The attached files are the following. This patchset is not
complete missing changes of map files. The change is tremendously
large but generatable.

0001-Add-missing-semicolon.patch
 UCS_to_EUC_JP.pl has a line missing teminating semicolon. This doesn't harm but surely a syntax error. This patch
fixesit. This might should be a separate patch.
 

0002-Correct-reference-resolution-syntax.patch
 convutils.pm has lines with different syntax of reference resolution. This unifies the syntax.

0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
 Before adding radix tree stuff, applied pgperltidy and inserted format-skipping pragma for the parts where perltidy
seemsto do too much.
 

0004-Use-radix-tree-for-character-conversion.patch
 Radix tree body.


The unattached fifth patch is generated by the following steps.

[$(TOP)]$ ./configure
[Unicode]$ make
[Unicode]$ make distclean
[Unicode]$ git add .
[Unicode]$ commit 
=== COMMITE MESSSAGE
Replace map files with radix tree files.

These encodings no longer uses the former map files and uses new radix
tree files.
===


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Radix tree for character conversion

From
Ishii Ayumi
Date:
HI,

I patched 4 patchset and run "make", but I got failed.
Is this a bug or my mistake ?
I'm sorry if I'm wrong.

[$(TOP)]$ patch -p1 < ../0001-Add-missing-semicolon.patch
[$(TOP)]$ patch -p1 < ../0002-Correct-reference-resolution-syntax.patch
[$(TOP)]$ patch -p1 <
../0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
[$(TOP)]$ patch -p1 < ../0004-Use-radix-tree-for-character-conversion.patch
[$(TOP)]$ ./configure
[Unicode]$ make
'/usr/bin/perl' UCS_to_most.pl
Type of arg 1 to keys must be hash (not hash element) at convutils.pm
line 443, near "})       "
Type of arg 1 to values must be hash (not hash element) at
convutils.pm line 596, near "})       "
Type of arg 1 to each must be hash (not private variable) at
convutils.pm line 755, near "$map)       "
Compilation failed in require at UCS_to_most.pl line 19.
make: *** [iso8859_2_to_utf8.map] Error 255

Regars,
-- 
Ayumi Ishii



Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, thank you for looking this.

At Wed, 25 Jan 2017 19:18:26 +0900, Ishii Ayumi <ayumi.ishii.pg@gmail.com> wrote in
<CAOu5J714+w-TRSNHbsS+aBVE5LdsR3CEZ6w4QLQ=9NrAJNavTA@mail.gmail.com>
> I patched 4 patchset and run "make", but I got failed.
> Is this a bug or my mistake ?
> I'm sorry if I'm wrong.
> 
> [$(TOP)]$ patch -p1 < ../0001-Add-missing-semicolon.patch
> [$(TOP)]$ patch -p1 < ../0002-Correct-reference-resolution-syntax.patch
> [$(TOP)]$ patch -p1 <
> ../0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> [$(TOP)]$ patch -p1 < ../0004-Use-radix-tree-for-character-conversion.patch
> [$(TOP)]$ ./configure
> [Unicode]$ make

The directory src/backend/mb/uilts/Unicode is not built as a part
of the top-level build, and it would be preferable that the
preexisting map files are removed.

$ cd src/backend/utils/mb/Unicode
$ make distclean      # this would require ./configure
$ make maintainer-clean
$ cd ../../../../..   # go to top
$ make clean
(make'ing here will give you an error saying a .map file is not found)
$ cd  src/backend/utils/mb/Unicode  # again
$ make
$ cd ../../../../..   # go to top
$ make

This steps still suceeds for me, even with the patches on the
current master.

The cause of the the following errors seems the other things.

> '/usr/bin/perl' UCS_to_most.pl
> Type of arg 1 to keys must be hash (not hash element) at convutils.pm
> line 443, near "})
>         "
> Type of arg 1 to values must be hash (not hash element) at
> convutils.pm line 596, near "})
>         "
> Type of arg 1 to each must be hash (not private variable) at
> convutils.pm line 755, near "$map)
>         "
> Compilation failed in require at UCS_to_most.pl line 19.
> make: *** [iso8859_2_to_utf8.map] Error 255

Surely perl 5.8.9 complained just as above but 5.16
doesn't. Google told me that at least 5.10 behaves as the same
way. The *current* requirement for perl verion to build is 5.8.

https://www.postgresql.org/docs/current/static/install-requirements.html

Fortunately, only three lines of change suffices 5.8 so I've
chosen to flatter perl 5.8.

As the result, no changes has been made on 0001-0003 so I
attached only 0004 to this mail.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Radix tree for character conversion

From
Michael Paquier
Date:
On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> [...patch...]

Nobody has showed up yet to review this patch, so I am giving it a shot.

The patch file sizes are scary at first sight, but after having a look:36 files changed, 1411 insertions(+), 54398
deletions(-)
Yes that's a surprise, something like git diff --irreversible-delete
would have helped as most of the diffs are just caused by 3 files
being deleted in patch 0004, making 50k lines going to the abyss of
deletion.

> Hello, I found a bug in my portion while rebasing.

Right, that's 0001. Nice catch.

> The attached files are the following. This patchset is not
> complete missing changes of map files. The change is tremendously
> large but generatable.
>
> 0001-Add-missing-semicolon.patch
>
>   UCS_to_EUC_JP.pl has a line missing teminating semicolon. This
>   doesn't harm but surely a syntax error. This patch fixes it.
>   This might should be a separate patch.

This requires a back-patch. This makes me wonder how long this script
has actually not run...

> 0002-Correct-reference-resolution-syntax.patch
>
>   convutils.pm has lines with different syntax of reference
>   resolution. This unifies the syntax.

Yes that looks right to me. I am the best perl guru on this list but
looking around $$var{foo} is bad, ${$var}{foo} is better, and
$var->{foo} is even better. This also generates no diffs when running
make in src/backend/utils/mb/Unicode/. So no objections to that.

> 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
>
>   Before adding radix tree stuff, applied pgperltidy and inserted
>   format-skipping pragma for the parts where perltidy seems to do
>   too much.

Which version of perltidy did you use? Looking at the archives, the
perl code is cleaned up with a specific version, v20090616. See
https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
for example on the matter. As perltidy changes over time, this may be
a sensitive change if done this way.

> 0004-Use-radix-tree-for-character-conversion.patch
>
>   Radix tree body.

Well, here a lot of diffs could have been saved.

> The unattached fifth patch is generated by the following steps.
>
> [$(TOP)]$ ./configure
> [Unicode]$ make
> [Unicode]$ make distclean
> [Unicode]$ git add .
> [Unicode]$ commit
> === COMMITE MESSSAGE
> Replace map files with radix tree files.
>
> These encodings no longer uses the former map files and uses new radix
> tree files.
> ===

OK, I can see that working, with 200k of maps generated.. So going
through the important bits of this jungle..

+/*
+ * radix tree conversion function - this should be identical to the function in
+ * ../conv.c with the same name
+ */
+static inline uint32
+pg_mb_radix_conv(const pg_mb_radix_tree *rt,
+                int l,
+                unsigned char b1,
+                unsigned char b2,
+                unsigned char b3,
+                unsigned char b4)
This is not nice. Having a duplication like that is a recipe to forget
about it as this patch introduces a dependency with conv.c and the
radix tree generation.

Having a .gitignore in Unicode/ would be nice, particularly to avoid
committing map_checker.

A README documenting things may be welcome, or at least comments at
the top of map_checker.c. Why is map_checker essential? What does it
do? There is no way to understand that easily, except that it includes
a "radix tree conversion function", and that it performs sanity checks
on the radix trees to be sure that they are on a good shape. But as
this something that one would guess only after looking at your patch
and the code (at least I will sleep less stupid tonight after reading
this stuff).

--- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl# Drop these SJIS codes from the source for UTF8=>SJIS conversion#<<<
donot let perltidy touch this
 
-my @reject_sjis =(
+my @reject_sjis = (   0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
-   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
+   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,   0x879a..0x879c
-);
+   );
This is not generated, it would be nice to drop the noise from the patch.

Here is another one:
-       $i->{code} = $jis | (
-           $jis < 0x100
-           ? 0x8e00
-           : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
-
+#<<< do not let perltidy touch this
+       $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
+                            ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
+#>>>
       if (l == 2)       {
-           iutf = *utf++ << 8;
-           iutf |= *utf++;
+           b3 = *utf++;
+           b4 = *utf++;       }
Ah, OK. This conversion is important so as it performs a minimum of
bitwise operations. Yes let's keep that. That's pretty cool to get a
faster operation.
-- 
Michael



Re: [HACKERS] Radix tree for character conversion

From
Michael Paquier
Date:
On Wed, Jan 25, 2017 at 7:18 PM, Ishii Ayumi <ayumi.ishii.pg@gmail.com> wrote:
> I patched 4 patchset and run "make", but I got failed.
> Is this a bug or my mistake ?
> I'm sorry if I'm wrong.
>
> [$(TOP)]$ patch -p1 < ../0001-Add-missing-semicolon.patch
> [$(TOP)]$ patch -p1 < ../0002-Correct-reference-resolution-syntax.patch
> [$(TOP)]$ patch -p1 <
> ../0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> [$(TOP)]$ patch -p1 < ../0004-Use-radix-tree-for-character-conversion.patch
> [$(TOP)]$ ./configure
> [Unicode]$ make
> '/usr/bin/perl' UCS_to_most.pl
> Type of arg 1 to keys must be hash (not hash element) at convutils.pm
> line 443, near "})
>         "
> Type of arg 1 to values must be hash (not hash element) at
> convutils.pm line 596, near "})
>         "
> Type of arg 1 to each must be hash (not private variable) at
> convutils.pm line 755, near "$map)
>         "
> Compilation failed in require at UCS_to_most.pl line 19.
> make: *** [iso8859_2_to_utf8.map] Error 255

Hm, I am not sure what you are missing. I was able to get things to build.
-- 
Michael



Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
At Thu, 26 Jan 2017 16:29:10 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQd860tOC17O3Qs3+dzZTYbXrXxVD9Tfph0pJ9LAYZ=Ww@mail.gmail.com>
> On Wed, Jan 25, 2017 at 7:18 PM, Ishii Ayumi <ayumi.ishii.pg@gmail.com> wrote:
> > I patched 4 patchset and run "make", but I got failed.
> > Is this a bug or my mistake ?
> > I'm sorry if I'm wrong.
> >
> > [$(TOP)]$ patch -p1 < ../0001-Add-missing-semicolon.patch
> > [$(TOP)]$ patch -p1 < ../0002-Correct-reference-resolution-syntax.patch
> > [$(TOP)]$ patch -p1 <
> > ../0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> > [$(TOP)]$ patch -p1 < ../0004-Use-radix-tree-for-character-conversion.patch
> > [$(TOP)]$ ./configure
> > [Unicode]$ make
> > '/usr/bin/perl' UCS_to_most.pl
> > Type of arg 1 to keys must be hash (not hash element) at convutils.pm
> > line 443, near "})
> >         "
> > Type of arg 1 to values must be hash (not hash element) at
> > convutils.pm line 596, near "})
> >         "
> > Type of arg 1 to each must be hash (not private variable) at
> > convutils.pm line 755, near "$map)
> >         "
> > Compilation failed in require at UCS_to_most.pl line 19.
> > make: *** [iso8859_2_to_utf8.map] Error 255
> 
> Hm, I am not sure what you are missing. I was able to get things to build.

As I posted, it should be caused by older perl, at least 5.8
complains so and 5.16 doesn't.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Thank you for looking this.

At Thu, 26 Jan 2017 16:28:16 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqREL1fsDBGv4zRvaXY+UKtS0wzkamJcnYhX0--OZvpUUQ@mail.gmail.com>
> On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > [...patch...]
> 
> Nobody has showed up yet to review this patch, so I am giving it a shot.
> 
> The patch file sizes are scary at first sight, but after having a look:
>  36 files changed, 1411 insertions(+), 54398 deletions(-)
> Yes that's a surprise, something like git diff --irreversible-delete
> would have helped as most of the diffs are just caused by 3 files
> being deleted in patch 0004, making 50k lines going to the abyss of
> deletion.

Thank you. Good to hear that. I'll try that at the next chance.

> > Hello, I found a bug in my portion while rebasing.
> 
> Right, that's 0001. Nice catch.
> 
> > The attached files are the following. This patchset is not
> > complete missing changes of map files. The change is tremendously
> > large but generatable.
> >
> > 0001-Add-missing-semicolon.patch
> >
> >   UCS_to_EUC_JP.pl has a line missing teminating semicolon. This
> >   doesn't harm but surely a syntax error. This patch fixes it.
> >   This might should be a separate patch.
> 
> This requires a back-patch. This makes me wonder how long this script
> has actually not run...
> 
> > 0002-Correct-reference-resolution-syntax.patch
> >
> >   convutils.pm has lines with different syntax of reference
> >   resolution. This unifies the syntax.
> 
> Yes that looks right to me.

Yes, I thoght that the three patches can be back-patched, a kind
of bug fix.

> I am the best perl guru on this list but
> looking around $$var{foo} is bad, ${$var}{foo} is better, and
> $var->{foo} is even better. This also generates no diffs when running
> make in src/backend/utils/mb/Unicode/. So no objections to that.

Thank you for the explanation. I think no '$$'s is left alone.

> > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> >
> >   Before adding radix tree stuff, applied pgperltidy and inserted
> >   format-skipping pragma for the parts where perltidy seems to do
> >   too much.
> 
> Which version of perltidy did you use? Looking at the archives, the
> perl code is cleaned up with a specific version, v20090616. See
> https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> for example on the matter. As perltidy changes over time, this may be
> a sensitive change if done this way.

Hmm. I will make a confirmation on that.. tomorrow.

> > 0004-Use-radix-tree-for-character-conversion.patch
> >
> >   Radix tree body.
> 
> Well, here a lot of diffs could have been saved.
> 
> > The unattached fifth patch is generated by the following steps.
> >
> > [$(TOP)]$ ./configure
> > [Unicode]$ make
> > [Unicode]$ make distclean
> > [Unicode]$ git add .
> > [Unicode]$ commit
> > === COMMITE MESSSAGE
> > Replace map files with radix tree files.
> >
> > These encodings no longer uses the former map files and uses new radix
> > tree files.
> > ===
> 
> OK, I can see that working, with 200k of maps generated.. So going
> through the important bits of this jungle..

Many thaks for the exploration.

> +/*
> + * radix tree conversion function - this should be identical to the function in
> + * ../conv.c with the same name
> + */
> +static inline uint32
> +pg_mb_radix_conv(const pg_mb_radix_tree *rt,
> +                int l,
> +                unsigned char b1,
> +                unsigned char b2,
> +                unsigned char b3,
> +                unsigned char b4)
> This is not nice. Having a duplication like that is a recipe to forget
> about it as this patch introduces a dependency with conv.c and the
> radix tree generation.

Mmmmm. I agree to you, but conv.c contains unwanted reference to
elog or other sutff of the core. Separating the function in a
dedicate source file named such as "../pg_mb_radix_conv.c" will
work. If it is not so bad, I'll do that in the next version.

> Having a .gitignore in Unicode/ would be nice, particularly to avoid
> committing map_checker.
> 
> A README documenting things may be welcome, or at least comments at
> the top of map_checker.c. Why is map_checker essential? What does it
> do? There is no way to understand that easily, except that it includes
> a "radix tree conversion function", and that it performs sanity checks
> on the radix trees to be sure that they are on a good shape. But as
> this something that one would guess only after looking at your patch
> and the code (at least I will sleep less stupid tonight after reading
> this stuff).

Okay, I'll do that.

> --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
>  # Drop these SJIS codes from the source for UTF8=>SJIS conversion
>  #<<< do not let perltidy touch this
> -my @reject_sjis =(
> +my @reject_sjis = (
>     0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> -   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> +   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
>     0x879a..0x879c
> -);
> +   );
> This is not generated, it would be nice to drop the noise from the patch.

Mmm. I'm not sure how this is generated but I'll care for that.

> Here is another one:
> -       $i->{code} = $jis | (
> -           $jis < 0x100
> -           ? 0x8e00
> -           : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> -
> +#<<< do not let perltidy touch this
> +       $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> +                            ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> +#>>>

Ok. Will revert this.

>         if (l == 2)
>         {
> -           iutf = *utf++ << 8;
> -           iutf |= *utf++;
> +           b3 = *utf++;
> +           b4 = *utf++;
>         }
> Ah, OK. This conversion is important so as it performs a minimum of
> bitwise operations. Yes let's keep that. That's pretty cool to get a
> faster operation.

It is Heikki's work:p

I'll address them and repost the next version sooner.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hi, this is an intermediate report without a patch.

At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170126.214212.111556326.horiguchi.kyotaro@lab.ntt.co.jp>
> > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> > >
> > >   Before adding radix tree stuff, applied pgperltidy and inserted
> > >   format-skipping pragma for the parts where perltidy seems to do
> > >   too much.
> > 
> > Which version of perltidy did you use? Looking at the archives, the
> > perl code is cleaned up with a specific version, v20090616. See
> > https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> > for example on the matter. As perltidy changes over time, this may be
> > a sensitive change if done this way.
> 
> Hmm. I will make a confirmation on that.. tomorrow.

My perltidy -v said "v20121207'. Anyway, I gave up to apply
perltidy by myself. So I'll just drop 0003 and new 0004 (name
changed to 0003) is made immediately on 0002.

> > > 0004-Use-radix-tree-for-character-conversion.patch
> > >
> > >   Radix tree body.
> > 
> > Well, here a lot of diffs could have been saved.
> > 
> > > The unattached fifth patch is generated by the following steps.
> > >
> > > [$(TOP)]$ ./configure
> > > [Unicode]$ make
> > > [Unicode]$ make distclean
> > > [Unicode]$ git add .
> > > [Unicode]$ commit
> > > === COMMITE MESSSAGE
> > > Replace map files with radix tree files.
> > >
> > > These encodings no longer uses the former map files and uses new radix
> > > tree files.
> > > ===
> > 
> > OK, I can see that working, with 200k of maps generated.. So going
> > through the important bits of this jungle..
> 
> Many thaks for the exploration.
> 
> > +/*
> > + * radix tree conversion function - this should be identical to the function in
> > + * ../conv.c with the same name
..
> > This is not nice. Having a duplication like that is a recipe to forget
> > about it as this patch introduces a dependency with conv.c and the
> > radix tree generation.

In the attatched patch, mb/char_conveter.c which contains one
inline function is created and it is includ'ed from mb/conv.c and
mb/Unicode/map_checker.c.

> > Having a .gitignore in Unicode/ would be nice, particularly to avoid
> > committing map_checker.

I missed this.  I added .gitignore to ignore map_checker stuff
and authority files and old-style map files.

> > A README documenting things may be welcome, or at least comments at
> > the top of map_checker.c. Why is map_checker essential? What does it
> > do? There is no way to understand that easily, except that it includes
> > a "radix tree conversion function", and that it performs sanity checks
> > on the radix trees to be sure that they are on a good shape. But as
> > this something that one would guess only after looking at your patch
> > and the code (at least I will sleep less stupid tonight after reading
> > this stuff).
> 
> Okay, I'll do that.

The patch has not been provided yet, I'm going to put the
following comment just before the main() in map_checker.c.

/** The old-style plain map files were error-resistant due to its* straight-forward way for generation from authority
files.In contrast the* radix tree maps are generated by a rather complex calculation and have a* complex,
hard-to-confirmformat.** This program runs sanity check of the radix tree maps by confirming all* characters in the
plainmap files to be converted to the same code by the* corresponding radix tree map.** All map files are included by
map_checker.hthat is generated by the script* make_mapchecker.pl as the variable mappairs.**/
 


I'll do the following things later.

> > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> >  # Drop these SJIS codes from the source for UTF8=>SJIS conversion
> >  #<<< do not let perltidy touch this
> > -my @reject_sjis =(
> > +my @reject_sjis = (
> >     0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> > -   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> > +   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> >     0x879a..0x879c
> > -);
> > +   );
> > This is not generated, it would be nice to drop the noise from the patch.
> 
> Mmm. I'm not sure how this is generated but I'll care for that.
> 
> > Here is another one:
> > -       $i->{code} = $jis | (
> > -           $jis < 0x100
> > -           ? 0x8e00
> > -           : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > -
> > +#<<< do not let perltidy touch this
> > +       $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> > +                            ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > +#>>>
> 
> Ok. Will revert this.
> 
> >         if (l == 2)
> >         {
> > -           iutf = *utf++ << 8;
> > -           iutf |= *utf++;
> > +           b3 = *utf++;
> > +           b4 = *utf++;
> >         }
> > Ah, OK. This conversion is important so as it performs a minimum of
> > bitwise operations. Yes let's keep that. That's pretty cool to get a
> > faster operation.
> 
> It is Heikki's work:p
> 
> I'll address them and repost the next version sooner.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello, this is the revised version of character conversion using radix tree.

At Fri, 27 Jan 2017 17:33:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170127.173357.221584433.horiguchi.kyotaro@lab.ntt.co.jp>
> Hi, this is an intermediate report without a patch.
> 
> At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170126.214212.111556326.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> > > >
> > > >   Before adding radix tree stuff, applied pgperltidy and inserted
> > > >   format-skipping pragma for the parts where perltidy seems to do
> > > >   too much.
> > > 
> > > Which version of perltidy did you use? Looking at the archives, the
> > > perl code is cleaned up with a specific version, v20090616. See
> > > https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> > > for example on the matter. As perltidy changes over time, this may be
> > > a sensitive change if done this way.
> 
> My perltidy -v said "v20121207'. Anyway, I gave up to apply
> perltidy by myself. So I'll just drop 0003 and new 0004 (name
> changed to 0003) is made immediately on 0002.

I'm not sure what to handle this so I just removed the perltidy
stuff from this patchset.

> > > > 0004-Use-radix-tree-for-character-conversion.patch
> > > >
> > > >   Radix tree body.
> > > 
> > > Well, here a lot of diffs could have been saved.
> > > 
> > > > The unattached fifth patch is generated by the following steps.
> > > >
> > > > [$(TOP)]$ ./configure
> > > > [Unicode]$ make
> > > > [Unicode]$ make distclean
> > > > [Unicode]$ git add .
> > > > [Unicode]$ commit
> > > > === COMMITE MESSSAGE
> > > > Replace map files with radix tree files.
> > > >
> > > > These encodings no longer uses the former map files and uses new radix
> > > > tree files.
> > > > ===
> > > 
> > > OK, I can see that working, with 200k of maps generated.. So going
> > > through the important bits of this jungle..
> > 
> > Many thaks for the exploration.
> > 
> > > +/*
> > > + * radix tree conversion function - this should be identical to the function in
> > > + * ../conv.c with the same name
> ..
> > > This is not nice. Having a duplication like that is a recipe to forget
> > > about it as this patch introduces a dependency with conv.c and the
> > > radix tree generation.
> 
> In the attatched patch, mb/char_conveter.c which contains one
> inline function is created and it is includ'ed from mb/conv.c and
> mb/Unicode/map_checker.c.
> 
> > > Having a .gitignore in Unicode/ would be nice, particularly to avoid
> > > committing map_checker.
> 
> I missed this.  I added .gitignore to ignore map_checker stuff
> and authority files and old-style map files.
> 
> > > A README documenting things may be welcome, or at least comments at
> > > the top of map_checker.c. Why is map_checker essential? What does it
> > > do? There is no way to understand that easily, except that it includes
> > > a "radix tree conversion function", and that it performs sanity checks
> > > on the radix trees to be sure that they are on a good shape. But as
> > > this something that one would guess only after looking at your patch
> > > and the code (at least I will sleep less stupid tonight after reading
> > > this stuff).
> > 
> > Okay, I'll do that.
> 
> The patch has not been provided yet, I'm going to put the
> following comment just before the main() in map_checker.c.
> 
> /*
>  * The old-style plain map files were error-resistant due to its
>  * straight-forward way for generation from authority files. In contrast the
>  * radix tree maps are generated by a rather complex calculation and have a
>  * complex, hard-to-confirm format.
>  *
>  * This program runs sanity check of the radix tree maps by confirming all
>  * characters in the plain map files to be converted to the same code by the
>  * corresponding radix tree map.
>  *
>  * All map files are included by map_checker.h that is generated by the script
>  * make_mapchecker.pl as the variable mappairs.
>  *
>  */
> 
> 
> I'll do the following things later.

The following is the continuation.

> > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> >  # Drop these SJIS codes from the source for UTF8=>SJIS conversion
> >  #<<< do not let perltidy touch this
> > -my @reject_sjis =(
> > +my @reject_sjis = (
> >     0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> > -   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> > +   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> >     0x879a..0x879c
> > -);
> > +   );
> > This is not generated, it would be nice to drop the noise from the patch.
> 
> Mmm. I'm not sure how this is generated but I'll care for that.

I don't still understand what what the intermediate diff comes
from but copy-n-pasting from master silenced it...

> > Here is another one:
> > -       $i->{code} = $jis | (
> > -           $jis < 0x100
> > -           ? 0x8e00
> > -           : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > -
> > +#<<< do not let perltidy touch this
> > +       $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> > +                            ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > +#>>>
> 
> Ok. Will revert this.

The "previous" code (prefixed with a minus sign) is "my"
perltidy's work. Preltidy step is just removed from the patchset.

> >         if (l == 2)
> >         {
> > -           iutf = *utf++ << 8;
> > -           iutf |= *utf++;
> > +           b3 = *utf++;
> > +           b4 = *utf++;
> >         }
> > Ah, OK. This conversion is important so as it performs a minimum of
> > bitwise operations. Yes let's keep that. That's pretty cool to get a
> > faster operation.
> 
> It is Heikki's work:p
> 
> I'll address them and repost the next version sooner.


Finally, the patchset had the following changes from the previous
shape.

- Avoid syntaxes perl 5.8 complains about

- The perltidy step has been removed.

- pg_mb_radix_conv is now a separate .c file (but included from other c files)

- Added Unicode/.gitignore. The line for [~#] might be needless.

- The patchset are made with --irreversible-delete. This is just what I wanted (but counldn't find by myself..)

- Semicolon-fix patch(0001) gets several additional fixes.



This patchset consists of four patches. The first two are bug
fixes back-patchable to older versions. The third one is the
patch that adds radix tree feature. The forth one is not attached
to this mail but generatable.

0001-Add-missing-semicolon.patch
 Adds missing semicolon found in three files.

0002-Correct-reference-resolution-syntax.patch
 Changes reference syntax to more preferable style.

0003-Use-radix-tree-for-character-conversion.patch
 Radix tree conversion patch. The size has been reduced from 1.6MB to 91KB.

0004: Replace map files
 This is not attached but generatable. This shouldn't fail even on the environment with perl 5.8.
 [$(TOP)]$ ./configure [$(TOP)]$ cd src/backend/utils/mb/Unicode [Unicode]$ make distclean maintainer-clean all
[Unicode]$make mapcheck ... All radix trees are perfect! [Unicode]$ make distclean [Unicode]$ git add . [Unicode]$ git
commit...
 
 The size of the forth patch was about 7.6MB using --irreversible-delete.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Radix tree for character conversion

From
Michael Paquier
Date:
On Mon, Jan 30, 2017 at 3:37 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, this is the revised version of character conversion using radix tree.

Thanks for the new version, I'll look at it once I am done with the
cleanup of the current CF. For now I have moved it to the CF 2017-03.
-- 
Michael



Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
At Tue, 31 Jan 2017 12:25:46 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSGWmHwZqaOA--EHCveG9m77pjWRzxZ6B5iUSJ7GKz-4w@mail.gmail.com>
> On Mon, Jan 30, 2017 at 3:37 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello, this is the revised version of character conversion using radix tree.
> 
> Thanks for the new version, I'll look at it once I am done with the
> cleanup of the current CF. For now I have moved it to the CF 2017-03.

Agreed. Thank you.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Tnanks to that Heikki have pushed the first two patches and a
part of the third, only one patch is remaining now.

# Sorry for not separating KOI8 stuffs.

At Tue, 31 Jan 2017 19:06:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170131.190609.254672218.horiguchi.kyotaro@lab.ntt.co.jp>
> > Thanks for the new version, I'll look at it once I am done with the
> > cleanup of the current CF. For now I have moved it to the CF 2017-03.
> 
> Agreed. Thank you.

Attached is the latest version on the current master (555494d).

Note: since this patch is created by git diff --irreversble-delete,
three files mb/Unicode/*.(txt|xml) to be deleted are left alone.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Radix tree for character conversion

From
Michael Paquier
Date:
On Fri, Feb 3, 2017 at 1:18 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Thanks to that Heikki have pushed the first two patches and a
> part of the third, only one patch is remaining now.
>
> # Sorry for not separating KOI8 stuffs.
>
> At Tue, 31 Jan 2017 19:06:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170131.190609.254672218.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> > Thanks for the new version, I'll look at it once I am done with the
>> > cleanup of the current CF. For now I have moved it to the CF 2017-03.
>>
>> Agreed. Thank you.
>
> Attached is the latest version on the current master (555494d).
>
> Note: since this patch is created by git diff --irreversble-delete,
> three files mb/Unicode/*.(txt|xml) to be deleted are left alone.

Thanks for the rebase. I have been spending sore time looking at this
patch. The new stuff in convutils.pm is by far the interesting part of
the patch, where the building of the radix trees using a byte
structure looks in pretty good shape after eyeballing the logic for a
couple of hours.

+# ignore backup files of editors
+/*[~#]
+
This does not belong to Postgres core code. You could always set up
that in a global exclude file with core.excludefiles.

In order to conduct sanity checks on the shape of the radix tree maps
compared to the existing maps, having map_checker surely makes sense.
Now in the final result I don't think we need it. The existing map
files ought to be replaced by their radix versions at the end, and
map_checker should be removed. This leads to a couple of
simplifications, like Makefile, and reduces the maintenance to one
mechanism.

+sub print_radix_trees
+{
+   my ($this_script, $csname, $charset) = @_;
+
+   &print_radix_map($this_script, $csname, "from_unicode", $charset, 78);
+   &print_radix_map($this_script, $csname, "to_unicode",   $charset, 78);
+}
There is no need for the table width to be defined as a variable (5th
argument). Similarly, to_unicode/from_unicode require checks in
multiple places, this could be just a simple boolean flag. Or if you
want to go to the road of non-simple things, you could have two
arguments: an origin and a target. If one is UTF8 the other is the
mapping name.

+sub dump_charset
+{
+   my ($list, $filt) = @_;
+
+   foreach my $i (@$list)
+   {
+       next if (defined $filt && !&$filt($i));
+       if (!defined $i->{ucs}) { $i->{ucs} = &utf2ucs($i->{utf8}); }
+       printf "ucs=%x, code=%x, direction=%s %s:%d %s\n",
+         $i->{ucs}, $i->{code}, $i->{direction},
+         $i->{f},   $i->{l},    $i->{comment};
+   }
+}
This is used nowhere. Perhaps it was useful for debugging at some point?

+# make_charmap - convert charset table to charmap hash
+#     with checking duplicate source code
Maybe this should be "with checking of duplicated source codes".

+# print_radix_map($this_script, $csname, $direction, \%charset, $tblwidth)
+#
+# this_script - the name of the *caller script* of this feature
$this_script is not needed at all, you could just use basename($0) and
reduce the number of arguments of the different functions of the
stack.

+   ### amount of zeros that can be ovarlaid.
s/ovarlaid/overlaid.

+# make_mapchecker.pl - Gerates map_checker.h file included by map_checker.c
s/gerates/generates/

+           if (s < 0x80)
+           {
+               fprintf(stderr, "\nASCII character ? (%x)", s);
+               exit(1);
+           }
Most likely a newline at the end of the error string is better here.

+           $charmap{ ucs2utf($src) } = $dst;
+       }
+
+   }
Unnecessary newline here.
-- 
Michael



Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Thank you for the comment.

At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRTQ+7ZjxuPTbsr18MXvW7mTd29mN+91N7AG8fe5aCeAA@mail.gmail.com>
> Thanks for the rebase. I have been spending sore time looking at this
> patch. The new stuff in convutils.pm is by far the interesting part of
> the patch, where the building of the radix trees using a byte
> structure looks in pretty good shape after eyeballing the logic for a
> couple of hours.
> 
> +# ignore backup files of editors
> +/*[~#]
> +
> This does not belong to Postgres core code. You could always set up
> that in a global exclude file with core.excludefiles.

Thank you for letting me know about it. I removed that.

> In order to conduct sanity checks on the shape of the radix tree maps
> compared to the existing maps, having map_checker surely makes sense.
> Now in the final result I don't think we need it. The existing map
> files ought to be replaced by their radix versions at the end, and
> map_checker should be removed. This leads to a couple of
> simplifications, like Makefile, and reduces the maintenance to one
> mechanism.

Hmm.. Though I don't remember clearly what the radix map of the
first version looked like, the current radix map seems
human-readable for me. It might be by practice or by additional
comments in map files. Anyway I removed all of the stuff so as
not to generate the plain maps. But I didn't change the names of
_radix.map and just commented out the line to output the plain
maps in UCS_to_*.pl.  Combined maps are still in the plain format
so print_tables was changed to take character tables separately
for regular (non-combined) characters and combined characters.

> +sub print_radix_trees
> +{
> +   my ($this_script, $csname, $charset) = @_;
> +
> +   &print_radix_map($this_script, $csname, "from_unicode", $charset, 78);
> +   &print_radix_map($this_script, $csname, "to_unicode",   $charset, 78);
> +}
> There is no need for the table width to be defined as a variable (5th
> argument).

The table width was already useless.. Removed.

> Similarly, to_unicode/from_unicode require checks in
> multiple places, this could be just a simple boolean flag.

The direction is a tristate (to/from/both) variable so cannot be
replaced with a boolean. But I agree that comparing with free
string is not so good. This is a change already committed in the
master but it is changed in the attached patch.

# Perhaps it is easier to read in string form..

> Or if you want to go to the road of non-simple things, you
> could have two arguments: an origin and a target. If one is
> UTF8 the other is the mapping name.

Mmmm. It seems (even) to me to give more harm than good.  I can
guess two alternatives for this.

- Split the property {direction} into two boolean properties {to_unicode} and {from_unicode}.

- Make the {direction} property an integer and compared with defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE
usingthe '=' operator.
 

I choosed the former in this patch.

> +sub dump_charset
> +{
> +   my ($list, $filt) = @_;
> +
> +   foreach my $i (@$list)
> +   {
> +       next if (defined $filt && !&$filt($i));
> +       if (!defined $i->{ucs}) { $i->{ucs} = &utf2ucs($i->{utf8}); }
> +       printf "ucs=%x, code=%x, direction=%s %s:%d %s\n",
> +         $i->{ucs}, $i->{code}, $i->{direction},
> +         $i->{f},   $i->{l},    $i->{comment};
> +   }
> +}
> This is used nowhere. Perhaps it was useful for debugging at some point?

Yes, it was quite useful. Removed.

> +# make_charmap - convert charset table to charmap hash
> +#     with checking duplicate source code
> Maybe this should be "with checking of duplicated source codes".

Even though I'm not good English writer, 'duplicated codes' looks
as multiple copies of the original 'code' (for me, of
course). And the 'checking' is a (pure) verbal noun (means not a
deverbal noun) so 'of' is not required. But, of course, I'm not
sure which sounds more natural as English.

This comment is not changed.

> +# print_radix_map($this_script, $csname, $direction, \%charset, $tblwidth)
> +#
> +# this_script - the name of the *caller script* of this feature
> $this_script is not needed at all, you could just use basename($0) and
> reduce the number of arguments of the different functions of the
> stack.

I avoided relying on global stuff by that but I accept the
suggestion. Fixed in this version.

> +   ### amount of zeros that can be ovarlaid.
> s/ovarlaid/overlaid.
> 
> +# make_mapchecker.pl - Gerates map_checker.h file included by map_checker.c
> s/gerates/generates/

make_mapcheker.pl is a tool only for map_checker.c so this files
is removed.

> +           if (s < 0x80)
> +           {
> +               fprintf(stderr, "\nASCII character ? (%x)", s);
> +               exit(1);
> +           }
> Most likely a newline at the end of the error string is better here.

map_checker.c is removed.

> +           $charmap{ ucs2utf($src) } = $dst;
> +       }
> +
> +   }
> Unnecessary newline here.

removed in convutils.pm.

Since Makefile ignores old .map files, the steps to generate a
patch for map files was a bit chaged.

$ rm *.map
$ make distclean maintainer-clean all
$ make distclean
$ git add .
$ git commit

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Radix tree for character conversion

From
Robert Haas
Date:
On Mon, Feb 27, 2017 at 2:07 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> +# make_charmap - convert charset table to charmap hash
>> +#     with checking duplicate source code
>> Maybe this should be "with checking of duplicated source codes".
>
> Even though I'm not good English writer, 'duplicated codes' looks
> as multiple copies of the original 'code' (for me, of
> course). And the 'checking' is a (pure) verbal noun (means not a
> deverbal noun) so 'of' is not required. But, of course, I'm not
> sure which sounds more natural as English

The problem is that, because "checking" is a noun in this sentence, it
can't be followed by a direct object so you need "of" to connect
"checking" with the thing that is being checked.  However, what I
would do is rearrange this sentence slightly as to use "checking" as a
verb, like this:

convert charset table to charmap hash, checking for duplicate source
codes along the way

While I don't think Michael's suggestion is wrong, I find the above a
little more natural.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 28 Feb 2017 08:00:22 +0530, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYheGx+knBQAMBm+nr8Cnr7e2RZm1BEwgK5AGMx4MKv9A@mail.gmail.com>
> On Mon, Feb 27, 2017 at 2:07 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> +# make_charmap - convert charset table to charmap hash
> >> +#     with checking duplicate source code
> >> Maybe this should be "with checking of duplicated source codes".
> >
> > Even though I'm not good English writer, 'duplicated codes' looks
> > as multiple copies of the original 'code' (for me, of
> > course). And the 'checking' is a (pure) verbal noun (means not a
> > deverbal noun) so 'of' is not required. But, of course, I'm not
> > sure which sounds more natural as English
> 
> The problem is that, because "checking" is a noun in this sentence, it
> can't be followed by a direct object so you need "of" to connect
> "checking" with the thing that is being checked.  However, what I
> would do is rearrange this sentence slightly as to use "checking" as a
> verb, like this:
> 
> convert charset table to charmap hash, checking for duplicate source
> codes along the way
> 
> While I don't think Michael's suggestion is wrong, I find the above a
> little more natural.

Thank you for the suggestion and explantion. I leaned that,
maybe.  I'll send the version with the revised comment.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Radix tree for character conversion

From
Michael Paquier
Date:
On Mon, Feb 27, 2017 at 5:37 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRTQ+7ZjxuPTbsr18MXvW7mTd29mN+91N7AG8fe5aCeAA@mail.gmail.com>
>> In order to conduct sanity checks on the shape of the radix tree maps
>> compared to the existing maps, having map_checker surely makes sense.
>> Now in the final result I don't think we need it. The existing map
>> files ought to be replaced by their radix versions at the end, and
>> map_checker should be removed. This leads to a couple of
>> simplifications, like Makefile, and reduces the maintenance to one
>> mechanism.
>
> Hmm.. Though I don't remember clearly what the radix map of the
> first version looked like, the current radix map seems
> human-readable for me. It might be by practice or by additional
> comments in map files. Anyway I removed all of the stuff so as
> not to generate the plain maps. But I didn't change the names of
> _radix.map and just commented out the line to output the plain
> maps in UCS_to_*.pl.  Combined maps are still in the plain format
> so print_tables was changed to take character tables separately
> for regular (non-combined) characters and combined characters.

Do others have thoughts to offer on the matter? I would think that the
new radix maps should just replace by the old plain ones, and that the
only way to build the maps going forward is to use the new methods.
The radix trees is the only thing used in the backend code as well
(conv.c). We could keep the way to build the old maps, with the
map_checker in module out of the core code. FWIW, I am fine to add the
old APIs in my plugin repository on github and have the sanity checks
in that as well. And of course also publish on this thread a module to
do that.

>> Or if you want to go to the road of non-simple things, you
>> could have two arguments: an origin and a target. If one is
>> UTF8 the other is the mapping name.
>
> Mmmm. It seems (even) to me to give more harm than good.  I can
> guess two alternatives for this.
>
> - Split the property {direction} into two boolean properties
>   {to_unicode} and {from_unicode}.
>
> - Make the {direction} property an integer and compared with
>   defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE using
>   the '=' operator.
>
> I choosed the former in this patch.

Fine for me.

>> +           $charmap{ ucs2utf($src) } = $dst;
>> +       }
>> +
>> +   }
>> Unnecessary newline here.
>
> removed in convutils.pm.
>
> Since Makefile ignores old .map files, the steps to generate a
> patch for map files was a bit chaged.
>
> $ rm *.map
> $ make distclean maintainer-clean all
> $ make distclean
> $ git add .
> $ git commit

+# ignore generated files
+/map_checker
+/map_checker.h
[...]
+map_checker.h: make_mapchecker.pl $(MAPS) $(RADIXMAPS)
+   $(PERL) $<
+
+map_checker.o: map_checker.c map_checker.h ../char_converter.c
+
+map_checker: map_checker.o
With map_checker out of the game, those things are not needed.

+++ b/src/backend/utils/mb/char_converter.c
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ *   Character converter function using radix tree
In the simplified version of the patch, pg_mb_radix_conv() being only
needed in conv.c I think that this could just be a static local
routine.

-#include "../../Unicode/utf8_to_koi8r.map"
-#include "../../Unicode/koi8r_to_utf8.map"
-#include "../../Unicode/utf8_to_koi8u.map"
-#include "../../Unicode/koi8u_to_utf8.map"
+#include "../../Unicode/utf8_to_koi8r_radix.map"
+#include "../../Unicode/koi8r_to_utf8_radix.map"
+#include "../../Unicode/utf8_to_koi8u_radix.map"
+#include "../../Unicode/koi8u_to_utf8_radix.map"
FWIW, I am fine to use those new names as include points.

-distclean: clean
+distclean:   rm -f $(TEXTS)
-maintainer-clean: distclean
+# maintainer-clean intentionally leaves $(TEXTS)
+maintainer-clean:
Why is that? There is also a useless diff down that code block.

+conv.o: conv.c char_converter.c
This also can go away.

-print_tables("EUC_JIS_2004", \@all, 1);
+# print_tables("EUC_JIS_2004", \@regular, undef, 1);
+print_radix_trees("EUC_JIS_2004", \@regular);
+print_tables("EUC_JIS_2004", undef, \@combined, 1);
[...]sub print_tables{
-   my ($charset, $table, $verbose) = @_;
+   my ($charset, $regular, $combined, $verbose) = @_;
print_tables is only used for combined maps, you could remove $regular
from it and just keep $combined around, perhaps renaming print_tables
to print_combined_maps?
-- 
Michael



Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
At Tue, 28 Feb 2017 15:20:06 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqR49krGP6qaaKaL2v3HCnn+dnzv8Dq_ySGbDSr6b_ywrw@mail.gmail.com>
> On Mon, Feb 27, 2017 at 5:37 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRTQ+7ZjxuPTbsr18MXvW7mTd29mN+91N7AG8fe5aCeAA@mail.gmail.com>
> >> In order to conduct sanity checks on the shape of the radix tree maps
> >> compared to the existing maps, having map_checker surely makes sense.
> >> Now in the final result I don't think we need it. The existing map
> >> files ought to be replaced by their radix versions at the end, and
> >> map_checker should be removed. This leads to a couple of
> >> simplifications, like Makefile, and reduces the maintenance to one
> >> mechanism.
> >
> > Hmm.. Though I don't remember clearly what the radix map of the
> > first version looked like, the current radix map seems
> > human-readable for me. It might be by practice or by additional
> > comments in map files. Anyway I removed all of the stuff so as
> > not to generate the plain maps. But I didn't change the names of
> > _radix.map and just commented out the line to output the plain
> > maps in UCS_to_*.pl.  Combined maps are still in the plain format
> > so print_tables was changed to take character tables separately
> > for regular (non-combined) characters and combined characters.
> 
> Do others have thoughts to offer on the matter? I would think that the
> new radix maps should just replace by the old plain ones, and that the
> only way to build the maps going forward is to use the new methods.
> The radix trees is the only thing used in the backend code as well
> (conv.c). We could keep the way to build the old maps, with the
> map_checker in module out of the core code. FWIW, I am fine to add the
> old APIs in my plugin repository on github and have the sanity checks
> in that as well. And of course also publish on this thread a module to
> do that.

I couldn't make out my mind to move to radix tree completely, but
UtfToLocal/LocalToUtf no longer handle the "plain map"s for
non-combined character so they have lost their planground. Okay,
I think I removed all the trace of the plain map era.

Every characters in a mapping has a comment that describes what
the character is or where it is defined. This information is no
longer useful (radix map doesn't have a plance to show it) but
I left it for debug use. (This might just be justification..)

> > - Split the property {direction} into two boolean properties
> >   {to_unicode} and {from_unicode}.
> >
> > - Make the {direction} property an integer and compared with
> >   defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE using
> >   the '=' operator.
> >
> > I choosed the former in this patch.
> 
> Fine for me.

Thanks.

> >> +           $charmap{ ucs2utf($src) } = $dst;
> >> +       }
> >> +
> >> +   }
> >> Unnecessary newline here.
> >
> > removed in convutils.pm.
> >
> > Since Makefile ignores old .map files, the steps to generate a
> > patch for map files was a bit chaged.
> >
> > $ rm *.map
> > $ make distclean maintainer-clean all
> > $ make distclean
> > $ git add .
> > $ git commit
> 
> +# ignore generated files
> +/map_checker
> +/map_checker.h
> [...]
> +map_checker.h: make_mapchecker.pl $(MAPS) $(RADIXMAPS)
> +   $(PERL) $<
> +
> +map_checker.o: map_checker.c map_checker.h ../char_converter.c
> +
> +map_checker: map_checker.o
> With map_checker out of the game, those things are not needed.

Ouch! Thanks for pointing out it. Removed.

> +++ b/src/backend/utils/mb/char_converter.c
> @@ -0,0 +1,116 @@
> +/*-------------------------------------------------------------------------
> + *
> + *   Character converter function using radix tree
> In the simplified version of the patch, pg_mb_radix_conv() being only
> needed in conv.c I think that this could just be a static local
> routine.
> 
> -#include "../../Unicode/utf8_to_koi8r.map"
> -#include "../../Unicode/koi8r_to_utf8.map"
> -#include "../../Unicode/utf8_to_koi8u.map"
> -#include "../../Unicode/koi8u_to_utf8.map"
> +#include "../../Unicode/utf8_to_koi8r_radix.map"
> +#include "../../Unicode/koi8r_to_utf8_radix.map"
> +#include "../../Unicode/utf8_to_koi8u_radix.map"
> +#include "../../Unicode/koi8u_to_utf8_radix.map"
> FWIW, I am fine to use those new names as include points.
> 
> -distclean: clean
> +distclean:
>     rm -f $(TEXTS)
> -maintainer-clean: distclean
> +# maintainer-clean intentionally leaves $(TEXTS)
> +maintainer-clean:
> Why is that? There is also a useless diff down that code block.

It *was* for convenience but now it is automatically downloaded
so such distinction donsn't offer anything good. Changed it to
remove $(TEXTS).

> +conv.o: conv.c char_converter.c
> This also can go away.

Touching char_converter.c will be ignored if it is removed. Did
you mistake it for map_checker?

> -print_tables("EUC_JIS_2004", \@all, 1);
> +# print_tables("EUC_JIS_2004", \@regular, undef, 1);
> +print_radix_trees("EUC_JIS_2004", \@regular);
> +print_tables("EUC_JIS_2004", undef, \@combined, 1);
> [...]
>  sub print_tables
>  {
> -   my ($charset, $table, $verbose) = @_;
> +   my ($charset, $regular, $combined, $verbose) = @_;
> print_tables is only used for combined maps, you could remove $regular
> from it and just keep $combined around, perhaps renaming print_tables
> to print_combined_maps?

Renamed to print_combied_maps.

And the code-comment pointed in the comment by the previous mail
is rewritten as Robert's suggestion.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Radix tree for character conversion

From
Michael Paquier
Date:
On Tue, Feb 28, 2017 at 5:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 28 Feb 2017 15:20:06 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqR49krGP6qaaKaL2v3HCnn+dnzv8Dq_ySGbDSr6b_ywrw@mail.gmail.com>
>> +conv.o: conv.c char_converter.c
>> This also can go away.
>
> Touching char_converter.c will be ignored if it is removed. Did
> you mistake it for map_checker?

That was not what I meant: as pg_mb_radix_conv() is only used in
conv.c, it may be better to just remove completely char_converter.c.

> And the code-comment pointed in the comment by the previous mail
> is rewritten as Robert's suggestion.

Fine for me.

-distclean: clean
+distclean:   rm -f $(TEXTS)

-maintainer-clean: distclean
-   rm -f $(MAPS)
-
+maintainer-clean:
+   rm -f $(TEXTS) $(MAPS)
Well, I would have assumed that this should not change..

The last version of the patch looks in rather good shape to me, we are
also sure that the sanity checks on the old maps and the new maps
match per the previous runs with map_checker. One thing that still
need some extra opinions is what to do with the old maps:
1) Just remove them, replacing the old maps by the new radix tree maps.
2) Keep them around in the backend code, even if they are useless.
3) Use a GUC to be able to switch from one to the other, giving a
fallback method in case of emergency.
4) Use an extension module to store the old maps with as well the
previous build code, so as sanity checks can still be performed on the
new maps.

I would vote for 2), to reduce long term maintenance burdens and after
seeing all the sanity checks that have been done in previous versions.
-- 
Michael



Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
At Wed, 1 Mar 2017 14:34:23 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQ_4n+FDWi5Xiueo68i=fTmdg1Wx+y6XWWX=8rAhKRtFw@mail.gmail.com>
> On Tue, Feb 28, 2017 at 5:34 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Tue, 28 Feb 2017 15:20:06 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqR49krGP6qaaKaL2v3HCnn+dnzv8Dq_ySGbDSr6b_ywrw@mail.gmail.com>
> >> +conv.o: conv.c char_converter.c
> >> This also can go away.
> >
> > Touching char_converter.c will be ignored if it is removed. Did
> > you mistake it for map_checker?
> 
> That was not what I meant: as pg_mb_radix_conv() is only used in
> conv.c, it may be better to just remove completely char_converter.c.

Ouch! You're right. Sorry for my short-sight. char_converter.c is
removed and related description in Makefile is removed.

> > And the code-comment pointed in the comment by the previous mail
> > is rewritten as Robert's suggestion.
> 
> Fine for me.
> 
> -distclean: clean
> +distclean:
>     rm -f $(TEXTS)
> 
> -maintainer-clean: distclean
> -   rm -f $(MAPS)
> -
> +maintainer-clean:
> +   rm -f $(TEXTS) $(MAPS)
> Well, I would have assumed that this should not change..

I should have reverted there but actually the patch somehow does
the different thing.. Surely reverted it this time.

> The last version of the patch looks in rather good shape to me, we are
> also sure that the sanity checks on the old maps and the new maps
> match per the previous runs with map_checker.

Agreed.

>  One thing that still
> need some extra opinions is what to do with the old maps:
> 1) Just remove them, replacing the old maps by the new radix tree maps.
> 2) Keep them around in the backend code, even if they are useless.
> 3) Use a GUC to be able to switch from one to the other, giving a
> fallback method in case of emergency.
> 4) Use an extension module to store the old maps with as well the
> previous build code, so as sanity checks can still be performed on the
> new maps.
> 
> I would vote for 2), to reduce long term maintenance burdens and after
> seeing all the sanity checks that have been done in previous versions.

I don't vote 3 and 4. And I did 1 in the last patch.

About 2, any change in the authority files rarely but possiblly
happens. Even in the case the plain map files are no longer can
be generated. (but can with a bit tweak of convutils.pm) If the
radix-tree file generator is under a suspicion, the "plain" map
file generator (and the map_checker) or some other means to
sanity check might be required.

That being said, when something occurs in radix files, we can
find it in the radix file and can find the corresponding lines in
the aurhority file. The remaining problem is the case where some
substantial change in authority files doesn't affect radix
files. We can detect such mistake by detecting changes in
authority files. So I propose the 5th option.

5) Just remove plain map files and all related code. Addition to  that, Makefile stores hash digest of authority files
in Unicode/authoriy_hashes.txt or something that is managed by  git.
 

This digest may differ among platforms (typically from cr/nl
difference) but we can assume *nix for the usage.

I will send the next version after this discussion is settled.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Radix tree for character conversion

From
Michael Paquier
Date:
On Thu, Mar 2, 2017 at 2:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 5) Just remove plain map files and all related code. Addition to
>    that, Makefile stores hash digest of authority files in
>    Unicode/authoriy_hashes.txt or something that is managed by
>    git.

That may be an idea to check for differences across upstream versions.
But that sounds like a separate discussion to me.

> This digest may differ among platforms (typically from cr/nl
> difference) but we can assume *nix for the usage.
>
> I will send the next version after this discussion is settled.

Sure. There is not much point to move on without Heikki's opinion at
least, or anybody else like Ishii-san or Tom who are familiar with
this code. I would think that Heikki would be the committer to pick up
this change though.
-- 
Michael



Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 3 Mar 2017 12:53:04 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSQaLozFNg+5Tf9s1TZs2pcE-GHhnMG31qnsusV9vMUOw@mail.gmail.com>
> On Thu, Mar 2, 2017 at 2:20 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > 5) Just remove plain map files and all related code. Addition to
> >    that, Makefile stores hash digest of authority files in
> >    Unicode/authoriy_hashes.txt or something that is managed by
> >    git.
> 
> That may be an idea to check for differences across upstream versions.
> But that sounds like a separate discussion to me.

Fine with me either.

> > This digest may differ among platforms (typically from cr/nl
> > difference) but we can assume *nix for the usage.
> >
> > I will send the next version after this discussion is settled.
> 
> Sure. There is not much point to move on without Heikki's opinion at
> least, or anybody else like Ishii-san or Tom who are familiar with
> this code. I would think that Heikki would be the committer to pick up
> this change though.

So, this is the latest version of this patch in the shape of the
option 1.


| need some extra opinions is what to do with the old maps:
| 1) Just remove them, replacing the old maps by the new radix tree maps.
| 2) Keep them around in the backend code, even if they are useless.
| 3) Use a GUC to be able to switch from one to the other, giving a
| fallback method in case of emergency.
| 4) Use an extension module to store the old maps with as well the
| previous build code, so as sanity checks can still be performed on the
| new maps.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 03/06/2017 10:16 AM, Kyotaro HORIGUCHI wrote:
> At Fri, 3 Mar 2017 12:53:04 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSQaLozFNg+5Tf9s1TZs2pcE-GHhnMG31qnsusV9vMUOw@mail.gmail.com>
>> On Thu, Mar 2, 2017 at 2:20 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> 5) Just remove plain map files and all related code. Addition to
>>>    that, Makefile stores hash digest of authority files in
>>>    Unicode/authoriy_hashes.txt or something that is managed by
>>>    git.
>>
>> That may be an idea to check for differences across upstream versions.
>> But that sounds like a separate discussion to me.
>
> Fine with me either.

I did some more kibitzing here and there, and committed. Thanks everyone!

I agree the new maps should just replace the old maps altogether, so 
committed that way. I also moved the combined map files to the same .map 
files as the main radix trees. Seems more clear that way to me.

I changed the to/from_unicode properties back to a single direction 
property, with Perl constants BOTH, TO_UNICODE and FROM_UNICODE, per 
your alternative suggestion upthread. Seems more clear to me.

It would be nice to run the map_checker tool one more time, though, to 
verify that the mappings match those from PostgreSQL 9.6. Just to be 
sure, and after that the map checker can go to the dustbin.

- Heikki




Re: [HACKERS] Radix tree for character conversion

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> I did some more kibitzing here and there, and committed. Thanks everyone!

111 files changed, 147742 insertions(+), 367346 deletions(-)

Nice.

> It would be nice to run the map_checker tool one more time, though, to 
> verify that the mappings match those from PostgreSQL 9.6.

+1

> Just to be sure, and after that the map checker can go to the dustbin.

Hm, maybe we should keep it around for the next time somebody has a bright
idea in this area?
        regards, tom lane



Re: [HACKERS] Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 03/13/2017 08:53 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> It would be nice to run the map_checker tool one more time, though, to
>> verify that the mappings match those from PostgreSQL 9.6.
>
> +1
>
>> Just to be sure, and after that the map checker can go to the dustbin.
>
> Hm, maybe we should keep it around for the next time somebody has a bright
> idea in this area?

The map checker compares old-style maps with the new radix maps. The 
next time 'round, we'll need something that compares the radix maps with 
the next great thing. Not sure how easy it would be to adapt.

Hmm. A somewhat different approach might be more suitable for testing 
across versions, though. We could modify the perl scripts slightly to 
print out SQL statements that exercise every mapping. For every 
supported conversion, the SQL script could:

1. create a database in the source encoding.
2. set client_encoding='<target encoding>'
3. SELECT a string that contains every character in the source encoding.

You could then run those SQL statements against old and new server 
version, and verify that you get the same results.

- Heikki




Re: [HACKERS] Radix tree for character conversion

From
Michael Paquier
Date:
On Tue, Mar 14, 2017 at 4:07 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 03/13/2017 08:53 PM, Tom Lane wrote:
>> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>>>
>>> It would be nice to run the map_checker tool one more time, though, to
>>> verify that the mappings match those from PostgreSQL 9.6.
>>
>> +1

Nice to login and see that committed!
-- 
Michael



Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Thank you for committing this.

At Mon, 13 Mar 2017 21:07:39 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<d5b70078-9f57-0f63-3462-1e564a57739f@iki.fi>
> On 03/13/2017 08:53 PM, Tom Lane wrote:
> > Heikki Linnakangas <hlinnaka@iki.fi> writes:
> >> It would be nice to run the map_checker tool one more time, though, to
> >> verify that the mappings match those from PostgreSQL 9.6.
> >
> > +1
> >
> >> Just to be sure, and after that the map checker can go to the dustbin.
> >
> > Hm, maybe we should keep it around for the next time somebody has a
> > bright
> > idea in this area?
> 
> The map checker compares old-style maps with the new radix maps. The
> next time 'round, we'll need something that compares the radix maps
> with the next great thing. Not sure how easy it would be to adapt.
> 
> Hmm. A somewhat different approach might be more suitable for testing
> across versions, though. We could modify the perl scripts slightly to
> print out SQL statements that exercise every mapping. For every
> supported conversion, the SQL script could:
> 
> 1. create a database in the source encoding.
> 2. set client_encoding='<target encoding>'
> 3. SELECT a string that contains every character in the source
> encoding.

There are many encodings that can be client-encoding but cannot
be database-encoding. And some encodings such as UTF-8 has
several one-way conversion. If we do something like this, it
would be as the following.

1. Encoding test
1-1. create a database in UTF-8
1-2. set client_encoding='<source encoding>'
1-3. INSERT all characters defined in the source encoding.
1-4. set client_encoding='UTF-8'
1-5. SELECT a string that contains every character in UTF-8.
2. Decoding test

.... sucks!


I would like to use convert() function. It can be a large
PL/PgSQL function or a series of "SELECT convert(...)"s. The
latter is doable on-the-fly (by not generating/storing the whole
script).

| -- Test for SJIS->UTF-8 conversion
| ...
| SELECT convert('\0000', 'SJIS', 'UTF-8'); -- results in error
| ...
| SELECT convert('\897e', 'SJIS', 'UTF-8');

> You could then run those SQL statements against old and new server
> version, and verify that you get the same results.

Including the result files in the repository will make this easy
but unacceptably bloats. Put mb/Unicode/README.sanity_check?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Radix tree for character conversion

From
Heikki Linnakangas
Date:
On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote:
> At Mon, 13 Mar 2017 21:07:39 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<d5b70078-9f57-0f63-3462-1e564a57739f@iki.fi>
>> Hmm. A somewhat different approach might be more suitable for testing
>> across versions, though. We could modify the perl scripts slightly to
>> print out SQL statements that exercise every mapping. For every
>> supported conversion, the SQL script could:
>>
>> 1. create a database in the source encoding.
>> 2. set client_encoding='<target encoding>'
>> 3. SELECT a string that contains every character in the source
>> encoding.
>
> There are many encodings that can be client-encoding but cannot
> be database-encoding.

Good point.

> I would like to use convert() function. It can be a large
> PL/PgSQL function or a series of "SELECT convert(...)"s. The
> latter is doable on-the-fly (by not generating/storing the whole
> script).
>
> | -- Test for SJIS->UTF-8 conversion
> | ...
> | SELECT convert('\0000', 'SJIS', 'UTF-8'); -- results in error
> | ...
> | SELECT convert('\897e', 'SJIS', 'UTF-8');

Makes sense.

>> You could then run those SQL statements against old and new server
>> version, and verify that you get the same results.
>
> Including the result files in the repository will make this easy
> but unacceptably bloats. Put mb/Unicode/README.sanity_check?

Yeah, a README with instructions on how to do sounds good. No need to 
include the results in the repository, you can run the script against an 
older version when you need something to compare with.

- Heikki




Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 17 Mar 2017 13:03:35 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<01efd334-b839-0450-1b63-f2dea9326a7e@iki.fi>
> On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote:
> > I would like to use convert() function. It can be a large
> > PL/PgSQL function or a series of "SELECT convert(...)"s. The
> > latter is doable on-the-fly (by not generating/storing the whole
> > script).
> >
> > | -- Test for SJIS->UTF-8 conversion
> > | ...
> > | SELECT convert('\0000', 'SJIS', 'UTF-8'); -- results in error
> > | ...
> > | SELECT convert('\897e', 'SJIS', 'UTF-8');
> 
> Makes sense.
> 
> >> You could then run those SQL statements against old and new server
> >> version, and verify that you get the same results.
> >
> > Including the result files in the repository will make this easy
> > but unacceptably bloats. Put mb/Unicode/README.sanity_check?
> 
> Yeah, a README with instructions on how to do sounds good. No need to
> include the results in the repository, you can run the script against
> an older version when you need something to compare with.

Ok, I'll write a small script to generate a set of "conversion
dump" and try to write README.sanity_check describing how to use
it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
At Tue, 21 Mar 2017 13:10:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170321.131048.150321071.horiguchi.kyotaro@lab.ntt.co.jp>
> At Fri, 17 Mar 2017 13:03:35 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<01efd334-b839-0450-1b63-f2dea9326a7e@iki.fi>
> > On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote:
> > > I would like to use convert() function. It can be a large
> > > PL/PgSQL function or a series of "SELECT convert(...)"s. The
> > > latter is doable on-the-fly (by not generating/storing the whole
> > > script).
> > >
> > > | -- Test for SJIS->UTF-8 conversion
> > > | ...
> > > | SELECT convert('\0000', 'SJIS', 'UTF-8'); -- results in error
> > > | ...
> > > | SELECT convert('\897e', 'SJIS', 'UTF-8');
> > 
> > Makes sense.
> > 
> > >> You could then run those SQL statements against old and new server
> > >> version, and verify that you get the same results.
> > >
> > > Including the result files in the repository will make this easy
> > > but unacceptably bloats. Put mb/Unicode/README.sanity_check?
> > 
> > Yeah, a README with instructions on how to do sounds good. No need to
> > include the results in the repository, you can run the script against
> > an older version when you need something to compare with.
> 
> Ok, I'll write a small script to generate a set of "conversion
> dump" and try to write README.sanity_check describing how to use
> it.

I found that there's no way to identify the character domain of a
conversion on SQL interface. Unconditionally giving from 0 to
0xffffffff as a bytea string yields too-bloat result by containg
many bogus lines.  (If \x40 is a character, convert() also
accepts \x4040, \x404040 and \x40404040)

One more annoyance is the fact that mappings and conversion
procedures are not in one-to-one correspondence. The
corresnponcence is hidden in conversion_procs/*.c files so we
should extract it from them or provide as knowledge. Both don't
seem good.

Finally, it seems that I have no choice than resurrecting
map_checker. The exactly the same one no longer works but
map_dumper.c with almost the same structure will work.

If no one objects to adding map_dumper.c and
gen_mapdumper_header.pl (tentavie name, of course), I'll make a
patch to do that.

Any suggestions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Radix tree for character conversion

From
Kyotaro HORIGUCHI
Date:
Hmm, things are bit different.

At Thu, 23 Mar 2017 12:13:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170323.121307.241436413.horiguchi.kyotaro@lab.ntt.co.jp>
> > Ok, I'll write a small script to generate a set of "conversion
> > dump" and try to write README.sanity_check describing how to use
> > it.
> 
> I found that there's no way to identify the character domain of a
> conversion on SQL interface. Unconditionally giving from 0 to
> 0xffffffff as a bytea string yields too-bloat result by containg
> many bogus lines.  (If \x40 is a character, convert() also
> accepts \x4040, \x404040 and \x40404040)
> 
> One more annoyance is the fact that mappings and conversion
> procedures are not in one-to-one correspondence. The
> corresnponcence is hidden in conversion_procs/*.c files so we
> should extract it from them or provide as knowledge. Both don't
> seem good.
> 
> Finally, it seems that I have no choice than resurrecting
> map_checker. The exactly the same one no longer works but
> map_dumper.c with almost the same structure will work.
> 
> If no one objects to adding map_dumper.c and
> gen_mapdumper_header.pl (tentavie name, of course), I'll make a
> patch to do that.

The scirpt or executable should be compatible between versions
but pg_mb_radix_conv is not. On the other hand more upper level
API reuiqres server stuff.

Finally I made an extension that dumps encoding conversion.

encoding_dumper('SJIS', 'UTF-8') or encoding_dumper(35, 6)

Then it returns the following output consists of two BYTEAs.
srccode | dstcode  
---------+----------\x01    | \x01\x02    | \x02
...\xfc4a  | \xe9b899\xfc4b  | \xe9bb91
(7914 rows)

This returns in a very short time but doesn't when srccode
extends to 4 bytes. As an extreme example the following,

> =# select * from encoding_dumper('UTF-8', 'LATIN1');

takes over 2 minutes to return only 255 rows. We cannot determine
the exact domain without looking into map data so the function
cannot do other than looping through all the four-byte values.
Providing a function that gives the domain for a conversion was a
mess, especially for artithmetic-conversions. The following query
took 94 minutes to give 25M lines/125MB.  In short, that's a
crap. (the first attached)

SELECT x.conname, y.srccode, y.dstcode
FROM(   SELECT conname, conforencoding, contoencoding   FROM pg_conversion c   WHERE pg_char_to_encoding('UTF-8') IN
(c.conforencoding,c.contoencoding)     AND pg_char_to_encoding('SQL_ASCII')         NOT IN (c.conforencoding,
c.contoencoding))as x,LATERAL (  SELECT srccode, dstcode  FROM  encoding_dumper(x.conforencoding, x.contoencoding)) as
y
ORDER BY x.conforencoding, x.contoencoding, y.srccode;


As the another way, I added a measure to generate plain mapping
lists corresponding to .map files (similar to old maps but
simpler) and this finishes the work within a second.

$ make mapdumps

If we will not shortly change the framework of mapped character
conversion, the dumper program may be useful but I'm not sure
this is reasonable as sanity check for future modifications.  In
the PoC, pg_mb_radix_tree() is copied into map_checker.c but this
needs to be a separate file again.  (the second attached)


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center