Home > mailing lists

Re: Radix tree for character conversion - Mailing list pgsql-hackers

From	Kyotaro HORIGUCHI
Subject	Re: Radix tree for character conversion
Date	March 27, 2017 13:05:43
Msg-id	20170327.190543.137530765.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw
In response to	Re: [HACKERS] Radix tree for character conversion (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List	pgsql-hackers

Tree view

Hmm, things are bit different.

At Thu, 23 Mar 2017 12:13:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170323.121307.241436413.horiguchi.kyotaro@lab.ntt.co.jp>
> > Ok, I'll write a small script to generate a set of "conversion
> > dump" and try to write README.sanity_check describing how to use
> > it.
> 
> I found that there's no way to identify the character domain of a
> conversion on SQL interface. Unconditionally giving from 0 to
> 0xffffffff as a bytea string yields too-bloat result by containg
> many bogus lines.  (If \x40 is a character, convert() also
> accepts \x4040, \x404040 and \x40404040)
> 
> One more annoyance is the fact that mappings and conversion
> procedures are not in one-to-one correspondence. The
> corresnponcence is hidden in conversion_procs/*.c files so we
> should extract it from them or provide as knowledge. Both don't
> seem good.
> 
> Finally, it seems that I have no choice than resurrecting
> map_checker. The exactly the same one no longer works but
> map_dumper.c with almost the same structure will work.
> 
> If no one objects to adding map_dumper.c and
> gen_mapdumper_header.pl (tentavie name, of course), I'll make a
> patch to do that.

The scirpt or executable should be compatible between versions
but pg_mb_radix_conv is not. On the other hand more upper level
API reuiqres server stuff.

Finally I made an extension that dumps encoding conversion.

encoding_dumper('SJIS', 'UTF-8') or encoding_dumper(35, 6)

Then it returns the following output consists of two BYTEAs.
srccode | dstcode  
---------+----------\x01    | \x01\x02    | \x02
...\xfc4a  | \xe9b899\xfc4b  | \xe9bb91
(7914 rows)

This returns in a very short time but doesn't when srccode
extends to 4 bytes. As an extreme example the following,

> =# select * from encoding_dumper('UTF-8', 'LATIN1');

takes over 2 minutes to return only 255 rows. We cannot determine
the exact domain without looking into map data so the function
cannot do other than looping through all the four-byte values.
Providing a function that gives the domain for a conversion was a
mess, especially for artithmetic-conversions. The following query
took 94 minutes to give 25M lines/125MB.  In short, that's a
crap. (the first attached)

SELECT x.conname, y.srccode, y.dstcode
FROM(   SELECT conname, conforencoding, contoencoding   FROM pg_conversion c   WHERE pg_char_to_encoding('UTF-8') IN
(c.conforencoding,c.contoencoding)     AND pg_char_to_encoding('SQL_ASCII')         NOT IN (c.conforencoding,
c.contoencoding))as x,LATERAL (  SELECT srccode, dstcode  FROM  encoding_dumper(x.conforencoding, x.contoencoding)) as
y
ORDER BY x.conforencoding, x.contoencoding, y.srccode;


As the another way, I added a measure to generate plain mapping
lists corresponding to .map files (similar to old maps but
simpler) and this finishes the work within a second.

$ make mapdumps

If we will not shortly change the framework of mapped character
conversion, the dumper program may be useful but I'm not sure
this is reasonable as sanity check for future modifications.  In
the PoC, pg_mb_radix_tree() is copied into map_checker.c but this
needs to be a separate file again.  (the second attached)


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

From: Stas Kelvich
Date: 27 March 2017, 12:53:01
Subject: Re: logical decoding of two-phase transactions

From: Rafia Sabih
Date: 27 March 2017, 13:49:37
Subject: Re: pgbench - allow to store select results into variables

Re: Radix tree for character conversion - Mailing list pgsql-hackers

Previous

Next