Re: Radix tree for character conversion - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Re: Radix tree for character conversion |
Date | |
Msg-id | 20170327.190543.137530765.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
In response to | Re: [HACKERS] Radix tree for character conversion (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
List | pgsql-hackers |
Hmm, things are bit different. At Thu, 23 Mar 2017 12:13:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170323.121307.241436413.horiguchi.kyotaro@lab.ntt.co.jp> > > Ok, I'll write a small script to generate a set of "conversion > > dump" and try to write README.sanity_check describing how to use > > it. > > I found that there's no way to identify the character domain of a > conversion on SQL interface. Unconditionally giving from 0 to > 0xffffffff as a bytea string yields too-bloat result by containg > many bogus lines. (If \x40 is a character, convert() also > accepts \x4040, \x404040 and \x40404040) > > One more annoyance is the fact that mappings and conversion > procedures are not in one-to-one correspondence. The > corresnponcence is hidden in conversion_procs/*.c files so we > should extract it from them or provide as knowledge. Both don't > seem good. > > Finally, it seems that I have no choice than resurrecting > map_checker. The exactly the same one no longer works but > map_dumper.c with almost the same structure will work. > > If no one objects to adding map_dumper.c and > gen_mapdumper_header.pl (tentavie name, of course), I'll make a > patch to do that. The scirpt or executable should be compatible between versions but pg_mb_radix_conv is not. On the other hand more upper level API reuiqres server stuff. Finally I made an extension that dumps encoding conversion. encoding_dumper('SJIS', 'UTF-8') or encoding_dumper(35, 6) Then it returns the following output consists of two BYTEAs. srccode | dstcode ---------+----------\x01 | \x01\x02 | \x02 ...\xfc4a | \xe9b899\xfc4b | \xe9bb91 (7914 rows) This returns in a very short time but doesn't when srccode extends to 4 bytes. As an extreme example the following, > =# select * from encoding_dumper('UTF-8', 'LATIN1'); takes over 2 minutes to return only 255 rows. We cannot determine the exact domain without looking into map data so the function cannot do other than looping through all the four-byte values. Providing a function that gives the domain for a conversion was a mess, especially for artithmetic-conversions. The following query took 94 minutes to give 25M lines/125MB. In short, that's a crap. (the first attached) SELECT x.conname, y.srccode, y.dstcode FROM( SELECT conname, conforencoding, contoencoding FROM pg_conversion c WHERE pg_char_to_encoding('UTF-8') IN (c.conforencoding,c.contoencoding) AND pg_char_to_encoding('SQL_ASCII') NOT IN (c.conforencoding, c.contoencoding))as x,LATERAL ( SELECT srccode, dstcode FROM encoding_dumper(x.conforencoding, x.contoencoding)) as y ORDER BY x.conforencoding, x.contoencoding, y.srccode; As the another way, I added a measure to generate plain mapping lists corresponding to .map files (similar to old maps but simpler) and this finishes the work within a second. $ make mapdumps If we will not shortly change the framework of mapped character conversion, the dumper program may be useful but I'm not sure this is reasonable as sanity check for future modifications. In the PoC, pg_mb_radix_tree() is copied into map_checker.c but this needs to be a separate file again. (the second attached) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: