Radix tree for character conversion - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Radix tree for character conversion |
Date | |
Msg-id | 20161007.173606.217452136.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
In response to | Re: Supporting SJIS as a database encoding (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: Radix tree for character conversion
|
List | pgsql-hackers |
This is a differnet topic from the original thread so I renamed the subject and repost. Sorry for duplicate posting. ====================== Hello, I did this. As a result, radix tree is about 1.5 times faster and needs a half memory. At Wed, 21 Sep 2016 15:14:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160921.151427.265121484.horiguchi.kyotaro@lab.ntt.co.jp> > I'll work on this for the next CF. The radix conversion function and map conversion script became more generic than the previous state. So I could easily added radix conversion of EUC_JP in addition to SjiftJIS. nm -S said that the size of radix tree data for sjis->utf8 conversion is 34kB and that for utf8->sjis is 46kB. (eucjp->utf8 57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and 59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm not missing something, radix tree is faster and require less memory. A simple test where 'select '7070 sjis chars' x 100' (I'm not sure, but the size is 1404kB) on local connection shows that this is fast enough. radix: real 0m0.285s / user 0m0.199s / sys 0m0.006s master: real 0m0.418s / user 0m0.180s / sys 0m0.004s To make sure, the result of a test of sending the same amount of ASCII string (1404kB) on SJIS and UTF8(no-conversion) encoding is as follows. ascii/utf8-sjis: real 0m0.220s / user 0m0.176s / sys 0m0.011s ascii/utf8-utf8: real 0m0.137s / user 0m0.111s / sys 0m0.008s ====== Random discussions - Currently the tree structure is devided into several elements, One for 2-byte, other ones for 3-byte and 4-byte codes and output table. The other than the last one is logically and technically merged into single table but it makes the generator script far complex than the current complexity. I no longer want to play hide'n seek with complex perl object.. It might be better that combining this as a native feature of the core. Currently the helper function is in core but that function is given as conv_func on calling LocalToUtf. Current implement uses *.map files of pg_utf_to_local as input. It seems not good but the radix tree files is completely uneditable. Provide custom made loading functions for every source instead of load_chartable() would be the way to go. # However, for example utf8_to_sjis.map, it doesn't seem to have # generated from the source mentioned in UCS_to_SJIS.pl I'm not sure that compilers other than gcc accepts generated map file content. The RADIXTREE.pm is in rather older style but seem no problem. I haven't tried this for charsets that contains 4-byte code. I haven't consider charset with conbined characters. I don't think it is needed so immediately. Though I believe that this is easily applied to other conversions, I tried this only with character sets that I know about it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: