Re: [HACKERS] Radix tree for character conversion - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Re: [HACKERS] Radix tree for character conversion |
Date | |
Msg-id | 20170130.153738.139030994.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
In response to | Re: [HACKERS] Radix tree for character conversion (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: [HACKERS] Radix tree for character conversion
|
List | pgsql-hackers |
Hello, this is the revised version of character conversion using radix tree. At Fri, 27 Jan 2017 17:33:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170127.173357.221584433.horiguchi.kyotaro@lab.ntt.co.jp> > Hi, this is an intermediate report without a patch. > > At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20170126.214212.111556326.horiguchi.kyotaro@lab.ntt.co.jp> > > > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch > > > > > > > > Before adding radix tree stuff, applied pgperltidy and inserted > > > > format-skipping pragma for the parts where perltidy seems to do > > > > too much. > > > > > > Which version of perltidy did you use? Looking at the archives, the > > > perl code is cleaned up with a specific version, v20090616. See > > > https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com > > > for example on the matter. As perltidy changes over time, this may be > > > a sensitive change if done this way. > > My perltidy -v said "v20121207'. Anyway, I gave up to apply > perltidy by myself. So I'll just drop 0003 and new 0004 (name > changed to 0003) is made immediately on 0002. I'm not sure what to handle this so I just removed the perltidy stuff from this patchset. > > > > 0004-Use-radix-tree-for-character-conversion.patch > > > > > > > > Radix tree body. > > > > > > Well, here a lot of diffs could have been saved. > > > > > > > The unattached fifth patch is generated by the following steps. > > > > > > > > [$(TOP)]$ ./configure > > > > [Unicode]$ make > > > > [Unicode]$ make distclean > > > > [Unicode]$ git add . > > > > [Unicode]$ commit > > > > === COMMITE MESSSAGE > > > > Replace map files with radix tree files. > > > > > > > > These encodings no longer uses the former map files and uses new radix > > > > tree files. > > > > === > > > > > > OK, I can see that working, with 200k of maps generated.. So going > > > through the important bits of this jungle.. > > > > Many thaks for the exploration. > > > > > +/* > > > + * radix tree conversion function - this should be identical to the function in > > > + * ../conv.c with the same name > .. > > > This is not nice. Having a duplication like that is a recipe to forget > > > about it as this patch introduces a dependency with conv.c and the > > > radix tree generation. > > In the attatched patch, mb/char_conveter.c which contains one > inline function is created and it is includ'ed from mb/conv.c and > mb/Unicode/map_checker.c. > > > > Having a .gitignore in Unicode/ would be nice, particularly to avoid > > > committing map_checker. > > I missed this. I added .gitignore to ignore map_checker stuff > and authority files and old-style map files. > > > > A README documenting things may be welcome, or at least comments at > > > the top of map_checker.c. Why is map_checker essential? What does it > > > do? There is no way to understand that easily, except that it includes > > > a "radix tree conversion function", and that it performs sanity checks > > > on the radix trees to be sure that they are on a good shape. But as > > > this something that one would guess only after looking at your patch > > > and the code (at least I will sleep less stupid tonight after reading > > > this stuff). > > > > Okay, I'll do that. > > The patch has not been provided yet, I'm going to put the > following comment just before the main() in map_checker.c. > > /* > * The old-style plain map files were error-resistant due to its > * straight-forward way for generation from authority files. In contrast the > * radix tree maps are generated by a rather complex calculation and have a > * complex, hard-to-confirm format. > * > * This program runs sanity check of the radix tree maps by confirming all > * characters in the plain map files to be converted to the same code by the > * corresponding radix tree map. > * > * All map files are included by map_checker.h that is generated by the script > * make_mapchecker.pl as the variable mappairs. > * > */ > > > I'll do the following things later. The following is the continuation. > > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl > > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl > > # Drop these SJIS codes from the source for UTF8=>SJIS conversion > > #<<< do not let perltidy touch this > > -my @reject_sjis =( > > +my @reject_sjis = ( > > 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782, > > - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, > > + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, > > 0x879a..0x879c > > -); > > + ); > > This is not generated, it would be nice to drop the noise from the patch. > > Mmm. I'm not sure how this is generated but I'll care for that. I don't still understand what what the intermediate diff comes from but copy-n-pasting from master silenced it... > > Here is another one: > > - $i->{code} = $jis | ( > > - $jis < 0x100 > > - ? 0x8e00 > > - : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); > > - > > +#<<< do not let perltidy touch this > > + $i->{code} = $jis | ($jis < 0x100 ? 0x8e00: > > + ($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); > > +#>>> > > Ok. Will revert this. The "previous" code (prefixed with a minus sign) is "my" perltidy's work. Preltidy step is just removed from the patchset. > > if (l == 2) > > { > > - iutf = *utf++ << 8; > > - iutf |= *utf++; > > + b3 = *utf++; > > + b4 = *utf++; > > } > > Ah, OK. This conversion is important so as it performs a minimum of > > bitwise operations. Yes let's keep that. That's pretty cool to get a > > faster operation. > > It is Heikki's work:p > > I'll address them and repost the next version sooner. Finally, the patchset had the following changes from the previous shape. - Avoid syntaxes perl 5.8 complains about - The perltidy step has been removed. - pg_mb_radix_conv is now a separate .c file (but included from other c files) - Added Unicode/.gitignore. The line for [~#] might be needless. - The patchset are made with --irreversible-delete. This is just what I wanted (but counldn't find by myself..) - Semicolon-fix patch(0001) gets several additional fixes. This patchset consists of four patches. The first two are bug fixes back-patchable to older versions. The third one is the patch that adds radix tree feature. The forth one is not attached to this mail but generatable. 0001-Add-missing-semicolon.patch Adds missing semicolon found in three files. 0002-Correct-reference-resolution-syntax.patch Changes reference syntax to more preferable style. 0003-Use-radix-tree-for-character-conversion.patch Radix tree conversion patch. The size has been reduced from 1.6MB to 91KB. 0004: Replace map files This is not attached but generatable. This shouldn't fail even on the environment with perl 5.8. [$(TOP)]$ ./configure [$(TOP)]$ cd src/backend/utils/mb/Unicode [Unicode]$ make distclean maintainer-clean all [Unicode]$make mapcheck ... All radix trees are perfect! [Unicode]$ make distclean [Unicode]$ git add . [Unicode]$ git commit... The size of the forth patch was about 7.6MB using --irreversible-delete. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: