Re: [HACKERS] Radix tree for character conversion - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [HACKERS] Radix tree for character conversion
Date
Msg-id 20170130.153738.139030994.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: [HACKERS] Radix tree for character conversion  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [HACKERS] Radix tree for character conversion
List pgsql-hackers
Hello, this is the revised version of character conversion using radix tree.

At Fri, 27 Jan 2017 17:33:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170127.173357.221584433.horiguchi.kyotaro@lab.ntt.co.jp>
> Hi, this is an intermediate report without a patch.
> 
> At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170126.214212.111556326.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> > > >
> > > >   Before adding radix tree stuff, applied pgperltidy and inserted
> > > >   format-skipping pragma for the parts where perltidy seems to do
> > > >   too much.
> > > 
> > > Which version of perltidy did you use? Looking at the archives, the
> > > perl code is cleaned up with a specific version, v20090616. See
> > > https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> > > for example on the matter. As perltidy changes over time, this may be
> > > a sensitive change if done this way.
> 
> My perltidy -v said "v20121207'. Anyway, I gave up to apply
> perltidy by myself. So I'll just drop 0003 and new 0004 (name
> changed to 0003) is made immediately on 0002.

I'm not sure what to handle this so I just removed the perltidy
stuff from this patchset.

> > > > 0004-Use-radix-tree-for-character-conversion.patch
> > > >
> > > >   Radix tree body.
> > > 
> > > Well, here a lot of diffs could have been saved.
> > > 
> > > > The unattached fifth patch is generated by the following steps.
> > > >
> > > > [$(TOP)]$ ./configure
> > > > [Unicode]$ make
> > > > [Unicode]$ make distclean
> > > > [Unicode]$ git add .
> > > > [Unicode]$ commit
> > > > === COMMITE MESSSAGE
> > > > Replace map files with radix tree files.
> > > >
> > > > These encodings no longer uses the former map files and uses new radix
> > > > tree files.
> > > > ===
> > > 
> > > OK, I can see that working, with 200k of maps generated.. So going
> > > through the important bits of this jungle..
> > 
> > Many thaks for the exploration.
> > 
> > > +/*
> > > + * radix tree conversion function - this should be identical to the function in
> > > + * ../conv.c with the same name
> ..
> > > This is not nice. Having a duplication like that is a recipe to forget
> > > about it as this patch introduces a dependency with conv.c and the
> > > radix tree generation.
> 
> In the attatched patch, mb/char_conveter.c which contains one
> inline function is created and it is includ'ed from mb/conv.c and
> mb/Unicode/map_checker.c.
> 
> > > Having a .gitignore in Unicode/ would be nice, particularly to avoid
> > > committing map_checker.
> 
> I missed this.  I added .gitignore to ignore map_checker stuff
> and authority files and old-style map files.
> 
> > > A README documenting things may be welcome, or at least comments at
> > > the top of map_checker.c. Why is map_checker essential? What does it
> > > do? There is no way to understand that easily, except that it includes
> > > a "radix tree conversion function", and that it performs sanity checks
> > > on the radix trees to be sure that they are on a good shape. But as
> > > this something that one would guess only after looking at your patch
> > > and the code (at least I will sleep less stupid tonight after reading
> > > this stuff).
> > 
> > Okay, I'll do that.
> 
> The patch has not been provided yet, I'm going to put the
> following comment just before the main() in map_checker.c.
> 
> /*
>  * The old-style plain map files were error-resistant due to its
>  * straight-forward way for generation from authority files. In contrast the
>  * radix tree maps are generated by a rather complex calculation and have a
>  * complex, hard-to-confirm format.
>  *
>  * This program runs sanity check of the radix tree maps by confirming all
>  * characters in the plain map files to be converted to the same code by the
>  * corresponding radix tree map.
>  *
>  * All map files are included by map_checker.h that is generated by the script
>  * make_mapchecker.pl as the variable mappairs.
>  *
>  */
> 
> 
> I'll do the following things later.

The following is the continuation.

> > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> >  # Drop these SJIS codes from the source for UTF8=>SJIS conversion
> >  #<<< do not let perltidy touch this
> > -my @reject_sjis =(
> > +my @reject_sjis = (
> >     0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> > -   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> > +   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> >     0x879a..0x879c
> > -);
> > +   );
> > This is not generated, it would be nice to drop the noise from the patch.
> 
> Mmm. I'm not sure how this is generated but I'll care for that.

I don't still understand what what the intermediate diff comes
from but copy-n-pasting from master silenced it...

> > Here is another one:
> > -       $i->{code} = $jis | (
> > -           $jis < 0x100
> > -           ? 0x8e00
> > -           : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > -
> > +#<<< do not let perltidy touch this
> > +       $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> > +                            ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > +#>>>
> 
> Ok. Will revert this.

The "previous" code (prefixed with a minus sign) is "my"
perltidy's work. Preltidy step is just removed from the patchset.

> >         if (l == 2)
> >         {
> > -           iutf = *utf++ << 8;
> > -           iutf |= *utf++;
> > +           b3 = *utf++;
> > +           b4 = *utf++;
> >         }
> > Ah, OK. This conversion is important so as it performs a minimum of
> > bitwise operations. Yes let's keep that. That's pretty cool to get a
> > faster operation.
> 
> It is Heikki's work:p
> 
> I'll address them and repost the next version sooner.


Finally, the patchset had the following changes from the previous
shape.

- Avoid syntaxes perl 5.8 complains about

- The perltidy step has been removed.

- pg_mb_radix_conv is now a separate .c file (but included from other c files)

- Added Unicode/.gitignore. The line for [~#] might be needless.

- The patchset are made with --irreversible-delete. This is just what I wanted (but counldn't find by myself..)

- Semicolon-fix patch(0001) gets several additional fixes.



This patchset consists of four patches. The first two are bug
fixes back-patchable to older versions. The third one is the
patch that adds radix tree feature. The forth one is not attached
to this mail but generatable.

0001-Add-missing-semicolon.patch
 Adds missing semicolon found in three files.

0002-Correct-reference-resolution-syntax.patch
 Changes reference syntax to more preferable style.

0003-Use-radix-tree-for-character-conversion.patch
 Radix tree conversion patch. The size has been reduced from 1.6MB to 91KB.

0004: Replace map files
 This is not attached but generatable. This shouldn't fail even on the environment with perl 5.8.
 [$(TOP)]$ ./configure [$(TOP)]$ cd src/backend/utils/mb/Unicode [Unicode]$ make distclean maintainer-clean all
[Unicode]$make mapcheck ... All radix trees are perfect! [Unicode]$ make distclean [Unicode]$ git add . [Unicode]$ git
commit...
 
 The size of the forth patch was about 7.6MB using --irreversible-delete.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

Previous
From: Rushabh Lathia
Date:
Subject: Re: [HACKERS] Query fails when SRFs are part of FROM clause (Commitid: 69f4b9c85f)
Next
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Create a separate test file for exercising system views