Re: [HACKERS] Radix tree for character conversion - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [HACKERS] Radix tree for character conversion
Date
Msg-id 20170227.173733.18397866.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: [HACKERS] Radix tree for character conversion  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: [HACKERS] Radix tree for character conversion
Re: [HACKERS] Radix tree for character conversion
List pgsql-hackers
Thank you for the comment.

At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRTQ+7ZjxuPTbsr18MXvW7mTd29mN+91N7AG8fe5aCeAA@mail.gmail.com>
> Thanks for the rebase. I have been spending sore time looking at this
> patch. The new stuff in convutils.pm is by far the interesting part of
> the patch, where the building of the radix trees using a byte
> structure looks in pretty good shape after eyeballing the logic for a
> couple of hours.
> 
> +# ignore backup files of editors
> +/*[~#]
> +
> This does not belong to Postgres core code. You could always set up
> that in a global exclude file with core.excludefiles.

Thank you for letting me know about it. I removed that.

> In order to conduct sanity checks on the shape of the radix tree maps
> compared to the existing maps, having map_checker surely makes sense.
> Now in the final result I don't think we need it. The existing map
> files ought to be replaced by their radix versions at the end, and
> map_checker should be removed. This leads to a couple of
> simplifications, like Makefile, and reduces the maintenance to one
> mechanism.

Hmm.. Though I don't remember clearly what the radix map of the
first version looked like, the current radix map seems
human-readable for me. It might be by practice or by additional
comments in map files. Anyway I removed all of the stuff so as
not to generate the plain maps. But I didn't change the names of
_radix.map and just commented out the line to output the plain
maps in UCS_to_*.pl.  Combined maps are still in the plain format
so print_tables was changed to take character tables separately
for regular (non-combined) characters and combined characters.

> +sub print_radix_trees
> +{
> +   my ($this_script, $csname, $charset) = @_;
> +
> +   &print_radix_map($this_script, $csname, "from_unicode", $charset, 78);
> +   &print_radix_map($this_script, $csname, "to_unicode",   $charset, 78);
> +}
> There is no need for the table width to be defined as a variable (5th
> argument).

The table width was already useless.. Removed.

> Similarly, to_unicode/from_unicode require checks in
> multiple places, this could be just a simple boolean flag.

The direction is a tristate (to/from/both) variable so cannot be
replaced with a boolean. But I agree that comparing with free
string is not so good. This is a change already committed in the
master but it is changed in the attached patch.

# Perhaps it is easier to read in string form..

> Or if you want to go to the road of non-simple things, you
> could have two arguments: an origin and a target. If one is
> UTF8 the other is the mapping name.

Mmmm. It seems (even) to me to give more harm than good.  I can
guess two alternatives for this.

- Split the property {direction} into two boolean properties {to_unicode} and {from_unicode}.

- Make the {direction} property an integer and compared with defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE
usingthe '=' operator.
 

I choosed the former in this patch.

> +sub dump_charset
> +{
> +   my ($list, $filt) = @_;
> +
> +   foreach my $i (@$list)
> +   {
> +       next if (defined $filt && !&$filt($i));
> +       if (!defined $i->{ucs}) { $i->{ucs} = &utf2ucs($i->{utf8}); }
> +       printf "ucs=%x, code=%x, direction=%s %s:%d %s\n",
> +         $i->{ucs}, $i->{code}, $i->{direction},
> +         $i->{f},   $i->{l},    $i->{comment};
> +   }
> +}
> This is used nowhere. Perhaps it was useful for debugging at some point?

Yes, it was quite useful. Removed.

> +# make_charmap - convert charset table to charmap hash
> +#     with checking duplicate source code
> Maybe this should be "with checking of duplicated source codes".

Even though I'm not good English writer, 'duplicated codes' looks
as multiple copies of the original 'code' (for me, of
course). And the 'checking' is a (pure) verbal noun (means not a
deverbal noun) so 'of' is not required. But, of course, I'm not
sure which sounds more natural as English.

This comment is not changed.

> +# print_radix_map($this_script, $csname, $direction, \%charset, $tblwidth)
> +#
> +# this_script - the name of the *caller script* of this feature
> $this_script is not needed at all, you could just use basename($0) and
> reduce the number of arguments of the different functions of the
> stack.

I avoided relying on global stuff by that but I accept the
suggestion. Fixed in this version.

> +   ### amount of zeros that can be ovarlaid.
> s/ovarlaid/overlaid.
> 
> +# make_mapchecker.pl - Gerates map_checker.h file included by map_checker.c
> s/gerates/generates/

make_mapcheker.pl is a tool only for map_checker.c so this files
is removed.

> +           if (s < 0x80)
> +           {
> +               fprintf(stderr, "\nASCII character ? (%x)", s);
> +               exit(1);
> +           }
> Most likely a newline at the end of the error string is better here.

map_checker.c is removed.

> +           $charmap{ ucs2utf($src) } = $dst;
> +       }
> +
> +   }
> Unnecessary newline here.

removed in convutils.pm.

Since Makefile ignores old .map files, the steps to generate a
patch for map files was a bit chaged.

$ rm *.map
$ make distclean maintainer-clean all
$ make distclean
$ git add .
$ git commit

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: [HACKERS] Recommend wrappers of PG_DETOAST_DATUM_PACKED()
Next
From: Artur Zakirov
Date:
Subject: Re: [HACKERS] Bug in to_timestamp().