Re: [HACKERS] Radix tree for character conversion - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: [HACKERS] Radix tree for character conversion
Date
Msg-id CAB7nPqR49krGP6qaaKaL2v3HCnn+dnzv8Dq_ySGbDSr6b_ywrw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Radix tree for character conversion  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [HACKERS] Radix tree for character conversion  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
On Mon, Feb 27, 2017 at 5:37 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRTQ+7ZjxuPTbsr18MXvW7mTd29mN+91N7AG8fe5aCeAA@mail.gmail.com>
>> In order to conduct sanity checks on the shape of the radix tree maps
>> compared to the existing maps, having map_checker surely makes sense.
>> Now in the final result I don't think we need it. The existing map
>> files ought to be replaced by their radix versions at the end, and
>> map_checker should be removed. This leads to a couple of
>> simplifications, like Makefile, and reduces the maintenance to one
>> mechanism.
>
> Hmm.. Though I don't remember clearly what the radix map of the
> first version looked like, the current radix map seems
> human-readable for me. It might be by practice or by additional
> comments in map files. Anyway I removed all of the stuff so as
> not to generate the plain maps. But I didn't change the names of
> _radix.map and just commented out the line to output the plain
> maps in UCS_to_*.pl.  Combined maps are still in the plain format
> so print_tables was changed to take character tables separately
> for regular (non-combined) characters and combined characters.

Do others have thoughts to offer on the matter? I would think that the
new radix maps should just replace by the old plain ones, and that the
only way to build the maps going forward is to use the new methods.
The radix trees is the only thing used in the backend code as well
(conv.c). We could keep the way to build the old maps, with the
map_checker in module out of the core code. FWIW, I am fine to add the
old APIs in my plugin repository on github and have the sanity checks
in that as well. And of course also publish on this thread a module to
do that.

>> Or if you want to go to the road of non-simple things, you
>> could have two arguments: an origin and a target. If one is
>> UTF8 the other is the mapping name.
>
> Mmmm. It seems (even) to me to give more harm than good.  I can
> guess two alternatives for this.
>
> - Split the property {direction} into two boolean properties
>   {to_unicode} and {from_unicode}.
>
> - Make the {direction} property an integer and compared with
>   defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE using
>   the '=' operator.
>
> I choosed the former in this patch.

Fine for me.

>> +           $charmap{ ucs2utf($src) } = $dst;
>> +       }
>> +
>> +   }
>> Unnecessary newline here.
>
> removed in convutils.pm.
>
> Since Makefile ignores old .map files, the steps to generate a
> patch for map files was a bit chaged.
>
> $ rm *.map
> $ make distclean maintainer-clean all
> $ make distclean
> $ git add .
> $ git commit

+# ignore generated files
+/map_checker
+/map_checker.h
[...]
+map_checker.h: make_mapchecker.pl $(MAPS) $(RADIXMAPS)
+   $(PERL) $<
+
+map_checker.o: map_checker.c map_checker.h ../char_converter.c
+
+map_checker: map_checker.o
With map_checker out of the game, those things are not needed.

+++ b/src/backend/utils/mb/char_converter.c
@@ -0,0 +1,116 @@
+/*-------------------------------------------------------------------------
+ *
+ *   Character converter function using radix tree
In the simplified version of the patch, pg_mb_radix_conv() being only
needed in conv.c I think that this could just be a static local
routine.

-#include "../../Unicode/utf8_to_koi8r.map"
-#include "../../Unicode/koi8r_to_utf8.map"
-#include "../../Unicode/utf8_to_koi8u.map"
-#include "../../Unicode/koi8u_to_utf8.map"
+#include "../../Unicode/utf8_to_koi8r_radix.map"
+#include "../../Unicode/koi8r_to_utf8_radix.map"
+#include "../../Unicode/utf8_to_koi8u_radix.map"
+#include "../../Unicode/koi8u_to_utf8_radix.map"
FWIW, I am fine to use those new names as include points.

-distclean: clean
+distclean:   rm -f $(TEXTS)
-maintainer-clean: distclean
+# maintainer-clean intentionally leaves $(TEXTS)
+maintainer-clean:
Why is that? There is also a useless diff down that code block.

+conv.o: conv.c char_converter.c
This also can go away.

-print_tables("EUC_JIS_2004", \@all, 1);
+# print_tables("EUC_JIS_2004", \@regular, undef, 1);
+print_radix_trees("EUC_JIS_2004", \@regular);
+print_tables("EUC_JIS_2004", undef, \@combined, 1);
[...]sub print_tables{
-   my ($charset, $table, $verbose) = @_;
+   my ($charset, $regular, $combined, $verbose) = @_;
print_tables is only used for combined maps, you could remove $regular
from it and just keep $combined around, perhaps renaming print_tables
to print_combined_maps?
-- 
Michael



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: [HACKERS] Partitioned tables and relfilenode
Next
From: Michael Paquier
Date:
Subject: Re: [HACKERS] timeouts in PostgresNode::psql