Re: speed up unicode decomposition and recomposition - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: speed up unicode decomposition and recomposition
Date
Msg-id 20201016033208.GC1581@paquier.xyz
Whole thread Raw
In response to Re: speed up unicode decomposition and recomposition  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up unicode decomposition and recomposition
Re: speed up unicode decomposition and recomposition
List pgsql-hackers
On Thu, Oct 15, 2020 at 01:59:38PM -0400, John Naylor wrote:
> I think I've seen a trie recommended somewhere, maybe the official website.
> That said, I was able to get the hash working for recomposition (split into
> a separate patch, and both of them now leave frontend alone), and I'm
> pleased to say it's 50-75x faster than linear search in simple tests. I'd
> be curious how it compares to ICU now. Perhaps Daniel Verite would be
> interested in testing again? (CC'd)

Yeah, that would be interesting to compare.  Now the gains proposed by
this patch are already a good step forward, so I don't think that it
should be a blocker for a solution we have at hand as the numbers
speak by themselves here.  So if something better gets proposed, we
could always change the decomposition and recomposition logic as
needed.

> select count(normalize(t, NFC)) from (
> select md5(i::text) as t from
> generate_series(1,100000) as i
> ) s;
>
> master     patch
> 18800ms    257ms

My environment was showing HEAD as being a bit faster with 15s, while
the patch gets "only" down to 290~300ms (compiled with -O2, as I guess
you did).  Nice.

+   # Then the second
+   return -1 if $a2 < $b2;
+   return 1 if $a2 > $b2;
Should say "second code point" here?

+       hashkey = pg_hton64(((uint64) start << 32) | (uint64) code);
+       h = recompinfo.hash(&hashkey);
This choice should be documented, and most likely we should have
comments on the perl and C sides to keep track of the relationship
between the two.

The binary sizes of libpgcommon_shlib.a and libpgcommon.a change
because Decomp_hash_func() gets included, impacting libpq.
Structurally, wouldn't it be better to move this part into its own,
backend-only, header?  It could be possible to paint the difference
with some ifdef FRONTEND of course, or just keep things as they are
because this can be useful for some out-of-core frontend tool?  But if
we keep that as a separate header then any C part can decide to
include it or not, so frontend tools could also make this choice.
Note that we don't include unicode_normprops_table.h for frontends in
unicode_norm.c, but that's the case of unicode_norm_table.h.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: gs_group_1 crashing on 13beta2/s390x
Next
From: "Hou, Zhijie"
Date:
Subject: RE: Use list_delete_xxxcell O(1) instead of list_delete_ptr O(N) in some places