Thread: speed up unicode decomposition and recomposition

speed up unicode decomposition and recomposition

From

John Naylor

Date:

14 October 2020, 16:58:21

Having committed the optimization for unicode normalization quick check, Michael Paquier suggested I might do the same for decomposition as well. I wrote:

> I'll
> do some performance testing soon. Note that a 25kB increase in size could
> be present in frontend binaries as well in this case. While looking at
> decomposition, I noticed that recomposition does a linear search through
> all 6600+ entries, although it seems only about 800 are valid for that.
> That could be optimized as well now, since with hashing we have more
> flexibility in the ordering and can put the recomp-valid entries in front.
> I'm not yet sure if it's worth the additional complexity. I'll take a look
> and start a new thread.

The attached patch uses a perfect hash for codepoint decomposition, and for recomposing reduces the linear search from 6604 entries to 942.

The performance is very nice, and if I'd known better I would have done this first, since the decomp array is as big as the two quick check arrays put together:

Normalize, decomp only

select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

master patchÏ
887ms 231ms

select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;

master patch
1110ms 208ms

Normalize, decomp+recomp (note: 100x less data)

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;

master patch
194ms 50.6ms

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,1000) as i
) s;

master patch
137ms 39.4ms

Quick check is another 2x faster on top of previous gains, since it tests canonical class via the decomposition array:

-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s
where t is NFC normalized;

master patch
296ms 131ms

Some other considerations:

- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since the decomp array was reordered to optimize linear search, it can no longer be used for binary search. It's possible to build two arrays, one for frontend and one for backend, but that's additional complexity. We could also force frontend to do a linear search all the time, but that seems foolish. I haven't checked if it's possible to exclude the hash from backend's libpq.

- I could split out the two approaches into separate patches, but it'd be rather messy.

I'll add a CF entry for this.

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Attachment

v1-0001-Optimize-unicode-decomposition-and-recomposition.patch

Re: speed up unicode decomposition and recomposition

From

Tom Lane

Date:

14 October 2020, 17:06:40

John Naylor <john.naylor@enterprisedb.com> writes:
> Some other considerations:
> - As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since
> the decomp array was reordered to optimize linear search, it can no longer
> be used for binary search. It's possible to build two arrays, one for
> frontend and one for backend, but that's additional complexity. We could
> also force frontend to do a linear search all the time, but that seems
> foolish. I haven't checked if it's possible to exclude the hash from
> backend's libpq.

IIUC, the only place libpq uses this is to process a password-sized string
or two during connection establishment.  It seems quite silly to add
26kB in order to make that faster.  Seems like a nice speedup on the
backend side, but I'd vote for keeping the frontend as-is.

            regards, tom lane

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

15 October 2020, 00:25:23

On Wed, Oct 14, 2020 at 01:06:40PM -0400, Tom Lane wrote:
> John Naylor <john.naylor@enterprisedb.com> writes:
>> Some other considerations:
>> - As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since
>> the decomp array was reordered to optimize linear search, it can no longer
>> be used for binary search.  It's possible to build two arrays, one for
>> frontend and one for backend, but that's additional complexity. We could
>> also force frontend to do a linear search all the time, but that seems
>> foolish. I haven't checked if it's possible to exclude the hash from
>> backend's libpq.
>
> IIUC, the only place libpq uses this is to process a password-sized string
> or two during connection establishment.  It seems quite silly to add
> 26kB in order to make that faster.  Seems like a nice speedup on the
> backend side, but I'd vote for keeping the frontend as-is.

Agreed.  Let's only use the perfect hash in the backend.  It would be
nice to avoid an extra generation of the decomposition table for that,
and a table ordered by codepoints is easier to look at.  How much do
you think would be the performance impact if we don't use for the
linear search the most-optimized decomposition table?
--
Michael

Attachment

signature.asc

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

15 October 2020, 02:56:41

On Wed, Oct 14, 2020 at 8:25 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Oct 14, 2020 at 01:06:40PM -0400, Tom Lane wrote:
> IIUC, the only place libpq uses this is to process a password-sized string
> or two during connection establishment. It seems quite silly to add
> 26kB in order to make that faster. Seems like a nice speedup on the
> backend side, but I'd vote for keeping the frontend as-is.

Agreed. Let's only use the perfect hash in the backend. It would be
nice to avoid an extra generation of the decomposition table for that,
and a table ordered by codepoints is easier to look at. How much do
you think would be the performance impact if we don't use for the
linear search the most-optimized decomposition table?

With those points in mind and thinking more broadly, I'd like to try harder on recomposition. Even several times faster, recomposition is still orders of magnitude slower than ICU, as measured by Daniel Verite [1]. I only did it this way because I couldn't think of how to do the inverse lookup with a hash. But I think if we constructed the hash key like

pg_hton64((code1 << 32) | code2)

and on the Perl side do something like

pack('N',$code1) . pack('N',$code2)

that might work. Or something that looks more like the C side. And make sure to use the lowest codepoint for the result. That way, we can still keep the large decomp array ordered, making it easier to keep the current implementation in the front end, and hopefully getting even better performance in the backend.

[1] https://www.postgresql.org/message-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a%40manitou-mail.org

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Re: speed up unicode decomposition and recomposition

From

Tom Lane

Date:

15 October 2020, 03:06:28

John Naylor <john.naylor@enterprisedb.com> writes:
> With those points in mind and thinking more broadly, I'd like to try harder
> on recomposition. Even several times faster, recomposition is still orders
> of magnitude slower than ICU, as measured by Daniel Verite [1].

Huh.  Has anyone looked into how they do it?

            regards, tom lane

Re: speed up unicode decomposition and recomposition

From

Kyotaro Horiguchi

Date:

15 October 2020, 05:30:47

At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> John Naylor <john.naylor@enterprisedb.com> writes:
> > With those points in mind and thinking more broadly, I'd like to try harder
> > on recomposition. Even several times faster, recomposition is still orders
> > of magnitude slower than ICU, as measured by Daniel Verite [1].
> 
> Huh.  Has anyone looked into how they do it?

I'm not sure it is that, but it would be that..  It uses separate
tables for decomposition and composition pointed from a trie?

That table is used after trying algorithmic decomposition/composition
for, for example, Hangul. I didn't look it any fruther but just for
information, icu4c/source/common/normalizer2impl.cpp seems doing that.
For example icu4c/srouce/common/norm2_nfc_data.h defines the static data.

icu4c/source/common/normalier2impl.h:244 points a design documentation
of normalization.

http://site.icu-project.org/design/normalization/custom

> Old and New Implementation Details
> 
> The old normalization data format (unorm.icu, ca. 2001..2009) uses
> three data structures for normalization: A trie for looking up 32-bit
> values for every code point, a 16-bit-unit array with decompositions
> and some other data, and a composition table (16-bit-unit array,
> linear search list per starter). The data is combined for all 4
> standard normalization forms: NFC, NFD, NFKC and NFKD.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

15 October 2020, 17:59:38

On Thu, Oct 15, 2020 at 1:30 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
> John Naylor <john.naylor@enterprisedb.com> writes:
> > With those points in mind and thinking more broadly, I'd like to try harder
> > on recomposition. Even several times faster, recomposition is still orders
> > of magnitude slower than ICU, as measured by Daniel Verite [1].
>
> Huh. Has anyone looked into how they do it?

I'm not sure it is that, but it would be that.. It uses separate
tables for decomposition and composition pointed from a trie?

I think I've seen a trie recommended somewhere, maybe the official website. That said, I was able to get the hash working for recomposition (split into a separate patch, and both of them now leave frontend alone), and I'm pleased to say it's 50-75x faster than linear search in simple tests. I'd be curious how it compares to ICU now. Perhaps Daniel Verite would be interested in testing again? (CC'd)

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

master patch
18800ms 257ms

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;

master patch
13000ms 254ms

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Attachment

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

16 October 2020, 03:32:08

On Thu, Oct 15, 2020 at 01:59:38PM -0400, John Naylor wrote:
> I think I've seen a trie recommended somewhere, maybe the official website.
> That said, I was able to get the hash working for recomposition (split into
> a separate patch, and both of them now leave frontend alone), and I'm
> pleased to say it's 50-75x faster than linear search in simple tests. I'd
> be curious how it compares to ICU now. Perhaps Daniel Verite would be
> interested in testing again? (CC'd)

Yeah, that would be interesting to compare.  Now the gains proposed by
this patch are already a good step forward, so I don't think that it
should be a blocker for a solution we have at hand as the numbers
speak by themselves here.  So if something better gets proposed, we
could always change the decomposition and recomposition logic as
needed.

> select count(normalize(t, NFC)) from (
> select md5(i::text) as t from
> generate_series(1,100000) as i
> ) s;
>
> master     patch
> 18800ms    257ms

My environment was showing HEAD as being a bit faster with 15s, while
the patch gets "only" down to 290~300ms (compiled with -O2, as I guess
you did).  Nice.

+   # Then the second
+   return -1 if $a2 < $b2;
+   return 1 if $a2 > $b2;
Should say "second code point" here?

+       hashkey = pg_hton64(((uint64) start << 32) | (uint64) code);
+       h = recompinfo.hash(&hashkey);
This choice should be documented, and most likely we should have
comments on the perl and C sides to keep track of the relationship
between the two.

The binary sizes of libpgcommon_shlib.a and libpgcommon.a change
because Decomp_hash_func() gets included, impacting libpq.
Structurally, wouldn't it be better to move this part into its own,
backend-only, header?  It could be possible to paint the difference
with some ifdef FRONTEND of course, or just keep things as they are
because this can be useful for some out-of-core frontend tool?  But if
we keep that as a separate header then any C part can decide to
include it or not, so frontend tools could also make this choice.
Note that we don't include unicode_normprops_table.h for frontends in
unicode_norm.c, but that's the case of unicode_norm_table.h.
--
Michael

Attachment

signature.asc

Re: speed up unicode decomposition and recomposition

From

"Daniel Verite"

Date:

16 October 2020, 18:08:55

    John Naylor wrote:

> I'd be curious how it compares to ICU now

I've made another run of the test in [1] with your v2 patches
from this thread against icu_ext built with ICU-67.1.
The results show the times in milliseconds to process
about 10 million short strings:

 operation  | unpatched | patched | icu_ext
------------+-----------+---------+---------
 nfc check  |       7968 |    5989 |    4076
 nfc conv   |     700894 |   15163 |    6808
 nfd check  |      16399 |    9852 |    3849
 nfd conv   |      17391 |   10916 |    6758
 nfkc check |       8259 |    6092 |    4301
 nfkc conv  |     700241 |   15354 |    7034
 nfkd check |      16585 |   10049 |    4038
 nfkd conv  |      17587 |   11109 |    7086

The ICU implementation still wins by a large margin, but
the improvements brought by the patch are gorgeous,
especially for the conversion to NFC/NFKC.
This test ran on a slower machine than what I used for [1], so
that's why all queries take longer.

For the two queries upthread, I get this:

1)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
count
--------
 100000
(1 row)

Time: 371.043 ms

VS ICU:

select count(icu_normalize(t, 'NFC')) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
 count
--------
 100000
(1 row)

Time: 125.809 ms


2)
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
 count
--------
 100000
(1 row)
Time: 428.214 ms


VS ICU:

select count(icu_normalize(t, 'NFC')) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
 count
--------
 100000
(1 row)

Time: 147.713 ms


[1]
https://www.postgresql.org/message-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a%40manitou-mail.org


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

19 October 2020, 14:34:33

On Thu, Oct 15, 2020 at 11:32 PM Michael Paquier <michael@paquier.xyz> wrote:

The binary sizes of libpgcommon_shlib.a and libpgcommon.a change
because Decomp_hash_func() gets included, impacting libpq.

I don't see any difference on gcc/Linux in those two files, nor in unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as expected. Could it depend on the compiler?

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

19 October 2020, 14:36:00

On Fri, Oct 16, 2020 at 2:08 PM Daniel Verite <daniel@manitou-mail.org> wrote:

John Naylor wrote:

> I'd be curious how it compares to ICU now

I've made another run of the test in [1] with your v2 patches
from this thread against icu_ext built with ICU-67.1.
The results show the times in milliseconds to process
about 10 million short strings:

Thanks for testing!

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

20 October 2020, 07:22:09

On Mon, Oct 19, 2020 at 10:34:33AM -0400, John Naylor wrote:
> I don't see any difference on gcc/Linux in those two files, nor in
> unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as
> expected. Could it depend on the compiler?

Hmm.  My guess is that you don't have --enable-debug in your set of
configure options?  It is not unusual to have this one enabled for GCC
even on production systems, and the size of the libs is impacted in
this case with your patch.
--
Michael

Attachment

signature.asc

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

20 October 2020, 12:03:12

On Tue, Oct 20, 2020 at 3:22 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Oct 19, 2020 at 10:34:33AM -0400, John Naylor wrote:
> I don't see any difference on gcc/Linux in those two files, nor in
> unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as
> expected. Could it depend on the compiler?

Hmm. My guess is that you don't have --enable-debug in your set of
configure options? It is not unusual to have this one enabled for GCC
even on production systems, and the size of the libs is impacted in
this case with your patch.

I've confirmed that. How about a new header unicode_norm_hashfunc.h which would include unicode_norm_table.h at the top. In unicode.c, we can include one of these depending on frontend or backend.

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

21 October 2020, 00:31:31

On Tue, Oct 20, 2020 at 08:03:12AM -0400, John Naylor wrote:
> I've confirmed that. How about a new header unicode_norm_hashfunc.h which
> would include unicode_norm_table.h at the top. In unicode.c, we can include
> one of these depending on frontend or backend.

Sounds good to me.  Looking at the code, I would just generate the
second file within generate-unicode_norm_table.pl.
--
Michael

Attachment

signature.asc

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

21 October 2020, 22:45:44

Attached v3 addressing review points below:

On Thu, Oct 15, 2020 at 11:32 PM Michael Paquier <michael@paquier.xyz> wrote:

+ # Then the second
+ return -1 if $a2 < $b2;
+ return 1 if $a2 > $b2;
Should say "second code point" here?

Done. Also changed the tiebreaker to the composed codepoint. Beforehand, it was the index into DecompMain[], which is only equivalent if the list is in order (it is but we don't want correctness to depend on that), and not very clear.

+ hashkey = pg_hton64(((uint64) start << 32) | (uint64) code);
+ h = recompinfo.hash(&hashkey);
This choice should be documented, and most likely we should have
comments on the perl and C sides to keep track of the relationship
between the two.

Done.

<separate headers>

Done.

Other cosmetic changes:

- format recomp array comments like /* U+0045+032D -> U+1E18 */

- make sure to comment #endif's that are far from the #if

- small whitespace fixes

Note: for the new header I simply adapted from unicode_norm_table.h the choice of "There is deliberately not an #ifndef PG_UNICODE_NORM_TABLE_H here", although I must confess I'm not sure what the purpose of that is, in this case.

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Attachment

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

21 October 2020, 23:35:12

There was a mistake in v3 with pgindent/exclude_file_patterns, fixed in v4.

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Attachment

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

22 October 2020, 04:34:18

On Wed, Oct 21, 2020 at 07:35:12PM -0400, John Naylor wrote:
> There was a mistake in v3 with pgindent/exclude_file_patterns, fixed in v4.

Thanks for the updated version, that was fast.  I have found a couple
of places that needed to be adjusted, like the comment at the top of
generate-unicode_norm_table.pl or some comments, an incorrect include
in the new headers and the indentation was not right in perl (we use
perltidy v20170521, see the README in src/tools/pgindent).

Except that, this looks good to me.  Attached is the updated version
with all my tweaks, that I would like to commit.  If there are any
comments, please feel free of course.
--
Michael

Attachment

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

22 October 2020, 09:50:52

On Thu, Oct 22, 2020 at 12:34 AM Michael Paquier <michael@paquier.xyz> wrote:

Thanks for the updated version, that was fast. I have found a couple
of places that needed to be adjusted, like the comment at the top of
generate-unicode_norm_table.pl or some comments, an incorrect include
in the new headers and the indentation was not right in perl (we use
perltidy v20170521, see the README in src/tools/pgindent).

Except that, this looks good to me. Attached is the updated version
with all my tweaks, that I would like to commit. If there are any
comments, please feel free of course.

Looks good to me.

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

23 October 2020, 02:11:24

On Thu, Oct 22, 2020 at 05:50:52AM -0400, John Naylor wrote:
> Looks good to me.

Thanks.  Committed, then.  Great work!
--
Michael

Attachment

signature.asc

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

23 October 2020, 09:54:26

On Thu, Oct 22, 2020 at 10:11 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 22, 2020 at 05:50:52AM -0400, John Naylor wrote:
> Looks good to me.

Thanks. Committed, then. Great work!

Thank you!

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Re: speed up unicode decomposition and recomposition

From

Tom Lane

Date:

23 October 2020, 16:07:52

I chanced to do an --enable-coverage test run today, and I got this
weird message during "make coverage-html":

genhtml: WARNING: function data mismatch at /home/postgres/pgsql/src/common/unicode_norm.c:102

I've never seen anything like that before.  I suppose it means that
something about 783f0cc64 is a bit fishy, but I don't know what.

The emitted coverage report looks fairly normal anyway.  It says
unicode_norm.c has zero test coverage, which is very possibly correct
since I wasn't running in UTF8 encoding, but I'm not entirely sure of
that either.

This is with RHEL8's lcov-1.13-4.el8 package.  I suppose the first
question is does anybody else see that?

            regards, tom lane

Re: speed up unicode decomposition and recomposition

From

Mark Dilger

Date:

23 October 2020, 23:18:13


> On Oct 23, 2020, at 9:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I chanced to do an --enable-coverage test run today, and I got this
> weird message during "make coverage-html":
>
> genhtml: WARNING: function data mismatch at /home/postgres/pgsql/src/common/unicode_norm.c:102
>
> I've never seen anything like that before.  I suppose it means that
> something about 783f0cc64 is a bit fishy, but I don't know what.
>
> The emitted coverage report looks fairly normal anyway.  It says
> unicode_norm.c has zero test coverage, which is very possibly correct
> since I wasn't running in UTF8 encoding, but I'm not entirely sure of
> that either.
>
> This is with RHEL8's lcov-1.13-4.el8 package.  I suppose the first
> question is does anybody else see that?

I don't see it on mac nor on ubuntu64.  I get 70.6% coverage of lines and 90.9% of functions on ubuntu.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

24 October 2020, 00:02:52

On Fri, Oct 23, 2020 at 04:18:13PM -0700, Mark Dilger wrote:
> On Oct 23, 2020, at 9:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> genhtml: WARNING: function data mismatch at /home/postgres/pgsql/src/common/unicode_norm.c:102
>>
>> I've never seen anything like that before.  I suppose it means that
>> something about 783f0cc64 is a bit fishy, but I don't know what.
>>
>> The emitted coverage report looks fairly normal anyway.  It says
>> unicode_norm.c has zero test coverage, which is very possibly correct
>> since I wasn't running in UTF8 encoding, but I'm not entirely sure of
>> that either.
>
> I don't see it on mac nor on ubuntu64.  I get 70.6% coverage of
> lines and 90.9% of functions on ubuntu.

I can see the problem on Debian GID with lcov 1.14-2.  This points to
the second declaration of get_code_entry().  I think that genhtml,
because it considers the code of unicode_norm.c as a whole without its
CFLAGS, gets confused because it finds the same function to index as
defined twice.  It expects only one definition, hence the warning.  So
I think that this can lead to some incorrect data in the HTML report,
and the attached patch takes care of fixing that.  Tom, does it take
care of the issue on your side?
--
Michael

Attachment

Re: speed up unicode decomposition and recomposition

From

Tom Lane

Date:

24 October 2020, 00:24:06

Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Oct 23, 2020 at 04:18:13PM -0700, Mark Dilger wrote:
>> On Oct 23, 2020, at 9:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> genhtml: WARNING: function data mismatch at /home/postgres/pgsql/src/common/unicode_norm.c:102

> I can see the problem on Debian GID with lcov 1.14-2.  This points to
> the second declaration of get_code_entry().  I think that genhtml,
> because it considers the code of unicode_norm.c as a whole without its
> CFLAGS, gets confused because it finds the same function to index as
> defined twice.  It expects only one definition, hence the warning.  So
> I think that this can lead to some incorrect data in the HTML report,
> and the attached patch takes care of fixing that.  Tom, does it take
> care of the issue on your side?

Good catch!  Yeah, that fixes it for me.

I'd advise not putting conv_compare() between get_code_entry() and
the header comment for get_code_entry().  Looks good otherwise.

            regards, tom lane

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

24 October 2020, 05:25:36

On Fri, Oct 23, 2020 at 08:24:06PM -0400, Tom Lane wrote:
> I'd advise not putting conv_compare() between get_code_entry() and
> the header comment for get_code_entry().  Looks good otherwise.

Indeed.  I have adjusted the position of the comment, and applied the
fix.  Thanks for the report.
--
Michael

Attachment

signature.asc

Re: speed up unicode decomposition and recomposition

From

John Naylor

Date:

06 November 2020, 22:20:00

There is a latent bug in the way code pairs for recomposition are sorted due to a copy-pasto on my part. Makes no difference now, but it could in the future. While looking, it seems pg_bswap.h should actually be backend-only. Both fixed in the attached.

John Naylor

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

Attachment

minor-fixes-in-unicode-norm.patch

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

07 November 2020, 00:29:30

On Fri, Nov 06, 2020 at 06:20:00PM -0400, John Naylor wrote:
> There is a latent bug in the way code pairs for recomposition are sorted
> due to a copy-pasto on my part. Makes no difference now, but it could in
> the future.  While looking, it seems pg_bswap.h should actually be
> backend-only. Both fixed in the attached.

Thanks John.  Both look right to me.  I'll apply both in a bit.
--
Michael

Attachment

signature.asc

Re: speed up unicode decomposition and recomposition

From

Michael Paquier

Date:

07 November 2020, 01:23:42

On Sat, Nov 07, 2020 at 09:29:30AM +0900, Michael Paquier wrote:
> Thanks John.  Both look right to me.  I'll apply both in a bit.

Done that now.  Just for the note: you forgot to run pgperltidy.
--
Michael

Attachment

signature.asc