Thread: speed up unicode decomposition and recomposition
Having committed the optimization for unicode normalization quick check, Michael Paquier suggested I might do the same for decomposition as well. I wrote:
> I'll
> do some performance testing soon. Note that a 25kB increase in size could
> be present in frontend binaries as well in this case. While looking at
> decomposition, I noticed that recomposition does a linear search through
> all 6600+ entries, although it seems only about 800 are valid for that.
> That could be optimized as well now, since with hashing we have more
> flexibility in the ordering and can put the recomp-valid entries in front.
> I'm not yet sure if it's worth the additional complexity. I'll take a look
> and start a new thread.
--
> do some performance testing soon. Note that a 25kB increase in size could
> be present in frontend binaries as well in this case. While looking at
> decomposition, I noticed that recomposition does a linear search through
> all 6600+ entries, although it seems only about 800 are valid for that.
> That could be optimized as well now, since with hashing we have more
> flexibility in the ordering and can put the recomp-valid entries in front.
> I'm not yet sure if it's worth the additional complexity. I'll take a look
> and start a new thread.
The attached patch uses a perfect hash for codepoint decomposition, and for recomposing reduces the linear search from 6604 entries to 942.
The performance is very nice, and if I'd known better I would have done this first, since the decomp array is as big as the two quick check arrays put together:
Normalize, decomp only
select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
master patchÏ
887ms 231ms
select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;
master patch
1110ms 208ms
Normalize, decomp+recomp (note: 100x less data)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;
master patch
194ms 50.6ms
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,1000) as i
) s;
master patch
137ms 39.4ms
Quick check is another 2x faster on top of previous gains, since it tests canonical class via the decomposition array:
-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s
where t is NFC normalized;
master patch
296ms 131ms
select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
master patchÏ
887ms 231ms
select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;
master patch
1110ms 208ms
Normalize, decomp+recomp (note: 100x less data)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;
master patch
194ms 50.6ms
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,1000) as i
) s;
master patch
137ms 39.4ms
Quick check is another 2x faster on top of previous gains, since it tests canonical class via the decomposition array:
-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s
where t is NFC normalized;
master patch
296ms 131ms
Some other considerations:
- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since the decomp array was reordered to optimize linear search, it can no longer be used for binary search. It's possible to build two arrays, one for frontend and one for backend, but that's additional complexity. We could also force frontend to do a linear search all the time, but that seems foolish. I haven't checked if it's possible to exclude the hash from backend's libpq.
- I could split out the two approaches into separate patches, but it'd be rather messy.
I'll add a CF entry for this.
Attachment
John Naylor <john.naylor@enterprisedb.com> writes: > Some other considerations: > - As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since > the decomp array was reordered to optimize linear search, it can no longer > be used for binary search. It's possible to build two arrays, one for > frontend and one for backend, but that's additional complexity. We could > also force frontend to do a linear search all the time, but that seems > foolish. I haven't checked if it's possible to exclude the hash from > backend's libpq. IIUC, the only place libpq uses this is to process a password-sized string or two during connection establishment. It seems quite silly to add 26kB in order to make that faster. Seems like a nice speedup on the backend side, but I'd vote for keeping the frontend as-is. regards, tom lane
On Wed, Oct 14, 2020 at 01:06:40PM -0400, Tom Lane wrote: > John Naylor <john.naylor@enterprisedb.com> writes: >> Some other considerations: >> - As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since >> the decomp array was reordered to optimize linear search, it can no longer >> be used for binary search. It's possible to build two arrays, one for >> frontend and one for backend, but that's additional complexity. We could >> also force frontend to do a linear search all the time, but that seems >> foolish. I haven't checked if it's possible to exclude the hash from >> backend's libpq. > > IIUC, the only place libpq uses this is to process a password-sized string > or two during connection establishment. It seems quite silly to add > 26kB in order to make that faster. Seems like a nice speedup on the > backend side, but I'd vote for keeping the frontend as-is. Agreed. Let's only use the perfect hash in the backend. It would be nice to avoid an extra generation of the decomposition table for that, and a table ordered by codepoints is easier to look at. How much do you think would be the performance impact if we don't use for the linear search the most-optimized decomposition table? -- Michael
Attachment
On Wed, Oct 14, 2020 at 8:25 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Oct 14, 2020 at 01:06:40PM -0400, Tom Lane wrote:
> IIUC, the only place libpq uses this is to process a password-sized string
> or two during connection establishment. It seems quite silly to add
> 26kB in order to make that faster. Seems like a nice speedup on the
> backend side, but I'd vote for keeping the frontend as-is.
Agreed. Let's only use the perfect hash in the backend. It would be
nice to avoid an extra generation of the decomposition table for that,
and a table ordered by codepoints is easier to look at. How much do
you think would be the performance impact if we don't use for the
linear search the most-optimized decomposition table?
With those points in mind and thinking more broadly, I'd like to try harder on recomposition. Even several times faster, recomposition is still orders of magnitude slower than ICU, as measured by Daniel Verite [1]. I only did it this way because I couldn't think of how to do the inverse lookup with a hash. But I think if we constructed the hash key like
pg_hton64((code1 << 32) | code2)
and on the Perl side do something like
pack('N',$code1) . pack('N',$code2)
that might work. Or something that looks more like the C side. And make sure to use the lowest codepoint for the result. That way, we can still keep the large decomp array ordered, making it easier to keep the current implementation in the front end, and hopefully getting even better performance in the backend.
John Naylor <john.naylor@enterprisedb.com> writes: > With those points in mind and thinking more broadly, I'd like to try harder > on recomposition. Even several times faster, recomposition is still orders > of magnitude slower than ICU, as measured by Daniel Verite [1]. Huh. Has anyone looked into how they do it? regards, tom lane
At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > John Naylor <john.naylor@enterprisedb.com> writes: > > With those points in mind and thinking more broadly, I'd like to try harder > > on recomposition. Even several times faster, recomposition is still orders > > of magnitude slower than ICU, as measured by Daniel Verite [1]. > > Huh. Has anyone looked into how they do it? I'm not sure it is that, but it would be that.. It uses separate tables for decomposition and composition pointed from a trie? That table is used after trying algorithmic decomposition/composition for, for example, Hangul. I didn't look it any fruther but just for information, icu4c/source/common/normalizer2impl.cpp seems doing that. For example icu4c/srouce/common/norm2_nfc_data.h defines the static data. icu4c/source/common/normalier2impl.h:244 points a design documentation of normalization. http://site.icu-project.org/design/normalization/custom > Old and New Implementation Details > > The old normalization data format (unorm.icu, ca. 2001..2009) uses > three data structures for normalization: A trie for looking up 32-bit > values for every code point, a 16-bit-unit array with decompositions > and some other data, and a composition table (16-bit-unit array, > linear search list per starter). The data is combined for all 4 > standard normalization forms: NFC, NFD, NFKC and NFKD. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Oct 15, 2020 at 1:30 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
> John Naylor <john.naylor@enterprisedb.com> writes:
> > With those points in mind and thinking more broadly, I'd like to try harder
> > on recomposition. Even several times faster, recomposition is still orders
> > of magnitude slower than ICU, as measured by Daniel Verite [1].
>
> Huh. Has anyone looked into how they do it?
I'm not sure it is that, but it would be that.. It uses separate
tables for decomposition and composition pointed from a trie?
I think I've seen a trie recommended somewhere, maybe the official website. That said, I was able to get the hash working for recomposition (split into a separate patch, and both of them now leave frontend alone), and I'm pleased to say it's 50-75x faster than linear search in simple tests. I'd be curious how it compares to ICU now. Perhaps Daniel Verite would be interested in testing again? (CC'd)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
master patch
18800ms 257ms
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;
master patch
13000ms 254ms
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
master patch
18800ms 257ms
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from
generate_series(1,100000) as i
) s;
master patch
13000ms 254ms
Attachment
On Thu, Oct 15, 2020 at 01:59:38PM -0400, John Naylor wrote: > I think I've seen a trie recommended somewhere, maybe the official website. > That said, I was able to get the hash working for recomposition (split into > a separate patch, and both of them now leave frontend alone), and I'm > pleased to say it's 50-75x faster than linear search in simple tests. I'd > be curious how it compares to ICU now. Perhaps Daniel Verite would be > interested in testing again? (CC'd) Yeah, that would be interesting to compare. Now the gains proposed by this patch are already a good step forward, so I don't think that it should be a blocker for a solution we have at hand as the numbers speak by themselves here. So if something better gets proposed, we could always change the decomposition and recomposition logic as needed. > select count(normalize(t, NFC)) from ( > select md5(i::text) as t from > generate_series(1,100000) as i > ) s; > > master patch > 18800ms 257ms My environment was showing HEAD as being a bit faster with 15s, while the patch gets "only" down to 290~300ms (compiled with -O2, as I guess you did). Nice. + # Then the second + return -1 if $a2 < $b2; + return 1 if $a2 > $b2; Should say "second code point" here? + hashkey = pg_hton64(((uint64) start << 32) | (uint64) code); + h = recompinfo.hash(&hashkey); This choice should be documented, and most likely we should have comments on the perl and C sides to keep track of the relationship between the two. The binary sizes of libpgcommon_shlib.a and libpgcommon.a change because Decomp_hash_func() gets included, impacting libpq. Structurally, wouldn't it be better to move this part into its own, backend-only, header? It could be possible to paint the difference with some ifdef FRONTEND of course, or just keep things as they are because this can be useful for some out-of-core frontend tool? But if we keep that as a separate header then any C part can decide to include it or not, so frontend tools could also make this choice. Note that we don't include unicode_normprops_table.h for frontends in unicode_norm.c, but that's the case of unicode_norm_table.h. -- Michael
Attachment
John Naylor wrote: > I'd be curious how it compares to ICU now I've made another run of the test in [1] with your v2 patches from this thread against icu_ext built with ICU-67.1. The results show the times in milliseconds to process about 10 million short strings: operation | unpatched | patched | icu_ext ------------+-----------+---------+--------- nfc check | 7968 | 5989 | 4076 nfc conv | 700894 | 15163 | 6808 nfd check | 16399 | 9852 | 3849 nfd conv | 17391 | 10916 | 6758 nfkc check | 8259 | 6092 | 4301 nfkc conv | 700241 | 15354 | 7034 nfkd check | 16585 | 10049 | 4038 nfkd conv | 17587 | 11109 | 7086 The ICU implementation still wins by a large margin, but the improvements brought by the patch are gorgeous, especially for the conversion to NFC/NFKC. This test ran on a slower machine than what I used for [1], so that's why all queries take longer. For the two queries upthread, I get this: 1) select count(normalize(t, NFC)) from ( select md5(i::text) as t from generate_series(1,100000) as i ) s; count -------- 100000 (1 row) Time: 371.043 ms VS ICU: select count(icu_normalize(t, 'NFC')) from ( select md5(i::text) as t from generate_series(1,100000) as i ) s; count -------- 100000 (1 row) Time: 125.809 ms 2) select count(normalize(t, NFC)) from ( select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from generate_series(1,100000) as i ) s; count -------- 100000 (1 row) Time: 428.214 ms VS ICU: select count(icu_normalize(t, 'NFC')) from ( select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3 + 1) as t from generate_series(1,100000) as i ) s; count -------- 100000 (1 row) Time: 147.713 ms [1] https://www.postgresql.org/message-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a%40manitou-mail.org Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
On Thu, Oct 15, 2020 at 11:32 PM Michael Paquier <michael@paquier.xyz> wrote:
The binary sizes of libpgcommon_shlib.a and libpgcommon.a change
because Decomp_hash_func() gets included, impacting libpq.
I don't see any difference on gcc/Linux in those two files, nor in unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as expected. Could it depend on the compiler?
On Fri, Oct 16, 2020 at 2:08 PM Daniel Verite <daniel@manitou-mail.org> wrote:
John Naylor wrote:
> I'd be curious how it compares to ICU now
I've made another run of the test in [1] with your v2 patches
from this thread against icu_ext built with ICU-67.1.
The results show the times in milliseconds to process
about 10 million short strings:
Thanks for testing!
On Mon, Oct 19, 2020 at 10:34:33AM -0400, John Naylor wrote: > I don't see any difference on gcc/Linux in those two files, nor in > unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as > expected. Could it depend on the compiler? Hmm. My guess is that you don't have --enable-debug in your set of configure options? It is not unusual to have this one enabled for GCC even on production systems, and the size of the libs is impacted in this case with your patch. -- Michael
Attachment
On Tue, Oct 20, 2020 at 3:22 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Oct 19, 2020 at 10:34:33AM -0400, John Naylor wrote:
> I don't see any difference on gcc/Linux in those two files, nor in
> unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as
> expected. Could it depend on the compiler?
Hmm. My guess is that you don't have --enable-debug in your set of
configure options? It is not unusual to have this one enabled for GCC
even on production systems, and the size of the libs is impacted in
this case with your patch.
On Tue, Oct 20, 2020 at 08:03:12AM -0400, John Naylor wrote: > I've confirmed that. How about a new header unicode_norm_hashfunc.h which > would include unicode_norm_table.h at the top. In unicode.c, we can include > one of these depending on frontend or backend. Sounds good to me. Looking at the code, I would just generate the second file within generate-unicode_norm_table.pl. -- Michael
Attachment
Attached v3 addressing review points below:
On Thu, Oct 15, 2020 at 11:32 PM Michael Paquier <michael@paquier.xyz> wrote:
+ # Then the second
+ return -1 if $a2 < $b2;
+ return 1 if $a2 > $b2;
Should say "second code point" here?
Done. Also changed the tiebreaker to the composed codepoint. Beforehand, it was the index into DecompMain[], which is only equivalent if the list is in order (it is but we don't want correctness to depend on that), and not very clear.
+ hashkey = pg_hton64(((uint64) start << 32) | (uint64) code);
+ h = recompinfo.hash(&hashkey);
This choice should be documented, and most likely we should have
comments on the perl and C sides to keep track of the relationship
between the two.
Done.
<separate headers>
Other cosmetic changes:
- format recomp array comments like /* U+0045+032D -> U+1E18 */
- make sure to comment #endif's that are far from the #if
- small whitespace fixes
Note: for the new header I simply adapted from unicode_norm_table.h the choice of "There is deliberately not an #ifndef PG_UNICODE_NORM_TABLE_H here", although I must confess I'm not sure what the purpose of that is, in this case.
--
Attachment
There was a mistake in v3 with pgindent/exclude_file_patterns, fixed in v4.
Attachment
On Wed, Oct 21, 2020 at 07:35:12PM -0400, John Naylor wrote: > There was a mistake in v3 with pgindent/exclude_file_patterns, fixed in v4. Thanks for the updated version, that was fast. I have found a couple of places that needed to be adjusted, like the comment at the top of generate-unicode_norm_table.pl or some comments, an incorrect include in the new headers and the indentation was not right in perl (we use perltidy v20170521, see the README in src/tools/pgindent). Except that, this looks good to me. Attached is the updated version with all my tweaks, that I would like to commit. If there are any comments, please feel free of course. -- Michael
Attachment
On Thu, Oct 22, 2020 at 12:34 AM Michael Paquier <michael@paquier.xyz> wrote:
Thanks for the updated version, that was fast. I have found a couple
of places that needed to be adjusted, like the comment at the top of
generate-unicode_norm_table.pl or some comments, an incorrect include
in the new headers and the indentation was not right in perl (we use
perltidy v20170521, see the README in src/tools/pgindent).
Except that, this looks good to me. Attached is the updated version
with all my tweaks, that I would like to commit. If there are any
comments, please feel free of course.
Looks good to me.
On Thu, Oct 22, 2020 at 05:50:52AM -0400, John Naylor wrote: > Looks good to me. Thanks. Committed, then. Great work! -- Michael
Attachment
On Thu, Oct 22, 2020 at 10:11 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Oct 22, 2020 at 05:50:52AM -0400, John Naylor wrote:
> Looks good to me.
Thanks. Committed, then. Great work!
I chanced to do an --enable-coverage test run today, and I got this weird message during "make coverage-html": genhtml: WARNING: function data mismatch at /home/postgres/pgsql/src/common/unicode_norm.c:102 I've never seen anything like that before. I suppose it means that something about 783f0cc64 is a bit fishy, but I don't know what. The emitted coverage report looks fairly normal anyway. It says unicode_norm.c has zero test coverage, which is very possibly correct since I wasn't running in UTF8 encoding, but I'm not entirely sure of that either. This is with RHEL8's lcov-1.13-4.el8 package. I suppose the first question is does anybody else see that? regards, tom lane
> On Oct 23, 2020, at 9:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I chanced to do an --enable-coverage test run today, and I got this > weird message during "make coverage-html": > > genhtml: WARNING: function data mismatch at /home/postgres/pgsql/src/common/unicode_norm.c:102 > > I've never seen anything like that before. I suppose it means that > something about 783f0cc64 is a bit fishy, but I don't know what. > > The emitted coverage report looks fairly normal anyway. It says > unicode_norm.c has zero test coverage, which is very possibly correct > since I wasn't running in UTF8 encoding, but I'm not entirely sure of > that either. > > This is with RHEL8's lcov-1.13-4.el8 package. I suppose the first > question is does anybody else see that? I don't see it on mac nor on ubuntu64. I get 70.6% coverage of lines and 90.9% of functions on ubuntu. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 23, 2020 at 04:18:13PM -0700, Mark Dilger wrote: > On Oct 23, 2020, at 9:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> genhtml: WARNING: function data mismatch at /home/postgres/pgsql/src/common/unicode_norm.c:102 >> >> I've never seen anything like that before. I suppose it means that >> something about 783f0cc64 is a bit fishy, but I don't know what. >> >> The emitted coverage report looks fairly normal anyway. It says >> unicode_norm.c has zero test coverage, which is very possibly correct >> since I wasn't running in UTF8 encoding, but I'm not entirely sure of >> that either. > > I don't see it on mac nor on ubuntu64. I get 70.6% coverage of > lines and 90.9% of functions on ubuntu. I can see the problem on Debian GID with lcov 1.14-2. This points to the second declaration of get_code_entry(). I think that genhtml, because it considers the code of unicode_norm.c as a whole without its CFLAGS, gets confused because it finds the same function to index as defined twice. It expects only one definition, hence the warning. So I think that this can lead to some incorrect data in the HTML report, and the attached patch takes care of fixing that. Tom, does it take care of the issue on your side? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Fri, Oct 23, 2020 at 04:18:13PM -0700, Mark Dilger wrote: >> On Oct 23, 2020, at 9:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> genhtml: WARNING: function data mismatch at /home/postgres/pgsql/src/common/unicode_norm.c:102 > I can see the problem on Debian GID with lcov 1.14-2. This points to > the second declaration of get_code_entry(). I think that genhtml, > because it considers the code of unicode_norm.c as a whole without its > CFLAGS, gets confused because it finds the same function to index as > defined twice. It expects only one definition, hence the warning. So > I think that this can lead to some incorrect data in the HTML report, > and the attached patch takes care of fixing that. Tom, does it take > care of the issue on your side? Good catch! Yeah, that fixes it for me. I'd advise not putting conv_compare() between get_code_entry() and the header comment for get_code_entry(). Looks good otherwise. regards, tom lane
On Fri, Oct 23, 2020 at 08:24:06PM -0400, Tom Lane wrote: > I'd advise not putting conv_compare() between get_code_entry() and > the header comment for get_code_entry(). Looks good otherwise. Indeed. I have adjusted the position of the comment, and applied the fix. Thanks for the report. -- Michael
Attachment
There is a latent bug in the way code pairs for recomposition are sorted due to a copy-pasto on my part. Makes no difference now, but it could in the future. While looking, it seems pg_bswap.h should actually be backend-only. Both fixed in the attached.
Attachment
On Fri, Nov 06, 2020 at 06:20:00PM -0400, John Naylor wrote: > There is a latent bug in the way code pairs for recomposition are sorted > due to a copy-pasto on my part. Makes no difference now, but it could in > the future. While looking, it seems pg_bswap.h should actually be > backend-only. Both fixed in the attached. Thanks John. Both look right to me. I'll apply both in a bit. -- Michael
Attachment
On Sat, Nov 07, 2020 at 09:29:30AM +0900, Michael Paquier wrote: > Thanks John. Both look right to me. I'll apply both in a bit. Done that now. Just for the note: you forgot to run pgperltidy. -- Michael