Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

From Alexander Borisov
Subject Re: Improve the performance of Unicode Normalization Forms.
Date
Msg-id cfd504f7-1fc1-43df-9356-f68818f30921@gmail.com
Whole thread Raw
In response to Re: Improve the performance of Unicode Normalization Forms.  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: Improve the performance of Unicode Normalization Forms.
List pgsql-hackers
11.06.2025 10:13, John Naylor wrote:
> On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
>> 5. The server part "lost weight" in the binary, but the frontend
>>      "gained weight" a little.
>>
>> I read the old commits, which say that the size of the frontend is very
>> important and that speed is not important
>> (speed is important on the server).
>> I'm not quite sure what to do if this is really the case. Perhaps
>> we should leave the slow version for the frontend.
> 
> In the "small" patch, the frontend files got a few kB bigger, but the
> backend got quite a bit smaller. If we decided to go with this patch,
> I'd say it's preferable to do it in a way that keeps both paths the
> same.

Okay, then I'll leave the frontend unchanged so that the size remains
the same. The changes will only affect the backend.

>> How was it tested?
>> Four files were created for each normalization form: NFC, NFD, NFKC,
>> and NFKD.
>> The files were sent via pgbench. The files contain all code points that
>> need to be normalized.
>> Unfortunately, the patches are already quite large, but if necessary,
>> I can send these files in a separate email or upload them somewhere.
> 
> What kind of workload do they present?
> Did you consider running the same tests from the thread that lead to
> the current implementation?

I found performance tests in this discussion 
https://www.postgresql.org/message-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
Below are performance test results.

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

1.

Normalize, decomp only

select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 279,858 ms
Patch (small table): 282,925 ms
Without: 444,118 ms


2.

select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 219,858 ms
Patch (small table): 247,893 ms
Without: 376,906 ms


3.

Normalize, decomp+recomp

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;

Patch (big table): 7,553 ms
Patch (small table): 7,876 ms
Without: 13,177 ms


4.

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,1000) as i
) s;

Patch (big table): 5,765 ms
Patch (small table): 6,782 ms
Without: 10,800 ms


5.

Quick check has not changed because these patches do not affect it:

-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 29,477 ms
Patch (small table): 29,436 ms
Without: 29,378 ms


 From these tests, we see 2x in some tests.


--
Best regards,
Alexander Borisov



pgsql-hackers by date:

Previous
From: Junwang Zhao
Date:
Subject: Re: Use RELATION_IS_OTHER_TEMP where possible
Next
From: Dmitry Koval
Date:
Subject: Re: Add SPLIT PARTITION/MERGE PARTITIONS commands