Thread: Improve the performance of Unicode Normalization Forms.

Improve the performance of Unicode Normalization Forms.

From
Alexander Borisov
Date:
Hi, hackers!

As promised, I continue to improve/speed up Unicode in Postgres.
Last time, we improved the lower(), upper(), and casefold()
functions. [1]
Now it's time for Unicode Normalization Forms, specifically
the normalize() function.

The attachment contains three patches:
1. Transfer of functions for building a range index to a separate
    Perl module.
2-small. In this patch, the mapping tables are reduced, but this
    increases the number of branches for searching for the desired value.
2-big. In this patch, the tables are large, but the number of branches
    for searching for the desired value is small.

In other words, we choose between two patches, either small or big.

What did we manage to achieve?
1. The uint32 field, which stored the unicode codepoint record, was
    removed from the main structure/table. This was necessary for perfect
    hashing.
2. Thanks to the first point, we managed to get rid of duplicates and
    reduce the main table from 6843 to 3943.
3. We managed to get rid of duplicates in the table with codepoints for
    decomposition from 5138 to 4304 (uint32).
4. These manipulations allowed us to increase the speed by up to 40%
    in some cases (benchmarks below).
5. The server part "lost weight" in the binary, but the frontend
    "gained weight" a little.

I read the old commits, which say that the size of the frontend is very
important and that speed is not important
(speed is important on the server).
I'm not quite sure what to do if this is really the case. Perhaps
we should leave the slow version for the frontend.

Benchmarks.

How was it tested?
Four files were created for each normalization form: NFC, NFD, NFKC,
and NFKD.
The files were sent via pgbench. The files contain all code points that
need to be normalized.
Unfortunately, the patches are already quite large, but if necessary,
I can send these files in a separate email or upload them somewhere.

Legend.
Patch (big table) — this is a bloated mapping table, but with a small
     number of branches for searching.
Patch (small table) — these are small tables, but with a large number
     of branches for searching.
Without — the original Postgres without patches.

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

NFC:
Patch (big table): tps = 5057.624509
Patch (small table): tps = 4727.158048
Without: tps = 3268.654867

NFD:
Patch (big table): tps = 1145.730247
Patch (small table): tps = 1073.084633
Without: tps = 689.627239

NFKC:
Patch (big table): tps = 4821.700702
Patch (small table): tps = 4662.640629
Without: tps = 3450.254581

NFKD:
Patch (big table): tps = 1142.644341
Patch (small table): tps = 951.405893
Without: tps = 636.161496

How the size of object files and libraries has changed (in bytes):
Patch (big table):
unicode_norm.o: 138896
unicode_norm_srv.o: 191336
libpq.a: 364782
libpq.so.5.18: 480944
libpgcommon.a: 584432
libpgcommon_shlib.a: 568724

Patch (small table):
unicode_norm.o: 86416
unicode_norm_srv.o: 138824
libpq.a: 364782
libpq.so.5.18: 423600
libpgcommon.a: 531952
libpgcommon_shlib.a: 516244

Without:
unicode_norm.o: 79488
unicode_norm_srv.o: 165504
libpq.a: 364782
libpq.so.5.18: 419288
libpgcommon.a: 525024
libpgcommon_shlib.a: 509316

[1] 
https://www.postgresql.org/message-id/flat/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com

--
Best regards,
Alexander Borisov

Attachment

Re: Improve the performance of Unicode Normalization Forms.

From
John Naylor
Date:
On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
> 5. The server part "lost weight" in the binary, but the frontend
>     "gained weight" a little.
>
> I read the old commits, which say that the size of the frontend is very
> important and that speed is not important
> (speed is important on the server).
> I'm not quite sure what to do if this is really the case. Perhaps
> we should leave the slow version for the frontend.

In the "small" patch, the frontend files got a few kB bigger, but the
backend got quite a bit smaller. If we decided to go with this patch,
I'd say it's preferable to do it in a way that keeps both paths the
same.

> How was it tested?
> Four files were created for each normalization form: NFC, NFD, NFKC,
> and NFKD.
> The files were sent via pgbench. The files contain all code points that
> need to be normalized.
> Unfortunately, the patches are already quite large, but if necessary,
> I can send these files in a separate email or upload them somewhere.

What kind of workload do they present?
Did you consider running the same tests from the thread that lead to
the current implementation?

--
John Naylor
Amazon Web Services



Re: Improve the performance of Unicode Normalization Forms.

From
Alexander Borisov
Date:
11.06.2025 10:13, John Naylor wrote:
> On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
>> 5. The server part "lost weight" in the binary, but the frontend
>>      "gained weight" a little.
>>
>> I read the old commits, which say that the size of the frontend is very
>> important and that speed is not important
>> (speed is important on the server).
>> I'm not quite sure what to do if this is really the case. Perhaps
>> we should leave the slow version for the frontend.
> 
> In the "small" patch, the frontend files got a few kB bigger, but the
> backend got quite a bit smaller. If we decided to go with this patch,
> I'd say it's preferable to do it in a way that keeps both paths the
> same.

Okay, then I'll leave the frontend unchanged so that the size remains
the same. The changes will only affect the backend.

>> How was it tested?
>> Four files were created for each normalization form: NFC, NFD, NFKC,
>> and NFKD.
>> The files were sent via pgbench. The files contain all code points that
>> need to be normalized.
>> Unfortunately, the patches are already quite large, but if necessary,
>> I can send these files in a separate email or upload them somewhere.
> 
> What kind of workload do they present?
> Did you consider running the same tests from the thread that lead to
> the current implementation?

I found performance tests in this discussion 
https://www.postgresql.org/message-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
Below are performance test results.

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

1.

Normalize, decomp only

select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 279,858 ms
Patch (small table): 282,925 ms
Without: 444,118 ms


2.

select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 219,858 ms
Patch (small table): 247,893 ms
Without: 376,906 ms


3.

Normalize, decomp+recomp

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;

Patch (big table): 7,553 ms
Patch (small table): 7,876 ms
Without: 13,177 ms


4.

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,1000) as i
) s;

Patch (big table): 5,765 ms
Patch (small table): 6,782 ms
Without: 10,800 ms


5.

Quick check has not changed because these patches do not affect it:

-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 29,477 ms
Patch (small table): 29,436 ms
Without: 29,378 ms


 From these tests, we see 2x in some tests.


--
Best regards,
Alexander Borisov



Re: Improve the performance of Unicode Normalization Forms.

From
John Naylor
Date:
On Wed, Jun 11, 2025 at 7:27 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
>
> 11.06.2025 10:13, John Naylor wrote:
> > On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
> >> 5. The server part "lost weight" in the binary, but the frontend
> >>      "gained weight" a little.
> >>
> >> I read the old commits, which say that the size of the frontend is very
> >> important and that speed is not important
> >> (speed is important on the server).
> >> I'm not quite sure what to do if this is really the case. Perhaps
> >> we should leave the slow version for the frontend.
> >
> > In the "small" patch, the frontend files got a few kB bigger, but the
> > backend got quite a bit smaller. If we decided to go with this patch,
> > I'd say it's preferable to do it in a way that keeps both paths the
> > same.
>
> Okay, then I'll leave the frontend unchanged so that the size remains
> the same. The changes will only affect the backend.

Sorry, I by "both paths" I meant make the frontend and backend the
same, because it's good for testing. In the "small table" patch, libpq
etc increase by about 1%, which is negligible. unicode_norm.o is only
bigger by 7kB -- That seems okay to me, especially considering
unicode_norm_srv.o is smaller by 27kB.

>  From these tests, we see 2x in some tests.

Nice!

--
John Naylor
Amazon Web Services



Re: Improve the performance of Unicode Normalization Forms.

From
Jeff Davis
Date:
On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote:
> As promised, I continue to improve/speed up Unicode in Postgres.
> Last time, we improved the lower(), upper(), and casefold()
> functions. [1]
> Now it's time for Unicode Normalization Forms, specifically
> the normalize() function.

Did you compare against other implementations, such as ICU's
normalization functions? There's also a rust crate here:

https://github.com/unicode-rs/unicode-normalization

that might have been optimized.

In addition to the lookups themselves, there are other opportunities
for optimization as well, such as:

* reducing the need for palloc and extra buffers, perhaps by using
buffers on the stack for small strings

* operate more directly on UTF-8 data rather than decoding and re-
encoding the entire string

Regards,
    Jeff Davis




Re: Improve the performance of Unicode Normalization Forms.

From
Alexander Borisov
Date:
19.06.2025 20:41, Jeff Davis wrote:
> On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote:
>> As promised, I continue to improve/speed up Unicode in Postgres.
>> Last time, we improved the lower(), upper(), and casefold()
>> functions. [1]
>> Now it's time for Unicode Normalization Forms, specifically
>> the normalize() function.
> 
> Did you compare against other implementations, such as ICU's
> normalization functions? There's also a rust crate here:
> 
> https://github.com/unicode-rs/unicode-normalization
> 
> that might have been optimized.

I don't quite see how this compares to the implementation on Rust. In
the link provided, they use perfect hash, which I get rid of and get
a x2 boost.
If you take ICU implementations in C++, I have always considered them
slow, at least when used in C code.
I may well run benchmarks and compare the performance of the approach
in Postgres and ICU. But this is beyond the scope of the patches under
discussion.

I want to emphasize that the pachty I gave doesn't change the
normalization code/logic.
We change the approach in finding the right codepoints across tables,
which is what gives us the performance boost.

> In addition to the lookups themselves, there are other opportunities
> for optimization as well, such as:
> 
> * reducing the need for palloc and extra buffers, perhaps by using
> buffers on the stack for small strings
> 
> * operate more directly on UTF-8 data rather than decoding and re-
> encoding the entire string

Absolutely agree with you, the normalization code is very well written
but far from optimized.
I didn't send changes in the normalization code itself to avoid lumping
everything together and make the review easier.
In keeping with my idea of optimizations in normalization forms, I
planned to discuss the optimization code (C code) in the next iteration
on “Improve performance...”.


-- 
Regards,
Alexander Borisov



Re: Improve the performance of Unicode Normalization Forms.

From
Jeff Davis
Date:
On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote:
> In the slow path you only
> normalize the _current character_, so you only need enough buffer
> space
> for that.

That's a clear win for UTF8 data. Also, if there are no changes, then
you can just return the input buffer and not bother allocating an
output buffer.

> The really nice thing about form-insensitive/form-preserving
> functionality is that it is form-preserving rather than normalizing
> on
> create and lookup, and that makes the fast-path described above
> feasible.  Whereas if you normalize on create and lookup you have to
> heap allocate enough space for each string normalized.

Non-deterministic ICU collations are already insensitive to most
normalization differences. Some differences are not handled when it
involves too many combining marks, but you can use the "full
normalization" option to address that. See:

https://www.postgresql.org/docs/current/collation.html#ICU-COLLATION-SETTINGS-TABLE

>   The other nice
> thing is that f-i/f-p behavior is a lot less surprising to users --
> the
> input methods they use don't matter.

Postgres is already form-preserving; it does not auto-normalize. (I
have suggested that we might want to offer something like that, but
that would be a user choice.)

Currently, the non-deterministic collations (which offer form-
insensitivity) are not available at the database level, so you have to
explicitly specify the COLLATE clause on a column or query. In other
words, Postgres is not form-insensitive by default, though there is
work to make that possible.

> What motivated this f-i/f-p behavior was that HFS+ used NFD (well,
> something very close to NFD) but input methods (even on OS X)
> invariably
> produce NFC (well, something close to it), at least for Latin
> scripts.
> This means that on OS X if you use filenames from directory listings
> those will be NFD while user inputs will be NFC, and so you can't
> just
> memcmp()/strcmp() those -- you have to normalize _or_ use form-
> insensitive string comparison, but nothing did that 20 years ago. 
> Thus
> doing the form-insensitivity in the filesystem seemed best, and if
> you
> do that you can be form-preserving to enable the optimization
> described
> above.

Databases have similar concerns as a filesystem in this respect.

Regards,
    Jeff Davis




Re: Improve the performance of Unicode Normalization Forms.

From
Jeff Davis
Date:
On Fri, 2025-06-20 at 17:51 +0300, Alexander Borisov wrote:
> I don't quite see how this compares to the implementation on Rust. In
> the link provided, they use perfect hash, which I get rid of and get
> a x2 boost.
> If you take ICU implementations in C++, I have always considered them
> slow, at least when used in C code.
> I may well run benchmarks and compare the performance of the approach
> in Postgres and ICU. But this is beyond the scope of the patches
> under
> discussion.

Are you saying that, with these patches, Postgres will offer the
fastest open-source Unicode normalization? If so, that would be very
cool.

The reason I'm asking is because, if there are multiple open source
implementations, we should either have the best one, or just borrow
another one as long as it has a suitable license (perhaps translating
to C as necessary).


Regards,
    Jeff Davis




Re: Improve the performance of Unicode Normalization Forms.

From
Nico Williams
Date:
On Fri, Jun 20, 2025 at 10:15:47AM -0700, Jeff Davis wrote:
> On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote:
> > In the slow path you only normalize the _current character_, so you
> > only need enough buffer space for that.
> 
> That's a clear win for UTF8 data. Also, if there are no changes, then
> you can just return the input buffer and not bother allocating an
> output buffer.

The latter is not relevant to string comparison or hashing, but, yeah,
if you have to produce a normalized string you can optimistically assume
it is already normalized and defer allocation until you know it isn't
normalized.

> Postgres is already form-preserving; it does not auto-normalize. (I
> have suggested that we might want to offer something like that, but
> that would be a user choice.)

Excellent, then I would advise looking into adding form-insensitive
string comparison and hashing to get f-i/f-p behavior.

> Currently, the non-deterministic collations (which offer form-
> insensitivity) are not available at the database level, so you have to
> explicitly specify the COLLATE clause on a column or query. In other
> words, Postgres is not form-insensitive by default, though there is
> work to make that possible.

TIL.  Thanks.

> Databases have similar concerns as a filesystem in this respect.

I figured :)



Re: Improve the performance of Unicode Normalization Forms.

From
Jeff Davis
Date:
On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote:
> That's what we're aiming for - to implement the fastest approach.

Awesome! Thank you for clarifying this as a goal. Having the fastest
open-source Unicode normalization would be a great thing to highlight
when this is done.

Regards,
    Jeff Davis