Thread: Re: Optimization for lower(), upper(), casefold() functions.

Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Tue, 2025-02-11 at 23:08 +0300, Alexander Borisov wrote:
> I tried the approach via a range table. The result was worse than
> without the table. With branching in a function, the result is
> better.
>
> Patch v3 — ranges binary search by branches.
> Patch v4 — ranges binary search by table.

Thoughts on v3:

It looks like the top 5 bits of the offset are unused. What if we used
those bits for flags to indicate:

   HAS_LOWER
   HAS_UPPER
   HAS_FOLD
   HAS_SPECIAL
   HAS_TITLE

That way, we only need to look in the corresponding table if it
actually has an entry other than the codepoint itself.

It doesn't leave a lot of room if the tables get larger, but if we are
worried about that, we could eliminate HAS_TITLE, because I don't think
the performance for INITCAP() is as important as LOWER/UPPER/CASEFOLD.

Regards,
    Jeff Davis




Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
19.02.2025 01:02, Jeff Davis пишет:
> On Tue, 2025-02-11 at 23:08 +0300, Alexander Borisov wrote:
>> I tried the approach via a range table. The result was worse than
>> without the table. With branching in a function, the result is
>> better.
>>
>> Patch v3 — ranges binary search by branches.
>> Patch v4 — ranges binary search by table.
> 
> Thoughts on v3:
> 
> It looks like the top 5 bits of the offset are unused. What if we used
> those bits for flags to indicate:
> 
>     HAS_LOWER
>     HAS_UPPER
>     HAS_FOLD
>     HAS_SPECIAL
>     HAS_TITLE
> 
> That way, we only need to look in the corresponding table if it
> actually has an entry other than the codepoint itself.
> 
> It doesn't leave a lot of room if the tables get larger, but if we are
> worried about that, we could eliminate HAS_TITLE, because I don't think
> the performance for INITCAP() is as important as LOWER/UPPER/CASEFOLD.
> 
> Regards,
>     Jeff Davis
> 

It seems to be micro-optimizations. In the sense that the code is very
clear and understandable now.
These micro-optimizations make the code more complex, and there will be
no performance gain.

In proposing the patch for v3, I struck a balance between improving
performance and reducing binary size, without sacrificing code clarity.

We can go into micro-optimizations now and lose the essence of
the proposed improvements. And ahead of us we have
Unicode Normalization Forms.

I'm willing to try different approaches, but there's confidence that
we've found a great optimization right now.


--
Alexander Borisov




Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Wed, 2025-02-19 at 01:54 +0300, Alexander Borisov wrote:
> In proposing the patch for v3, I struck a balance between improving
> performance and reducing binary size, without sacrificing code
> clarity.

Fair enough. I will continue reviewing v3.

Regards,
    Jeff Davis




Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
19.02.2025 01:56, Jeff Davis пишет:
> On Wed, 2025-02-19 at 01:54 +0300, Alexander Borisov wrote:
>> In proposing the patch for v3, I struck a balance between improving
>> performance and reducing binary size, without sacrificing code
>> clarity.
> 
> Fair enough. I will continue reviewing v3.

Did you have a time for review this?

I'd like to continue improving Unicode in Postgres, as I previously
wrote, next in my plans are Normalization forms, and more.
But now I am blocked by this patch.


--
Alexander Borisov




Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Sun, 2025-03-02 at 23:33 +0300, Alexander Borisov wrote:
> Did you have a time for review this?
>
> I'd like to continue improving Unicode in Postgres, as I previously
> wrote, next in my plans are Normalization forms, and more.
> But now I am blocked by this patch.

Hi,

I have refactored unicode_case.c a bit (v3j-0001) and rebased your v3
work on top of that (v3j-0002).

The refactoring is so that the optimizations do not need to modify
convert_case, which is already complex and I'd like to avoid adding
more to that function. Instead, I created a casemap() function, which
maps a single chracter, and convert_case() calls that.

I didn't test the refactoring for performance, but it looks as
optimizable as what was there before.

A couple questions:

* Is there a reason the fast-path for codepoints < 0x80 is in
unicode_case.c rather than unicode_case_func.h?

* Is there a reason you defined case_index() as static rather than
static inline?

* Is there a reason to have a new file unicode_case_func.h rather than
just add it to unicode_case_table.h?

I'm looking at a few more details, but this is a low-risk change
because there are exhaustive tests, so I intend to commit something
like this soon.

Regards,
    Jeff Davis


Attachment

Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
Hi Jeff,

12.03.2025 07:05, Jeff Davis wrote:
> On Sun, 2025-03-02 at 23:33 +0300, Alexander Borisov wrote:

[...]

> I have refactored unicode_case.c a bit (v3j-0001) and rebased your v3
> work on top of that (v3j-0002).
> 
> The refactoring is so that the optimizations do not need to modify
> convert_case, which is already complex and I'd like to avoid adding
> more to that function. Instead, I created a casemap() function, which
> maps a single chracter, and convert_case() calls that.
> 
> I didn't test the refactoring for performance, but it looks as
> optimizable as what was there before.

I like the proposed changes.
Here are some thoughts and small improvements:

I've modified patch v3j-0001 a bit.
Specifically:
1. Added static for casemap() function. Otherwise the compiler could not
optimize the code and the performance dropped significantly.
2. Added a fast path for codepoint < 0x80.

v3j-0002:
In the fast path for codepoints < 0x80, I added a premature return.
This avoided additional insertions, which increased performance.

> A couple questions:
> 
> * Is there a reason the fast-path for codepoints < 0x80 is in
> unicode_case.c rather than unicode_case_func.h?

Yes, this is an important optimization, below are benchmarks that
confirm it (I mean the lack of fast path in patch v3j-0001). Also we
share logic (0) case conversion (1) get/find value from table.

> * Is there a reason you defined case_index() as static rather than
> static inline?

There is no reason, but we can make it static inline, but that won't do
anything, the compiler will optimize it anyway. Perhaps for general
beauty it should be made static inline, I don't have a rigid position
here. Benchmark showed that there is no difference in performance.

> * Is there a reason to have a new file unicode_case_func.h rather than
> just add it to unicode_case_table.h?

I was purely based on existing approaches in Postgres, the
Normalization Forms have them separated into different headers. Just
trying to be consistent with existing approaches.

> I'm looking at a few more details, but this is a low-risk change
> because there are exhaustive tests, so I intend to commit something
> like this soon.

Thanks for the patch review!

Regards,
Alexander Borisov

Attachment

Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
12.03.2025 19:55, Alexander Borisov wrote:

[...]

>> A couple questions:
>>
>> * Is there a reason the fast-path for codepoints < 0x80 is in
>> unicode_case.c rather than unicode_case_func.h?
> 
> Yes, this is an important optimization, below are benchmarks that

[...]

I forgot to add the benchmark:

Benchmark

Anatation:
Patch v3j_0001: patch v3j_0001 without any changes.
Patch v3j_0001 + static: patch v3j_0001 with adding static to casemap().
Patch v3j_0001 + static + fast path: patch v3j_0001 with adding static
     to casemap() and fast path for codepoint < 0x80.
v4_0001: v3j_0001 + static + fast path
v4_0001 + v3j_0002: v3j_0002 patch unchanged, but with static for
     casemap() (inherited from v4_0001 patch)
v4_0001 + v4_0002: changed fast path for codepoint < 0x80. Made fast
     return to avoid unnecessary checks.
Without: current version of the code without any patches.

All results are in tps.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch v3j_0001: 201.029609
Patch v3j_0001 + static: 247.155438
Patch v3j_0001 + static + fast path: 267.941047
Patch v4_0001 + v3j_0002: 260.737601
Patch v4_0001 + v4_0002: 268.913658
Without: 260.437508

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch v3j_0001: 48.130350
Patch v3j_0001 + static: 49.365897
Patch v3j_0001 + static + fast path: 48.891842
Patch v4_0001 + v3j_0002: 88.833061
Patch v4_0001 + v4_0002: 88.329603
Without: 47.869687

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch v3j_0001: 91.333557
Patch v3j_0001 + static: 92.464786
Patch v3j_0001 + static + fast path: 91.359428
Patch v4_0001 + v3j_0002: 103.390609
Patch v4_0001 + v4_0002: 102.819763
Without: 92.279652


Regards,
Alexander Borisov



Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Wed, 2025-03-12 at 19:55 +0300, Alexander Borisov wrote:
> 1. Added static for casemap() function. Otherwise the compiler could
> not
> optimize the code and the performance dropped significantly.

Oops, it was static, but I made it external just to see what code it
generated. I didn't intend to publish it as an external function --
thank you for catching that!

> 2. Added a fast path for codepoint < 0x80.
>
> v3j-0002:
> In the fast path for codepoints < 0x80, I added a premature return.
> This avoided additional insertions, which increased performance.

What do you mean "additional insertions"?

Also, should we just compute the results in the fast path? We don't
even need a table. Rough patch attached to go on top of v4-0001.

Should we properly return CASEMAP_SELF when *simple == u1, or is it ok
to return CASEMAP_SIMPLE? It probably doesn't matter performance-wise,
but it feels more correct to return CASEMAP_SELF.

>
> Perhaps for general
> beauty it should be made static inline, I don't have a rigid position
> here.

We ordinarily use "static inline" if it's in a header file, and
"static" if it's in a .c file, so I'll do it that way.

> I was purely based on existing approaches in Postgres, the
> Normalization Forms have them separated into different headers. Just
> trying to be consistent with existing approaches.

I think that was done for normalization primarily because it's not used
#ifndef FRONTEND (see unicode_norm.c), and perhaps also because it's
just a more complex function worthy of its own file.

I looked into the history, and commit 783f0cc64d explains why perfect
hashing is not used in the frontend:

"The decomposition table remains the same, getting used for the binary
search in the frontend code, where we care more about the size of the
libraries like libpq over performance..."

>
Regards,
    Jeff Davis


Attachment

Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
12.03.2025 22:39, Jeff Davis wrote:

[...]

>> 2. Added a fast path for codepoint < 0x80.
>>
>> v3j-0002:
>> In the fast path for codepoints < 0x80, I added a premature return.
>> This avoided additional insertions, which increased performance.
> 
> What do you mean "additional insertions"?

Sorry for my English. I mean, we immediately do a return in the
if () condition. To avoid further branching/checking.

> Also, should we just compute the results in the fast path? We don't
> even need a table. Rough patch attached to go on top of v4-0001.
> 
> Should we properly return CASEMAP_SELF when *simple == u1, or is it ok
> to return CASEMAP_SIMPLE? It probably doesn't matter performance-wise,
> but it feels more correct to return CASEMAP_SELF.

It seems to disrupt the overall "beauty" of the approach. Thus, we will
copy code (bloat code), make optimizations that do not improve
performance but bloat code. I would refrain from such practices.
Especially since we'll be changing all that in the next patch (v4-0002).

>>
>> Perhaps for general
>> beauty it should be made static inline, I don't have a rigid position
>> here.
> 
> We ordinarily use "static inline" if it's in a header file, and
> "static" if it's in a .c file, so I'll do it that way.

Great, I've changed this place. Performance has not changed in any way.

>> I was purely based on existing approaches in Postgres, the
>> Normalization Forms have them separated into different headers. Just
>> trying to be consistent with existing approaches.
> 
> I think that was done for normalization primarily because it's not used
> #ifndef FRONTEND (see unicode_norm.c), and perhaps also because it's
> just a more complex function worthy of its own file.
> 
> I looked into the history, and commit 783f0cc64d explains why perfect
> hashing is not used in the frontend:
> 
> "The decomposition table remains the same, getting used for the binary
> search in the frontend code, where we care more about the size of the
> libraries like libpq over performance..."

I removed the extra file (unicode_case_func.h). You are right, we should
not create unnecessary clutter.

v5 attached.

Regards,
Alexander Borisov

Attachment

Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Wed, 2025-03-12 at 23:39 +0300, Alexander Borisov wrote:
> v5 attached.

Attached v6j.

* marked arrays as "static const" rather than just "static"
* ran pgindent
* changed data types where appropriate (uint32->pg_wchar)
* modified perl code so that it produces code that's already pgindented
* cleanup of perl code, removing unnecessary subroutines and variables
* added a few comments
* ran pgperltidy

Some of the perl code working with ranges still needs further cleanup
and explanation, though.

Also, I ran some of my own simple tests (mostly ASCII) and it showed
over 10% speedup. That combined with the smaller table sizes makes this
well worth it.

Regards,
    Jeff Davis


Attachment

Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
14.03.2025 06:43, Jeff Davis wrote:
> On Wed, 2025-03-12 at 23:39 +0300, Alexander Borisov wrote:
>> v5 attached.
> 
> Attached v6j.

I like the current changes.
Some comments:

[...]
> * modified perl code so that it produces code that's already pgindented
> * cleanup of perl code, removing unnecessary subroutines and variables

Does it make sense to arrange table creation in this way (lower, title,
upper, fold, special), don't you think it complicates code perception,
it turns out to be just copy-past.
Previously, the creation of these tables was done in a loop, which
required little code and understandable.

I tried adding a loop to create tables, and everything looks fine (v7).
Also removed unnecessary (hanging) global variables.

> * added a few comments
> * ran pgperltidy
> 
> Some of the perl code working with ranges still needs further cleanup
> and explanation, though.

It's all about not getting too carried away. In my vision these
subroutines (user-defined functions) will have to be moved to perl
module (.pm) in the future, after the patch is committed, so that the
code can be used for Normalization Forms.

So it seems that removing the branch_function() sub, or rather
simplifying it and writing it in place, was unnecessary.
But it is not terrible.

v7 attached.


Regards,
Alexander Borisov

Attachment