Thread: Re: Optimization for lower(), upper(), casefold() functions.

Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Tue, 2025-02-04 at 23:19 +0300, Alexander Borisov wrote:
> I've done many different experiments and everywhere the result is
> within
> the margin of the v2 patch result.

Great, thank you for working on this!

There doesn't appear to be a downside. Even though it's more complex,
we have exhaustive tests to compare with ICU, so that should catch any
correctness issues.

Heikki mentioned the radix tree, so I'd be interested to know what the
trade-offs there are. I don't have a strong opinion, but I'd like to be
able to explain why we use a radix tree for encoding conversions and
the generated branches approach in this patch for case mapping.

Also, I have a question: when there are deeply-nested "if" statements,
like in this patch, can that create a burden on the branch predictor
that affects code elsewhere?

Regards,
    Jeff Davis




Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
Hi Jeff,

06.02.2025 00:46, Jeff Davis пишет:
> On Tue, 2025-02-04 at 23:19 +0300, Alexander Borisov wrote:
>> I've done many different experiments and everywhere the result is
>> within
>> the margin of the v2 patch result.
> 
> Great, thank you for working on this!
> 
> There doesn't appear to be a downside. Even though it's more complex,
> we have exhaustive tests to compare with ICU, so that should catch any
> correctness issues.
> 
> Heikki mentioned the radix tree, so I'd be interested to know what the
> trade-offs there are. I don't have a strong opinion, but I'd like to be
> able to explain why we use a radix tree for encoding conversions and
> the generated branches approach in this patch for case mapping.

I don't think I understand you correctly, but I'll try to answer.

In Postgres I found several approaches to mapping.
1. Perfect Hash
2. Radix tree
3. Binary search

For Unicode Normalization, approach 1 and 3 are used.
For Unicode Case, approach 3 is used.
For Encoding, approach 2 is used.

Since I started to improve Unicode Case, I used the same approach,
essentially a binary search, only not by individual values, but by
ranges.
This allowed me to get rid of storing the original codepoints and reduce
the tables considerably.

I plan to use the same approach for Unicode Normalization. Acting
sequentially and suggesting improvements.
Unfortunately I'm new here and I can't say why one place uses one
approach and another another - I just don't know.


> 
> Also, I have a question: when there are deeply-nested "if" statements,
> like in this patch, can that create a burden on the branch predictor
> that affects code elsewhere?

We can do an experiment to radically reduce the number of branches in
the case_index() function.
In src/common/unicode/generate-unicode_case_table.pl we simply specify
a large range of possible empty values, from the current 500 to 5000.
my $range = make_range(\@codepoints, 5000);

The table size will increase from 4k to 20k. The size of the object
file will still be much smaller than the original.
The number of branches will be 7, and the depth will be no more than 2.
As a result, the benchmarks won't change much, it will be better, but
everything is within the margin of error.

Yes, you might think that the number of branches has decreased, but
the table has grown and is not so cacheable. But it is not so.
The 4k table itself (initial) is not small.
As I've already written, it's essentially a binary search, but not by
value, but by range of values (well, and immediately expanded in if).


--
Alexander Borisov




Re: Optimization for lower(), upper(), casefold() functions.

From
Jeff Davis
Date:
On Thu, 2025-02-06 at 18:39 +0300, Alexander Borisov wrote:
> Since I started to improve Unicode Case, I used the same approach,
> essentially a binary search, only not by individual values, but by
> ranges.

I considered it a 4th approach because of the generated branches in
case_index(). Case_index() is essentially a part of the data structure
-- you can't navigate the table without it.

By the way, could we just represent those ranges as another tiny table
instead of representing them as branches?

In other words, a table something like:

    {{.start=0x0000, .end=0x0588, .offset=0},
     {.start=0x10A0, .end=0x1100, .offset=0x10A0 - 0x0588},
     ...
     {.start=0x1E900, .end=0x1E944, .offset=0x1E900 - 0x11D3}}

How would that perform?

> I plan to use the same approach for Unicode Normalization. Acting
> sequentially and suggesting improvements.
> Unfortunately I'm new here and I can't say why one place uses one
> approach and another another - I just don't know.

We have 3 or 4 different solutions (naive binary search, range binary
search, perfect hashing, and radix tree) to 3 or 4 similar problems
(normalization, character properties, case mapping, and encoding
conversion). We should consolidate these approaches somehow.

IIUC, out of the solutions, it seems that either your modified binary
search or the radix tree are best because they are both fast and both
allow compact tables.

My questions are: is there a clear winner between radix tree and the
range binary search for all four problems? If not, what are the trade-
offs?

Regards,
    Jeff Davis




Re: Optimization for lower(), upper(), casefold() functions.

From
Alexander Borisov
Date:
06.02.2025 22:08, Jeff Davis пишет:
> On Thu, 2025-02-06 at 18:39 +0300, Alexander Borisov wrote:
>> Since I started to improve Unicode Case, I used the same approach,
>> essentially a binary search, only not by individual values, but by
>> ranges.
> 
> I considered it a 4th approach because of the generated branches in
> case_index(). Case_index() is essentially a part of the data structure
> -- you can't navigate the table without it.
> 
> By the way, could we just represent those ranges as another tiny table
> instead of representing them as branches?
> 
> In other words, a table something like:
> 
>      {{.start=0x0000, .end=0x0588, .offset=0},
>       {.start=0x10A0, .end=0x1100, .offset=0x10A0 - 0x0588},
>       ...
>       {.start=0x1E900, .end=0x1E944, .offset=0x1E900 - 0x11D3}}
> 
> How would that perform?

I'll try this approach, but it seems like the only hope here is compiler
optimizations.


>> I plan to use the same approach for Unicode Normalization. Acting
>> sequentially and suggesting improvements.
>> Unfortunately I'm new here and I can't say why one place uses one
>> approach and another another - I just don't know.
> 
> We have 3 or 4 different solutions (naive binary search, range binary
> search, perfect hashing, and radix tree) to 3 or 4 similar problems
> (normalization, character properties, case mapping, and encoding
> conversion). We should consolidate these approaches somehow.
> 
> IIUC, out of the solutions, it seems that either your modified binary
> search or the radix tree are best because they are both fast and both
> allow compact tables.
> 
> My questions are: is there a clear winner between radix tree and the
> range binary search for all four problems? If not, what are the trade-
> offs?

I think it is already clear that there is only one thing left - to test
radix tree and range binary in how they will behave in Encoding
and Unicode Case.
That is, implement both approaches in both cases.
I'm thinking of taking big5 (big table there) to utf-8 for testing.
The main thing is to understand how to test it correctly (pgbench).
Okay, I'll do that.

But it's worth remembering that there is no such thing as a silver
bullet.


--
Alexander Borisov