Re: Optimization for lower(), upper(), casefold() functions. - Mailing list pgsql-hackers

From Alexander Borisov
Subject Re: Optimization for lower(), upper(), casefold() functions.
Date
Msg-id 5cf176fa-44b9-4019-b9b7-ee902352a844@gmail.com
Whole thread Raw
In response to Re: Optimization for lower(), upper(), casefold() functions.  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Optimization for lower(), upper(), casefold() functions.
List pgsql-hackers
Hi Jeff,

06.02.2025 00:46, Jeff Davis пишет:
> On Tue, 2025-02-04 at 23:19 +0300, Alexander Borisov wrote:
>> I've done many different experiments and everywhere the result is
>> within
>> the margin of the v2 patch result.
> 
> Great, thank you for working on this!
> 
> There doesn't appear to be a downside. Even though it's more complex,
> we have exhaustive tests to compare with ICU, so that should catch any
> correctness issues.
> 
> Heikki mentioned the radix tree, so I'd be interested to know what the
> trade-offs there are. I don't have a strong opinion, but I'd like to be
> able to explain why we use a radix tree for encoding conversions and
> the generated branches approach in this patch for case mapping.

I don't think I understand you correctly, but I'll try to answer.

In Postgres I found several approaches to mapping.
1. Perfect Hash
2. Radix tree
3. Binary search

For Unicode Normalization, approach 1 and 3 are used.
For Unicode Case, approach 3 is used.
For Encoding, approach 2 is used.

Since I started to improve Unicode Case, I used the same approach,
essentially a binary search, only not by individual values, but by
ranges.
This allowed me to get rid of storing the original codepoints and reduce
the tables considerably.

I plan to use the same approach for Unicode Normalization. Acting
sequentially and suggesting improvements.
Unfortunately I'm new here and I can't say why one place uses one
approach and another another - I just don't know.


> 
> Also, I have a question: when there are deeply-nested "if" statements,
> like in this patch, can that create a burden on the branch predictor
> that affects code elsewhere?

We can do an experiment to radically reduce the number of branches in
the case_index() function.
In src/common/unicode/generate-unicode_case_table.pl we simply specify
a large range of possible empty values, from the current 500 to 5000.
my $range = make_range(\@codepoints, 5000);

The table size will increase from 4k to 20k. The size of the object
file will still be much smaller than the original.
The number of branches will be 7, and the depth will be no more than 2.
As a result, the benchmarks won't change much, it will be better, but
everything is within the margin of error.

Yes, you might think that the number of branches has decreased, but
the table has grown and is not so cacheable. But it is not so.
The 4k table itself (initial) is not small.
As I've already written, it's essentially a binary search, but not by
value, but by range of values (well, and immediately expanded in if).


--
Alexander Borisov




pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Better visualization of default values
Next
From: Daniel Gustafsson
Date:
Subject: Re: Remove unnecessary static specifier