On 30/01/2025 15:39, Alexander Borisov wrote:
> The code is fixed, now the patch passes all tests.
>
> Change from the original patch (v1):
> Reduce the main table from 3003 to 1677 (duplicates removed) records.
> Added records from 0x00 to 0x80 for fast path.
> Renamed get_case() function to pg_unicode_case_index().
>
> Benchmark numbers have changed a bit.
Nice results!
> Because of the modified approach of record search by table we
> managed to:
> 1. Removed storing Unicode codepoints (unsigned int) in all tables.
> 2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
> 3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
> 4. Reduced the time to find a record in the table.
> 5. Reduce the size of the final object file.
>
> The approach is generally as follows:
> Group Unicode codepoints into ranges in which the difference between
> neighboring elements does not exceed the specified limit.
> For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
> there is a difference of 2 between 3 and 5, which is greater than 1,
> so there will be ranges 1-3 and 5-6.
>
> Then we form a table (let's call it an index table) by combining the
> obtained ranges. The table contains uint16_t index to the main table.
>
> Then from the previously obtained diapasons we form a function
> (get_case()) to get the index to the main table. The function, in fact,
> contains only IF/ELSE IF constructs imitating binary search.
>
> Because we are not directly accessing the main table with data, we can
> exclude duplicates from it, and there are almost half of them.
> Also, because get_case() contains all the information about Unicode
> ranges, we don't need to store Unicode codepoints in the main table.
> Also because of this approach some checks were removed, which allowed
> to increase performance even with fast path (codepoints < 0x80).
Did you consider using a radix tree? We use that method in
src/backend/utils/mb/Unicode/convutils.pm. I'm not sure if that's better
or worse than what's proposed here, but it would seem like a more
standard technique at least. Or if this is clearly better, then maybe we
should switch to this technique in convutils.pm too. A perfect hash
would be another alternative, we use that in src/common/unicode_norm.c.
Did you check that these optimizations still win with Unicode version 16
(https://www.postgresql.org/message-id/146349e4-4687-4321-91af-f235572490a8@eisentraut.org)?
We haven't updated to that yet, but sooner or later we will.
The way you're defining 'pg_unicode_case_index' as a function in the
header file won't work. It needs to be a static inline function if it's
in the header. Or put it in a .c file.
Some ideas on how to squeeze this further:
- Instead of having one table that contains Lower/Title/Upper/Fold for
every character, it might be better to have four separate tables. I
think that would be more cache-friendly: you typically run one of the
functions for many different characters in a loop, rather than all of
the functions for the same character. You could deduplicate between the
tables too: for many ranges of characters, Title=Upper and Lower=Fold.
- The characters are stored as 4-byte integers, but the high byte is
always 0. Could squeeze those out. Not sure if that would be a win if it
makes the accesses unaligned, but you could benchmark that.
Alternatively, use that empty byte to store the 'special_case' index,
instead of having a separate field for it.
- Many characters that have a special case only need the special case
for some of the functions, not all. If you stored the special_case
separately for each function (as the high byte in the 'simplemap' field
perhaps, like I suggested on previous point), you could avoid having
those "dummy" special cases.
--
Heikki Linnakangas
Neon (https://neon.tech)