Thread: Re: Optimization for lower(), upper(), casefold() functions.

Re: Optimization for lower(), upper(), casefold() functions.

From

Alexander Borisov

Date:

29 January, 23:52:59

Sorry, I made a mistake in the code. It's not worth watching this patch yet.

29.01.2025 23:23, Alexander Borisov пишет:
> Hi, hackers!
> 
> I propose to consider a simple optimization for Unicode case tables.
> The main changes affect the generate-unicode_case_table.pl file.
> 
> Because of the modified approach of record search by table we
> managed to:
> 1. Removed storing Unicode codepoints (unsigned int) in all tables.
> 2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
> 3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
> 4. Reduced the time to find a record in the table.
> 5. Reduce the size of the final object file.
> 
> The approach is generally as follows:
> Group Unicode codepoints into ranges in which the difference between
> neighboring elements does not exceed the specified limit.
> For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
> there is a difference of 2 between 3 and 5, which is greater than 1,
> so there will be ranges 1-3 and 5-6.
> 
> Then we form a table (let's call it an index table) by combining the
> obtained ranges. The table contains uint16_t index to the main table.
> 
> Then from the previously obtained diapasons we form a function
> (get_case()) to get the index to the main table. The function, in fact,
> contains only IF/ELSE IF constructs imitating binary search.
> 
> Because we are not directly accessing the main table with data, we can
> exclude duplicates from it, and there are almost half of them.
> Also, because get_case() contains all the information about Unicode
> ranges, we don't need to store Unicode codepoints in the main table.
> Also because of this approach some checks were removed, which allowed
> to increase performance even with fast path (codepoints < 0x80).
> 
> casefold() test.
> 
> * macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)
> 
> ASCII:
> Repeated characters (700kb) in the range from 0x20 to 0x7E.
> Patch: tps = 282.457745
> Without: tps = 263.749652
> 
> Cyrillic:
> Repeated characters (1MB) in the range from 0x0410 to 0x042F.
> Patch: tps = 82.399637
> Without: tps = 48.291034
> 
> Unicode:
> A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
> (excluding 0xD800..0xDFFF).
> Patch: tps = 120.703471
> Without: tps = 92.423490
> 
> * Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)
> 
> ASCII:
> Repeated characters (700kb) in the range from 0x20 to 0x7E.
> Patch: tps = 172.291972
> Without: tps = 111.592281
> 
> Cyrillic:
> Repeated characters (1MB) in the range from 0x0410 to 0x042F.
> Patch: tps = 36.487650
> Without: tps = 22.537515
> 
> Unicode:
> A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
> (excluding 0xD800..0xDFFF).
> Patch: tps = 55.190635
> Without: tps = 45.493104
> 
> 


-- 

Alexander Borisov

Re: Optimization for lower(), upper(), casefold() functions.

From

Heikki Linnakangas

Date:

31 January, 01:43:15

On 30/01/2025 15:39, Alexander Borisov wrote:
> The code is fixed, now the patch passes all tests.
> 
> Change from the original patch (v1):
> Reduce the main table from 3003 to 1677 (duplicates removed) records.
> Added records from 0x00 to 0x80 for fast path.
> Renamed get_case() function to pg_unicode_case_index().
> 
> Benchmark numbers have changed a bit.

Nice results!

> Because of the modified approach of record search by table we
> managed to:
> 1. Removed storing Unicode codepoints (unsigned int) in all tables.
> 2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
> 3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
> 4. Reduced the time to find a record in the table.
> 5. Reduce the size of the final object file.
> 
> The approach is generally as follows:
> Group Unicode codepoints into ranges in which the difference between
> neighboring elements does not exceed the specified limit.
> For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
> there is a difference of 2 between 3 and 5, which is greater than 1,
> so there will be ranges 1-3 and 5-6.
> 
> Then we form a table (let's call it an index table) by combining the
> obtained ranges. The table contains uint16_t index to the main table.
> 
> Then from the previously obtained diapasons we form a function
> (get_case()) to get the index to the main table. The function, in fact,
> contains only IF/ELSE IF constructs imitating binary search.
> 
> Because we are not directly accessing the main table with data, we can
> exclude duplicates from it, and there are almost half of them.
> Also, because get_case() contains all the information about Unicode
> ranges, we don't need to store Unicode codepoints in the main table.
> Also because of this approach some checks were removed, which allowed
> to increase performance even with fast path (codepoints < 0x80).

Did you consider using a radix tree? We use that method in 
src/backend/utils/mb/Unicode/convutils.pm. I'm not sure if that's better 
or worse than what's proposed here, but it would seem like a more 
standard technique at least. Or if this is clearly better, then maybe we 
should switch to this technique in convutils.pm too. A perfect hash 
would be another alternative, we use that in src/common/unicode_norm.c.

Did you check that these optimizations still win with Unicode version 16 
(https://www.postgresql.org/message-id/146349e4-4687-4321-91af-f235572490a8@eisentraut.org)? 
We haven't updated to that yet, but sooner or later we will.

The way you're defining 'pg_unicode_case_index' as a function in the 
header file won't work. It needs to be a static inline function if it's 
in the header. Or put it in a .c file.

Some ideas on how to squeeze this further:

- Instead of having one table that contains Lower/Title/Upper/Fold for 
every character, it might be better to have four separate tables. I 
think that would be more cache-friendly: you typically run one of the 
functions for many different characters in a loop, rather than all of 
the functions for the same character. You could deduplicate between the 
tables too: for many ranges of characters, Title=Upper and Lower=Fold.

- The characters are stored as 4-byte integers, but the high byte is 
always 0. Could squeeze those out. Not sure if that would be a win if it 
makes the accesses unaligned, but you could benchmark that. 
Alternatively, use that empty byte to store the 'special_case' index, 
instead of having a separate field for it.

- Many characters that have a special case only need the special case 
for some of the functions, not all. If you stored the special_case 
separately for each function (as the high byte in the 'simplemap' field 
perhaps, like I suggested on previous point), you could avoid having 
those "dummy" special cases.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Optimization for lower(), upper(), casefold() functions.

From

Heikki Linnakangas

Date:

14 March, 14:16:42

On 14/03/2025 05:43, Jeff Davis wrote:
> On Wed, 2025-03-12 at 23:39 +0300, Alexander Borisov wrote:
>> v5 attached.
> 
> Attached v6j.
> 
> * marked arrays as "static const" rather than just "static"
> * ran pgindent
> * changed data types where appropriate (uint32->pg_wchar)
> * modified perl code so that it produces code that's already pgindented
> * cleanup of perl code, removing unnecessary subroutines and variables
> * added a few comments
> * ran pgperltidy
> 
> Some of the perl code working with ranges still needs further cleanup
> and explanation, though.
> 
> Also, I ran some of my own simple tests (mostly ASCII) and it showed
> over 10% speedup. That combined with the smaller table sizes makes this
> well worth it.

Looks good overall.

> static const pg_wchar case_map_lower[1677] =
> {
>     0x000000,                    /* U+000000 */
>     0x000000,                    /* U+000000 */
>     0x000001,                    /* U+000001 */
>     0x000002,                    /* U+000002 */

The duplicated 0x000000 looks wrong. I understand that the 0'th entry is 
reserved, and the actual codepoints start at index 1, but the /* 
U+000000 */ comment on the 0'th entry is misleading.

> static const uint8 case_map_special[1677] =
> {
>     0x000000,                    /* U+000000 */
>     0x000000,                    /* U+000000 */
>     ...

0x000000 implies an 24-bit integer, but these are uint8's. Let's use 
plain base-10 decimals here rather than hex, like in 'case_map'.

Attached are fixes for those and some other minor things.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Optimization for lower(), upper(), casefold() functions.

From

Jeff Davis

Date:

14 March, 20:55:33

On Fri, 2025-03-14 at 13:16 +0200, Heikki Linnakangas wrote:
> Attached are fixes for those and some other minor things.

Thank you, I agree and I have applied your changes.

Regards,
    Jeff Davis