Sorry, I made a mistake in the code. It's not worth watching this patch yet.
29.01.2025 23:23, Alexander Borisov пишет:
> Hi, hackers!
>
> I propose to consider a simple optimization for Unicode case tables.
> The main changes affect the generate-unicode_case_table.pl file.
>
> Because of the modified approach of record search by table we
> managed to:
> 1. Removed storing Unicode codepoints (unsigned int) in all tables.
> 2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
> 3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
> 4. Reduced the time to find a record in the table.
> 5. Reduce the size of the final object file.
>
> The approach is generally as follows:
> Group Unicode codepoints into ranges in which the difference between
> neighboring elements does not exceed the specified limit.
> For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
> there is a difference of 2 between 3 and 5, which is greater than 1,
> so there will be ranges 1-3 and 5-6.
>
> Then we form a table (let's call it an index table) by combining the
> obtained ranges. The table contains uint16_t index to the main table.
>
> Then from the previously obtained diapasons we form a function
> (get_case()) to get the index to the main table. The function, in fact,
> contains only IF/ELSE IF constructs imitating binary search.
>
> Because we are not directly accessing the main table with data, we can
> exclude duplicates from it, and there are almost half of them.
> Also, because get_case() contains all the information about Unicode
> ranges, we don't need to store Unicode codepoints in the main table.
> Also because of this approach some checks were removed, which allowed
> to increase performance even with fast path (codepoints < 0x80).
>
> casefold() test.
>
> * macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)
>
> ASCII:
> Repeated characters (700kb) in the range from 0x20 to 0x7E.
> Patch: tps = 282.457745
> Without: tps = 263.749652
>
> Cyrillic:
> Repeated characters (1MB) in the range from 0x0410 to 0x042F.
> Patch: tps = 82.399637
> Without: tps = 48.291034
>
> Unicode:
> A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
> (excluding 0xD800..0xDFFF).
> Patch: tps = 120.703471
> Without: tps = 92.423490
>
> * Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)
>
> ASCII:
> Repeated characters (700kb) in the range from 0x20 to 0x7E.
> Patch: tps = 172.291972
> Without: tps = 111.592281
>
> Cyrillic:
> Repeated characters (1MB) in the range from 0x0410 to 0x042F.
> Patch: tps = 36.487650
> Without: tps = 22.537515
>
> Unicode:
> A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
> (excluding 0xD800..0xDFFF).
> Patch: tps = 55.190635
> Without: tps = 45.493104
>
>
--
Alexander Borisov