Thread: Re: Optimization for lower(), upper(), casefold() functions.
On Tue, 2025-02-11 at 23:08 +0300, Alexander Borisov wrote: > I tried the approach via a range table. The result was worse than > without the table. With branching in a function, the result is > better. > > Patch v3 — ranges binary search by branches. > Patch v4 — ranges binary search by table. Thoughts on v3: It looks like the top 5 bits of the offset are unused. What if we used those bits for flags to indicate: HAS_LOWER HAS_UPPER HAS_FOLD HAS_SPECIAL HAS_TITLE That way, we only need to look in the corresponding table if it actually has an entry other than the codepoint itself. It doesn't leave a lot of room if the tables get larger, but if we are worried about that, we could eliminate HAS_TITLE, because I don't think the performance for INITCAP() is as important as LOWER/UPPER/CASEFOLD. Regards, Jeff Davis
19.02.2025 01:02, Jeff Davis пишет: > On Tue, 2025-02-11 at 23:08 +0300, Alexander Borisov wrote: >> I tried the approach via a range table. The result was worse than >> without the table. With branching in a function, the result is >> better. >> >> Patch v3 — ranges binary search by branches. >> Patch v4 — ranges binary search by table. > > Thoughts on v3: > > It looks like the top 5 bits of the offset are unused. What if we used > those bits for flags to indicate: > > HAS_LOWER > HAS_UPPER > HAS_FOLD > HAS_SPECIAL > HAS_TITLE > > That way, we only need to look in the corresponding table if it > actually has an entry other than the codepoint itself. > > It doesn't leave a lot of room if the tables get larger, but if we are > worried about that, we could eliminate HAS_TITLE, because I don't think > the performance for INITCAP() is as important as LOWER/UPPER/CASEFOLD. > > Regards, > Jeff Davis > It seems to be micro-optimizations. In the sense that the code is very clear and understandable now. These micro-optimizations make the code more complex, and there will be no performance gain. In proposing the patch for v3, I struck a balance between improving performance and reducing binary size, without sacrificing code clarity. We can go into micro-optimizations now and lose the essence of the proposed improvements. And ahead of us we have Unicode Normalization Forms. I'm willing to try different approaches, but there's confidence that we've found a great optimization right now. -- Alexander Borisov
On Wed, 2025-02-19 at 01:54 +0300, Alexander Borisov wrote: > In proposing the patch for v3, I struck a balance between improving > performance and reducing binary size, without sacrificing code > clarity. Fair enough. I will continue reviewing v3. Regards, Jeff Davis
19.02.2025 01:56, Jeff Davis пишет: > On Wed, 2025-02-19 at 01:54 +0300, Alexander Borisov wrote: >> In proposing the patch for v3, I struck a balance between improving >> performance and reducing binary size, without sacrificing code >> clarity. > > Fair enough. I will continue reviewing v3. Did you have a time for review this? I'd like to continue improving Unicode in Postgres, as I previously wrote, next in my plans are Normalization forms, and more. But now I am blocked by this patch. -- Alexander Borisov
On Sun, 2025-03-02 at 23:33 +0300, Alexander Borisov wrote: > Did you have a time for review this? > > I'd like to continue improving Unicode in Postgres, as I previously > wrote, next in my plans are Normalization forms, and more. > But now I am blocked by this patch. Hi, I have refactored unicode_case.c a bit (v3j-0001) and rebased your v3 work on top of that (v3j-0002). The refactoring is so that the optimizations do not need to modify convert_case, which is already complex and I'd like to avoid adding more to that function. Instead, I created a casemap() function, which maps a single chracter, and convert_case() calls that. I didn't test the refactoring for performance, but it looks as optimizable as what was there before. A couple questions: * Is there a reason the fast-path for codepoints < 0x80 is in unicode_case.c rather than unicode_case_func.h? * Is there a reason you defined case_index() as static rather than static inline? * Is there a reason to have a new file unicode_case_func.h rather than just add it to unicode_case_table.h? I'm looking at a few more details, but this is a low-risk change because there are exhaustive tests, so I intend to commit something like this soon. Regards, Jeff Davis
Attachment
Hi Jeff, 12.03.2025 07:05, Jeff Davis wrote: > On Sun, 2025-03-02 at 23:33 +0300, Alexander Borisov wrote: [...] > I have refactored unicode_case.c a bit (v3j-0001) and rebased your v3 > work on top of that (v3j-0002). > > The refactoring is so that the optimizations do not need to modify > convert_case, which is already complex and I'd like to avoid adding > more to that function. Instead, I created a casemap() function, which > maps a single chracter, and convert_case() calls that. > > I didn't test the refactoring for performance, but it looks as > optimizable as what was there before. I like the proposed changes. Here are some thoughts and small improvements: I've modified patch v3j-0001 a bit. Specifically: 1. Added static for casemap() function. Otherwise the compiler could not optimize the code and the performance dropped significantly. 2. Added a fast path for codepoint < 0x80. v3j-0002: In the fast path for codepoints < 0x80, I added a premature return. This avoided additional insertions, which increased performance. > A couple questions: > > * Is there a reason the fast-path for codepoints < 0x80 is in > unicode_case.c rather than unicode_case_func.h? Yes, this is an important optimization, below are benchmarks that confirm it (I mean the lack of fast path in patch v3j-0001). Also we share logic (0) case conversion (1) get/find value from table. > * Is there a reason you defined case_index() as static rather than > static inline? There is no reason, but we can make it static inline, but that won't do anything, the compiler will optimize it anyway. Perhaps for general beauty it should be made static inline, I don't have a rigid position here. Benchmark showed that there is no difference in performance. > * Is there a reason to have a new file unicode_case_func.h rather than > just add it to unicode_case_table.h? I was purely based on existing approaches in Postgres, the Normalization Forms have them separated into different headers. Just trying to be consistent with existing approaches. > I'm looking at a few more details, but this is a low-risk change > because there are exhaustive tests, so I intend to commit something > like this soon. Thanks for the patch review! Regards, Alexander Borisov
Attachment
12.03.2025 19:55, Alexander Borisov wrote: [...] >> A couple questions: >> >> * Is there a reason the fast-path for codepoints < 0x80 is in >> unicode_case.c rather than unicode_case_func.h? > > Yes, this is an important optimization, below are benchmarks that [...] I forgot to add the benchmark: Benchmark Anatation: Patch v3j_0001: patch v3j_0001 without any changes. Patch v3j_0001 + static: patch v3j_0001 with adding static to casemap(). Patch v3j_0001 + static + fast path: patch v3j_0001 with adding static to casemap() and fast path for codepoint < 0x80. v4_0001: v3j_0001 + static + fast path v4_0001 + v3j_0002: v3j_0002 patch unchanged, but with static for casemap() (inherited from v4_0001 patch) v4_0001 + v4_0002: changed fast path for codepoint < 0x80. Made fast return to avoid unnecessary checks. Without: current version of the code without any patches. All results are in tps. * macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0) ASCII: Repeated characters (700kb) in the range from 0x20 to 0x7E. Patch v3j_0001: 201.029609 Patch v3j_0001 + static: 247.155438 Patch v3j_0001 + static + fast path: 267.941047 Patch v4_0001 + v3j_0002: 260.737601 Patch v4_0001 + v4_0002: 268.913658 Without: 260.437508 Cyrillic: Repeated characters (1MB) in the range from 0x0410 to 0x042F. Patch v3j_0001: 48.130350 Patch v3j_0001 + static: 49.365897 Patch v3j_0001 + static + fast path: 48.891842 Patch v4_0001 + v3j_0002: 88.833061 Patch v4_0001 + v4_0002: 88.329603 Without: 47.869687 Unicode: A query consisting of all Unicode characters from 0xA0 to 0x2FA1D (excluding 0xD800..0xDFFF). Patch v3j_0001: 91.333557 Patch v3j_0001 + static: 92.464786 Patch v3j_0001 + static + fast path: 91.359428 Patch v4_0001 + v3j_0002: 103.390609 Patch v4_0001 + v4_0002: 102.819763 Without: 92.279652 Regards, Alexander Borisov
On Wed, 2025-03-12 at 19:55 +0300, Alexander Borisov wrote: > 1. Added static for casemap() function. Otherwise the compiler could > not > optimize the code and the performance dropped significantly. Oops, it was static, but I made it external just to see what code it generated. I didn't intend to publish it as an external function -- thank you for catching that! > 2. Added a fast path for codepoint < 0x80. > > v3j-0002: > In the fast path for codepoints < 0x80, I added a premature return. > This avoided additional insertions, which increased performance. What do you mean "additional insertions"? Also, should we just compute the results in the fast path? We don't even need a table. Rough patch attached to go on top of v4-0001. Should we properly return CASEMAP_SELF when *simple == u1, or is it ok to return CASEMAP_SIMPLE? It probably doesn't matter performance-wise, but it feels more correct to return CASEMAP_SELF. > > Perhaps for general > beauty it should be made static inline, I don't have a rigid position > here. We ordinarily use "static inline" if it's in a header file, and "static" if it's in a .c file, so I'll do it that way. > I was purely based on existing approaches in Postgres, the > Normalization Forms have them separated into different headers. Just > trying to be consistent with existing approaches. I think that was done for normalization primarily because it's not used #ifndef FRONTEND (see unicode_norm.c), and perhaps also because it's just a more complex function worthy of its own file. I looked into the history, and commit 783f0cc64d explains why perfect hashing is not used in the frontend: "The decomposition table remains the same, getting used for the binary search in the frontend code, where we care more about the size of the libraries like libpq over performance..." > Regards, Jeff Davis
Attachment
12.03.2025 22:39, Jeff Davis wrote: [...] >> 2. Added a fast path for codepoint < 0x80. >> >> v3j-0002: >> In the fast path for codepoints < 0x80, I added a premature return. >> This avoided additional insertions, which increased performance. > > What do you mean "additional insertions"? Sorry for my English. I mean, we immediately do a return in the if () condition. To avoid further branching/checking. > Also, should we just compute the results in the fast path? We don't > even need a table. Rough patch attached to go on top of v4-0001. > > Should we properly return CASEMAP_SELF when *simple == u1, or is it ok > to return CASEMAP_SIMPLE? It probably doesn't matter performance-wise, > but it feels more correct to return CASEMAP_SELF. It seems to disrupt the overall "beauty" of the approach. Thus, we will copy code (bloat code), make optimizations that do not improve performance but bloat code. I would refrain from such practices. Especially since we'll be changing all that in the next patch (v4-0002). >> >> Perhaps for general >> beauty it should be made static inline, I don't have a rigid position >> here. > > We ordinarily use "static inline" if it's in a header file, and > "static" if it's in a .c file, so I'll do it that way. Great, I've changed this place. Performance has not changed in any way. >> I was purely based on existing approaches in Postgres, the >> Normalization Forms have them separated into different headers. Just >> trying to be consistent with existing approaches. > > I think that was done for normalization primarily because it's not used > #ifndef FRONTEND (see unicode_norm.c), and perhaps also because it's > just a more complex function worthy of its own file. > > I looked into the history, and commit 783f0cc64d explains why perfect > hashing is not used in the frontend: > > "The decomposition table remains the same, getting used for the binary > search in the frontend code, where we care more about the size of the > libraries like libpq over performance..." I removed the extra file (unicode_case_func.h). You are right, we should not create unnecessary clutter. v5 attached. Regards, Alexander Borisov
Attachment
On Wed, 2025-03-12 at 23:39 +0300, Alexander Borisov wrote: > v5 attached. Attached v6j. * marked arrays as "static const" rather than just "static" * ran pgindent * changed data types where appropriate (uint32->pg_wchar) * modified perl code so that it produces code that's already pgindented * cleanup of perl code, removing unnecessary subroutines and variables * added a few comments * ran pgperltidy Some of the perl code working with ranges still needs further cleanup and explanation, though. Also, I ran some of my own simple tests (mostly ASCII) and it showed over 10% speedup. That combined with the smaller table sizes makes this well worth it. Regards, Jeff Davis
Attachment
14.03.2025 06:43, Jeff Davis wrote: > On Wed, 2025-03-12 at 23:39 +0300, Alexander Borisov wrote: >> v5 attached. > > Attached v6j. I like the current changes. Some comments: [...] > * modified perl code so that it produces code that's already pgindented > * cleanup of perl code, removing unnecessary subroutines and variables Does it make sense to arrange table creation in this way (lower, title, upper, fold, special), don't you think it complicates code perception, it turns out to be just copy-past. Previously, the creation of these tables was done in a loop, which required little code and understandable. I tried adding a loop to create tables, and everything looks fine (v7). Also removed unnecessary (hanging) global variables. > * added a few comments > * ran pgperltidy > > Some of the perl code working with ranges still needs further cleanup > and explanation, though. It's all about not getting too carried away. In my vision these subroutines (user-defined functions) will have to be moved to perl module (.pm) in the future, after the patch is committed, so that the code can be used for Normalization Forms. So it seems that removing the branch_function() sub, or rather simplifying it and writing it in place, was unnecessary. But it is not terrible. v7 attached. Regards, Alexander Borisov