Re: Optimization for lower(), upper(), casefold() functions. - Mailing list pgsql-hackers
From | Alexander Borisov |
---|---|
Subject | Re: Optimization for lower(), upper(), casefold() functions. |
Date | |
Msg-id | 12956299-1496-4455-b3ab-299cc0a87d7c@gmail.com Whole thread Raw |
In response to | Re: Optimization for lower(), upper(), casefold() functions. (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
31.01.2025 01:43, Heikki Linnakangas пишет: Hi Heikki, > Did you consider using a radix tree? We use that method in src/backend/ > utils/mb/Unicode/convutils.pm. I'm not sure if that's better or worse > than what's proposed here, but it would seem like a more standard > technique at least. Or if this is clearly better, then maybe we should > switch to this technique in convutils.pm too. A perfect hash would be > another alternative, we use that in src/common/unicode_norm.c. I looked at the radix tree implementation, and according to number of branches and mathematical operations I think radix tree will not be faster than the proposed approach. About the perfect hash. The problem with the perfect hash is that it requires a Unicode codepoint to be stored for matching. Originally I started to optimize Unicode Normalization Form in Postgres. But I decided to “practice” on a case so as not to scare anyone with a big patch at once. Actually, I want to do Unicode in Postgres, optimizations and improvements. > Did you check that these optimizations still win with Unicode version 16 > (https://www.postgresql.org/message-id/146349e4-4687-4321-91af- > f235572490a8@eisentraut.org)? We haven't updated to that yet, but sooner > or later we will. Yes, everything works just as well with Unicode version 16 data. > The way you're defining 'pg_unicode_case_index' as a function in the > header file won't work. It needs to be a static inline function if it's > in the header. Or put it in a .c file. I agree, it needs to be moved to a .c file. > Some ideas on how to squeeze this further: > > - Instead of having one table that contains Lower/Title/Upper/Fold for > every character, it might be better to have four separate tables. I > think that would be more cache-friendly: you typically run one of the > functions for many different characters in a loop, rather than all of > the functions for the same character. You could deduplicate between the > tables too: for many ranges of characters, Title=Upper and Lower=Fold. I'll try to experiment with that. Theoretically the performance should increase. > - The characters are stored as 4-byte integers, but the high byte is > always 0. Could squeeze those out. Not sure if that would be a win if it > makes the accesses unaligned, but you could benchmark that. > Alternatively, use that empty byte to store the 'special_case' index, > instead of having a separate field for it. I thought about it, but it seems that it will make the code much more complicated, and we won't gain much. > > - Many characters that have a special case only need the special case > for some of the functions, not all. If you stored the special_case > separately for each function (as the high byte in the 'simplemap' field > perhaps, like I suggested on previous point), you could avoid having > those "dummy" special cases. That's a good idea. Then the main table will be reduced by uint8*n. Thanks, after the weekend I'll send an updated patch that takes into account the comments/advice. -- SberTech Alexander Borisov
pgsql-hackers by date: