Re: Speeding up text_position_next with multibyte encodings - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Speeding up text_position_next with multibyte encodings |
Date | |
Msg-id | 30dc0ed3-cc30-abda-427d-ac6e8305525b@2ndquadrant.com Whole thread Raw |
In response to | Re: Speeding up text_position_next with multibyte encodings (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Speeding up text_position_next with multibyte encodings
|
List | pgsql-hackers |
On 12/23/18 1:26 AM, Heikki Linnakangas wrote: > On 14/12/2018 20:20, John Naylor wrote: >> I signed up to be a reviewer, and I will be busy next month, so I went >> ahead and fixed the typo in the patch that broke assert-enabled >> builds. While at it, I standardized on the spelling "start_ptr" in a >> few places to match the rest of the file. It's a bit concerning that >> it wouldn't compile with asserts, but the patch was written by a >> committer and seems to work. > > Thanks for fixing that! > >> On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> This is a win in most cases. One case is slower: calling position() with >>> a large haystack input, where the match is near the end of the string. >> >> I wanted to test this case in detail, so I ran the attached script, ... > > I'm afraid that script doesn't work as a performance test. The > position() function is immutable, so the result gets cached in the plan > cache. All you're measuring is the speed to get the constant from the > plan cache :-(. > > I rewrote the same tests with a little C helper function, attached, to > fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the > loop counts so that the runtimes are reasonable, now that it actually > does some work. > > UTF-8 > > length master patch > short 2.7s 10.9s > med 2.6s 11.4s > long 3.9s 9.9s > > EUC_KR > > length master patch > short 2.2s 7.5s > med 2.2s 3.5s > long 3.4s 3.6s > > This doesn't look very flattering for the patch. Although, this is > deliberately testing the worst case. > > You chose interesting characters for the UTF-8 test. The haystack is a > repeating pattern of byte sequence EC 99 84, and the needle is a > repeating pattern of EC 84 B1. In the 'long' test, the lengths in the > skip table are '2', '1' and '250'. But the search bounces between the > '2' and '1' cases, and never hits the byte that would allow it to skip > 250 bytes. Interesting case, I had not realized that that can happen. > But I don't think we need to put much weight on that, you could come up > with other scenarios where the current code has skip table collisions, too. > > So overall, I think it's still a worthwhile tradeoff, given that that is > a worst case scenario. If you choose less pathological UTF-8 codepoints, > or there is no match or the match is closer to the beginning of the > string, the patch wins. > So, what is the expected speedup in a "good/average" case? Do we have some reasonable real-world workload mixing these cases that could be used as a realistic benchmark? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: