Re: Speeding up text_position_next with multibyte encodings - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Speeding up text_position_next with multibyte encodings
Date
Msg-id 05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d@iki.fi
Whole thread Raw
In response to Re: Speeding up text_position_next with multibyte encodings  (John Naylor <jcnaylor@gmail.com>)
Responses Re: Speeding up text_position_next with multibyte encodings
Re: Speeding up text_position_next with multibyte encodings
List pgsql-hackers
On 14/12/2018 20:20, John Naylor wrote:
> I signed up to be a reviewer, and I will be busy next month, so I went
> ahead and fixed the typo in the patch that broke assert-enabled
> builds. While at it, I standardized on the spelling "start_ptr" in a
> few places to match the rest of the file. It's a bit concerning that
> it wouldn't compile with asserts, but the patch was written by a
> committer and seems to work.

Thanks for fixing that!

> On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> This is a win in most cases. One case is slower: calling position() with
>> a large haystack input, where the match is near the end of the string.
> 
> I wanted to test this case in detail, so I ran the attached script, ...

I'm afraid that script doesn't work as a performance test. The 
position() function is immutable, so the result gets cached in the plan 
cache. All you're measuring is the speed to get the constant from the 
plan cache :-(.

I rewrote the same tests with a little C helper function, attached, to 
fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the 
loop counts so that the runtimes are reasonable, now that it actually 
does some work.

UTF-8

length        master        patch
short        2.7s        10.9s
med        2.6s        11.4s
long        3.9s        9.9s

EUC_KR

length        master        patch
short        2.2s        7.5s
med        2.2s        3.5s
long        3.4s        3.6s

This doesn't look very flattering for the patch. Although, this is 
deliberately testing the worst case.

You chose interesting characters for the UTF-8 test. The haystack is a 
repeating pattern of byte sequence EC 99 84, and the needle is a 
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the 
skip table are '2', '1' and '250'. But the search bounces between the 
'2' and '1' cases, and never hits the byte that would allow it to skip 
250 bytes. Interesting case, I had not realized that that can happen. 
But I don't think we need to put much weight on that, you could come up 
with other scenarios where the current code has skip table collisions, too.

So overall, I think it's still a worthwhile tradeoff, given that that is 
a worst case scenario. If you choose less pathological UTF-8 codepoints, 
or there is no match or the match is closer to the beginning of the 
string, the patch wins.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Joins on TID
Next
From: Heikki Linnakangas
Date:
Subject: Re: Speeding up text_position_next with multibyte encodings