Thread: Speeding up text_position_next with multibyte encodings

Speeding up text_position_next with multibyte encodings

From
Heikki Linnakangas
Date:
Attached is a patch to speed up text_position_setup/next(), in some 
common cases with multibyte encodings.

text_position_next() uses the Boyer-Moore-Horspool search algorithm, 
with a skip table. Currently, with a multi-byte encoding, we first 
convert both input strings to arrays of wchars, and run the algorithm on 
those arrays.

This patch removes the mb->wchar conversion, and instead runs the 
single-byte version of the algorithm directly on the inputs, even with 
multi-byte encodings. That avoids the mb->wchar conversion altogether, 
when there is no match. Even when there is a match, we don't need to 
convert the whole input string to wchars. It's enough to count the 
characters up to the match, using pg_mblen(). And when 
text_position_setup/next() are used as part of split_part() or replace() 
functions, we're not actually even interested in the character position, 
so we can skip that too.

Avoiding the large allocation is nice, too. That was actually why I 
stated to look into this: a customer complained that text_position fails 
with strings larger than 256 MB.

Using the byte-oriented search on multibyte strings might return false 
positives, though. To make that work, after finding a match, we verify 
that the match doesn't fall in the middle of a multi-byte character. 
However, as an important special case, that cannot happen with UTF-8, 
because it has the property that the byte sequence of one character 
never appears as part of another character. I think some other encodings 
might have the same property, but I wasn't sure.

This is a win in most cases. One case is slower: calling position() with 
a large haystack input, where the match is near the end of the string. 
Counting the characters up to that point is slower than the mb->wchar 
conversion. I think we could avoid that regression too, if we had a more 
optimized function to count characters. Currently, the code calls 
pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), the similar 
loop through all the characters is a tight loop within the function.

Thoughts?

- Heikki

Attachment

Re: Speeding up text_position_next with multibyte encodings

From
Kyotaro HORIGUCHI
Date:
Hello.

At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<3173d989-bc1c-fc8a-3b69-f24246f73876@iki.fi>
> Attached is a patch to speed up text_position_setup/next(), in some
> common cases with multibyte encodings.
> 
> text_position_next() uses the Boyer-Moore-Horspool search algorithm,
> with a skip table. Currently, with a multi-byte encoding, we first
> convert both input strings to arrays of wchars, and run the algorithm
> on those arrays.
> 
> This patch removes the mb->wchar conversion, and instead runs the
> single-byte version of the algorithm directly on the inputs, even with
> multi-byte encodings. That avoids the mb->wchar conversion altogether,
> when there is no match. Even when there is a match, we don't need to
> convert the whole input string to wchars. It's enough to count the
> characters up to the match, using pg_mblen(). And when
> text_position_setup/next() are used as part of split_part() or
> replace() functions, we're not actually even interested in the
> character position, so we can skip that too.

Sounds reasonable. Partial matching character is just
ignored. The conversion is simply useless.

> Avoiding the large allocation is nice, too. That was actually why I
> stated to look into this: a customer complained that text_position
> fails with strings larger than 256 MB.
> 
> Using the byte-oriented search on multibyte strings might return false
> positives, though. To make that work, after finding a match, we verify
> that the match doesn't fall in the middle of a multi-byte
> character. However, as an important special case, that cannot happen
> with UTF-8, because it has the property that the byte sequence of one
> character never appears as part of another character. I think some
> other encodings might have the same property, but I wasn't sure.

Yes. That happens for at least EUC_JP, where the same byte can
appear in arbitrary position. Maybe we can hard code to restrict
that to UTF-8 for the first step. (I'm not sure the second step
happens.)

> This is a win in most cases. One case is slower: calling position()
> with a large haystack input, where the match is near the end of the
> string. Counting the characters up to that point is slower than the
> mb->wchar conversion. I think we could avoid that regression too, if
> we had a more optimized function to count characters. Currently, the
> code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(),
> the similar loop through all the characters is a tight loop within the
> function.
> 
> Thoughts?

Coudn't we count characters while searching a match?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Speeding up text_position_next with multibyte encodings

From
Dmitry Dolgov
Date:
> At Fri, 19 Oct 2018 15:52:59 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote
> Attached is a patch to speed up text_position_setup/next(), in some
> common cases with multibyte encodings.

Hi,

Unfortunately, patch doesn't compile anymore due:

varlena.c: In function ‘text_position_next_internal’:
varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this function)
  Assert(start_ptr >= haystack && start_ptr <= haystack_end);

Could you please send an updated version? For now I'm moving it to the next CF.


Re: Speeding up text_position_next with multibyte encodings

From
John Naylor
Date:
On 11/30/18, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Unfortunately, patch doesn't compile anymore due:
>
> varlena.c: In function ‘text_position_next_internal’:
> varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this
> function)
>   Assert(start_ptr >= haystack && start_ptr <= haystack_end);
>
> Could you please send an updated version? For now I'm moving it to the next
> CF.

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> This is a win in most cases. One case is slower: calling position() with
> a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script,
which runs position() in three scenarios:
short:   1 character needle at the end of a ~1000 character haystack,
repeated 1 million times
medium: 50 character needle at the end of a ~1 million character
haystack, repeated 1 million times
long:  250 character needle at the end of an ~80 million character
haystack (~230MB, comfortably below the 256MB limit Heikki reported),
done just one time

I took the average of 5 runs using Korean characters in both UTF-8
(3-byte) and EUC-KR (2-byte) encodings.

UTF-8
length        master        patch
short        2.26s        2.51s
med            2.28s        2.54s
long        3.07s        3.11s

EUC-KR
length        master        patch
short        2.29s        2.37s
med            2.29s        2.36s
long        1.75s        1.71s

With UTF-8, the patch is 11-12% slower on short and medium strings,
and about the same on long strings. With EUC-KR, the patch is about 3%
slower on short and medium strings, and 2-3% faster on long strings.
It seems the worst case is not that bad, and could be optimized, as
Heikki said.

-John Naylor

Attachment

Re: Speeding up text_position_next with multibyte encodings

From
John Naylor
Date:
On 12/14/18, John Naylor <jcnaylor@gmail.com> wrote:
> I signed up to be a reviewer, and I will be busy next month, so I went
> ahead and fixed the typo in the patch that broke assert-enabled
> builds. While at it, I standardized on the spelling "start_ptr" in a
> few places to match the rest of the file. It's a bit concerning that
> it wouldn't compile with asserts, but the patch was written by a
> committer and seems to work.

I just noticed that the contrib/citext test fails. I've set the status
to waiting on author.

-John Naylor


Re: Speeding up text_position_next with multibyte encodings

From
Heikki Linnakangas
Date:
On 14/12/2018 20:20, John Naylor wrote:
> I signed up to be a reviewer, and I will be busy next month, so I went
> ahead and fixed the typo in the patch that broke assert-enabled
> builds. While at it, I standardized on the spelling "start_ptr" in a
> few places to match the rest of the file. It's a bit concerning that
> it wouldn't compile with asserts, but the patch was written by a
> committer and seems to work.

Thanks for fixing that!

> On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> This is a win in most cases. One case is slower: calling position() with
>> a large haystack input, where the match is near the end of the string.
> 
> I wanted to test this case in detail, so I ran the attached script, ...

I'm afraid that script doesn't work as a performance test. The 
position() function is immutable, so the result gets cached in the plan 
cache. All you're measuring is the speed to get the constant from the 
plan cache :-(.

I rewrote the same tests with a little C helper function, attached, to 
fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the 
loop counts so that the runtimes are reasonable, now that it actually 
does some work.

UTF-8

length        master        patch
short        2.7s        10.9s
med        2.6s        11.4s
long        3.9s        9.9s

EUC_KR

length        master        patch
short        2.2s        7.5s
med        2.2s        3.5s
long        3.4s        3.6s

This doesn't look very flattering for the patch. Although, this is 
deliberately testing the worst case.

You chose interesting characters for the UTF-8 test. The haystack is a 
repeating pattern of byte sequence EC 99 84, and the needle is a 
repeating pattern of EC 84 B1. In the 'long' test, the lengths in the 
skip table are '2', '1' and '250'. But the search bounces between the 
'2' and '1' cases, and never hits the byte that would allow it to skip 
250 bytes. Interesting case, I had not realized that that can happen. 
But I don't think we need to put much weight on that, you could come up 
with other scenarios where the current code has skip table collisions, too.

So overall, I think it's still a worthwhile tradeoff, given that that is 
a worst case scenario. If you choose less pathological UTF-8 codepoints, 
or there is no match or the match is closer to the beginning of the 
string, the patch wins.

- Heikki

Attachment

Re: Speeding up text_position_next with multibyte encodings

From
Heikki Linnakangas
Date:
On 14/12/2018 23:40, John Naylor wrote:
> I just noticed that the contrib/citext test fails. I've set the status
> to waiting on author.

Hmm, it works for me. What failure did you see?

- Heikki


Re: Speeding up text_position_next with multibyte encodings

From
Heikki Linnakangas
Date:
On 23/12/2018 02:28, Heikki Linnakangas wrote:
> On 14/12/2018 23:40, John Naylor wrote:
>> I just noticed that the contrib/citext test fails. I've set the status
>> to waiting on author.
> 
> Hmm, it works for me. What failure did you see?

Never mind, I'm seeing it now, with assertions enabled. Thanks, I'll 
investigate!

- Heikki


Re: Speeding up text_position_next with multibyte encodings

From
Heikki Linnakangas
Date:
On 23/12/2018 02:32, Heikki Linnakangas wrote:
> On 23/12/2018 02:28, Heikki Linnakangas wrote:
>> On 14/12/2018 23:40, John Naylor wrote:
>>> I just noticed that the contrib/citext test fails. I've set the status
>>> to waiting on author.
>>
>> Hmm, it works for me. What failure did you see?
> 
> Never mind, I'm seeing it now, with assertions enabled. Thanks, I'll
> investigate!

The bug was in handling empty inputs. text_position_setup assumed and 
asserted that neither the needle nor haystack are empty, expecting the 
callers to have handled those special cases already, but not all callers 
did. Here is a fixed version.

- Heikki

Attachment

Re: Speeding up text_position_next with multibyte encodings

From
Tomas Vondra
Date:

On 12/23/18 1:26 AM, Heikki Linnakangas wrote:
> On 14/12/2018 20:20, John Naylor wrote:
>> I signed up to be a reviewer, and I will be busy next month, so I went
>> ahead and fixed the typo in the patch that broke assert-enabled
>> builds. While at it, I standardized on the spelling "start_ptr" in a
>> few places to match the rest of the file. It's a bit concerning that
>> it wouldn't compile with asserts, but the patch was written by a
>> committer and seems to work.
> 
> Thanks for fixing that!
> 
>> On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> This is a win in most cases. One case is slower: calling position() with
>>> a large haystack input, where the match is near the end of the string.
>>
>> I wanted to test this case in detail, so I ran the attached script, ...
> 
> I'm afraid that script doesn't work as a performance test. The
> position() function is immutable, so the result gets cached in the plan
> cache. All you're measuring is the speed to get the constant from the
> plan cache :-(.
> 
> I rewrote the same tests with a little C helper function, attached, to
> fix that, and to eliminate any PL/pgSQL overhead. And I adjusted the
> loop counts so that the runtimes are reasonable, now that it actually
> does some work.
> 
> UTF-8
> 
> length        master        patch
> short        2.7s        10.9s
> med        2.6s        11.4s
> long        3.9s        9.9s
> 
> EUC_KR
> 
> length        master        patch
> short        2.2s        7.5s
> med        2.2s        3.5s
> long        3.4s        3.6s
> 
> This doesn't look very flattering for the patch. Although, this is
> deliberately testing the worst case.
> 
> You chose interesting characters for the UTF-8 test. The haystack is a
> repeating pattern of byte sequence EC 99 84, and the needle is a
> repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
> skip table are '2', '1' and '250'. But the search bounces between the
> '2' and '1' cases, and never hits the byte that would allow it to skip
> 250 bytes. Interesting case, I had not realized that that can happen.
> But I don't think we need to put much weight on that, you could come up
> with other scenarios where the current code has skip table collisions, too.
> 
> So overall, I think it's still a worthwhile tradeoff, given that that is
> a worst case scenario. If you choose less pathological UTF-8 codepoints,
> or there is no match or the match is closer to the beginning of the
> string, the patch wins.
> 

So, what is the expected speedup in a "good/average" case? Do we have
some reasonable real-world workload mixing these cases that could be
used as a realistic benchmark?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Speeding up text_position_next with multibyte encodings

From
John Naylor
Date:
On 12/22/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 14/12/2018 20:20, John Naylor wrote:
> I'm afraid that script doesn't work as a performance test. The
> position() function is immutable, so the result gets cached in the plan
> cache. All you're measuring is the speed to get the constant from the
> plan cache :-(.

That makes perfect sense now. I should have been more skeptical about
the small and medium sizes having similar times. :/

> I rewrote the same tests with a little C helper function, attached, to
> fix that, and to eliminate any PL/pgSQL overhead.

Thanks for that, I'll probably have occasion to do something like this
for other tests.

> You chose interesting characters for the UTF-8 test. The haystack is a
> repeating pattern of byte sequence EC 99 84, and the needle is a
> repeating pattern of EC 84 B1. In the 'long' test, the lengths in the
> skip table are '2', '1' and '250'. But the search bounces between the
> '2' and '1' cases, and never hits the byte that would allow it to skip
> 250 bytes. Interesting case, I had not realized that that can happen.

Me neither, that was unintentional.

> But I don't think we need to put much weight on that, you could come up
> with other scenarios where the current code has skip table collisions, too.

Okay.

> So overall, I think it's still a worthwhile tradeoff, given that that is
> a worst case scenario. If you choose less pathological UTF-8 codepoints,
> or there is no match or the match is closer to the beginning of the
> string, the patch wins.

On 12/23/18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> So, what is the expected speedup in a "good/average" case? Do we have
> some reasonable real-world workload mixing these cases that could be
> used as a realistic benchmark?

I'll investigate some "better" cases.

-John Naylor


Re: Speeding up text_position_next with multibyte encodings

From
John Naylor
Date:
On Sun, Dec 23, 2018 at 9:33 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So, what is the expected speedup in a "good/average" case? Do we have
> some reasonable real-world workload mixing these cases that could be
> used as a realistic benchmark?

Not sure about a realistic mix, but I investigated the tradeoffs.
First, using the test function Heikki posted upthread [1], I tried a 2
character needle needle with different haystack sizes, and looked for
the position where master and patch were roughly equal (average of 3,
within 1%). Beyond this point, the patch regresses. To keep it simple
I used UTF-8 only.

haystack    position
<=23       (patch always faster)
30          23
100         58
1000       ~550
1000000    ~550000

For very short haystacks, the patch is always faster or regresses only
when the needle is close to the end. For longer haystacks, the patch
will be faster if the position searched for is less than ~55% of the
way to the end, slower if after. Next, I tested finding the position
of a 2 character needle in a couple positions towards the front of a
1000 character haystack, plus not present and at the end for
comparison. As seen earlier [1], the worst-case regression for this
haystack length was 4.4x slower for UTF-8, but that had a skip table
collision, and this time I used characters with no bytes in common.
Average of 3, with a loop count of 1,000,000:

UTF-8
pos.  master    patch      diff
*     3880ms     144ms    -96%
100   4440ms    1410ms    -68%
333   5700ms    4280ms    -25%
999   9370ms   12500ms     34%

EUC-KR
pos.  master     patch    diff
*     3900ms     112ms    -97%
100   4410ms    1289ms    -71%
333   5680ms    3970ms    -30%
999   9370ms   11600ms     24%

The patch is much faster than master when the needle is near the front
of a large haystack or not there at all.

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

[1] https://www.postgresql.org/message-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d%40iki.fi
--

-John Naylor


Re: Speeding up text_position_next with multibyte encodings

From
Heikki Linnakangas
Date:
On 15/01/2019 02:52, John Naylor wrote:
> The majority of cases are measurably faster, and the best case is at
> least 20x faster. On the whole I'd say this patch is a performance win
> even without further optimization. I'm marking it ready for committer.

I read through the patch one more time, tweaked the comments a little 
bit, and committed. Thanks for the review!

I did a little profiling of the worst case, where this is slower than 
the old approach. There's a lot of function call overhead coming from 
walking the string with pg_mblen(). That could be improved. If we 
inlined pg_mblen() into loop, it becomes much faster, and I think this 
code would be faster even in the worst case. (Except for the very worst 
cases, where hash table with the new code happens to have a collision at 
a different point than before, but that doesn't seem like a fair 
comparison.)

I think this is good enough as it is, but if I have the time, I'm going 
to try optimizing the pg_mblen() loop, as well as similar loops e.g. in 
pg_mbstrlen(). Or if someone else wants to give that a go, feel free.

- Heikki


Re: Speeding up text_position_next with multibyte encodings

From
Bruce Momjian
Date:
On Fri, Jan 25, 2019 at 04:33:54PM +0200, Heikki Linnakangas wrote:
> On 15/01/2019 02:52, John Naylor wrote:
> >The majority of cases are measurably faster, and the best case is at
> >least 20x faster. On the whole I'd say this patch is a performance win
> >even without further optimization. I'm marking it ready for committer.
> 
> I read through the patch one more time, tweaked the comments a little bit,
> and committed. Thanks for the review!
> 
> I did a little profiling of the worst case, where this is slower than the
> old approach. There's a lot of function call overhead coming from walking
> the string with pg_mblen(). That could be improved. If we inlined pg_mblen()
> into loop, it becomes much faster, and I think this code would be faster
> even in the worst case. (Except for the very worst cases, where hash table
> with the new code happens to have a collision at a different point than
> before, but that doesn't seem like a fair comparison.)
> 
> I think this is good enough as it is, but if I have the time, I'm going to
> try optimizing the pg_mblen() loop, as well as similar loops e.g. in
> pg_mbstrlen(). Or if someone else wants to give that a go, feel free.

It might be valuable to just inline the UTF8 case.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +