Speeding up text_position_next with multibyte encodings - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Speeding up text_position_next with multibyte encodings
Date
Msg-id 3173d989-bc1c-fc8a-3b69-f24246f73876@iki.fi
Whole thread Raw
Responses Re: Speeding up text_position_next with multibyte encodings  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Re: Speeding up text_position_next with multibyte encodings  (John Naylor <jcnaylor@gmail.com>)
List pgsql-hackers
Attached is a patch to speed up text_position_setup/next(), in some 
common cases with multibyte encodings.

text_position_next() uses the Boyer-Moore-Horspool search algorithm, 
with a skip table. Currently, with a multi-byte encoding, we first 
convert both input strings to arrays of wchars, and run the algorithm on 
those arrays.

This patch removes the mb->wchar conversion, and instead runs the 
single-byte version of the algorithm directly on the inputs, even with 
multi-byte encodings. That avoids the mb->wchar conversion altogether, 
when there is no match. Even when there is a match, we don't need to 
convert the whole input string to wchars. It's enough to count the 
characters up to the match, using pg_mblen(). And when 
text_position_setup/next() are used as part of split_part() or replace() 
functions, we're not actually even interested in the character position, 
so we can skip that too.

Avoiding the large allocation is nice, too. That was actually why I 
stated to look into this: a customer complained that text_position fails 
with strings larger than 256 MB.

Using the byte-oriented search on multibyte strings might return false 
positives, though. To make that work, after finding a match, we verify 
that the match doesn't fall in the middle of a multi-byte character. 
However, as an important special case, that cannot happen with UTF-8, 
because it has the property that the byte sequence of one character 
never appears as part of another character. I think some other encodings 
might have the same property, but I wasn't sure.

This is a win in most cases. One case is slower: calling position() with 
a large haystack input, where the match is near the end of the string. 
Counting the characters up to that point is slower than the mb->wchar 
conversion. I think we could avoid that regression too, if we had a more 
optimized function to count characters. Currently, the code calls 
pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), the similar 
loop through all the characters is a tight loop within the function.

Thoughts?

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2
Next
From: John Naylor
Date:
Subject: ERROR's turning FATAL in BRIN regression tests