pgsql: Use single-byte Boyer-Moore-Horspool search even withmultibyte - Mailing list pgsql-committers

From Heikki Linnakangas
Subject pgsql: Use single-byte Boyer-Moore-Horspool search even withmultibyte
Date
Msg-id E1gn2Tv-0007AD-L3@gemulon.postgresql.org
Whole thread Raw
List pgsql-committers
Use single-byte Boyer-Moore-Horspool search even with multibyte encodings.

The old implementation first converted the input strings to arrays of
wchars, and performed the conversion on those. However, the conversion is
expensive, and for a large input string, consumes a lot of memory.
Allocating the large arrays also meant that these functions could not be
used on strings larger 1 GB / pg_encoding_max_length() (256 MB for UTF-8).

Avoid the conversion, and instead use the single-byte algorithm even with
multibyte encodings. That can get fooled, if there is a matching byte
sequence in the middle of a multi-byte character, so to eliminate false
positives like that, we verify any matches by walking the string character
by character with pg_mblen(). Also, if the caller needs the position of
the match, as a character-offset, we also need to walk the string to count
the characters.

Performance testing shows that walking the whole string with pg_mblen() is
somewhat slower than converting the whole string to wchars. It's still
often a win, though, because we don't need to do it if there is no match,
and even when there is, we only need to walk up to the point where the
match is, not the whole string. Even in the worst case, there would be
room for optimization: Much of the CPU time in the current loop with
pg_mblen() is function call overhead, and could be improved by inlining
pg_mblen() and/or the encoding-specific mblen() functions. But I didn't
attempt to do that as part of this patch.

Most of the callers of text_position_setup/next functions were actually
not interested in the position of the match, counted in characters. To
cater for them, refactor the text_position_next() interface into two
parts: searching for the next match (text_position_next()), and returning
the current match's position as a pointer (text_position_get_match_ptr())
or as a character offset (text_position_get_match_pos()). Getting the
pointer to the match is a more convenient API for many callers, and with
UTF-8, it allows skipping the character-walking step altogether, because
UTF-8 can't have false matches even when treated like raw byte strings.

Reviewed-by: John Naylor
Discussion: https://www.postgresql.org/message-id/3173d989-bc1c-fc8a-3b69-f24246f73876%40iki.fi

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/9556aa01c69a26ca726d8dda8e395acc7c1e30fc

Modified Files
--------------
src/backend/utils/adt/varlena.c | 483 +++++++++++++++++++++-------------------
1 file changed, 257 insertions(+), 226 deletions(-)


pgsql-committers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: pgsql: Fix comments that claimed that mblen() only looks at firstbyte.
Next
From: Tom Lane
Date:
Subject: pgsql: Fix possibly-uninitialized-variable warning from commit9556aa01