speed up text_position() for utf-8 - Mailing list pgsql-hackers

From John Naylor
Subject speed up text_position() for utf-8
Date
Msg-id CAFBsxsH1Yutrmu+6LLHKK8iXY+vG--Do6zN+2900spHXQNNQKQ@mail.gmail.com
Whole thread Raw
Responses Re: speed up text_position() for utf-8
List pgsql-hackers
Hi,

Commit 9556aa01c69 (Use single-byte Boyer-Moore-Horspool search even
with multibyte encodings), was a speed improvement for the majority of
cases, but when the match was toward the end of the string, the
slowdown in text_position_get_match_pos() was noticeable. It was found
that there was a lot of overhead in pg_mblen(). [1]

The attached exploratory PoC improves this for utf-8. It applies on
top v25 of my utf-8 verification patch in [2], since one approach
relies on the DFA from it. The other three approaches are:
- a version of pg_utf_mblen() that uses a lookup table [3]
- an inlined copy of pg_utf_mblen()
- an ascii fast path with a fallback to the inlined copy of pg_utf_mblen()

The test is attached and the test function is part of the patch. It's
based on the test used in the commit above. The test searches for a
string that's at the end of a ~1 million byte string. This is on gcc
11 with 2-3 runs to ensure repeatability, but I didn't bother with
statistics because the differences are pretty big:

                  patch                  | no match | ascii | mulitbyte
-----------------------------------------+----------+-------+-----------
 PG11                                    |     1120 |  1100 |       900
 master                                  |      381 |  2350 |      1900
 DFA                                     |      386 |  1640 |      1640
 branchless utf mblen                    |      387 |  4100 |      2600
 inline pg_utf_mblen()                   |      380 |  1080 |       920
 inline pg_utf_mblen() + ascii fast path |      382 |   470 |       918

Neither of the branchless approaches worked well. The DFA can't work
as well here as in verification because it must do additional work.
Inlining pg_utf_mblen() restores worst-case performance to PG11
levels. The ascii fast path is a nice improvement on top of that. A
similar approach could work for pg_mbstrlen() as well, but I haven't
looked into that yet. There are other callers of pg_mblen(), but I
haven't looked into whether they are performance-sensitive. A more
general application would be preferable to a targeted one.

[1] https://www.postgresql.org/message-id/b65df3d8-1f59-3bd7-ebbe-68b81d5a76a4%40iki.fi
[2] https://www.postgresql.org/message-id/CAFBsxsHG%3Dg6W8Mie%2B_NO8dV6O0pO2stxrnS%3Dme5ZmGqk--fd5g%40mail.gmail.com
[3] https://github.com/skeeto/branchless-utf8/blob/master/utf8.h

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Remove pg_strtouint64(), use strtoull() directly
Next
From: Tomas Vondra
Date:
Subject: Re: daitch_mokotoff module