Re: Speeding up text_position_next with multibyte encodings - Mailing list pgsql-hackers

From John Naylor
Subject Re: Speeding up text_position_next with multibyte encodings
Date
Msg-id CAJVSVGWKEShqDeKiQi8jdBBEpXfsM7p7DaML=bEDsK3cPWm=gA@mail.gmail.com
Whole thread Raw
In response to Re: Speeding up text_position_next with multibyte encodings  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Speeding up text_position_next with multibyte encodings  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Sun, Dec 23, 2018 at 9:33 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So, what is the expected speedup in a "good/average" case? Do we have
> some reasonable real-world workload mixing these cases that could be
> used as a realistic benchmark?

Not sure about a realistic mix, but I investigated the tradeoffs.
First, using the test function Heikki posted upthread [1], I tried a 2
character needle needle with different haystack sizes, and looked for
the position where master and patch were roughly equal (average of 3,
within 1%). Beyond this point, the patch regresses. To keep it simple
I used UTF-8 only.

haystack    position
<=23       (patch always faster)
30          23
100         58
1000       ~550
1000000    ~550000

For very short haystacks, the patch is always faster or regresses only
when the needle is close to the end. For longer haystacks, the patch
will be faster if the position searched for is less than ~55% of the
way to the end, slower if after. Next, I tested finding the position
of a 2 character needle in a couple positions towards the front of a
1000 character haystack, plus not present and at the end for
comparison. As seen earlier [1], the worst-case regression for this
haystack length was 4.4x slower for UTF-8, but that had a skip table
collision, and this time I used characters with no bytes in common.
Average of 3, with a loop count of 1,000,000:

UTF-8
pos.  master    patch      diff
*     3880ms     144ms    -96%
100   4440ms    1410ms    -68%
333   5700ms    4280ms    -25%
999   9370ms   12500ms     34%

EUC-KR
pos.  master     patch    diff
*     3900ms     112ms    -97%
100   4410ms    1289ms    -71%
333   5680ms    3970ms    -30%
999   9370ms   11600ms     24%

The patch is much faster than master when the needle is near the front
of a large haystack or not there at all.

The majority of cases are measurably faster, and the best case is at
least 20x faster. On the whole I'd say this patch is a performance win
even without further optimization. I'm marking it ready for committer.

[1] https://www.postgresql.org/message-id/05d95c5c-1fb7-29ea-1c5d-e7e941a0a14d%40iki.fi
--

-John Naylor


pgsql-hackers by date:

Previous
From: "Tsunakawa, Takayuki"
Date:
Subject: RE: O_DIRECT for relations and SLRUs (Prototype)
Next
From: David Rowley
Date:
Subject: Re: "SELECT ... FROM DUAL" is not quite as silly as it appears