Re: Speeding up text_position_next with multibyte encodings - Mailing list pgsql-hackers

From John Naylor
Subject Re: Speeding up text_position_next with multibyte encodings
Date
Msg-id CAJVSVGUVe9G6yUb3F-YrC_myrj_nqhagYrpmP3OT8sUjzvAuaA@mail.gmail.com
Whole thread Raw
In response to Speeding up text_position_next with multibyte encodings  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Speeding up text_position_next with multibyte encodings
Re: Speeding up text_position_next with multibyte encodings
List pgsql-hackers
On 11/30/18, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Unfortunately, patch doesn't compile anymore due:
>
> varlena.c: In function ‘text_position_next_internal’:
> varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this
> function)
>   Assert(start_ptr >= haystack && start_ptr <= haystack_end);
>
> Could you please send an updated version? For now I'm moving it to the next
> CF.

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> This is a win in most cases. One case is slower: calling position() with
> a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script,
which runs position() in three scenarios:
short:   1 character needle at the end of a ~1000 character haystack,
repeated 1 million times
medium: 50 character needle at the end of a ~1 million character
haystack, repeated 1 million times
long:  250 character needle at the end of an ~80 million character
haystack (~230MB, comfortably below the 256MB limit Heikki reported),
done just one time

I took the average of 5 runs using Korean characters in both UTF-8
(3-byte) and EUC-KR (2-byte) encodings.

UTF-8
length        master        patch
short        2.26s        2.51s
med            2.28s        2.54s
long        3.07s        3.11s

EUC-KR
length        master        patch
short        2.29s        2.37s
med            2.29s        2.36s
long        1.75s        1.71s

With UTF-8, the patch is 11-12% slower on short and medium strings,
and about the same on long strings. With EUC-KR, the patch is about 3%
slower on short and medium strings, and 2-3% faster on long strings.
It seems the worst case is not that bad, and could be optimized, as
Heikki said.

-John Naylor

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: removal of dangling temp tables
Next
From: Tom Lane
Date:
Subject: Re: Ryu floating point output patch