Home > mailing lists

Re: Speeding up text_position_next with multibyte encodings - Mailing list pgsql-hackers

From	John Naylor
Subject	Re: Speeding up text_position_next with multibyte encodings
Date	December 14, 2018 18:20:46
Msg-id	CAJVSVGUVe9G6yUb3F-YrC_myrj_nqhagYrpmP3OT8sUjzvAuaA@mail.gmail.com Whole thread
In response to	Speeding up text_position_next with multibyte encodings (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses	Re: Speeding up text_position_next with multibyte encodings Re: Speeding up text_position_next with multibyte encodings
List	pgsql-hackers

Tree view

On 11/30/18, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Unfortunately, patch doesn't compile anymore due:
>
> varlena.c: In function ‘text_position_next_internal’:
> varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this
> function)
>   Assert(start_ptr >= haystack && start_ptr <= haystack_end);
>
> Could you please send an updated version? For now I'm moving it to the next
> CF.

I signed up to be a reviewer, and I will be busy next month, so I went
ahead and fixed the typo in the patch that broke assert-enabled
builds. While at it, I standardized on the spelling "start_ptr" in a
few places to match the rest of the file. It's a bit concerning that
it wouldn't compile with asserts, but the patch was written by a
committer and seems to work.

On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> This is a win in most cases. One case is slower: calling position() with
> a large haystack input, where the match is near the end of the string.

I wanted to test this case in detail, so I ran the attached script,
which runs position() in three scenarios:
short:   1 character needle at the end of a ~1000 character haystack,
repeated 1 million times
medium: 50 character needle at the end of a ~1 million character
haystack, repeated 1 million times
long:  250 character needle at the end of an ~80 million character
haystack (~230MB, comfortably below the 256MB limit Heikki reported),
done just one time

I took the average of 5 runs using Korean characters in both UTF-8
(3-byte) and EUC-KR (2-byte) encodings.

UTF-8
length        master        patch
short        2.26s        2.51s
med            2.28s        2.54s
long        3.07s        3.11s

EUC-KR
length        master        patch
short        2.29s        2.37s
med            2.29s        2.36s
long        1.75s        1.71s

With UTF-8, the patch is 11-12% slower on short and medium strings,
and about the same on long strings. With EUC-KR, the patch is about 3%
slower on short and medium strings, and 2-3% faster on long strings.
It seems the worst case is not that bad, and could be optimized, as
Heikki said.

-John Naylor

Attachment

pgsql-hackers by date:

From: Robert Haas
Date: 14 December 2018, 18:16:30
Subject: Re: removal of dangling temp tables

From: Tom Lane
Date: 14 December 2018, 18:25:29
Subject: Re: Ryu floating point output patch

Re: Speeding up text_position_next with multibyte encodings - Mailing list pgsql-hackers

Attachment

Previous

Next