Home > mailing lists

Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: speed up verifying UTF-8
Date	June 9, 2021 14:02:02
Msg-id	e7729297-53e8-6e17-7334-7227043ce716@iki.fi Whole thread Raw
In response to	Re: speed up verifying UTF-8 (John Naylor <john.naylor@enterprisedb.com>)
Responses	Re: speed up verifying UTF-8
List	pgsql-hackers

Tree view

On 07/06/2021 15:39, John Naylor wrote:
> On Mon, Jun 7, 2021 at 8:24 AM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>  >
>  > On 03/06/2021 21:58, John Naylor wrote:
>  > > The microbenchmark is the same one you attached to [1], which I 
> extended
>  > > with a 95% multibyte case.
>  >
>  > Could you share the exact test you're using? I'd like to test this on my
>  > old raspberry pi, out of curiosity.
> 
> Sure, attached.
> 
> --
> John Naylor
> EDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
> 
Results from chipmunk, my first generation Raspberry Pi:

Master:

  chinese | mixed | ascii
---------+-------+-------
    25392 | 16287 | 10295
(1 row)

v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:

  chinese | mixed | ascii
---------+-------+-------
    17739 | 10854 |  4121
(1 row)

So that's good.

What is the worst case scenario for this algorithm? Something where the 
new fast ASCII check never helps, but is as fast as possible with the 
old code. For that, I added a repeating pattern of '123456789012345ä' to 
the test set (these results are from my Intel laptop, not the raspberry pi):

Master:

  chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     1333 |   757 |   410 |    573
(1 row)

v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:

  chinese | mixed | ascii | mixed2
---------+-------+-------+--------
      942 |   470 |    66 |   1249
(1 row)

So there's a regression with that input. Maybe that's acceptable, this 
is the worst case, after all. Or you could tweak check_ascii for a 
different performance tradeoff, by checking the two 64-bit words 
separately and returning "8" if the failure happens in the second word. 
And I haven't tried the SSE patch yet, maybe that compensates for this.

- Heikki

pgsql-hackers by date:

From: Amit Kapila
Date: 09 June 2021, 13:51:51
Subject: Re: Decoding speculative insert with toast leaks memory

From: Tomas Vondra
Date: 09 June 2021, 14:08:36
Subject: Re: postgres_fdw batching vs. (re)creating the tuple slots

Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

Previous

Next