Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: speed up verifying UTF-8
Date
Msg-id e7729297-53e8-6e17-7334-7227043ce716@iki.fi
Whole thread Raw
In response to Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8
List pgsql-hackers
On 07/06/2021 15:39, John Naylor wrote:
> On Mon, Jun 7, 2021 at 8:24 AM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>  >
>  > On 03/06/2021 21:58, John Naylor wrote:
>  > > The microbenchmark is the same one you attached to [1], which I 
> extended
>  > > with a 95% multibyte case.
>  >
>  > Could you share the exact test you're using? I'd like to test this on my
>  > old raspberry pi, out of curiosity.
> 
> Sure, attached.
> 
> --
> John Naylor
> EDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
> 
Results from chipmunk, my first generation Raspberry Pi:

Master:

  chinese | mixed | ascii
---------+-------+-------
    25392 | 16287 | 10295
(1 row)

v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:

  chinese | mixed | ascii
---------+-------+-------
    17739 | 10854 |  4121
(1 row)

So that's good.

What is the worst case scenario for this algorithm? Something where the 
new fast ASCII check never helps, but is as fast as possible with the 
old code. For that, I added a repeating pattern of '123456789012345ä' to 
the test set (these results are from my Intel laptop, not the raspberry pi):

Master:

  chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     1333 |   757 |   410 |    573
(1 row)

v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:

  chinese | mixed | ascii | mixed2
---------+-------+-------+--------
      942 |   470 |    66 |   1249
(1 row)

So there's a regression with that input. Maybe that's acceptable, this 
is the worst case, after all. Or you could tweak check_ascii for a 
different performance tradeoff, by checking the two 64-bit words 
separately and returning "8" if the failure happens in the second word. 
And I haven't tried the SSE patch yet, maybe that compensates for this.

- Heikki



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Decoding speculative insert with toast leaks memory
Next
From: Tomas Vondra
Date:
Subject: Re: postgres_fdw batching vs. (re)creating the tuple slots