Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: speed up verifying UTF-8
Date
Msg-id 2f95e70d-4623-87d4-9f24-ca534155f179@iki.fi
Whole thread Raw
In response to Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8
List pgsql-hackers
On 29/06/2021 14:20, John Naylor wrote:
> I still wasn't quite happy with the churn in the regression tests, so 
> for v13 I gave up on using both the existing utf8 table and my new one 
> for the "padded input" tests, and instead just copied the NUL byte test 
> into the new table. Also added a primary key to make sure the padded 
> test won't give weird results if a new entry has a duplicate description.
> 
> I came up with "highbit_carry" as a more descriptive variable name than 
> "x", but that doesn't matter a whole lot.
> 
> It also occurred to me that if we're going to check one 8-byte chunk at 
> a time (like v12 does), maybe it's only worth it to load 8 bytes at a 
> time. An earlier version did this, but without the recent tweaks. The 
> worst-case scenario now might be different from the one with 16-bytes, 
> but for now just tested the previous worst case (mixed2).

I tested the new worst case scenario on my laptop:

gcc master:

  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     1311 |   758 |   405 |     583 |    725


gcc v13:

  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
      956 |   472 |   160 |     572 |    939


mixed16 is the same as "mixed2" in the previous rounds, with 
'123456789012345ä' as the repeating string, and mixed8 uses '1234567ä', 
which I believe is the worst case for patch v13. So v13 is somewhat 
slower than master in the worst case.

Hmm, there's one more simple trick we can do: We can have a separate 
fast-path version of the loop when there are at least 8 bytes of input 
left, skipping all the length checks. With that:

gcc v14:
  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
      737 |   412 |    94 |     476 |    725


All the above numbers were with gcc 10.2.1. For completeness, with clang 
11.0.1-2 I got:

clang master:
  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     1044 |   724 |   403 |     930 |    603
(1 row)

clang v13:
  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
      596 |   445 |    79 |     417 |    715
(1 row)


clang v14:
  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
      600 |   337 |    93 |     318 |    511

Attached is patch v14 with that optimization. It needs some cleanup, I 
just hacked it up quickly for performance testing.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Use simplehash.h instead of dynahash in SMgr
Next
From: Andrey Lepikhov
Date:
Subject: Re: Removing unneeded self joins