Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsEVoO-cGN7Q7H+ytuExSfnm0xm19CMbjs2Q5a+7LXX_rw@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Responses Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Thu, Jul 15, 2021 at 1:10 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

> - check_ascii() seems to be used only for 64-bit chunks. So why not
> remove the len argument and the len <= sizeof(int64) checks inside the
> function. We can rename it to check_ascii64() for clarity.

Thanks for taking a look!

Well yes, but there's nothing so intrinsic to 64 bits that the name needs to reflect that. Earlier versions worked on 16 bytes at time. The compiler will optimize away the len check, but we could replace with an assert instead.

> - I was thinking, why not have a pg_utf8_verify64() that processes
> 64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
> extract a 64-bit chunk from the string. We can use the same chunk to
> extract the required bits from a two byte char or a 4 byte char. This
> way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
> etc.

Loading bytes from L1 is really fast -- I wouldn't even call it "extraction".

> More importantly, we can avoid the separate continuation-char
> checks for each individual byte.

On a pipelined superscalar CPU, I wouldn't expect it to matter in the slightest.

> Additionally, we can try to simplify
> the subsequent overlong or surrogate char checks. Something like this

My recent experience with itemptrs has made me skeptical of this kind of thing, but the idea was interesting enough that I couldn't resist trying it out. I have two attempts, which are attached as v16*.txt and apply independently. They are rough, and some comments are now lies. To simplify the constants, I do shift down to uint32, and I didn't bother working around that. v16alpha regressed on worst-case input, so for v16beta I went back to earlier coding for the one-byte ascii check. That helped, but it's still slower than v14.

That was not unexpected, but I was mildly shocked to find out that v15 is also slower than the v14 that Heikki posted. The only non-cosmetic difference is using pg_utf8_verifychar_internal within pg_utf8_verifychar. I'm not sure why it would make such a big difference here. The numbers on Power8 / gcc 4.8 (little endian):

HEAD:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2951 |  1521 |   871 |    1474 |   1508

v14:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     885 |   607 |   179 |     774 |   1325

v15:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1085 |   671 |   180 |    1032 |   1799

v16alpha:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1268 |   822 |   180 |    1410 |   2518

v16beta:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1096 |   654 |   182 |     814 |   1403


As it stands now, for v17 I'm inclined to go back to v15, but without the attempt at being clever that seems to have slowed it down from v14.

Any interest in testing on 64-bit Arm?

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Using a stock openssl BIO
Next
From: Jan Wieck
Date:
Subject: Re: pg_upgrade does not upgrade pg_stat_statements properly