Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsGZ_ssdVmOK5qbcO5on87ByyDvW3APRohR=kCfb8Z3XVA@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: speed up verifying UTF-8
List pgsql-hackers
On Wed, Jun 30, 2021 at 7:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> Hmm, there's one more simple trick we can do: We can have a separate
> fast-path version of the loop when there are at least 8 bytes of input
> left, skipping all the length checks. With that:

Good idea, and the numbers look good on Power8 / gcc 4.8 as well:

master:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2951 |  1521 |   871 |    1473 |   1508

v13:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     949 |   642 |   203 |    1046 |   1818

v14:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     887 |   607 |   179 |     776 |   1325


I don't think the new structuring will pose any challenges for rebasing 0002, either. This might need some experimentation, though:

+ * Subroutine of pg_utf8_verifystr() to check on char. Returns the length of the
+ * character at *s in bytes, or 0 on invalid input or premature end of input.
+ *
+ * XXX: could this be combined with pg_utf8_verifychar above?
+ */
+static inline int
+pg_utf8_verify_one(const unsigned char *s, int len)

It seems like it would be easy to have pg_utf8_verify_one in my proposed pg_utf8.h header and replace the body of pg_utf8_verifychar with it.

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: David Christensen
Date:
Subject: [PATCH] pgbench: add multiconnect option
Next
From: Peter Eisentraut
Date:
Subject: Re: [PATCH] Make jsonapi usable from libpq