Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsGB=dSBee2M+5-OntnkLgh_LajmW4P+dXhesnmbijfQLg@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8  (Amit Khandekar <amitdkhan.pg@gmail.com>)
List pgsql-hackers
I wrote:

> I don't think the new structuring will pose any challenges for rebasing 0002, either. This might need some experimentation, though:
>
> + * Subroutine of pg_utf8_verifystr() to check on char. Returns the length of the
> + * character at *s in bytes, or 0 on invalid input or premature end of input.
> + *
> + * XXX: could this be combined with pg_utf8_verifychar above?
> + */
> +static inline int
> +pg_utf8_verify_one(const unsigned char *s, int len)
>
> It seems like it would be easy to have pg_utf8_verify_one in my proposed pg_utf8.h header and replace the body of pg_utf8_verifychar with it.

0001: I went ahead and tried this for v15, and also attempted some clean-up:

- Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
- Have pg_utf8_verifychar_internal return -1 for invalid input to match other functions in the file. We could also do this for check_ascii, but it's not quite the same thing, because the string could still have valid bytes in it, just not enough to advance the pointer by the stride length.
- Remove hard-coded numbers (not wedded to this).

- Use a call to pg_utf8_verifychar in the slow path.
- Reduce pg_utf8_verifychar to thin wrapper around pg_utf8_verifychar_internal.

The last two aren't strictly necessary, but it prevents bloating the binary in the slow path, and aids readability. For 0002, this required putting pg_utf8_verifychar* in src/port. (While writing this I noticed I neglected to explain that with a comment, though)

Feedback welcome on any of the above.

Since by now it hardly resembles the simdjson (or Fuchsia for that matter) fallback that it took inspiration from, I've removed that mention from the commit message.

0002: Just a rebase to work with the above. One possible review point: We don't really need to have separate control over whether to use special instructions for CRC and UTF-8. It should probably be just one configure knob, but having them separate is perhaps easier to review.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [PATCH v3 1/1] Fix detection of preadv/pwritev support for OSX.
Next
From: "Euler Taveira"
Date:
Subject: Re: row filtering for logical replication