[POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From John Naylor
Subject [POC] verifying UTF-8 using SIMD instructions
Date
Msg-id CAFBsxsEV_SzH+OLyCiyon=iwggSyMh_eF6A3LU2tiWf3Cy2ZQg@mail.gmail.com
Whole thread Raw
Responses Re: [POC] verifying UTF-8 using SIMD instructions
List pgsql-hackers
Hi,

As of b80e10638e3, there is a new API for validating the encoding of strings, and one of the side effects is that we have a wider choice of algorithms. For UTF-8, it has been demonstrated that SIMD is much faster at decoding [1] and validation [2] than the standard approach we use.

It makes sense to start with the ascii subset of UTF-8 for a couple reasons. First, ascii is very widespread in database content, particularly in bulk loads. Second, ascii can be validated using the simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and I'm guessing we can detect that at compile time and not mess with runtime checks. The examples above using SSE for the general case are much more complicated and involve SSE 4.2 or AVX.

Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is okay, I'll do Linux/gcc and add more inputs). The test is the same as Heikki shared in [3], but I added a case with >95% Chinese characters just to show how that compares to the mixed ascii/multibyte case.

master:

 chinese | mixed | ascii
---------+-------+-------
    1081 |   761 |   366

patch:

 chinese | mixed | ascii
---------+-------+-------
    1103 |   498 |    51

The speedup in the pure ascii case is nice.

In the attached POC, I just have a pro forma portability stub, and left full portability detection for later. The fast path is inlined inside pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted into a separate function to which is passed a function pointer for full encoding validation. That would allow other encodings with strict ascii subsets to use this as well, but coding that abstraction might be a little messy, and b80e10638e3 already gives a performance boost over PG13.

I also gave a shot at doing full UTF-8 recognition using a DFA, but so far that has made performance worse. If I ever have more success with that, I'll add that in the mix.

[1] https://woboq.com/blog/utf-8-processing-using-simd.html
[2] https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
[3] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Proposal: Save user's original authenticated identity for logging
Next
From: Stephen Frost
Date:
Subject: Re: Proposal: Save user's original authenticated identity for logging