Thread: [POC] verifying UTF-8 using SIMD instructions

[POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

01 February 2021, 17:32:23

Hi,

As of b80e10638e3, there is a new API for validating the encoding of strings, and one of the side effects is that we have a wider choice of algorithms. For UTF-8, it has been demonstrated that SIMD is much faster at decoding [1] and validation [2] than the standard approach we use.

It makes sense to start with the ascii subset of UTF-8 for a couple reasons. First, ascii is very widespread in database content, particularly in bulk loads. Second, ascii can be validated using the simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and I'm guessing we can detect that at compile time and not mess with runtime checks. The examples above using SSE for the general case are much more complicated and involve SSE 4.2 or AVX.

Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is okay, I'll do Linux/gcc and add more inputs). The test is the same as Heikki shared in [3], but I added a case with >95% Chinese characters just to show how that compares to the mixed ascii/multibyte case.

master:

chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366

patch:

chinese | mixed | ascii
---------+-------+-------
1103 | 498 | 51

The speedup in the pure ascii case is nice.

In the attached POC, I just have a pro forma portability stub, and left full portability detection for later. The fast path is inlined inside pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted into a separate function to which is passed a function pointer for full encoding validation. That would allow other encodings with strict ascii subsets to use this as well, but coding that abstraction might be a little messy, and b80e10638e3 already gives a performance boost over PG13.

I also gave a shot at doing full UTF-8 recognition using a DFA, but so far that has made performance worse. If I ever have more success with that, I'll add that in the mix.

[1] https://woboq.com/blog/utf-8-processing-using-simd.html
[2] https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
[3] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v1-verify-utf8-sse-ascii.patch

Re: [POC] verifying UTF-8 using SIMD instructions

From

Heikki Linnakangas

Date:

01 February 2021, 18:01:50

On 01/02/2021 19:32, John Naylor wrote:
> It makes sense to start with the ascii subset of UTF-8 for a couple 
> reasons. First, ascii is very widespread in database content, 
> particularly in bulk loads. Second, ascii can be validated using the 
> simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and 
> I'm guessing we can detect that at compile time and not mess with 
> runtime checks. The examples above using SSE for the general case are 
> much more complicated and involve SSE 4.2 or AVX.

I wonder how using SSE compares with dealing with 64 or 32-bit words at 
a time, using regular instructions? That would be more portable.

> Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is 
> okay, I'll do Linux/gcc and add more inputs). The test is the same as 
> Heikki shared in [3], but I added a case with >95% Chinese characters 
> just to show how that compares to the mixed ascii/multibyte case.
> 
> master:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>      1081 |   761 |   366
> 
> patch:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>      1103 |   498 |    51
> 
> The speedup in the pure ascii case is nice.

Yep.

> In the attached POC, I just have a pro forma portability stub, and left 
> full portability detection for later. The fast path is inlined inside 
> pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted 
> into a separate function to which is passed a function pointer for full 
> encoding validation. That would allow other encodings with strict ascii 
> subsets to use this as well, but coding that abstraction might be a 
> little messy, and b80e10638e3 already gives a performance boost over PG13.

All supported encodings are ASCII subsets. Might be best to putt the 
ASCII-check into a static inline function and use it in all the verify 
functions. I presume it's only a few instructions, and these functions 
can be pretty performance sensitive.

> I also gave a shot at doing full UTF-8 recognition using a DFA, but so 
> far that has made performance worse. If I ever have more success with 
> that, I'll add that in the mix.

That's disappointing. Perhaps the SIMD algorithms have higher startup 
costs, so that you need longer inputs to benefit? In that case, it might 
make sense to check the length of the input and only use the SIMD 
algorithm if the input is long enough.

- Heikki

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

04 February 2021, 21:48:35

On Mon, Feb 1, 2021 at 2:01 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 01/02/2021 19:32, John Naylor wrote:
> > It makes sense to start with the ascii subset of UTF-8 for a couple
> > reasons. First, ascii is very widespread in database content,
> > particularly in bulk loads. Second, ascii can be validated using the
> > simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
> > I'm guessing we can detect that at compile time and not mess with
> > runtime checks. The examples above using SSE for the general case are
> > much more complicated and involve SSE 4.2 or AVX.
>
> I wonder how using SSE compares with dealing with 64 or 32-bit words at
> a time, using regular instructions? That would be more portable.

I gave that a shot, and it's actually pretty good. According to this paper, [1], 16 bytes was best and gives a good apples-to-apples comparison to SSE registers, so I tried both 16 and 8 bytes.

> All supported encodings are ASCII subsets. Might be best to putt the
> ASCII-check into a static inline function and use it in all the verify
> functions. I presume it's only a few instructions, and these functions
> can be pretty performance sensitive.

I tried both the static inline function and also putting the whole optimized utf-8 loop in a separate function to which the caller passes a pointer to the appropriate pg_*_verifychar().

In the table below, "inline" refers to coding directly inside pg_utf8_verifystr(). Both C and SSE are in the same patch, with an #ifdef. I didn't bother splitting them out because for other encodings, we want one of the other approaches above. For those, "C retail" refers to a static inline function to code the contents of the inner loop, if I understood your suggestion correctly. This needs more boilerplate in each function, so I don't prefer this. "C func pointer" refers to the pointer approach I just mentioned. That is the cleanest looking way to generalize it, so I only tested that version with different strides -- 8- and 16-bytes

This is the same test I used earlier, which is the test in [2] but adding an almost-pure multibyte Chinese text of about the same size.

x64-64 Linux gcc 8.4.0:

build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 1480 | 848 | 428
inline SSE | 1617 | 634 | 63
inline C | 1481 | 843 | 50
C retail | 1493 | 838 | 49
C func pointer | 1467 | 851 | 49
C func pointer 8 | 1518 | 757 | 56

x64-64 MacOS clang 10.0.0:

build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 1086 | 760 | 374
inline SSE | 1081 | 529 | 70
inline C | 1093 | 649 | 49
C retail | 1132 | 695 | 152
C func pointer | 1085 | 609 | 59
C func pointer 8 | 1099 | 571 | 71

PowerPC-LE Linux gcc 4.8.5:

build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 2961 | 1525 | 871
inline SSE | (n/a) | (n/a) | (n/a)
inline C | 2911 | 1329 | 80
C retail | 2838 | 1311 | 102
C func pointer | 2828 | 1314 | 80
C func pointer 8 | 3143 | 1249 | 133

Looking at the results, the main advantage of SSE here is it's more robust for mixed inputs. If a 16-byte chunk is not ascii-only but contains a block of ascii at the front, we can skip those with a single CPU instruction, but in C, we have to verify the whole chunk using the slow path.

The "C func pointer approach" seems to win out over the "C retail" approach (static inline function).

Using an 8-byte stride is slightly better for mixed inputs on all platforms tested, but regresses on pure ascii and also seems to regress on pure multibyte. The difference in the multibyte caes is small enough that it could be random, but it happens on two platforms, so I'd say it's real. On the other hand, pure multibyte is not as common as mixed text.

Overall, I think the function pointer approach with an 8-byte stride is the best balance. If that's agreeable, next I plan to test with short inputs, because I think we'll want a guard if-statement to only loop through the fast path if the string is long enough to justify that.

> > I also gave a shot at doing full UTF-8 recognition using a DFA, but so
> > far that has made performance worse. If I ever have more success with
> > that, I'll add that in the mix.
>
> That's disappointing. Perhaps the SIMD algorithms have higher startup
> costs, so that you need longer inputs to benefit? In that case, it might
> make sense to check the length of the input and only use the SIMD
> algorithm if the input is long enough.

I changed topics a bit quickly, but here I'm talking about using a table-driven state machine to verify the multibyte case. It's possible I did something wrong, since my model implementation decodes, and having to keep track of how many bytes got verified might be the culprit. I'd like to try again to speed up multibyte, but that might be a PG15 project.

[1] https://arxiv.org/abs/2010.03090
[2] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

07 February 2021, 20:24:16

Here is a more polished version of the function pointer approach, now adapted to all multibyte encodings. Using the not-yet-committed tests from [1], I found a thinko bug that resulted in the test for nul bytes to not only be wrong, but probably also elided by the compiler. Doing it correctly is noticeably slower on pure ascii, but still several times faster than before, so the conclusions haven't changed any. I'll run full measurements later this week, but I'll share the patch now for review.

[1] https://www.postgresql.org/message-id/11d39e63-b80a-5f8d-8043-fff04201fadc@iki.fi

John Naylor

EDB: http://www.enterprisedb.com

Attachment

v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch

Re: [POC] verifying UTF-8 using SIMD instructions

From

Heikki Linnakangas

Date:

08 February 2021, 10:17:11

On 07/02/2021 22:24, John Naylor wrote:
> Here is a more polished version of the function pointer approach, now 
> adapted to all multibyte encodings. Using the not-yet-committed tests 
> from [1], I found a thinko bug that resulted in the test for nul bytes 
> to not only be wrong, but probably also elided by the compiler. Doing it 
> correctly is noticeably slower on pure ascii, but still several times 
> faster than before, so the conclusions haven't changed any. I'll run 
> full measurements later this week, but I'll share the patch now for review.

As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's 
algorithm from the simdjson library [1], see attached patch. I 
microbenchmarked it using the the same test I used before [2].

These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1 
20210110"

unpatched master:

postgres=# \i mbverifystr-speed.sql
CREATE FUNCTION
  mixed | ascii
-------+-------
    728 |   393
(1 row)

v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:

  mixed | ascii
-------+-------
    759 |    98
(1 row)

simdjson-utf8-hack.patch:

  mixed | ascii
-------+-------
     53 |    31
(1 row)

So clearly that algorithm is fast. Not sure if it has a high startup 
cost, or large code size, or other tradeoffs that we don't want. At 
least it depends on SIMD instructions, so it requires more code for the 
architecture-specific implementations and autoconf logic and all that. 
Nevertheless I think it deserves a closer look, I'm a bit reluctant to 
put in half-way measures, when there's a clearly superior algorithm out 
there.

I also tested the fallback implementation from the simdjson library 
(included in the patch, if you uncomment it in simdjson-glue.c):

  mixed | ascii
-------+-------
    447 |    46
(1 row)

I think we should at least try to adopt that. At a high level, it looks 
pretty similar your patch: you load the data 8 bytes at a time, check if 
there are all ASCII. If there are any non-ASCII chars, you check the 
bytes one by one, otherwise you load the next 8 bytes. Your patch should 
be able to achieve the same performance, if done right. I don't think 
the simdjson code forbids \0 bytes, so that will add a few cycles, but 
still.

[1] https://github.com/simdjson/simdjson
[2] 
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

- Heikki

PS. Your patch as it stands isn't safe on systems with strict alignment, 
the string passed to the verify function isn't guaranteed to be 8 bytes 
aligned. Use memcpy to fetch the next 8-byte chunk to fix.

Attachment

simdjson-utf8-hack.patch

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

08 February 2021, 13:14:44

On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's
> algorithm from the simdjson library [1], see attached patch. I
> microbenchmarked it using the the same test I used before [2].

I've been looking at various iterations of Lemire's utf8 code, and trying it out was next on my list, so thanks for doing that!

> These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1
> 20210110"
>
> unpatched master:
>
> postgres=# \i mbverifystr-speed.sql
> CREATE FUNCTION
> mixed | ascii
> -------+-------
> 728 | 393
> (1 row)
>
> v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:
>
> mixed | ascii
> -------+-------
> 759 | 98
> (1 row)

Hmm, the mixed case got worse -- I haven't seen that in any of my tests.

> simdjson-utf8-hack.patch:
>
> mixed | ascii
> -------+-------
> 53 | 31
> (1 row)
>
> So clearly that algorithm is fast. Not sure if it has a high startup
> cost, or large code size, or other tradeoffs that we don't want.

The simdjson lib uses everything up through AVX512 depending on what hardware is available. I seem to remember reading that high start-up cost is more relevant to floating point than to integer ops, but I could be wrong. Just the utf8 portion is surely tiny also.

> At
> least it depends on SIMD instructions, so it requires more code for the
> architecture-specific implementations and autoconf logic and all that.

One of his earlier demos [1] (in simdutf8check.h) had a version that used mostly SSE2 with just three intrinsics from SSSE3. That's widely available by now. He measured that at 0.7 cycles per byte, which is still good compared to AVX2 0.45 cycles per byte [2].

Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would assume that if that check (and the corresponding runtime check) passes, we can assume SSE2. That code has three licenses to choose from -- Apache 2, Boost, and MIT. Something like that might be straightforward to start from. I think the only obstacles to worry about are license and getting it to fit into our codebase. Adding more than zero high-level comments with a good description of how it works in detail is also a bit of a challenge.

> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
> mixed | ascii
> -------+-------
> 447 | 46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Okay, I'll look into that.

> PS. Your patch as it stands isn't safe on systems with strict alignment,
> the string passed to the verify function isn't guaranteed to be 8 bytes
> aligned. Use memcpy to fetch the next 8-byte chunk to fix.

Will do.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include
[2] https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

09 February 2021, 20:08:21

master

mixed | ascii
-------+-------
757 | 366

v1, but using memcpy()

mixed | ascii
-------+-------
601 | 129

remove zero-byte check:

mixed | ascii
-------+-------
588 | 93

inline ascii fastpath into pg_utf8_verifystr()

mixed | ascii
-------+-------
595 | 71

use 16-byte stride

mixed | ascii
-------+-------
652 | 49

With this cpu/compiler, v1 is fastest on the mixed input all else being equal.

Maybe there's a smarter way to check for zeros in C. Or maybe be more careful about cache -- running memchr() on the whole input first might not be the best thing to do.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From

Heikki Linnakangas

Date:

09 February 2021, 20:22:02

On 09/02/2021 22:08, John Naylor wrote:
> Maybe there's a smarter way to check for zeros in C. Or maybe be more 
> careful about cache -- running memchr() on the whole input first might 
> not be the best thing to do.

The usual trick is the haszero() macro here: 
https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That's 
how memchr() is typically implemented, too.

- Heikki

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

09 February 2021, 21:12:22

I wrote:
>
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> One of his earlier demos [1] (in simdutf8check.h) had a version that used mostly SSE2 with just three intrinsics from SSSE3. That's widely available by now. He measured that at 0.7 cycles per byte, which is still good compared to AVX2 0.45 cycles per byte [2].
>
> Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would assume that if that check (and the corresponding runtime check) passes, we can assume SSE2. That code has three licenses to choose from -- Apache 2, Boost, and MIT. Something like that might be straightforward to start from. I think the only obstacles to worry about are license and getting it to fit into our codebase. Adding more than zero high-level comments with a good description of how it works in detail is also a bit of a challenge.

I double checked, and it's actually two SSSE3 intrinsics and one SSE4.1, but the 4.1 one can be emulated with a few SSE2 intrinsics. But we could probably fold all three into the SSE4.2 CRC check and have a single symbol to save on boilerplate.

I hacked that demo [1] into wchar.c (very ugly patch attached), and got the following:

master

mixed | ascii
-------+-------
757 | 366

Lemire demo:

mixed | ascii
-------+-------
172 | 168

This one lacks an ascii fast path, but the AVX2 version in the same file has one that could probably be easily adapted. With that, I think this would be worth adapting to our codebase and license. Thoughts?

The advantage of this demo is that it's not buried in a mountain of modern C++.

Simdjson can use AVX -- do you happen to know which target it got compiled to? AVX vectors are 256-bits wide and that requires OS support. The OS's we care most about were updated 8-12 years ago, but that would still be something to check, in addition to more configure checks.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

utf-sse42-demo.patch

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

10 February 2021, 04:00:53

On Tue, Feb 9, 2021 at 4:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 09/02/2021 22:08, John Naylor wrote:
> > Maybe there's a smarter way to check for zeros in C. Or maybe be more
> > careful about cache -- running memchr() on the whole input first might
> > not be the best thing to do.
>
> The usual trick is the haszero() macro here:
> https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That's
> how memchr() is typically implemented, too.

Thanks for that. Checking with that macro each loop iteration gives a small boost:

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

13 February 2021, 01:31:33

On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
> mixed | ascii
> -------+-------
> 447 | 46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Attached is a patch that does roughly what simdjson fallback did, except I use straight tests on the bytes and only calculate code points in assertion builds. In the course of doing this, I found that my earlier concerns about putting the ascii check in a static inline function were due to my suboptimal loop implementation. I had assumed that if the chunked ascii check failed, it had to check all those bytes one at a time. As it turns out, that's a waste of the branch predictor. In the v2 patch, we do the chunked ascii check every time we loop. With that, I can also confirm the claim in the Lemire paper that it's better to do the check on 16-byte chunks:

(MacOS, Clang 10)

master:

chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366

v2 patch, with 16-byte stride:

chinese | mixed | ascii
---------+-------+-------
806 | 474 | 83

patch but with 8-byte stride:

chinese | mixed | ascii
---------+-------+-------
792 | 490 | 105

I also included the fast path in all other multibyte encodings, and that is also pretty good performance-wise. It regresses from master on pure multibyte input, but that case is still faster than PG13, which I simulated by reverting 6c5576075b0f9 and b80e10638e3:

~PG13:

chinese | mixed | ascii
---------+-------+-------
1565 | 848 | 365

ascii fast-path plus pg_*_verifychar():

chinese | mixed | ascii
---------+-------+-------
1279 | 656 | 94

v2 has a rough start to having multiple implementations in src/backend/port. Next steps are:

1. Add more tests for utf-8 coverage (in addition to the ones to be added by the noError argument patch)
2. Add SSE4 validator -- it turns out the demo I referred to earlier doesn't match the algorithm in the paper. I plan to only copy the lookup tables from simdjson verbatim, but the code will basically be written from scratch, using simdjson as a hint.
3. Adjust configure.ac

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v2-add-portability-stub-and-new-fallback.patch

Re: [POC] verifying UTF-8 using SIMD instructions

From

Heikki Linnakangas

Date:

15 February 2021, 13:18:09

On 13/02/2021 03:31, John Naylor wrote:
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>  >
>  > I also tested the fallback implementation from the simdjson library
>  > (included in the patch, if you uncomment it in simdjson-glue.c):
>  >
>  >   mixed | ascii
>  > -------+-------
>  >     447 |    46
>  > (1 row)
>  >
>  > I think we should at least try to adopt that. At a high level, it looks
>  > pretty similar your patch: you load the data 8 bytes at a time, check if
>  > there are all ASCII. If there are any non-ASCII chars, you check the
>  > bytes one by one, otherwise you load the next 8 bytes. Your patch should
>  > be able to achieve the same performance, if done right. I don't think
>  > the simdjson code forbids \0 bytes, so that will add a few cycles, but
>  > still.
> 
> Attached is a patch that does roughly what simdjson fallback did, except 
> I use straight tests on the bytes and only calculate code points in 
> assertion builds. In the course of doing this, I found that my earlier 
> concerns about putting the ascii check in a static inline function were 
> due to my suboptimal loop implementation. I had assumed that if the 
> chunked ascii check failed, it had to check all those bytes one at a 
> time. As it turns out, that's a waste of the branch predictor. In the v2 
> patch, we do the chunked ascii check every time we loop. With that, I 
> can also confirm the claim in the Lemire paper that it's better to do 
> the check on 16-byte chunks:
> 
> (MacOS, Clang 10)
> 
> master:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>      1081 |   761 |   366
> 
> v2 patch, with 16-byte stride:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>       806 |   474 |    83
> 
> patch but with 8-byte stride:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>       792 |   490 |   105
> 
> I also included the fast path in all other multibyte encodings, and that 
> is also pretty good performance-wise.

Cool.

> It regresses from master on pure 
> multibyte input, but that case is still faster than PG13, which I 
> simulated by reverting 6c5576075b0f9 and b80e10638e3:

I thought the "chinese" numbers above are pure multibyte input, and it 
seems to do well on that. Where does it regress? In multibyte encodings 
other than UTF-8? How bad is the regression?

I tested this on my first generation Raspberry Pi (chipmunk). I had to 
tweak it a bit to make it compile, since the SSE autodetection code was 
not finished yet. And I used generate_series(1, 1000) instead of 
generate_series(1, 10000) in the test script (mbverifystr-speed.sql) 
because this system is so slow.

master:

  mixed | ascii
-------+-------
   1310 |  1041
(1 row)

v2-add-portability-stub-and-new-fallback.patch:

  mixed | ascii
-------+-------
   2979 |   910
(1 row)

I'm guessing that's because the unaligned access in check_ascii() is 
expensive on this platform.

- Heikki

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

16 February 2021, 01:32:52

On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>

Attached is the first attempt at using SSE4 to do the validation, but first I'll answer your questions about the fallback.

I should mention that v2 had a correctness bug for 4-byte characters that I found when I was writing regression tests. It shouldn't materially affect performance, however.

> I thought the "chinese" numbers above are pure multibyte input, and it
> seems to do well on that. Where does it regress? In multibyte encodings
> other than UTF-8?

Yes, the second set of measurements was intended to represent multibyte encodings other than UTF-8. But instead of using one of those encodings, I simulated non-UTF-8 by copying the pattern used for those: in the loop, check for ascii then either advance or verify one character. It was a quick way to use the same test.

> How bad is the regression?

I'll copy the measurements here together with master so it's easier to compare:

~= PG13 (revert 6c5576075b0f9 and b80e10638e3):

chinese | mixed | ascii
---------+-------+-------
1565 | 848 | 365

master:

chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366

ascii fast-path plus pg_*_verifychar():

chinese | mixed | ascii
---------+-------+-------
1279 | 656 | 94

As I mentioned upthread, pure multibyte is still faster than PG13. Reducing the ascii check to 8-bytes at time might alleviate the regression.

> I tested this on my first generation Raspberry Pi (chipmunk). I had to
> tweak it a bit to make it compile, since the SSE autodetection code was
> not finished yet. And I used generate_series(1, 1000) instead of
> generate_series(1, 10000) in the test script (mbverifystr-speed.sql)
> because this system is so slow.
>
> master:
>
> mixed | ascii
> -------+-------
> 1310 | 1041
> (1 row)
>
> v2-add-portability-stub-and-new-fallback.patch:
>
> mixed | ascii
> -------+-------
> 2979 | 910
> (1 row)
>
> I'm guessing that's because the unaligned access in check_ascii() is
> expensive on this platform.

Hmm, I used memcpy() as suggested. Is that still slow on that platform? That's 32-bit, right? Some possible remedies:

1) For the COPY FROM case, we should align the allocation on a cacheline -- we already have examples of that idiom elsewhere. I was actually going to suggest doing this anyway, since unaligned SIMD loads are often slower, too.

2) As the simdjson fallback was based on Fuchsia (the Lemire paper implies it was tested carefully on Arm and I have no reason to doubt that), I could try to follow that example more faithfully by computing the actual codepoints. It's more computation and just as many branches as far as I can tell, but it's not a lot of work. I can add that alternative fallback to the patch set. I have no Arm machines, but I can test on a POWER8 machine.

3) #ifdef out the ascii check for 32-bit platforms.

4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time. I'll probably try this first.

Now, I'm pleased to report that I got SSE4 working, and it seems to work. It still needs some stress testing to find any corner case bugs, but it shouldn't be too early to share some numbers on Clang 10 / MacOS:

master:

chinese | mixed | ascii
---------+-------+-------
1082 | 751 | 364

v3 with SSE4.1:

chinese | mixed | ascii
---------+-------+-------
127 | 128 | 126

Some caveats and notes:

- It takes almost no recognizable code from simdjson, but it does take the magic constants lookup tables almost verbatim. The main body of the code has no intrinsics at all (I think). They're all hidden inside static inline helper functions. I reused some cryptic variable names from simdjson. It's a bit messy but not terrible.

- It diffs against the noError conversion patch and adds additional tests.

- It's not smart enough to stop at the last valid character boundary -- it's either all-valid or it must start over with the fallback. That will have to change in order to work with the proposed noError conversions. It shouldn't be very hard, but needs thought as to the clearest and safest way to code it.

- There is no ascii fast-path yet. With this algorithm we have to be a bit more careful since a valid ascii chunk could be preceded by an incomplete sequence at the end of the previous chunk. Not too hard, just a bit more work.

- This is my first time hacking autoconf, and it still seems slightly broken, yet functional on my machine at least.

- It only needs SSE4.1, but I didn't want to create a whole new CFLAGS, so it just reuses SSE4.2 for the runtime check and the macro names. Also, it doesn't test for SSE2, it just insists on 64-bit for the runtime check. I imagine it would refuse to build on 32-bit machines if you passed it -msse42

- There is a placeholder for Windows support, but it's not developed.

- I had to add a large number of casts to get rid of warnings in the magic constants macros. That needs some polish.

I also attached a C file that visually demonstrates every step of the algorithm following the example found in Table 9 in the paper. That contains the skeleton coding I started with and got abandoned early, so it might differ from the actual patch.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

17 February 2021, 05:40:32

I wrote:

> [v3]
> - It's not smart enough to stop at the last valid character boundary -- it's either all-valid or it must start over with the fallback. That will have to change in order to work with the proposed noError conversions. It shouldn't be very hard, but needs thought as to the clearest and safest way to code it.

In v4, it should be able to return an accurate count of valid bytes even when the end crosses a character boundary.

> - This is my first time hacking autoconf, and it still seems slightly broken, yet functional on my machine at least.

It was actually completely broken if you tried to pass the special flags to configure. I redesigned this part and it seems to work now.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v4-SSE4-with-autoconf-support.patch

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

19 February 2021, 00:43:04

On Mon, Feb 15, 2021 at 9:32 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > I'm guessing that's because the unaligned access in check_ascii() is
> > expensive on this platform.

> Some possible remedies:

> 3) #ifdef out the ascii check for 32-bit platforms.

> 4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time. I'll probably try this first.

I've attached a couple patches to try on top of v4; maybe they'll help the Arm32 regression. 01 reduces the stride to 8 bytes, and 02 applies on top of v1 to disable the fallback fast path entirely on 32-bit platforms. A bit of a heavy hammer, but it'll confirm (or not) your theory about unaligned loads.

Also, I've included patches to explain more fully how I modeled non-UTF-8 performance while still using the UTF-8 tests. I think it was a useful thing to do, and I have a theory that might predict how a non-UTF8 encoding will perform with the fast path.

03A and 03B are independent of each other and conflict, but both apply on top of v4 (don't need 02). Both replace the v4 fallback with the ascii fastpath + pg_utf8_verifychar() in the loop, similar to utf-8 on master. 03A has a local static copy of pg_utf8_islegal(), and 03B uses the existing global function. (On x86, you can disable SSE4 by passing USE_FALLBACK_UTF8=1 to configure.)

While Clang 10 regressed for me on pure multibyte in a similar test upthread, on Linux gcc 8.4 there isn't a regression at all. IIRC, gcc wasn't as good as Clang when the API changed a few weeks ago, so its regression from v4 is still faster than master. Clang only regressed with my changes because it somehow handled master much better to begin with.

x86-64 Linux gcc 8.4

master

chinese | mixed | ascii
---------+-------+-------
1453 | 857 | 428

v4 (fallback verifier written as a single function)

chinese | mixed | ascii
---------+-------+-------
815 | 514 | 82

v4 plus addendum 03A -- emulate non-utf-8 using a copy of pg_utf8_is_legal() as a static function

chinese | mixed | ascii
---------+-------+-------
1115 | 547 | 87

v4 plus addendum 03B -- emulate non-utf-8 using pg_utf8_is_legal() as a global function

chinese | mixed | ascii
---------+-------+-------
1279 | 604 | 82

(I also tried the same on ppc64le Linux, gcc 4.8.5 and while not great, it never got worse than master either on pure multibyte.)

This is supposed to model the performance of a non-utf8 encoding, where we don't have a bespoke function written from scratch. Here's my theory: If an encoding has pg_*_mblen(), a global function, inside pg_*_verifychar(), it seems it won't benefit as much from an ascii fast path as one whose pg_*_verifychar() has no function calls. I'm not sure whether a compiler can inline a global function's body into call sites in the unit where it's defined. (I haven't looked at the assembly.) But recall that you didn't commit 0002 from the earlier encoding change, because it wasn't performing. I looked at that patch again, and while it inlined the pg_utf8_verifychar() call, it still called the global function pg_utf8_islegal().

If the above is anything to go by, on gcc at least, I don't think we need to worry about a regression when adding an ascii fast path to non-utf-8 multibyte encodings.

Regarding SSE, I've added an ascii fast path in my local branch, but it's not going to be as big a difference because 1) the check is more expensive in terms of branches than the C case, and 2) because the general case is so fast already, it's hard to improve upon. I just need to do some testing and cleanup on the whole thing, and that'll be ready to share.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

19 March 2021, 19:24:06

I wrote:

> Thanks for testing! Good, the speedup is about as much as I can hope for using plain C. In the next patch I'll go ahead and squash in the ascii fast path, using 16-byte stride, unless there are objections. I claim we can live with the regression Heikki found on an old 32-bit Arm platform since it doesn't seem to be true of Arm in general.

In v8, I've squashed the 16-byte stride into 0002. I also removed the sole holdout of hard-coded intrinsics, by putting _mm_setr_epi8 inside a variadic macro, and also did some reordering of the one-line function definitions. (As before, 0001 is not my patch, but parts of it are a prerequisite to my regressions tests).

Over in [1] , I tested in-situ in a COPY FROM test and found a 10% speedup with mixed ascii and multibyte in the copy code, i.e. with buffer and storage taken completely out of the picture.

[1] https://www.postgresql.org/message-id/CAFBsxsEybzagsrmuoLsKYx417Sce9cgnM91nf8f9HKGLadixPg%40mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From

John Naylor

Date:

01 April 2021, 14:22:06

v9 is just a rebase.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v9-0001-Replace-pg_utf8_verifystr-with-two-faster-impleme.patch

speed up verifying UTF-8

From

John Naylor

Date:

02 June 2021, 16:26:41

For v10, I've split the patch up into two parts. 0001 uses pure C everywhere. This is much smaller and easier to review, and gets us the most bang for the buck.

One concern Heikki raised upthread is that platforms with poor unaligned-memory access will see a regression. We could easily add an #ifdef to take care of that, but I haven't done so here.

To recap: On ascii-only input with storage taken out of the picture, profiles of COPY FROM show a reduction from nealy 10% down to just over 1%. In microbenchmarks found earlier in this thread, this works out to about 7 times faster. On multibyte/mixed input, 0001 is a bit faster, but not really enough to make a difference in copy performance.

0002 adds the SSE4 implementation on x86-64, and is equally fast on all input, at the cost of greater complexity.

To reflect the split, I've changed the thread subject and the commitfest title.

John Naylor