Thread: [POC] verifying UTF-8 using SIMD instructions

[POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
Hi,

As of b80e10638e3, there is a new API for validating the encoding of strings, and one of the side effects is that we have a wider choice of algorithms. For UTF-8, it has been demonstrated that SIMD is much faster at decoding [1] and validation [2] than the standard approach we use.

It makes sense to start with the ascii subset of UTF-8 for a couple reasons. First, ascii is very widespread in database content, particularly in bulk loads. Second, ascii can be validated using the simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and I'm guessing we can detect that at compile time and not mess with runtime checks. The examples above using SSE for the general case are much more complicated and involve SSE 4.2 or AVX.

Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is okay, I'll do Linux/gcc and add more inputs). The test is the same as Heikki shared in [3], but I added a case with >95% Chinese characters just to show how that compares to the mixed ascii/multibyte case.

master:

 chinese | mixed | ascii
---------+-------+-------
    1081 |   761 |   366

patch:

 chinese | mixed | ascii
---------+-------+-------
    1103 |   498 |    51

The speedup in the pure ascii case is nice.

In the attached POC, I just have a pro forma portability stub, and left full portability detection for later. The fast path is inlined inside pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted into a separate function to which is passed a function pointer for full encoding validation. That would allow other encodings with strict ascii subsets to use this as well, but coding that abstraction might be a little messy, and b80e10638e3 already gives a performance boost over PG13.

I also gave a shot at doing full UTF-8 recognition using a DFA, but so far that has made performance worse. If I ever have more success with that, I'll add that in the mix.

[1] https://woboq.com/blog/utf-8-processing-using-simd.html
[2] https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
[3] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
Heikki Linnakangas
Date:
On 01/02/2021 19:32, John Naylor wrote:
> It makes sense to start with the ascii subset of UTF-8 for a couple 
> reasons. First, ascii is very widespread in database content, 
> particularly in bulk loads. Second, ascii can be validated using the 
> simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and 
> I'm guessing we can detect that at compile time and not mess with 
> runtime checks. The examples above using SSE for the general case are 
> much more complicated and involve SSE 4.2 or AVX.

I wonder how using SSE compares with dealing with 64 or 32-bit words at 
a time, using regular instructions? That would be more portable.

> Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is 
> okay, I'll do Linux/gcc and add more inputs). The test is the same as 
> Heikki shared in [3], but I added a case with >95% Chinese characters 
> just to show how that compares to the mixed ascii/multibyte case.
> 
> master:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>      1081 |   761 |   366
> 
> patch:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>      1103 |   498 |    51
> 
> The speedup in the pure ascii case is nice.

Yep.

> In the attached POC, I just have a pro forma portability stub, and left 
> full portability detection for later. The fast path is inlined inside 
> pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted 
> into a separate function to which is passed a function pointer for full 
> encoding validation. That would allow other encodings with strict ascii 
> subsets to use this as well, but coding that abstraction might be a 
> little messy, and b80e10638e3 already gives a performance boost over PG13.

All supported encodings are ASCII subsets. Might be best to putt the 
ASCII-check into a static inline function and use it in all the verify 
functions. I presume it's only a few instructions, and these functions 
can be pretty performance sensitive.

> I also gave a shot at doing full UTF-8 recognition using a DFA, but so 
> far that has made performance worse. If I ever have more success with 
> that, I'll add that in the mix.

That's disappointing. Perhaps the SIMD algorithms have higher startup 
costs, so that you need longer inputs to benefit? In that case, it might 
make sense to check the length of the input and only use the SIMD 
algorithm if the input is long enough.

- Heikki



Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
On Mon, Feb 1, 2021 at 2:01 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 01/02/2021 19:32, John Naylor wrote:
> > It makes sense to start with the ascii subset of UTF-8 for a couple
> > reasons. First, ascii is very widespread in database content,
> > particularly in bulk loads. Second, ascii can be validated using the
> > simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
> > I'm guessing we can detect that at compile time and not mess with
> > runtime checks. The examples above using SSE for the general case are
> > much more complicated and involve SSE 4.2 or AVX.
>
> I wonder how using SSE compares with dealing with 64 or 32-bit words at
> a time, using regular instructions? That would be more portable.

I gave that a shot, and it's actually pretty good. According to this paper, [1], 16 bytes was best and gives a good apples-to-apples comparison to SSE registers, so I tried both 16 and 8 bytes.

> All supported encodings are ASCII subsets. Might be best to putt the
> ASCII-check into a static inline function and use it in all the verify
> functions. I presume it's only a few instructions, and these functions
> can be pretty performance sensitive.

I tried both the static inline function and also putting the whole optimized utf-8 loop in a separate function to which the caller passes a pointer to the appropriate pg_*_verifychar().

In the table below, "inline" refers to coding directly inside pg_utf8_verifystr(). Both C and SSE are in the same patch, with an #ifdef. I didn't bother splitting them out because for other encodings, we want one of the other approaches above. For those, "C retail" refers to a static inline function to code the contents of the inner loop, if I understood your suggestion correctly. This needs more boilerplate in each function, so I don't prefer this. "C func pointer" refers to the pointer approach I just mentioned. That is the cleanest looking way to generalize it, so I only tested that version with different strides -- 8- and 16-bytes

This is the same test I used earlier, which is the test in [2] but adding an almost-pure multibyte Chinese text of about the same size.

x64-64 Linux gcc 8.4.0:

      build       | chinese | mixed | ascii
------------------+---------+-------+-------
 master           |    1480 |   848 |   428
 inline SSE       |    1617 |   634 |    63
 inline C         |    1481 |   843 |    50
 C retail         |    1493 |   838 |    49
 C func pointer   |    1467 |   851 |    49
 C func pointer 8 |    1518 |   757 |    56

x64-64 MacOS clang 10.0.0:

      build       | chinese | mixed | ascii
------------------+---------+-------+-------
 master           |    1086 |   760 |   374
 inline SSE       |    1081 |   529 |    70
 inline C         |    1093 |   649 |    49
 C retail         |    1132 |   695 |   152
 C func pointer   |    1085 |   609 |    59
 C func pointer 8 |    1099 |   571 |    71

PowerPC-LE Linux gcc 4.8.5:

      build       | chinese | mixed | ascii
------------------+---------+-------+-------
 master           |    2961 |  1525 |   871
 inline SSE       |   (n/a) | (n/a) | (n/a)
 inline C         |    2911 |  1329 |    80
 C retail         |    2838 |  1311 |   102
 C func pointer   |    2828 |  1314 |    80
 C func pointer 8 |    3143 |  1249 |   133

Looking at the results, the main advantage of SSE here is it's more robust for mixed inputs. If a 16-byte chunk is not ascii-only but contains a block of ascii at the front, we can skip those with a single CPU instruction, but in C, we have to verify the whole chunk using the slow path.

The "C func pointer approach" seems to win out over the "C retail" approach (static inline function).

Using an 8-byte stride is slightly better for mixed inputs on all platforms tested, but regresses on pure ascii and also seems to regress on pure multibyte. The difference in the multibyte caes is small enough that it could be random, but it happens on two platforms, so I'd say it's real. On the other hand, pure multibyte is not as common as mixed text.

Overall, I think the function pointer approach with an 8-byte stride is the best balance. If that's agreeable, next I plan to test with short inputs, because I think we'll want a guard if-statement to only loop through the fast path if the string is long enough to justify that.

> > I also gave a shot at doing full UTF-8 recognition using a DFA, but so
> > far that has made performance worse. If I ever have more success with
> > that, I'll add that in the mix.
>
> That's disappointing. Perhaps the SIMD algorithms have higher startup
> costs, so that you need longer inputs to benefit? In that case, it might
> make sense to check the length of the input and only use the SIMD
> algorithm if the input is long enough.

I changed topics a bit quickly, but here I'm talking about using a table-driven state machine to verify the multibyte case. It's possible I did something wrong, since my model implementation decodes, and having to keep track of how many bytes got verified might be the culprit. I'd like to try again to speed up multibyte, but that might be a PG15 project.

[1] https://arxiv.org/abs/2010.03090
[2] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
Here is a more polished version of the function pointer approach, now adapted to all multibyte encodings. Using the not-yet-committed tests from [1], I found a thinko bug that resulted in the test for nul bytes to not only be wrong, but probably also elided by the compiler. Doing it correctly is noticeably slower on pure ascii, but still several times faster than before, so the conclusions haven't changed any. I'll run full measurements later this week, but I'll share the patch now for review.

[1] https://www.postgresql.org/message-id/11d39e63-b80a-5f8d-8043-fff04201fadc@iki.fi

--
Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
Heikki Linnakangas
Date:
On 07/02/2021 22:24, John Naylor wrote:
> Here is a more polished version of the function pointer approach, now 
> adapted to all multibyte encodings. Using the not-yet-committed tests 
> from [1], I found a thinko bug that resulted in the test for nul bytes 
> to not only be wrong, but probably also elided by the compiler. Doing it 
> correctly is noticeably slower on pure ascii, but still several times 
> faster than before, so the conclusions haven't changed any. I'll run 
> full measurements later this week, but I'll share the patch now for review.

As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's 
algorithm from the simdjson library [1], see attached patch. I 
microbenchmarked it using the the same test I used before [2].

These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1 
20210110"

unpatched master:

postgres=# \i mbverifystr-speed.sql
CREATE FUNCTION
  mixed | ascii
-------+-------
    728 |   393
(1 row)

v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:

  mixed | ascii
-------+-------
    759 |    98
(1 row)

simdjson-utf8-hack.patch:

  mixed | ascii
-------+-------
     53 |    31
(1 row)

So clearly that algorithm is fast. Not sure if it has a high startup 
cost, or large code size, or other tradeoffs that we don't want. At 
least it depends on SIMD instructions, so it requires more code for the 
architecture-specific implementations and autoconf logic and all that. 
Nevertheless I think it deserves a closer look, I'm a bit reluctant to 
put in half-way measures, when there's a clearly superior algorithm out 
there.

I also tested the fallback implementation from the simdjson library 
(included in the patch, if you uncomment it in simdjson-glue.c):

  mixed | ascii
-------+-------
    447 |    46
(1 row)

I think we should at least try to adopt that. At a high level, it looks 
pretty similar your patch: you load the data 8 bytes at a time, check if 
there are all ASCII. If there are any non-ASCII chars, you check the 
bytes one by one, otherwise you load the next 8 bytes. Your patch should 
be able to achieve the same performance, if done right. I don't think 
the simdjson code forbids \0 bytes, so that will add a few cycles, but 
still.

[1] https://github.com/simdjson/simdjson
[2] 
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

- Heikki

PS. Your patch as it stands isn't safe on systems with strict alignment, 
the string passed to the verify function isn't guaranteed to be 8 bytes 
aligned. Use memcpy to fetch the next 8-byte chunk to fix.


Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's
> algorithm from the simdjson library [1], see attached patch. I
> microbenchmarked it using the the same test I used before [2].

I've been looking at various iterations of Lemire's utf8 code, and trying it out was next on my list, so thanks for doing that!

> These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1
> 20210110"
>
> unpatched master:
>
> postgres=# \i mbverifystr-speed.sql
> CREATE FUNCTION
>   mixed | ascii
> -------+-------
>     728 |   393
> (1 row)
>
> v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:
>
>   mixed | ascii
> -------+-------
>     759 |    98
> (1 row)

Hmm, the mixed case got worse -- I haven't seen that in any of my tests.

> simdjson-utf8-hack.patch:
>
>   mixed | ascii
> -------+-------
>      53 |    31
> (1 row)
>
> So clearly that algorithm is fast. Not sure if it has a high startup
> cost, or large code size, or other tradeoffs that we don't want.

The simdjson lib uses everything up through AVX512 depending on what hardware is available. I seem to remember reading that high start-up cost is more relevant to floating point than to integer ops, but I could be wrong. Just the utf8 portion is surely tiny also.

> At
> least it depends on SIMD instructions, so it requires more code for the
> architecture-specific implementations and autoconf logic and all that.

One of his earlier demos [1] (in simdutf8check.h) had a version that used mostly SSE2 with just three intrinsics from SSSE3. That's widely available by now. He measured that at 0.7 cycles per byte, which is still good compared to AVX2 0.45 cycles per byte [2].

Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would assume that if that check (and the corresponding runtime check) passes, we can assume SSE2. That code has three licenses to choose from -- Apache 2, Boost, and MIT. Something like that might be straightforward to start from. I think the only obstacles to worry about are license and getting it to fit into our codebase. Adding more than zero high-level comments with a good description of how it works in detail is also a bit of a challenge.

> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> -------+-------
>     447 |    46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Okay, I'll look into that.

> PS. Your patch as it stands isn't safe on systems with strict alignment,
> the string passed to the verify function isn't guaranteed to be 8 bytes
> aligned. Use memcpy to fetch the next 8-byte chunk to fix.

Will do.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include
[2] https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:


On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> -------+-------
>     447 |    46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

That fallback is very similar to my "inline C" case upthread, and they both actually check 16 bytes at a time (the comment is wrong in the patch you shared). I can work back and show how the performance changes with each difference (just MacOS, clang 10 here):

master

 mixed | ascii
-------+-------
   757 |   366

v1, but using memcpy()

 mixed | ascii
-------+-------
   601 |   129

remove zero-byte check:

 mixed | ascii
-------+-------
   588 |    93

inline ascii fastpath into pg_utf8_verifystr()

 mixed | ascii
-------+-------
   595 |    71

use 16-byte stride

 mixed | ascii
-------+-------
   652 |    49

With this cpu/compiler, v1 is fastest on the mixed input all else being equal. 

Maybe there's a smarter way to check for zeros in C. Or maybe be more careful about cache -- running memchr() on the whole input first might not be the best thing to do. 

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From
Heikki Linnakangas
Date:
On 09/02/2021 22:08, John Naylor wrote:
> Maybe there's a smarter way to check for zeros in C. Or maybe be more 
> careful about cache -- running memchr() on the whole input first might 
> not be the best thing to do.

The usual trick is the haszero() macro here: 
https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That's 
how memchr() is typically implemented, too.

- Heikki



Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:


I wrote:
>
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> One of his earlier demos [1] (in simdutf8check.h) had a version that used mostly SSE2 with just three intrinsics from SSSE3. That's widely available by now. He measured that at 0.7 cycles per byte, which is still good compared to AVX2 0.45 cycles per byte [2].
>
> Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would assume that if that check (and the corresponding runtime check) passes, we can assume SSE2. That code has three licenses to choose from -- Apache 2, Boost, and MIT. Something like that might be straightforward to start from. I think the only obstacles to worry about are license and getting it to fit into our codebase. Adding more than zero high-level comments with a good description of how it works in detail is also a bit of a challenge.

I double checked, and it's actually two SSSE3 intrinsics and one SSE4.1, but the 4.1 one can be emulated with a few SSE2 intrinsics. But we could probably fold all three into the SSE4.2 CRC check and have a single symbol to save on boilerplate.

I hacked that demo [1] into wchar.c (very ugly patch attached), and got the following:

master

 mixed | ascii
-------+-------
   757 |   366

Lemire demo:

 mixed | ascii
-------+-------
   172 |   168

This one lacks an ascii fast path, but the AVX2 version in the same file has one that could probably be easily adapted. With that, I think this would be worth adapting to our codebase and license. Thoughts?

The advantage of this demo is that it's not buried in a mountain of modern C++.
 
Simdjson can use AVX -- do you happen to know which target it got compiled to? AVX vectors are 256-bits wide and that requires OS support. The OS's we care most about were updated 8-12 years ago, but that would still be something to check, in addition to more configure checks.

Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:


On Tue, Feb 9, 2021 at 4:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 09/02/2021 22:08, John Naylor wrote:
> > Maybe there's a smarter way to check for zeros in C. Or maybe be more
> > careful about cache -- running memchr() on the whole input first might
> > not be the best thing to do.
>
> The usual trick is the haszero() macro here:
> https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That's
> how memchr() is typically implemented, too.

Thanks for that. Checking with that macro each loop iteration gives a small boost:

v1, but using memcpy()

 mixed | ascii
-------+-------
   601 |   129

with haszero()

 mixed | ascii
-------+-------
   583 |   105

remove zero-byte check:

 mixed | ascii
-------+-------
   588 |    93

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> -------+-------
>     447 |    46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Attached is a patch that does roughly what simdjson fallback did, except I use straight tests on the bytes and only calculate code points in assertion builds. In the course of doing this, I found that my earlier concerns about putting the ascii check in a static inline function were due to my suboptimal loop implementation. I had assumed that if the chunked ascii check failed, it had to check all those bytes one at a time. As it turns out, that's a waste of the branch predictor. In the v2 patch, we do the chunked ascii check every time we loop. With that, I can also confirm the claim in the Lemire paper that it's better to do the check on 16-byte chunks:

(MacOS, Clang 10)

master:

 chinese | mixed | ascii
---------+-------+-------
    1081 |   761 |   366

v2 patch, with 16-byte stride:

 chinese | mixed | ascii
---------+-------+-------
     806 |   474 |    83

patch but with 8-byte stride:

 chinese | mixed | ascii
---------+-------+-------
     792 |   490 |   105

I also included the fast path in all other multibyte encodings, and that is also pretty good performance-wise. It regresses from master on pure multibyte input, but that case is still faster than PG13, which I simulated by reverting 6c5576075b0f9 and b80e10638e3:

~PG13:

 chinese | mixed | ascii
---------+-------+-------
    1565 |   848 |   365

ascii fast-path plus pg_*_verifychar():

 chinese | mixed | ascii
---------+-------+-------
    1279 |   656 |    94


v2 has a rough start to having multiple implementations in src/backend/port. Next steps are:

1. Add more tests for utf-8 coverage (in addition to the ones to be added by the noError argument patch)
2. Add SSE4 validator -- it turns out the demo I referred to earlier doesn't match the algorithm in the paper. I plan to only copy the lookup tables from simdjson verbatim, but the code will basically be written from scratch, using  simdjson as a hint.
3. Adjust configure.ac

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
Heikki Linnakangas
Date:
On 13/02/2021 03:31, John Naylor wrote:
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>  >
>  > I also tested the fallback implementation from the simdjson library
>  > (included in the patch, if you uncomment it in simdjson-glue.c):
>  >
>  >   mixed | ascii
>  > -------+-------
>  >     447 |    46
>  > (1 row)
>  >
>  > I think we should at least try to adopt that. At a high level, it looks
>  > pretty similar your patch: you load the data 8 bytes at a time, check if
>  > there are all ASCII. If there are any non-ASCII chars, you check the
>  > bytes one by one, otherwise you load the next 8 bytes. Your patch should
>  > be able to achieve the same performance, if done right. I don't think
>  > the simdjson code forbids \0 bytes, so that will add a few cycles, but
>  > still.
> 
> Attached is a patch that does roughly what simdjson fallback did, except 
> I use straight tests on the bytes and only calculate code points in 
> assertion builds. In the course of doing this, I found that my earlier 
> concerns about putting the ascii check in a static inline function were 
> due to my suboptimal loop implementation. I had assumed that if the 
> chunked ascii check failed, it had to check all those bytes one at a 
> time. As it turns out, that's a waste of the branch predictor. In the v2 
> patch, we do the chunked ascii check every time we loop. With that, I 
> can also confirm the claim in the Lemire paper that it's better to do 
> the check on 16-byte chunks:
> 
> (MacOS, Clang 10)
> 
> master:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>      1081 |   761 |   366
> 
> v2 patch, with 16-byte stride:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>       806 |   474 |    83
> 
> patch but with 8-byte stride:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>       792 |   490 |   105
> 
> I also included the fast path in all other multibyte encodings, and that 
> is also pretty good performance-wise.

Cool.

> It regresses from master on pure 
> multibyte input, but that case is still faster than PG13, which I 
> simulated by reverting 6c5576075b0f9 and b80e10638e3:

I thought the "chinese" numbers above are pure multibyte input, and it 
seems to do well on that. Where does it regress? In multibyte encodings 
other than UTF-8? How bad is the regression?

I tested this on my first generation Raspberry Pi (chipmunk). I had to 
tweak it a bit to make it compile, since the SSE autodetection code was 
not finished yet. And I used generate_series(1, 1000) instead of 
generate_series(1, 10000) in the test script (mbverifystr-speed.sql) 
because this system is so slow.

master:

  mixed | ascii
-------+-------
   1310 |  1041
(1 row)

v2-add-portability-stub-and-new-fallback.patch:

  mixed | ascii
-------+-------
   2979 |   910
(1 row)

I'm guessing that's because the unaligned access in check_ascii() is 
expensive on this platform.

- Heikki



Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>

Attached is the first attempt at using SSE4 to do the validation, but first I'll answer your questions about the fallback.

I should mention that v2 had a correctness bug for 4-byte characters that I found when I was writing regression tests. It shouldn't materially affect performance, however.

> I thought the "chinese" numbers above are pure multibyte input, and it
> seems to do well on that. Where does it regress? In multibyte encodings
> other than UTF-8?

Yes, the second set of measurements was intended to represent multibyte encodings other than UTF-8. But instead of using one of those encodings, I simulated non-UTF-8 by copying the pattern used for those: in the loop, check for ascii then either advance or verify one character. It was a quick way to use the same test.

> How bad is the regression?

I'll copy the measurements here together with master so it's easier to compare:

~= PG13 (revert 6c5576075b0f9 and b80e10638e3):

 chinese | mixed | ascii
---------+-------+-------
    1565 |   848 |   365

master:

 chinese | mixed | ascii
---------+-------+-------
    1081 |   761 |   366

ascii fast-path plus pg_*_verifychar():

 chinese | mixed | ascii
---------+-------+-------
    1279 |   656 |    94

As I mentioned upthread, pure multibyte is still faster than PG13. Reducing the ascii check to 8-bytes at time might alleviate the regression.

> I tested this on my first generation Raspberry Pi (chipmunk). I had to
> tweak it a bit to make it compile, since the SSE autodetection code was
> not finished yet. And I used generate_series(1, 1000) instead of
> generate_series(1, 10000) in the test script (mbverifystr-speed.sql)
> because this system is so slow.
>
> master:
>
>   mixed | ascii
> -------+-------
>    1310 |  1041
> (1 row)
>
> v2-add-portability-stub-and-new-fallback.patch:
>
>   mixed | ascii
> -------+-------
>    2979 |   910
> (1 row)
>
> I'm guessing that's because the unaligned access in check_ascii() is
> expensive on this platform.

Hmm, I used memcpy() as suggested. Is that still slow on that platform? That's 32-bit, right? Some possible remedies:

1) For the COPY FROM case, we should align the allocation on a cacheline -- we already have examples of that idiom elsewhere. I was actually going to suggest doing this anyway, since unaligned SIMD loads are often slower, too.

2) As the simdjson fallback was based on Fuchsia (the Lemire paper implies it was tested carefully on Arm and I have no reason to doubt that), I could try to follow that example more faithfully by computing the actual codepoints. It's more computation and just as many branches as far as I can tell, but it's not a lot of work. I can add that alternative fallback to the patch set. I have no Arm machines, but I can test on a POWER8 machine.

3) #ifdef out the ascii check for 32-bit platforms.

4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time. I'll probably try this first.

Now, I'm pleased to report that I got SSE4 working, and it seems to work. It still needs some stress testing to find any corner case bugs, but it shouldn't be too early to share some numbers on Clang 10 / MacOS:

master:

 chinese | mixed | ascii
---------+-------+-------
    1082 |   751 |   364

v3 with SSE4.1:

 chinese | mixed | ascii
---------+-------+-------
     127 |   128 |   126

Some caveats and notes:

- It takes almost no recognizable code from simdjson, but it does take the magic constants lookup tables almost verbatim. The main body of the code has no intrinsics at all (I think). They're all hidden inside static inline helper functions. I reused some cryptic variable names from simdjson. It's a bit messy but not terrible.

- It diffs against the noError conversion patch and adds additional tests.

- It's not smart enough to stop at the last valid character boundary -- it's either all-valid or it must start over with the fallback. That will have to change in order to work with the proposed noError conversions. It shouldn't be very hard, but needs thought as to the clearest and safest way to code it.

- There is no ascii fast-path yet. With this algorithm we have to be a bit more careful since a valid ascii chunk could be preceded by an incomplete sequence at the end of the previous chunk. Not too hard, just a bit more work.

- This is my first time hacking autoconf, and it still seems slightly broken, yet functional on my machine at least.

- It only needs SSE4.1, but I didn't want to create a whole new CFLAGS, so it just reuses SSE4.2 for the runtime check and the macro names. Also, it doesn't test for SSE2, it just insists on 64-bit for the runtime check. I imagine it would refuse to build on 32-bit machines if you passed it -msse42

- There is a placeholder for Windows support, but it's not developed.

- I had to add a large number of casts to get rid of warnings in the magic constants macros. That needs some polish.

I also attached a C file that visually demonstrates every step of the algorithm following the example found in Table 9 in the paper. That contains the skeleton coding I started with and got abandoned early, so it might differ from the actual patch. 

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
I wrote:

> [v3]
> - It's not smart enough to stop at the last valid character boundary -- it's either all-valid or it must start over with the fallback. That will have to change in order to work with the proposed noError conversions. It shouldn't be very hard, but needs thought as to the clearest and safest way to code it.

In v4, it should be able to return an accurate count of valid bytes even when the end crosses a character boundary.

> - This is my first time hacking autoconf, and it still seems slightly broken, yet functional on my machine at least.

It was actually completely broken if you tried to pass the special flags to configure. I redesigned this part and it seems to work now. 

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
On Mon, Feb 15, 2021 at 9:32 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > I'm guessing that's because the unaligned access in check_ascii() is
> > expensive on this platform.

> Some possible remedies:

> 3) #ifdef out the ascii check for 32-bit platforms.

> 4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time. I'll probably try this first.

I've attached a couple patches to try on top of v4; maybe they'll help the Arm32 regression. 01 reduces the stride to 8 bytes, and 02 applies on top of v1 to disable the fallback fast path entirely on 32-bit platforms. A bit of a heavy hammer, but it'll confirm (or not) your theory about unaligned loads.

Also, I've included patches to explain more fully how I modeled non-UTF-8 performance while still using the UTF-8 tests. I think it was a useful thing to do, and I have a theory that might predict how a non-UTF8 encoding will perform with the fast path.

03A and 03B are independent of each other and conflict, but both apply on top of v4 (don't need 02). Both replace the v4 fallback with the ascii fastpath + pg_utf8_verifychar() in the loop, similar to utf-8 on master. 03A has a local static copy of pg_utf8_islegal(), and 03B uses the existing global function. (On x86, you can disable SSE4 by passing USE_FALLBACK_UTF8=1 to configure.)

While Clang 10 regressed for me on pure multibyte in a similar test upthread, on Linux gcc 8.4 there isn't a regression at all. IIRC, gcc wasn't as good as Clang when the API changed a few weeks ago, so its regression from v4 is still faster than master. Clang only regressed with my changes because it somehow handled master much better to begin with.

x86-64 Linux gcc 8.4

master

 chinese | mixed | ascii
---------+-------+-------
    1453 |   857 |   428

v4 (fallback verifier written as a single function)

 chinese | mixed | ascii
---------+-------+-------
     815 |   514 |    82

v4 plus addendum 03A -- emulate non-utf-8 using a copy of pg_utf8_is_legal() as a static function

 chinese | mixed | ascii
---------+-------+-------
    1115 |   547 |    87

v4 plus addendum 03B -- emulate non-utf-8 using pg_utf8_is_legal() as a global function

 chinese | mixed | ascii
---------+-------+-------
    1279 |   604 |    82

(I also tried the same on ppc64le Linux, gcc 4.8.5 and while not great, it never got worse than master either on pure multibyte.)

This is supposed to model the performance of a non-utf8 encoding, where we don't have a bespoke function written from scratch. Here's my theory: If an encoding has pg_*_mblen(), a global function, inside pg_*_verifychar(), it seems it won't benefit as much from an ascii fast path as one whose pg_*_verifychar() has no function calls. I'm not sure whether a compiler can inline a global function's body into call sites in the unit where it's defined. (I haven't looked at the assembly.) But recall that you didn't commit 0002 from the earlier encoding change, because it wasn't performing. I looked at that patch again, and while it inlined the pg_utf8_verifychar() call, it still called the global function pg_utf8_islegal().

If the above is anything to go by, on gcc at least, I don't think we need to worry about a regression when adding an ascii fast path to non-utf-8 multibyte encodings.

Regarding SSE, I've added an ascii fast path in my local branch, but it's not going to be as big a difference because 1) the check is more expensive in terms of branches than the C case, and 2) because the general case is so fast already, it's hard to improve upon. I just need to do some testing and cleanup on the whole thing, and that'll be ready to share.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
I wrote:

> Thanks for testing! Good, the speedup is about as much as I can hope for using plain C. In the next patch I'll go ahead and squash in the ascii fast path, using 16-byte stride, unless there are objections. I claim we can live with the regression Heikki found on an old 32-bit Arm platform since it doesn't seem to be true of Arm in general.

In v8, I've squashed the 16-byte stride into 0002. I also removed the sole holdout of hard-coded intrinsics, by putting _mm_setr_epi8 inside a variadic macro, and also did some reordering of the one-line function definitions. (As before, 0001 is not my patch, but parts of it are a prerequisite to my regressions tests).

Over in [1] , I tested in-situ in a COPY FROM test and found a 10% speedup with mixed ascii and multibyte in the copy code, i.e. with buffer and storage taken completely out of the picture.

Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:
v9 is just a rebase.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

speed up verifying UTF-8

From
John Naylor
Date:
For v10, I've split the patch up into two parts. 0001 uses pure C everywhere. This is much smaller and easier to review, and gets us the most bang for the buck. 

One concern Heikki raised upthread is that platforms with poor unaligned-memory access will see a regression. We could easily add an #ifdef to take care of that, but I haven't done so here.

To recap: On ascii-only input with storage taken out of the picture, profiles of COPY FROM show a reduction from nealy 10% down to just over 1%. In microbenchmarks found earlier in this thread, this works out to about 7 times faster. On multibyte/mixed input, 0001 is a bit faster, but not really enough to make a difference in copy performance.

0002 adds the SSE4 implementation on x86-64, and is equally fast on all input, at the cost of greater complexity.

To reflect the split, I've changed the thread subject and the commitfest title.
--
Attachment

Re: speed up verifying UTF-8

From
Heikki Linnakangas
Date:
On 02/06/2021 19:26, John Naylor wrote:
> For v10, I've split the patch up into two parts. 0001 uses pure C 
> everywhere. This is much smaller and easier to review, and gets us the 
> most bang for the buck.
> 
> One concern Heikki raised upthread is that platforms with poor 
> unaligned-memory access will see a regression. We could easily add an 
> #ifdef to take care of that, but I haven't done so here.
> 
> To recap: On ascii-only input with storage taken out of the picture, 
> profiles of COPY FROM show a reduction from nealy 10% down to just over 
> 1%. In microbenchmarks found earlier in this thread, this works out to 
> about 7 times faster. On multibyte/mixed input, 0001 is a bit faster, 
> but not really enough to make a difference in copy performance.

Nice!

This kind of bit-twiddling is fun, so I couldn't resist tinkering with 
it, to see if we can shave some more instructions from it:

> +/* from https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */
> +#define HAS_ZERO(chunk) ( \
> +    ((chunk) - UINT64CONST(0x0101010101010101)) & \
> +     ~(chunk) & \
> +     UINT64CONST(0x8080808080808080))
> +
> +/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
> +static inline int
> +check_ascii(const unsigned char *s, int len)
> +{
> +    uint64        half1,
> +                half2,
> +                highbits_set;
> +
> +    if (len >= 2 * sizeof(uint64))
> +    {
> +        memcpy(&half1, s, sizeof(uint64));
> +        memcpy(&half2, s + sizeof(uint64), sizeof(uint64));
> +
> +        /* If there are zero bytes, bail and let the slow path handle it. */
> +        if (HAS_ZERO(half1) || HAS_ZERO(half2))
> +            return 0;
> +
> +        /* Check if any bytes in this chunk have the high bit set. */
> +        highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
> +
> +        if (!highbits_set)
> +            return 2 * sizeof(uint64);
> +        else
> +            return 0;
> +    }
> +    else
> +        return 0;
> +}

Some ideas:

1. Better to check if any high bits are set first. We care more about 
the speed of that than of detecting zero bytes, because input with high 
bits is valid but zeros are an error.

2. Since we check that there are no high bits, we can do the zero-checks 
with fewer instructions like this:

/* NB: this is only correct if 'chunk' doesn't have any high bits set */
#define HAS_ZERO(chunk) ( \
   ((chunk) + \
    UINT64CONST(0x7f7f7f7f7f7f7f7f)) & \
   UINT64CONST(0x8080808080808080) == UINT64CONST(0x8080808080808080))

3. It's probably cheaper perform the HAS_ZERO check just once on (half1 
| half2). We have to compute (half1 | half2) anyway.


Putting all that together:

/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
static inline int
check_ascii(const unsigned char *s, int len)
{
    uint64        half1,
                half2,
                highbits_set;
    uint64        x;

    if (len >= 2 * sizeof(uint64))
    {
        memcpy(&half1, s, sizeof(uint64));
        memcpy(&half2, s + sizeof(uint64), sizeof(uint64));

        /* Check if any bytes in this chunk have the high bit set. */
        highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
        if (highbits_set)
            return 0;

        /*
         * Check if there are any zero bytes in this chunk. This is only correct
         * if there are no high bits set, but we checked that already.
         */
        x = (half1 | half2) + UINT64CONST(0x7f7f7f7f7f7f7f7f);
        x &= UINT64CONST(0x8080808080808080);
        if (x != UINT64CONST(0x8080808080808080))
            return 0;

        return 2 * sizeof(uint64);
    }
    else
        return 0;
}

In quick testing, that indeed compiles into fewer instructions. With 
GCC, there's no measurable difference in performance. But with clang, 
this version is much faster than the original, because the original 
version is much slower than when compiled with GCC. In other words, this 
version seems to avoid some clang misoptimization. I tested only with 
ASCII input, I haven't tried other cases.

What test set have you been using for performance testing this? I'd like 
to know how this version compares, and I could also try running it on my 
old raspberry pi, which is more strict about alignmnt.

> 0002 adds the SSE4 implementation on x86-64, and is equally fast on all 
> input, at the cost of greater complexity.

Didn't look closely, but seems reasonable at a quick glance.

- Heikki



Re: speed up verifying UTF-8

From
Greg Stark
Date:
> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1
| half2). We have to compute (half1 | half2) anyway.

Wouldn't you have to check (half1 & half2) ?



Re: speed up verifying UTF-8

From
Greg Stark
Date:
I haven't looked at the surrounding code. Are we processing all the
COPY data in one long stream or processing each field individually? If
we're processing much more than 128 bits and happy to detect NUL
errors only at the end after wasting some work then you could hoist
that has_zero check entirely out of the loop (removing the branch
though it's probably a correctly predicted branch anyways).

Do something like:

zero_accumulator = zero_accumulator & next_chunk

in the loop and then only at the very end check for zeros in that.



Re: speed up verifying UTF-8

From
John Naylor
Date:

On Thu, Jun 3, 2021 at 10:42 AM Greg Stark <stark@mit.edu> wrote:
>
> I haven't looked at the surrounding code. Are we processing all the
> COPY data in one long stream or processing each field individually? 

It happens on 64kB chunks.

> If
> we're processing much more than 128 bits and happy to detect NUL
> errors only at the end after wasting some work then you could hoist
> that has_zero check entirely out of the loop (removing the branch
> though it's probably a correctly predicted branch anyways).
>
> Do something like:
>
> zero_accumulator = zero_accumulator & next_chunk
>
> in the loop and then only at the very end check for zeros in that.

That's the approach taken in the SSE4 patch, and in fact that's the logical way to do it there. I hadn't considered doing it that way in the pure C case, but I think it's worth trying.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:


I wrote:

> On Thu, Jun 3, 2021 at 10:42 AM Greg Stark <stark@mit.edu> wrote:
> >

> > If
> > we're processing much more than 128 bits and happy to detect NUL
> > errors only at the end after wasting some work then you could hoist
> > that has_zero check entirely out of the loop (removing the branch
> > though it's probably a correctly predicted branch anyways).
> >
> > Do something like:
> >
> > zero_accumulator = zero_accumulator & next_chunk
> >
> > in the loop and then only at the very end check for zeros in that.
>
> That's the approach taken in the SSE4 patch, and in fact that's the logical way to do it there. I hadn't considered doing it that way in the pure C case, but I think it's worth trying.

Actually, I spoke too quickly. We can't have an error accumulator in the C case because we need to return how many bytes were valid. In fact, in the SSE case, it checks the error vector at the end and then reruns with the fallback case to count the valid bytes.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:
On Thu, Jun 3, 2021 at 9:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> Some ideas:
>
> 1. Better to check if any high bits are set first. We care more about
> the speed of that than of detecting zero bytes, because input with high
> bits is valid but zeros are an error.
>
> 2. Since we check that there are no high bits, we can do the zero-checks
> with fewer instructions like this:

Both ideas make sense, and I like the shortcut we can take with the zero check. I think Greg is right that the zero check needs “half1 & half2”, so I tested with that (updated patches attached).

> What test set have you been using for performance testing this? I'd like

The microbenchmark is the same one you attached to [1], which I extended with a 95% multibyte case. With the new zero check:

clang 12.0.5 / MacOS:

master:

 chinese | mixed | ascii
---------+-------+-------
     981 |   688 |   371

0001:

 chinese | mixed | ascii
---------+-------+-------
     932 |   548 |   110

plus optimized zero check:

 chinese | mixed | ascii
---------+-------+-------
     689 |   573 |    59

It makes sense that the Chinese text case is faster since the zero check is skipped.

gcc 4.8.5 / Linux:

master:

 chinese | mixed | ascii
---------+-------+-------
    2561 |  1493 |   825

0001:

 chinese | mixed | ascii
---------+-------+-------
    2968 |  1035 |   158

plus optimized zero check:

 chinese | mixed | ascii
---------+-------+-------
    2413 |  1078 |   137

The second machine is a bit older and has an old compiler, but there is still a small speed increase. In fact, without Heikki's tweaks, 0001 regresses on multibyte.

(Note: I'm not seeing the 7x improvement I claimed for 0001 here, but that was from memory and I think that was a different machine and newer gcc. We can report a range of results as we proceed.)

[1] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
Heikki Linnakangas
Date:
On 03/06/2021 17:33, Greg Stark wrote:
>> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1
> | half2). We have to compute (half1 | half2) anyway.
> 
> Wouldn't you have to check (half1 & half2) ?

Ah, you're right of course. But & is not quite right either, it will 
give false positives. That's ok from a correctness point of view here, 
because we then fall back to checking byte by byte, but I don't think 
it's a good tradeoff.

I think this works, however:

/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
static inline int
check_ascii(const unsigned char *s, int len)
{
    uint64        half1,
                half2,
                highbits_set;
    uint64        x1,
                x2;
    uint64        x;

    if (len >= 2 * sizeof(uint64))
    {
        memcpy(&half1, s, sizeof(uint64));
        memcpy(&half2, s + sizeof(uint64), sizeof(uint64));

        /* Check if any bytes in this chunk have the high bit set. */
        highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
        if (highbits_set)
            return 0;

        /*
         * Check if there are any zero bytes in this chunk.
         *
         * First, add 0x7f to each byte. This sets the high bit in each byte,
         * unless it was a zero. We already checked that none of the bytes had
         * the high bit set previously, so the max value each byte can have
         * after the addition is 0x7f + 0x7f = 0xfe, and we don't need to
         * worry about carrying over to the next byte.
         */
        x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
        x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);

        /* then check that the high bit is set in each byte. */
        x = (x1 | x2);
        x &= UINT64CONST(0x8080808080808080);
        if (x != UINT64CONST(0x8080808080808080))
            return 0;

        return 2 * sizeof(uint64);
    }
    else
        return 0;
}

- Heikki



Re: speed up verifying UTF-8

From
John Naylor
Date:


On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 03/06/2021 17:33, Greg Stark wrote:
> >> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1
> > | half2). We have to compute (half1 | half2) anyway.
> >
> > Wouldn't you have to check (half1 & half2) ?
>
> Ah, you're right of course. But & is not quite right either, it will
> give false positives. That's ok from a correctness point of view here,
> because we then fall back to checking byte by byte, but I don't think
> it's a good tradeoff.

Ah, of course.

>                 /*
>                  * Check if there are any zero bytes in this chunk.
>                  *
>                  * First, add 0x7f to each byte. This sets the high bit in each byte,
>                  * unless it was a zero. We already checked that none of the bytes had
>                  * the high bit set previously, so the max value each byte can have
>                  * after the addition is 0x7f + 0x7f = 0xfe, and we don't need to
>                  * worry about carrying over to the next byte.
>                  */
>                 x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
>                 x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
>
>                 /* then check that the high bit is set in each byte. */
>                 x = (x1 | x2);
>                 x &= UINT64CONST(0x8080808080808080);
>                 if (x != UINT64CONST(0x8080808080808080))
>                         return 0;

That seems right, I'll try that and update the patch. (Forgot to attach earlier anyway)

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
Heikki Linnakangas
Date:
On 03/06/2021 22:10, John Naylor wrote:
> On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>  >                 x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
>  >                 x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
>  >
>  >                 /* then check that the high bit is set in each byte. */
>  >                 x = (x1 | x2);
>  >                 x &= UINT64CONST(0x8080808080808080);
>  >                 if (x != UINT64CONST(0x8080808080808080))
>  >                         return 0;
> 
> That seems right, I'll try that and update the patch. (Forgot to attach 
> earlier anyway)

Ugh, actually that has the same issue as before. If one of the bytes is 
in one half is zero, but not in the other half, this fail to detect it. 
Sorry for the noise..

- Heikki



Re: speed up verifying UTF-8

From
Heikki Linnakangas
Date:
On 03/06/2021 22:16, Heikki Linnakangas wrote:
> On 03/06/2021 22:10, John Naylor wrote:
>> On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas <hlinnaka@iki.fi
>> <mailto:hlinnaka@iki.fi>> wrote:
>>   >                 x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
>>   >                 x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
>>   >
>>   >                 /* then check that the high bit is set in each byte. */
>>   >                 x = (x1 | x2);
>>   >                 x &= UINT64CONST(0x8080808080808080);
>>   >                 if (x != UINT64CONST(0x8080808080808080))
>>   >                         return 0;
>>
>> That seems right, I'll try that and update the patch. (Forgot to attach
>> earlier anyway)
> 
> Ugh, actually that has the same issue as before. If one of the bytes is
> in one half is zero, but not in the other half, this fail to detect it.
> Sorry for the noise..

If you replace (x1 | x2) with (x1 & x2) above, I think it's correct.

- Heikki



Re: speed up verifying UTF-8

From
John Naylor
Date:
On Thu, Jun 3, 2021 at 3:22 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 03/06/2021 22:16, Heikki Linnakangas wrote:
> > On 03/06/2021 22:10, John Naylor wrote:
> >> On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas <hlinnaka@iki.fi
> >> <mailto:hlinnaka@iki.fi>> wrote:
> >>   >                 x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
> >>   >                 x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
> >>   >
> >>   >                 /* then check that the high bit is set in each byte. */
> >>   >                 x = (x1 | x2);
> >>   >                 x &= UINT64CONST(0x8080808080808080);
> >>   >                 if (x != UINT64CONST(0x8080808080808080))
> >>   >                         return 0;

> If you replace (x1 | x2) with (x1 & x2) above, I think it's correct.

After looking at it again with fresh eyes, I agree this is correct. I modified the regression tests to pad the input bytes with ascii so that the code path that works on 16-bytes at a time is tested. I use both UTF-8 input tables for some of the additional tests. There is a de facto requirement that the descriptions are unique across both of the input tables. That could be done more elegantly, but I wanted to keep things simple for now.

v11-0001 is an improvement over v10:

clang 12.0.5 / MacOS:

master:

 chinese | mixed | ascii
---------+-------+-------
     975 |   686 |   369

v10-0001:

 chinese | mixed | ascii
---------+-------+-------
     930 |   549 |   109

v11-0001:

 chinese | mixed | ascii
---------+-------+-------
     687 |   440 |    64


gcc 4.8.5 / Linux (older machine)

master:

 chinese | mixed | ascii
---------+-------+-------
    2559 |  1495 |   825

v10-0001:

 chinese | mixed | ascii
---------+-------+-------
    2966 |  1034 |   156

v11-0001:

 chinese | mixed | ascii
---------+-------+-------
    2242 |   824 |   140

Previous testing on POWER8 and Arm64 leads me to expect similar results there as well.

I also looked again at 0002 and decided I wasn't quite happy with the test coverage. Previously, the code padded out a short input with ascii so that the 16-bytes-at-a-time code path was always exercised. However, that required some finicky complexity and still wasn't adequate. For v11, I ripped that out and put the responsibility on the regression tests to make sure the various code paths are exercised.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
Heikki Linnakangas
Date:
On 03/06/2021 21:58, John Naylor wrote:
> 
>  > What test set have you been using for performance testing this? I'd like
> 
> The microbenchmark is the same one you attached to [1], which I extended 
> with a 95% multibyte case.

Could you share the exact test you're using? I'd like to test this on my 
old raspberry pi, out of curiosity.

- Heikki



Re: speed up verifying UTF-8

From
John Naylor
Date:
On Mon, Jun 7, 2021 at 8:24 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 03/06/2021 21:58, John Naylor wrote:
> > The microbenchmark is the same one you attached to [1], which I extended
> > with a 95% multibyte case.
>
> Could you share the exact test you're using? I'd like to test this on my
> old raspberry pi, out of curiosity.

Sure, attached.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: speed up verifying UTF-8

From
Heikki Linnakangas
Date:
On 07/06/2021 15:39, John Naylor wrote:
> On Mon, Jun 7, 2021 at 8:24 AM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>  >
>  > On 03/06/2021 21:58, John Naylor wrote:
>  > > The microbenchmark is the same one you attached to [1], which I 
> extended
>  > > with a 95% multibyte case.
>  >
>  > Could you share the exact test you're using? I'd like to test this on my
>  > old raspberry pi, out of curiosity.
> 
> Sure, attached.
> 
> --
> John Naylor
> EDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
> 
Results from chipmunk, my first generation Raspberry Pi:

Master:

  chinese | mixed | ascii
---------+-------+-------
    25392 | 16287 | 10295
(1 row)

v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:

  chinese | mixed | ascii
---------+-------+-------
    17739 | 10854 |  4121
(1 row)

So that's good.

What is the worst case scenario for this algorithm? Something where the 
new fast ASCII check never helps, but is as fast as possible with the 
old code. For that, I added a repeating pattern of '123456789012345ä' to 
the test set (these results are from my Intel laptop, not the raspberry pi):

Master:

  chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     1333 |   757 |   410 |    573
(1 row)

v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:

  chinese | mixed | ascii | mixed2
---------+-------+-------+--------
      942 |   470 |    66 |   1249
(1 row)

So there's a regression with that input. Maybe that's acceptable, this 
is the worst case, after all. Or you could tweak check_ascii for a 
different performance tradeoff, by checking the two 64-bit words 
separately and returning "8" if the failure happens in the second word. 
And I haven't tried the SSE patch yet, maybe that compensates for this.

- Heikki



Re: speed up verifying UTF-8

From
John Naylor
Date:

On Wed, Jun 9, 2021 at 7:02 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> What is the worst case scenario for this algorithm? Something where the
> new fast ASCII check never helps, but is as fast as possible with the
> old code. For that, I added a repeating pattern of '123456789012345ä' to
> the test set (these results are from my Intel laptop, not the raspberry pi):
>
> Master:
>
>   chinese | mixed | ascii | mixed2
> ---------+-------+-------+--------
>      1333 |   757 |   410 |    573
> (1 row)
>
> v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:
>
>   chinese | mixed | ascii | mixed2
> ---------+-------+-------+--------
>       942 |   470 |    66 |   1249
> (1 row)

I get a much smaller regression on my laptop with clang 12:

master:

 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     978 |   685 |   370 |    452

v11-0001:

 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     686 |   438 |    64 |    595

> So there's a regression with that input. Maybe that's acceptable, this
> is the worst case, after all. Or you could tweak check_ascii for a
> different performance tradeoff, by checking the two 64-bit words
> separately and returning "8" if the failure happens in the second word.

For v12 (unformatted and without 0002 rebased) I tried the following:
--
highbits_set = (half1) & UINT64CONST(0x8080808080808080);
if (highbits_set)
     return 0;

x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
x1 &= UINT64CONST(0x8080808080808080);
if (x1 != UINT64CONST(0x8080808080808080))
     return 0;

/* now we know we have at least 8 bytes of valid ascii, so if any of these tests fails, return that */

highbits_set = (half2) & UINT64CONST(0x8080808080808080);
if (highbits_set)
     return sizeof(uint64);

x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
x2 &= UINT64CONST(0x8080808080808080);
if (x2 != UINT64CONST(0x8080808080808080))
     return sizeof(uint64);

return 2 * sizeof(uint64);
--
and got this:

 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     674 |   499 |   170 |    421

Pure ascii is significantly slower, but the regression is gone.

I used the string repeat('123456789012345ä', 3647) to match the ~62000 bytes in the other strings (62000 / 17 = 3647)

> And I haven't tried the SSE patch yet, maybe that compensates for this.

I would expect that this case is identical to all-multibyte. The worst case for SSE might be alternating 16-byte chunks of ascii-only and chunks of multibyte, since that's one of the few places it branches. In simdjson, they check ascii on 64 byte blocks at a time ((c1 | c2) | (c3 | c4)) and check only the previous block's "chunk 4" for incomplete sequences at the end. It's a bit messier, so I haven't done it, but it's an option.

Also, if SSE is accepted into the tree, then the C fallback is only important on platforms like PowerPC64 and Arm64, so we can make the tradeoff by testing those more carefully. I'll test on PowerPC soon.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:
I wrote:

> Also, if SSE is accepted into the tree, then the C fallback is only important on platforms like PowerPC64 and Arm64, so we can make the tradeoff by testing those more carefully. I'll test on PowerPC soon.

I got around to testing on POWER8 / Linux / gcc 4.8.5 and found a regression in the mixed2 case in v11. v12 improves that at the cost of some improvement in the ascii case (5x vs. 8x).

master:
 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
    2966 |  1525 |   871 |   1474

v11-0001:
 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
    1030 |   644 |   102 |   1760

v12-0001:
 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     977 |   632 |   168 |   1113

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:
I still wasn't quite happy with the churn in the regression tests, so for v13 I gave up on using both the existing utf8 table and my new one for the "padded input" tests, and instead just copied the NUL byte test into the new table. Also added a primary key to make sure the padded test won't give weird results if a new entry has a duplicate description.

I came up with "highbit_carry" as a more descriptive variable name than "x", but that doesn't matter a whole lot.

It also occurred to me that if we're going to check one 8-byte chunk at a time (like v12 does), maybe it's only worth it to load 8 bytes at a time. An earlier version did this, but without the recent tweaks. The worst-case scenario now might be different from the one with 16-bytes, but for now just tested the previous worst case (mixed2). Only tested on ppc64le, since I'm hoping x86 will get the SIMD algorithm (I'm holding off rebasing 0002 until 0001 settles down).

Power8, Linux, gcc 4.8

master:
 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
    2952 |  1520 |   871 |   1473

v11:
 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
    1015 |   641 |   102 |   1636

v12:
 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     964 |   629 |   168 |   1069

v13:
 chinese | mixed | ascii | mixed2
---------+-------+-------+--------
     954 |   643 |   202 |   1046

v13 is not that much different from v12, but has the nice property of simpler code. Both are not as nice as v11 for ascii, but don't regress for the latter's worst case. I'm leaning towards v13 for the fallback.

--
Attachment

Re: speed up verifying UTF-8

From
Heikki Linnakangas
Date:
On 29/06/2021 14:20, John Naylor wrote:
> I still wasn't quite happy with the churn in the regression tests, so 
> for v13 I gave up on using both the existing utf8 table and my new one 
> for the "padded input" tests, and instead just copied the NUL byte test 
> into the new table. Also added a primary key to make sure the padded 
> test won't give weird results if a new entry has a duplicate description.
> 
> I came up with "highbit_carry" as a more descriptive variable name than 
> "x", but that doesn't matter a whole lot.
> 
> It also occurred to me that if we're going to check one 8-byte chunk at 
> a time (like v12 does), maybe it's only worth it to load 8 bytes at a 
> time. An earlier version did this, but without the recent tweaks. The 
> worst-case scenario now might be different from the one with 16-bytes, 
> but for now just tested the previous worst case (mixed2).

I tested the new worst case scenario on my laptop:

gcc master:

  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     1311 |   758 |   405 |     583 |    725


gcc v13:

  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
      956 |   472 |   160 |     572 |    939


mixed16 is the same as "mixed2" in the previous rounds, with 
'123456789012345ä' as the repeating string, and mixed8 uses '1234567ä', 
which I believe is the worst case for patch v13. So v13 is somewhat 
slower than master in the worst case.

Hmm, there's one more simple trick we can do: We can have a separate 
fast-path version of the loop when there are at least 8 bytes of input 
left, skipping all the length checks. With that:

gcc v14:
  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
      737 |   412 |    94 |     476 |    725


All the above numbers were with gcc 10.2.1. For completeness, with clang 
11.0.1-2 I got:

clang master:
  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     1044 |   724 |   403 |     930 |    603
(1 row)

clang v13:
  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
      596 |   445 |    79 |     417 |    715
(1 row)


clang v14:
  chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
      600 |   337 |    93 |     318 |    511

Attached is patch v14 with that optimization. It needs some cleanup, I 
just hacked it up quickly for performance testing.

- Heikki

Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:
On Wed, Jun 30, 2021 at 7:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> Hmm, there's one more simple trick we can do: We can have a separate
> fast-path version of the loop when there are at least 8 bytes of input
> left, skipping all the length checks. With that:

Good idea, and the numbers look good on Power8 / gcc 4.8 as well:

master:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2951 |  1521 |   871 |    1473 |   1508

v13:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     949 |   642 |   203 |    1046 |   1818

v14:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     887 |   607 |   179 |     776 |   1325


I don't think the new structuring will pose any challenges for rebasing 0002, either. This might need some experimentation, though:

+ * Subroutine of pg_utf8_verifystr() to check on char. Returns the length of the
+ * character at *s in bytes, or 0 on invalid input or premature end of input.
+ *
+ * XXX: could this be combined with pg_utf8_verifychar above?
+ */
+static inline int
+pg_utf8_verify_one(const unsigned char *s, int len)

It seems like it would be easy to have pg_utf8_verify_one in my proposed pg_utf8.h header and replace the body of pg_utf8_verifychar with it.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:
I wrote:

> I don't think the new structuring will pose any challenges for rebasing 0002, either. This might need some experimentation, though:
>
> + * Subroutine of pg_utf8_verifystr() to check on char. Returns the length of the
> + * character at *s in bytes, or 0 on invalid input or premature end of input.
> + *
> + * XXX: could this be combined with pg_utf8_verifychar above?
> + */
> +static inline int
> +pg_utf8_verify_one(const unsigned char *s, int len)
>
> It seems like it would be easy to have pg_utf8_verify_one in my proposed pg_utf8.h header and replace the body of pg_utf8_verifychar with it.

0001: I went ahead and tried this for v15, and also attempted some clean-up:

- Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
- Have pg_utf8_verifychar_internal return -1 for invalid input to match other functions in the file. We could also do this for check_ascii, but it's not quite the same thing, because the string could still have valid bytes in it, just not enough to advance the pointer by the stride length.
- Remove hard-coded numbers (not wedded to this).

- Use a call to pg_utf8_verifychar in the slow path.
- Reduce pg_utf8_verifychar to thin wrapper around pg_utf8_verifychar_internal.

The last two aren't strictly necessary, but it prevents bloating the binary in the slow path, and aids readability. For 0002, this required putting pg_utf8_verifychar* in src/port. (While writing this I noticed I neglected to explain that with a comment, though)

Feedback welcome on any of the above.

Since by now it hardly resembles the simdjson (or Fuchsia for that matter) fallback that it took inspiration from, I've removed that mention from the commit message.

0002: Just a rebase to work with the above. One possible review point: We don't really need to have separate control over whether to use special instructions for CRC and UTF-8. It should probably be just one configure knob, but having them separate is perhaps easier to review.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
Amit Khandekar
Date:
On Tue, 13 Jul 2021 at 01:15, John Naylor <john.naylor@enterprisedb.com> wrote:
> > It seems like it would be easy to have pg_utf8_verify_one in my proposed pg_utf8.h header and replace the body of
pg_utf8_verifycharwith it. 
>
> 0001: I went ahead and tried this for v15, and also attempted some clean-up:
>
> - Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
> - Have pg_utf8_verifychar_internal return -1 for invalid input to match other functions in the file. We could also do
thisfor check_ascii, but it's not quite the same thing, because the string could still have valid bytes in it, just not
enoughto advance the pointer by the stride length. 
> - Remove hard-coded numbers (not wedded to this).
>
> - Use a call to pg_utf8_verifychar in the slow path.
> - Reduce pg_utf8_verifychar to thin wrapper around pg_utf8_verifychar_internal.

- check_ascii() seems to be used only for 64-bit chunks. So why not
remove the len argument and the len <= sizeof(int64) checks inside the
function. We can rename it to check_ascii64() for clarity.

- I was thinking, why not have a pg_utf8_verify64() that processes
64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
extract a 64-bit chunk from the string. We can use the same chunk to
extract the required bits from a two byte char or a 4 byte char. This
way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
etc. More importantly, we can avoid the separate continuation-char
checks for each individual byte. Additionally, we can try to simplify
the subsequent overlong or surrogate char checks. Something like this
:

int pg_utf8_verifychar_32(uint32 chunk)
{
    int        len, l;

    for (len = sizeof(chunk); len > 0; (len -= l), (chunk = chunk << l))
    {
        /* Is 2-byte lead */
        if ((chunk & 0xF0000000) == 0xC0000000)
        {
            l = 2;
            /* .......  .......  */
        }
        /* Is 3-byte lead */
        else if ((chunk & 0xF0000000) == 0xE0000000)
        {
            l = 3;
            if (len < l)
                break;

            /* b2 and b3 should be continuation bytes */
            if ((chunk & 0x00C0C000) != 0x00808000)
                return sizeof(chunk) - len;

            switch (chunk & 0xFF200000)
            {
                /* check 3-byte overlong: 1110.0000 1001.xxxx 10xx.xxxx
                 * i.e. (b1 == 0xE0 && b2 < 0xA0). We already know b2
is of the form
                 * 10xx since it's a continuation char. Additionally
condition b2 <=
                 * 0x9F means it is of the form 100x.xxxx.  i.e.
either 1000.xxxx
                 * or 1001.xxxx. So just verify that it is xx0x.xxxx
                 */
                case 0xE0000000:
                    return sizeof(chunk) - len;

                /* check surrogate: 1110.1101 101x.xxxx 10xx.xxxx
                 * i.e. (b1 == 0xED && b2 > 0x9F): Here, > 0x9F means either
                 * 1010.xxxx, 1011.xxxx, 1100.xxxx, or 1110.xxxx. Last
two are not
                 * possible because b2 is a continuation char. So it has to be
                 * first two. So just verify that it is xx1x.xxxx
                 */
                case 0xED200000:
                    return sizeof(chunk) - len;
                default:
                    ;
            }

        }
        /* Is 4-byte lead */
        else if ((chunk & 0xF0000000) == 0xF0000000)
        {
            /* .........  */
            l = 4;
        }
        else
            return sizeof(chunk) - len;
    }
    return sizeof(chunk) - len;
}



Re: speed up verifying UTF-8

From
John Naylor
Date:
On Thu, Jul 15, 2021 at 1:10 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

> - check_ascii() seems to be used only for 64-bit chunks. So why not
> remove the len argument and the len <= sizeof(int64) checks inside the
> function. We can rename it to check_ascii64() for clarity.

Thanks for taking a look!

Well yes, but there's nothing so intrinsic to 64 bits that the name needs to reflect that. Earlier versions worked on 16 bytes at time. The compiler will optimize away the len check, but we could replace with an assert instead.

> - I was thinking, why not have a pg_utf8_verify64() that processes
> 64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
> extract a 64-bit chunk from the string. We can use the same chunk to
> extract the required bits from a two byte char or a 4 byte char. This
> way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
> etc.

Loading bytes from L1 is really fast -- I wouldn't even call it "extraction".

> More importantly, we can avoid the separate continuation-char
> checks for each individual byte.

On a pipelined superscalar CPU, I wouldn't expect it to matter in the slightest.

> Additionally, we can try to simplify
> the subsequent overlong or surrogate char checks. Something like this

My recent experience with itemptrs has made me skeptical of this kind of thing, but the idea was interesting enough that I couldn't resist trying it out. I have two attempts, which are attached as v16*.txt and apply independently. They are rough, and some comments are now lies. To simplify the constants, I do shift down to uint32, and I didn't bother working around that. v16alpha regressed on worst-case input, so for v16beta I went back to earlier coding for the one-byte ascii check. That helped, but it's still slower than v14.

That was not unexpected, but I was mildly shocked to find out that v15 is also slower than the v14 that Heikki posted. The only non-cosmetic difference is using pg_utf8_verifychar_internal within pg_utf8_verifychar. I'm not sure why it would make such a big difference here. The numbers on Power8 / gcc 4.8 (little endian):

HEAD:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2951 |  1521 |   871 |    1474 |   1508

v14:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     885 |   607 |   179 |     774 |   1325

v15:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1085 |   671 |   180 |    1032 |   1799

v16alpha:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1268 |   822 |   180 |    1410 |   2518

v16beta:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1096 |   654 |   182 |     814 |   1403


As it stands now, for v17 I'm inclined to go back to v15, but without the attempt at being clever that seems to have slowed it down from v14.

Any interest in testing on 64-bit Arm?

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:
I wrote:

> To simplify the constants, I do shift down to uint32, and I didn't bother working around that. v16alpha regressed on worst-case input, so for v16beta I went back to earlier coding for the one-byte ascii check. That helped, but it's still slower than v14.

It occurred to me that I could rewrite the switch test into simple comparisons, like I already had for the 2- and 4-byte lead cases. While at it, I folded the leading byte and continuation tests into a single operation, like this:

/* 3-byte lead with two continuation bytes */
else if ((chunk & 0xF0C0C00000000000) == 0xE080800000000000)

...and also tried using 64-bit constants to avoid shifting. Still didn't quite beat v14, but got pretty close:

> The numbers on Power8 / gcc 4.8 (little endian):
>
> HEAD:
>
>  chinese | mixed | ascii | mixed16 | mixed8
> ---------+-------+-------+---------+--------
>     2951 |  1521 |   871 |    1474 |   1508
>
> v14:
>
>  chinese | mixed | ascii | mixed16 | mixed8
> ---------+-------+-------+---------+--------
>      885 |   607 |   179 |     774 |   1325

v16gamma:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     952 |   632 |   180 |     800 |   1333

A big-endian 64-bit platform just might shave enough cycles to beat v14 this way... or not.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
Vladimir Sitnikov
Date:
Have you considered shift-based DFA for a portable implementation https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ?

Vladimir

Re: speed up verifying UTF-8

From
John Naylor
Date:
On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
>
> Have you considered shift-based DFA for a portable implementation https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ?

I did consider some kind of DFA a while back and it was too slow.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:
My v16 experimental patches were a bit messy, so I've organized an experimental series that applies cumulatively, to try to trace the effects of various things.

v17-0001 is the same as v14. 0002 is a stripped-down implementation of Amit's chunk idea for multibyte, and it's pretty good on x86. On Power8, not so much. 0003 and 0004 are shot-in-the-dark guesses to improve it on Power8, with some success, but end up making x86 weirdly slow, so I'm afraid that could happen on other platforms as well.

v14 still looks like the safe bet for now. It also has the advantage of using the same function both in and out of the fastpath, which will come in handy when moving it to src/port as the fallback for SSE.

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v17-0001:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     888 |   607 |   179 |     777 |   1328

v17-0002:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1017 |   718 |   156 |    1213 |   2138

v17-0003:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1205 |   662 |   180 |     767 |   1256

v17-0004:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1085 |   660 |   224 |     868 |   1369


Macbook x86, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v17-0001:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     674 |   346 |    78 |     309 |    504

v17-0002:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     516 |   324 |    78 |     331 |    544

v17-0003:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     621 |   537 |   323 |     413 |    602

v17-0004:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     576 |   439 |   154 |     557 |    915

--

Re: speed up verifying UTF-8

From
John Naylor
Date:
I wrote:

> On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
> >
> > Have you considered shift-based DFA for a portable implementation https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ?
>
> I did consider some kind of DFA a while back and it was too slow.

I took a closer look at this "shift-based DFA", and it seemed pretty straightforward to implement this on top of my DFA attempt from some months ago. The DFA technique is not a great fit with our API, since we need to return how many bytes we found valid. On x86 (not our target for the fallback, but convenient to test) all my attempts were either worse than HEAD in multiple cases, or showed no improvement for the important ASCII case. On Power8, it's more compelling, and competitive with v14, so I'll characterize it on that platform as I describe the patch series:

0001 is a pure DFA, and has decent performance on multibyte, but terrible on ascii.
0002 dispatches on the leading byte category, unrolls the DFA loop according to how many valid bytes we need, and only checks the DFA state afterwards. It's good on multibyte (3-byte, at least) but still terrible on ascii.
0003 adds a 1-byte ascii fast path -- while robust on all inputs, it still regresses a bit on ascii.
0004 uses the same 8-byte ascii check as previous patches do.
0005 and 0006 use combinations of 1- and 8-byte ascii checks similar to in v17.

0005 seems the best on Power8, and is very close to v4. FWIW, v14's measurements seem lucky and fragile -- if I change any little thing, even

- return -1;
+ return 0;

it easily loses 100-200ms on non-pure-ascii tests. That said, v14 still seems the logical choice, unless there is some further tweak on top of v17 or v18 that gives some non-x86 platform a significant boost.

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v18-0001:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1257 |  1681 |  1385 |    1744 |   2018

v18-0002:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     951 |  1381 |  1217 |    1469 |   1172

v18-0003:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     911 |  1111 |   942 |    1112 |    865

v18-0004:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     987 |   730 |   222 |    1325 |   2306

v18-0005:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     962 |   664 |   180 |     928 |   1179

v18-0006:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     908 |   663 |   244 |    1026 |   1464

and for comparison,

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     888 |   607 |   179 |     777 |   1328

v17-0003:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1205 |   662 |   180 |     767 |   1256


Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v18-0001:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1334 |  2713 |  2802 |    2665 |   2541

v18-0002:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     733 |  1212 |  1064 |    1034 |   1007

v18-0003:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     653 |   560 |   370 |     420 |    465

v18-0004:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     574 |   402 |    88 |     584 |   1033

v18-0005:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1345 |   730 |   334 |     578 |    909

v18-0006:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     674 |   485 |   153 |     594 |    989

and for comparison,

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     674 |   346 |    78 |     309 |    504

v17-0002:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     516 |   324 |    78 |     331 |    544

--
Attachment

Re: speed up verifying UTF-8

From
Amit Khandekar
Date:
On Sat, 17 Jul 2021 at 04:48, John Naylor <john.naylor@enterprisedb.com> wrote:
> v17-0001 is the same as v14. 0002 is a stripped-down implementation of Amit's
> chunk idea for multibyte, and it's pretty good on x86. On Power8, not so
> much. 0003 and 0004 are shot-in-the-dark guesses to improve it on Power8,
> with some success, but end up making x86 weirdly slow, so I'm afraid that
> could happen on other platforms as well.

Thanks for trying the chunk approach. I tested your v17 versions on
Arm64. For the chinese characters, v17-0002 gave some improvement over
v14. But for all the other character sets, there was around 10%
degradation w.r.t. v14. I thought maybe the hhton64 call and memcpy()
for each mb character might be the culprit, so I tried iterating over
all the characters in the chunk within the same pg_utf8_verify_one()
function by left-shifting the bits. But that worsened the figures. So
I gave up that idea.

Here are the numbers on Arm64 :

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1781 |  1095 |   628 |     944 |   1151

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     852 |   484 |   144 |     584 |    971


v17-0001+2:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     731 |   520 |   152 |     645 |   1118


Haven't looked at your v18 patch set yet.



Re: speed up verifying UTF-8

From
Vladimir Sitnikov
Date:
Thank you,

It looks like it is important to have shrx for x86 which appears only when -march=x86-64-v3 is used (see https://github.com/golang/go/issues/47120#issuecomment-877629712 ).
Just in case: I know x86 wound not use fallback implementation, however, the sole purpose of shift-based DFA is to fold all the data-dependent ops into a single instruction.

An alternative idea: should we optimize for validation of **valid** inputs rather than optimizing the worst case?
In other words, what if the implementation processes all characters always and uses a slower method in case of validation failure?
I would guess it is more important to be faster with accepting valid input rather than "faster to reject invalid input".

In shift-DFA approach, it would mean the validation loop would be simpler with fewer branches (see https://godbolt.org/z/hhMxhT6cf ):

static inline int
pg_is_valid_utf8(const unsigned char *s, const unsigned char *end) {
    uint64 class;
    uint64 state = BGN;
    while (s < end) { // clang unrolls the loop
        class = ByteCategory[*s++];
        state = class >> (state & DFA_MASK); // <-- note that AND is fused into the shift operation
    }
    return (state & DFA_MASK) != ERR;
}

Note: GCC does not seem to unroll "while(s<end)" loop by default, so manual unroll might be worth trying:

static inline int
pg_is_valid_utf8(const unsigned char *s, const unsigned char *end) {
    uint64 class;
    uint64 state = BGN;
    while(s < end + 4) {
        for(int i = 0; i < 4; i++) {
            class = ByteCategory[*s++];
            state = class >> (state & DFA_MASK);
        }
    }
    while(s < end) {
        class = ByteCategory[*s++];
        state = class >> (state & DFA_MASK);
    }
    return (state & DFA_MASK) != ERR;
}

----

static int pg_utf8_verifystr2(const unsigned char *s, int len) {
    if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid, then just accept it
        return s + len;
    }
    // slow path: the string is not valid, perform a slower analysis
    return s + ....;
}

Vladimir

Re: speed up verifying UTF-8

From
John Naylor
Date:
On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:

> It looks like it is important to have shrx for x86 which appears only when -march=x86-64-v3 is used (see https://github.com/golang/go/issues/47120#issuecomment-877629712 ).
> Just in case: I know x86 wound not use fallback implementation, however, the sole purpose of shift-based DFA is to fold all the data-dependent ops into a single instruction.

I saw mention of that instruction, but didn't understand how important it was, thanks.

> An alternative idea: should we optimize for validation of **valid** inputs rather than optimizing the worst case?
> In other words, what if the implementation processes all characters always and uses a slower method in case of validation failure?
> I would guess it is more important to be faster with accepting valid input rather than "faster to reject invalid input".

> static int pg_utf8_verifystr2(const unsigned char *s, int len) {
>     if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid, then just accept it
>         return s + len;
>     }
>     // slow path: the string is not valid, perform a slower analysis
>     return s + ....;
> }

That might be workable. We have to be careful because in COPY FROM, validation is performed on 64kB chunks, and the boundary could fall in the middle of a multibyte sequence. In the SSE version, there is this comment:

+ /*
+ * NB: This check must be strictly greater-than, otherwise an invalid byte
+ * at the end might not get detected.
+ */
+ while (len > sizeof(__m128i))

...which should have more detail on this.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:
> On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:

> > An alternative idea: should we optimize for validation of **valid** inputs rather than optimizing the worst case?
> > In other words, what if the implementation processes all characters always and uses a slower method in case of validation failure?
> > I would guess it is more important to be faster with accepting valid input rather than "faster to reject invalid input".
>
> > static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> >     if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid, then just accept it
> >         return s + len;
> >     }
> >     // slow path: the string is not valid, perform a slower analysis
> >     return s + ....;
> > }

This turned out to be a really good idea (v19 attached):

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     888 |   607 |   179 |     777 |   1328

v19:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     809 |   472 |   223 |     558 |    805

x86 Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     674 |   346 |    78 |     309 |    504

v19:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     379 |   181 |    94 |     219 |    376

Note that the branchy code's worst case (mixed8) is here the same speed as multibyte. With Vladimir's idea * , we call check_ascii only every 8 bytes of input, not every time we verify one multibyte character. Also, we only have to check the DFA state every time we loop over 8 bytes, not every time we step through the DFA. That means we have to walk backwards at the end to find the last leading byte, but the SSE code already knew how to do that, so I used that logic here in the caller, which will allow some simplification of how the SSE code returns.

The state check is likely why the ascii case is slightly slower than v14. We could go back to checking ascii 16 bytes at a time, since there's little penalty for doing so.

* (Greg was thinking the same thing upthread, but I don't think the branchy code I posted at the time could have taken advantage of this)

I'm pretty confident this improvement is architecture-independent. Next month I'll clean this up and rebase the SSE patch over this.

I wrote:

> + /*
> + * NB: This check must be strictly greater-than, otherwise an invalid byte
> + * at the end might not get detected.
> + */
> + while (len > sizeof(__m128i))

Note to self: I actually think this isn't needed anymore since I changed how the SSE code deals with remainder sequences at the end.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
Thomas Munro
Date:
On Sat, Mar 13, 2021 at 4:37 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> On Fri, Mar 12, 2021 at 9:14 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > I was not thinking about auto-vectorizing the code in
> > pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization
> > inside the individual helper functions that you wrote, such as
> > _mm_setr_epi8(), shift_right(), bitwise_and(), prev1(), splat(),
>
> If the PhD holders who came up with this algorithm thought it possible to do it that way, I'm sure they would have.
Inreality, simdjson has different files for SSE4, AVX, AVX512, NEON, and Altivec. We can incorporate any of those as
needed.That's a PG15 project, though, and I'm not volunteering. 

Just for fun/experimentation, here's a quick (and probably too naive)
translation of those helper functions to NEON, on top of the v15
patch.

Attachment

Re: speed up verifying UTF-8

From
Vladimir Sitnikov
Date:
>I'm pretty confident this improvement is architecture-independent.

Thanks for testing it with different architectures.

It looks like the same utf8_advance function is good for both fast-path and for the slow path.
Then pg_utf8_verifychar could be removed altogether along with the corresponding IS_*_BYTE_LEAD macros.

Vladimir

Re: speed up verifying UTF-8

From
John Naylor
Date:

On Wed, Jul 21, 2021 at 12:13 PM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
> It looks like the same utf8_advance function is good for both fast-path and for the slow path.
> Then pg_utf8_verifychar could be removed altogether along with the corresponding IS_*_BYTE_LEAD macros.

pg_utf8_verifychar() is a public function usually called through pg_wchar_table[], so it needs to remain in any case.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:

On Wed, Jul 21, 2021 at 11:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:

> Just for fun/experimentation, here's a quick (and probably too naive)
> translation of those helper functions to NEON, on top of the v15
> patch.

Neat! It's good to make it more architecture-agnostic, and I'm sure we can use quite a bit of this. I don't know enough about NEON to comment intelligently, but a quick glance through the simdjson source show a couple differences that might be worth a look:

 to_bool(const pg_u8x16_t v)
 {
+#if defined(USE_NEON)
+ return vmaxvq_u32((uint32x4_t) v) != 0;

--> return vmaxvq_u8(*this) != 0;

 vzero()
 {
+#if defined(USE_NEON)
+ return vmovq_n_u8(0);

--> return vdupq_n_u8(0); // or equivalently, splat(0)

is_highbit_set(const pg_u8x16_t v)
 {
+#if defined(USE_NEON)
+ return to_bool(bitwise_and(v, vmovq_n_u8(0x80)));

--> return vmaxq_u8(v) > 0x7F

(Technically, their convention is: is_ascii(v) { return vmaxq_u8(v) < 0x80; } , but same effect)

+#if defined(USE_NEON)
+static pg_attribute_always_inline pg_u8x16_t
+vset(uint8 v0, uint8 v1, uint8 v2, uint8 v3,
+ uint8 v4, uint8 v5, uint8 v6, uint8 v7,
+ uint8 v8, uint8 v9, uint8 v10, uint8 v11,
+ uint8 v12, uint8 v13, uint8 v14, uint8 v15)
+{
+ uint8 pg_attribute_aligned(16) values[16] = {
+ v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15
+ };
+ return vld1q_u8(values);
+}

--> They have this strange beast instead:

  // Doing a load like so end ups generating worse code.
  // uint8_t array[16] = {x1, x2, x3, x4, x5, x6, x7, x8,
  //                     x9, x10,x11,x12,x13,x14,x15,x16};
  // return vld1q_u8(array);
  uint8x16_t x{};
  // incredibly, Visual Studio does not allow x[0] = x1
  x = vsetq_lane_u8(x1, x, 0);
  x = vsetq_lane_u8(x2, x, 1);
  x = vsetq_lane_u8(x3, x, 2);
... 
  x = vsetq_lane_u8(x15, x, 14);
  x = vsetq_lane_u8(x16, x, 15);
  return x;

Since you aligned the array, that might not have the problem alluded to above, and it looks nicer.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

From
Thomas Munro
Date:
On Thu, Jul 22, 2021 at 6:16 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> Neat! It's good to make it more architecture-agnostic, and I'm sure we can use quite a bit of this.

One question is whether this "one size fits all" approach will be
extensible to wider SIMD.

>  to_bool(const pg_u8x16_t v)
>  {
> +#if defined(USE_NEON)
> + return vmaxvq_u32((uint32x4_t) v) != 0;
>
> --> return vmaxvq_u8(*this) != 0;

I chose that lane width because I saw an unsubstantiated claim
somewhere that it might be faster, but I have no idea if it matters.
The u8 code looks more natural anyway.  Changed.

>  vzero()
>  {
> +#if defined(USE_NEON)
> + return vmovq_n_u8(0);
>
> --> return vdupq_n_u8(0); // or equivalently, splat(0)

I guess it doesn't make a difference which builtin you use here, but I
was influenced by the ARM manual which says the vdupq form is
generated for immediate values.

> is_highbit_set(const pg_u8x16_t v)
>  {
> +#if defined(USE_NEON)
> + return to_bool(bitwise_and(v, vmovq_n_u8(0x80)));
>
> --> return vmaxq_u8(v) > 0x7F

Ah, of course.  Much nicer!

> +#if defined(USE_NEON)
> +static pg_attribute_always_inline pg_u8x16_t
> +vset(uint8 v0, uint8 v1, uint8 v2, uint8 v3,
> + uint8 v4, uint8 v5, uint8 v6, uint8 v7,
> + uint8 v8, uint8 v9, uint8 v10, uint8 v11,
> + uint8 v12, uint8 v13, uint8 v14, uint8 v15)
> +{
> + uint8 pg_attribute_aligned(16) values[16] = {
> + v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15
> + };
> + return vld1q_u8(values);
> +}
>
> --> They have this strange beast instead:
>
>   // Doing a load like so end ups generating worse code.
>   // uint8_t array[16] = {x1, x2, x3, x4, x5, x6, x7, x8,
>   //                     x9, x10,x11,x12,x13,x14,x15,x16};
>   // return vld1q_u8(array);
>   uint8x16_t x{};
>   // incredibly, Visual Studio does not allow x[0] = x1
>   x = vsetq_lane_u8(x1, x, 0);
>   x = vsetq_lane_u8(x2, x, 1);
>   x = vsetq_lane_u8(x3, x, 2);
> ...
>   x = vsetq_lane_u8(x15, x, 14);
>   x = vsetq_lane_u8(x16, x, 15);
>   return x;
>
> Since you aligned the array, that might not have the problem alluded to above, and it looks nicer.

Strange indeed.  We should probably poke around in the assember and
see... it might be that MSVC doesn't like it, and I was just
cargo-culting the alignment.  I don't expect the generated code to
really "load" anything of course, it should ideally be some kind of
immediate mov...

FWIW here are some performance results from my humble RPI4:

master:

 chinese | mixed | ascii
---------+-------+-------
    4172 |  2763 |  1823
(1 row)

Your v15 patch:

 chinese | mixed | ascii
---------+-------+-------
    2267 |  1248 |   399
(1 row)

Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1:

 chinese | mixed | ascii
---------+-------+-------
     909 |   620 |   318
(1 row)

It's so good I wonder if it's producing incorrect results :-)

I also tried to do a quick and dirty AltiVec patch to see if it could
fit into the same code "shape", with less immediate success: it works
out slower than the fallback code on the POWER7 machine I scrounged an
account on.  I'm not sure what's wrong there, but maybe it's a uesful
start (I'm probably confused about endianness, or the encoding of
boolean vectors which may be different (is true 0x01or 0xff, does it
matter?), or something else, and it's falling back on errors all the
time?).

Attachment

Re: [POC] verifying UTF-8 using SIMD instructions

From
John Naylor
Date:

On Wed, Jul 21, 2021 at 8:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Jul 22, 2021 at 6:16 AM John Naylor

> One question is whether this "one size fits all" approach will be
> extensible to wider SIMD.

Sure, it'll just take a little more work and complexity. For one, 16-byte SIMD can operate on 32-byte chunks with a bit of repetition:

-       __m128i         input;
+       __m128i         input1;
+       __m128i         input2;

-#define SIMD_STRIDE_LENGTH (sizeof(__m128i))
+#define SIMD_STRIDE_LENGTH 32

        while (len >= SIMD_STRIDE_LENGTH)
        {
-               input = vload(s);
+               input1 = vload(s);
+               input2 = vload(s + sizeof(input1));

-               check_for_zeros(input, &error);
+               check_for_zeros(input1, &error);
+               check_for_zeros(input2, &error);

                /*
                 * If the chunk is all ASCII, we can skip the full UTF-8 check, but we
@@ -460,17 +463,18 @@ pg_validate_utf8_sse42(const unsigned char *s, int len)
                 * sequences at the end. We only update prev_incomplete if the chunk
                 * contains non-ASCII, since the error is cumulative.
                 */
-               if (is_highbit_set(input))
+               if (is_highbit_set(bitwise_or(input1, input2)))
                {
-                       check_utf8_bytes(prev, input, &error);
-                       prev_incomplete = is_incomplete(input);
+                       check_utf8_bytes(prev, input1, &error);
+                       check_utf8_bytes(input1, input2, &error);
+                       prev_incomplete = is_incomplete(input2);
                }
                else
                {
                        error = bitwise_or(error, prev_incomplete);
                }

-               prev = input;
+               prev = input2;
                s += SIMD_STRIDE_LENGTH;
                len -= SIMD_STRIDE_LENGTH;
        }

So with a few #ifdefs, we can accommodate two sizes if we like. 

For another, the prevN() functions would need to change, at least on x86 -- that would require replacing _mm_alignr_epi8() with _mm256_alignr_epi8() plus _mm256_permute2x128_si256(). Also, we might have to do something with the vector typedef.

That said, I think we can punt on that until we have an application that's much more compute-intensive. As it is with SSE4, COPY FROM WHERE <selective predicate> already pushes the utf8 validation way down in profiles.

> FWIW here are some performance results from my humble RPI4:
>
> master:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>     4172 |  2763 |  1823
> (1 row)
>
> Your v15 patch:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>     2267 |  1248 |   399
> (1 row)
>
> Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>      909 |   620 |   318
> (1 row)
>
> It's so good I wonder if it's producing incorrect results :-)

Nice! If it passes regression tests, it *should* be fine, but stress testing would be welcome on any platform.

> I also tried to do a quick and dirty AltiVec patch to see if it could
> fit into the same code "shape", with less immediate success: it works
> out slower than the fallback code on the POWER7 machine I scrounged an
> account on.  I'm not sure what's wrong there, but maybe it's a uesful
> start (I'm probably confused about endianness, or the encoding of
> boolean vectors which may be different (is true 0x01or 0xff, does it
> matter?), or something else, and it's falling back on errors all the
> time?).

Hmm, I have access to a power8 machine to play with, but I also don't mind having some type of server-class hardware that relies on the recent nifty DFA fallback, which performs even better on powerpc64le than v15.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:
Attached is v20, which has a number of improvements:

1. Cleaned up and explained DFA coding.
2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to produce an optimized loop, using branch-free accumulators. That way, it doesn't need to be rewritten for different input lengths. I also think it's a bit easier to understand this way.
3. Put SSE helper functions in their own file.
4. Mostly-cosmetic edits to the configure detection.
5. Draft commit message.

With #2 above in place, I wanted to try different strides for the DFA, so more measurements (hopefully not much more of these):

Power8, gcc 4.8

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v20, 8-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1189 |   550 |   246 |     600 |    936

v20, 16-byte stride (in the actual patch):
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     981 |   440 |   134 |     791 |    820

v20, 32-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     857 |   481 |   141 |     834 |    839

Based on the above, I decided that 16 bytes had the best overall balance. Other platforms may differ, but I don't expect it to make a huge amount of difference.

Just for fun, I was also a bit curious about what Vladimir mentioned upthread about x86-64-v3 offering a different shift instruction. Somehow, clang 12 refused to build with that target, even though the release notes say it can, but gcc 11 was fine:

x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1200 |   728 |   370 |     544 |    637

v20:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     459 |   243 |    77 |     424 |    440

v20, CFLAGS="-march=x86-64-v3 -O2" :
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     390 |   215 |    77 |     303 |    323

And, gcc does generate the desired shift here:

objdump -S src/port/pg_utf8_fallback.o | grep shrx
      53: c4 e2 eb f7 d1               shrxq %rdx, %rcx, %rdx

While it looks good, clang can do about as good by simply unrolling all 16 shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since x86-64-v3 includes AVX2, and if we had that we would just use it with the SIMD algorithm.

Macbook x86, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v20, USE_FALLBACK_UTF8=1:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     351 |   172 |    88 |     349 |    350

v20, with SSE4:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     142 |    92 |    59 |     141 |    141

I'm pretty happy with the patch at this point.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
Vladimir Sitnikov
Date:
Just wondering, do you have the code in a GitHub/Gitlab branch?

>+ utf8_advance(s, state, len);
>+
>+ /*
>+ * If we saw an error during the loop, let the caller handle it. We treat
>+ * all other states as success.
>+ */
>+ if (state == ERR)
>+ return 0;

Did you mean state = utf8_advance(s, state, len); there? (reassign state variable)

>I wanted to try different strides for the DFA

Does that (and "len >= 32" condition) mean the patch does not improve validation of the shorter strings (the ones less than 32 bytes)?
It would probably be nice to cover them as well (e.g. with 4 or 8-byte strides)

Vladimir

Re: speed up verifying UTF-8

From
John Naylor
Date:

On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
>
> Just wondering, do you have the code in a GitHub/Gitlab branch?
>
> >+ utf8_advance(s, state, len);
> >+
> >+ /*
> >+ * If we saw an error during the loop, let the caller handle it. We treat
> >+ * all other states as success.
> >+ */
> >+ if (state == ERR)
> >+ return 0;
>
> Did you mean state = utf8_advance(s, state, len); there? (reassign state variable)

Yep, that's a bug, thanks for catching!

> >I wanted to try different strides for the DFA
>
> Does that (and "len >= 32" condition) mean the patch does not improve validation of the shorter strings (the ones less than 32 bytes)?

Right. Also, the 32 byte threshold was just a temporary need for testing 32-byte stride -- testing different thresholds wouldn't hurt.  I'm not terribly concerned about short strings, though, as long as we don't regress.  That said, Heikki had something in his v14 [1] that we could use:

+/*
+ * Subroutine of pg_utf8_verifystr() to check on char. Returns the length of the
+ * character at *s in bytes, or 0 on invalid input or premature end of input.
+ *
+ * XXX: could this be combined with pg_utf8_verifychar above?
+ */
+static inline int
+pg_utf8_verify_one(const unsigned char *s, int len)

It would be easy to replace pg_utf8_verifychar with this. It might even speed up the SQL function length_in_encoding() -- that would be a better reason to do it.

[1] https://www.postgresql.org/message-id/2f95e70d-4623-87d4-9f24-ca534155f179%40iki.fi
--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:

On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
>
> Just wondering, do you have the code in a GitHub/Gitlab branch?

Sorry, I didn't see this earlier. No, I don't.
--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

From
John Naylor
Date:

I wrote:

> On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
> >
> > >+ utf8_advance(s, state, len);
> > >+
> > >+ /*
> > >+ * If we saw an error during the loop, let the caller handle it. We treat
> > >+ * all other states as success.
> > >+ */
> > >+ if (state == ERR)
> > >+ return 0;
> >
> > Did you mean state = utf8_advance(s, state, len); there? (reassign state variable)
>
> Yep, that's a bug, thanks for catching!

Fixed in v21, with a regression test added. Also, utf8_advance() now directly changes state by a passed pointer rather than returning a value. Some cosmetic changes:

s/valid_bytes/non_error_bytes/ since the former is kind of misleading now.

Some other var name and symbol changes. In my first DFA experiment, ASC conflicted with the parser or scanner somehow, but it doesn't here, so it's clearer to use this.

Rewrote a lot of comments about the state machine and regression tests.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:

On Mon, Jul 26, 2021 at 8:56 AM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> >
> > Does that (and "len >= 32" condition) mean the patch does not improve validation of the shorter strings (the ones less than 32 bytes)?
>
> Right. Also, the 32 byte threshold was just a temporary need for testing 32-byte stride -- testing different thresholds wouldn't hurt.  I'm not terribly concerned about short strings, though, as long as we don't regress.  

I put together the attached quick test to try to rationalize the fast-path threshold. (In case it isn't obvious, it must be at least 16 on all builds, since wchar.c doesn't know which implementation it's calling, and SSE register width sets the lower bound.) I changed the threshold first to 16, and then 100000, which will force using the byte-at-a-time code.

If we have only 16 bytes in the input, it still seems to be faster to use SSE, even though it's called through a function pointer on x86. I didn't test the DFA path, but I don't think the conclusion would be different. I'll include the 16 threshold next time I need to update the patch.

Macbook x86, clang 12:

master + use 16:
 asc16 | asc32 | asc64 | mb16 | mb32 | mb64
-------+-------+-------+------+------+------
   270 |   279 |   282 |  291 |  296 |  304

force byte-at-a-time:
 asc16 | asc32 | asc64 | mb16 | mb32 | mb64
-------+-------+-------+------+------+------
   277 |   292 |   310 |  296 |  317 |  362

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:
I wrote:
> If we have only 16 bytes in the input, it still seems to be faster to use SSE, even though it's called through a function pointer on x86. I didn't test the DFA path, but I don't think the conclusion would be different. I'll include the 16 threshold next time I need to update the patch.

v22 attached, which changes the threshold to 16, with a few other cosmetic adjustments, mostly in the comments.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:
Naively, the shift-based DFA requires 64-bit integers to encode the transitions, but I recently came across an idea from Dougall Johnson of using the Z3 SMT solver to pack the transitions into 32-bit integers [1]. That halves the size of the transition table for free. I adapted that effort to the existing conventions in v22 and arrived at the attached python script. Running the script outputs the following:

$ python dfa-pack-pg.py
offsets: [0, 11, 16, 1, 5, 6, 20, 25, 30]
transitions:
00000000000000000000000000000000 0x0
00000000000000000101100000000000 0x5800
00000000000000001000000000000000 0x8000
00000000000000000000100000000000 0x800
00000000000000000010100000000000 0x2800
00000000000000000011000000000000 0x3000
00000000000000001010000000000000 0xa000
00000000000000001100100000000000 0xc800
00000000000000001111000000000000 0xf000
01000001000010110000000000100000 0x410b0020
00000011000010110000000000100000 0x30b0020
00000010000010110000010000100000 0x20b0420

I'll include something like the attached text file diff in the next patch. Some comments are now outdated, but this is good enough for demonstration.

[1] https://gist.github.com/dougallj/166e326de6ad4cf2c94be97a204c025f
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:

I wrote:

> Naively, the shift-based DFA requires 64-bit integers to encode the transitions, but I recently came across an idea from Dougall Johnson of using the Z3 SMT solver to pack the transitions into 32-bit integers [1]. That halves the size of the transition table for free. I adapted that effort to the existing conventions in v22 and arrived at the attached python script.
> [...]
> I'll include something like the attached text file diff in the next patch. Some comments are now outdated, but this is good enough for demonstration.

Attached is v23 incorporating the 32-bit transition table, with the necessary comment adjustments.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
Vladimir Sitnikov
Date:
>Attached is v23 incorporating the 32-bit transition table, with the necessary comment adjustments

32bit table is nice.


in the header of src/port/pg_utf8_fallback.c?

It would make the URL more stable in case the file gets renamed.

Vladimir

Re: speed up verifying UTF-8

From
John Naylor
Date:
I've decided I'm not quite comfortable with the additional complexity in the build system introduced by the SIMD portion of the previous patches. It would make more sense if the pure C portion were unchanged, but with the shift-based DFA plus the bitwise ASCII check, we have a portable implementation that's still a substantial improvement over the current validator. In v24, I've included only that much, and the diff is only about 1/3 as many lines. If future improvements to COPY FROM put additional pressure on this path, we can always add SIMD support later.

One thing not in this patch is a possible improvement to pg_utf8_verifychar() that Heikki and I worked on upthread as part of earlier attempts to rewrite pg_utf8_verifystr(). That's worth looking into separately.

On Thu, Aug 26, 2021 at 12:09 PM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
>
> >Attached is v23 incorporating the 32-bit transition table, with the necessary comment adjustments
>
> 32bit table is nice.

Thanks for taking a look!

> Would you please replace https://github.com/BobSteagall/utf_utils/blob/master/src/utf_utils.cpp URL with
> https://github.com/BobSteagall/utf_utils/blob/6b7a465265de2f5fa6133d653df0c9bdd73bbcf8/src/utf_utils.cpp
> in the header of src/port/pg_utf8_fallback.c?
>
> It would make the URL more stable in case the file gets renamed.
>
> Vladimir
>

Makes sense, so done that way.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:
It occurred to me that the DFA + ascii quick check approach could also
be adapted to speed up some cases where we currently walk a string
counting characters, like this snippet in
text_position_get_match_pos():

/* Convert the byte position to char position. */
while (state->refpoint < state->last_match)
{
    state->refpoint += pg_mblen(state->refpoint);
    state->refpos++;
}

This coding changed in 9556aa01c69 (Use single-byte
Boyer-Moore-Horspool search even with multibyte encodings), in which I
found the majority of cases were faster, but some were slower. It
would be nice to regain the speed lost and do even better.

In the case of UTF-8, we could just run it through the DFA,
incrementing a count of the states found. The number of END states
should be the number of characters. The ascii quick check would still
be applicable as well. I think all that is needed is to export some
symbols and add the counting function. That wouldn't materially affect
the current patch for input verification, and would be separate, but
it would be nice to get the symbol visibility right up front. I've set
this to waiting on author while I experiment with that.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: speed up verifying UTF-8

From
Heikki Linnakangas
Date:
On 20/10/2021 00:42, John Naylor wrote:
> I've decided I'm not quite comfortable with the additional complexity in 
> the build system introduced by the SIMD portion of the previous patches. 
> It would make more sense if the pure C portion were unchanged, but with 
> the shift-based DFA plus the bitwise ASCII check, we have a portable 
> implementation that's still a substantial improvement over the current 
> validator. In v24, I've included only that much, and the diff is only 
> about 1/3 as many lines. If future improvements to COPY FROM put 
> additional pressure on this path, we can always add SIMD support later.

+1.

I had another look at this now. Looks good, just a few minor comments below:

> +/*
> + * Verify a chunk of bytes for valid ASCII, including a zero-byte check.
> + */
> +static inline bool
> +is_valid_ascii(const unsigned char *s, int len)
> +{
> +    uint64        chunk,
> +                highbit_cum = UINT64CONST(0),
> +                zero_cum = UINT64CONST(0x8080808080808080);
> +
> +    Assert(len % sizeof(chunk) == 0);
> +
> +    while (len >= sizeof(chunk))
> +    {
> +        memcpy(&chunk, s, sizeof(chunk));
> +
> +        /*
> +         * Capture any zero bytes in this chunk.
> +         *
> +         * First, add 0x7f to each byte. This sets the high bit in each byte,
> +         * unless it was a zero. We will check later that none of the bytes in
> +         * the chunk had the high bit set, in which case the max value each
> +         * byte can have after the addition is 0x7f + 0x7f = 0xfe, and we
> +         * don't need to worry about carrying over to the next byte.
> +         *
> +         * If any resulting high bits are zero, the corresponding high bits in
> +         * the zero accumulator will be cleared.
> +         */
> +        zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f));
> +
> +        /* Capture any set bits in this chunk. */
> +        highbit_cum |= chunk;
> +
> +        s += sizeof(chunk);
> +        len -= sizeof(chunk);
> +    }

This function assumes that the input len is a multiple of 8. There's an 
assertion for that, but it would be good to also mention it in the 
function comment. I took me a moment to realize that.

Given that assumption, I wonder if "while (len >= 0)" would marginally 
faster. Or compute "s_end = s + len" first, and check for "while (s < 
s_end)", so that you don't need to update 'len' in the loop.

Also would be good to mention what exactly the return value means. I.e 
"returns false if the input contains any bytes with the high-bit set, or 
zeros".

> +    /*
> +     * Check if any high bits in the zero accumulator got cleared.
> +     *
> +     * XXX: As noted above, the zero check is only valid if the chunk had no
> +     * high bits set. However, the compiler may perform these two checks in
> +     * any order. That's okay because if any high bits were set, we would
> +     * return false regardless, so invalid results from the zero check don't
> +     * matter.
> +     */
> +    if (zero_cum != UINT64CONST(0x8080808080808080))
> +        return false;

I don't understand the "the compiler may perform these checks in any 
order" comment. We trust the compiler to do the right thing, and only 
reorder things when it's safe to do so. What is special here, why is it 
worth mentioning here?

> @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
>      return s - start;
>  }
>  
> -static int
> +static pg_noinline int
>  pg_utf8_verifychar(const unsigned char *s, int len)
>  {
>      int            l;

Why force it to not be inlined?

> + * In a shift-based DFA, the input byte is an index into array of integers
> + * whose bit pattern encodes the state transitions. To compute the current
> + * state, we simply right-shift the integer by the current state and apply a
> + * mask. In this scheme, the address of the transition only depends on the
> + * input byte, so there is better pipelining.

Should be "To compute the *next* state, ...", I think.

The way the state transition table works is pretty inscrutable. That's 
understandable, because the values were found by an SMT solver, so I'm 
not sure if anything can be done about it.

- Heikki



RE: [EXTERNAL] Re: speed up verifying UTF-8

From
"Godfrin, Philippe E"
Date:
>-----Original Message-----
>From: Heikki Linnakangas <hlinnaka@iki.fi> 
>Sent: Friday, December 10, 2021 12:34 PM
>To: John Naylor <john.naylor@enterprisedb.com>; Vladimir Sitnikov <sitnikov.vladimir@gmail.com>
>Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; Amit Khandekar <amitdkhan.pg@gmail.com>; Thomas Munro
<thomas.munro@gmail.com>;Greg Stark <stark@mit.edu>
 
>Subject: [EXTERNAL] Re: speed up verifying UTF-8
>
>On 20/10/2021 00:42, John Naylor wrote:
>> I've decided I'm not quite comfortable with the additional complexity 
>> in the build system introduced by the SIMD portion of the previous patches.
>> It would make more sense if the pure C portion were unchanged, but 
>> with the shift-based DFA plus the bitwise ASCII check, we have a 
>> portable implementation that's still a substantial improvement over 
>> the current validator. In v24, I've included only that much, and the 
>> diff is only about 1/3 as many lines. If future improvements to COPY 
>> FROM put additional pressure on this path, we can always add SIMD support later.
>
>+1.
>
>I had another look at this now. Looks good, just a few minor comments below:
>
>> +/*
>> + * Verify a chunk of bytes for valid ASCII, including a zero-byte check.
>> + */
>> +static inline bool
>> +is_valid_ascii(const unsigned char *s, int len) {
>> +    uint64        chunk,
>> +                highbit_cum = UINT64CONST(0),
>> +                zero_cum = UINT64CONST(0x8080808080808080);
>> +
>> +    Assert(len % sizeof(chunk) == 0);
>> +
>> +    while (len >= sizeof(chunk))
>> +    {
>> +        memcpy(&chunk, s, sizeof(chunk));
>> +
>> +        /*
>> +         * Capture any zero bytes in this chunk.
>> +         *
>> +         * First, add 0x7f to each byte. This sets the high bit in each byte,
>> +         * unless it was a zero. We will check later that none of the bytes in
>> +         * the chunk had the high bit set, in which case the max value each
>> +         * byte can have after the addition is 0x7f + 0x7f = 0xfe, and we
>> +         * don't need to worry about carrying over to the next byte.
>> +         *
>> +         * If any resulting high bits are zero, the corresponding high bits in
>> +         * the zero accumulator will be cleared.
>> +         */
>> +        zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f));
>> +
>> +        /* Capture any set bits in this chunk. */
>> +        highbit_cum |= chunk;
>> +
>> +        s += sizeof(chunk);
>> +        len -= sizeof(chunk);
>> +    }
>
>This function assumes that the input len is a multiple of 8. There's an assertion for that, but it would be good to
alsomention it in the function comment. I took me a moment to realize that.
 
>
>Given that assumption, I wonder if "while (len >= 0)" would marginally faster. Or compute "s_end = s + len" first, and
checkfor "while (s < s_end)", so that you don't need to update 'len' in the loop.
 
>
>Also would be good to mention what exactly the return value means. I.e "returns false if the input contains any bytes
withthe high-bit set, or zeros".
 
>
>> +    /*
>> +     * Check if any high bits in the zero accumulator got cleared.
>> +     *
>> +     * XXX: As noted above, the zero check is only valid if the chunk had no
>> +     * high bits set. However, the compiler may perform these two checks in
>> +     * any order. That's okay because if any high bits were set, we would
>> +     * return false regardless, so invalid results from the zero check don't
>> +     * matter.
>> +     */
>> +    if (zero_cum != UINT64CONST(0x8080808080808080))
>> +        return false;
>
>I don't understand the "the compiler may perform these checks in any order" comment. We trust the compiler to do the
rightthing, and only reorder things when it's safe to do so. What is special here, why is it worth mentioning here?
 
>
>> @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
>>      return s - start;
>>  }
>>  
>> -static int
>> +static pg_noinline int
>>  pg_utf8_verifychar(const unsigned char *s, int len)  {
>>      int            l;
>
>Why force it to not be inlined?
>
>> + * In a shift-based DFA, the input byte is an index into array of 
>> + integers
>> + * whose bit pattern encodes the state transitions. To compute the 
>> + current
>> + * state, we simply right-shift the integer by the current state and 
>> + apply a
>> + * mask. In this scheme, the address of the transition only depends 
>> + on the
>> + * input byte, so there is better pipelining.
>
>Should be "To compute the *next* state, ...", I think.
>
>The way the state transition table works is pretty inscrutable. That's understandable, because the values were found
byan SMT solver, so I'm not sure if anything can be done about it.
 
>
>- Heikki
>

If I remember correctly the shift instruction is very fast...

Re: speed up verifying UTF-8

From
John Naylor
Date:
On Fri, Dec 10, 2021 at 2:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> I had another look at this now. Looks good, just a few minor comments below:

Thanks for reviewing! I've attached v25 to address your points.

> This function assumes that the input len is a multiple of 8. There's an
> assertion for that, but it would be good to also mention it in the
> function comment. I took me a moment to realize that.

Done.

> Given that assumption, I wonder if "while (len >= 0)" would marginally
> faster. Or compute "s_end = s + len" first, and check for "while (s <
> s_end)", so that you don't need to update 'len' in the loop.

With two chunks, gcc 4.8.5/11.2 and clang 12 will unroll the inner
loop, so it doesn't matter:

L51:
        mov     rdx, QWORD PTR [rdi]
        mov     rsi, QWORD PTR [rdi+8]
        lea     rax, [rdx+rbx]
        lea     rbp, [rsi+rbx]
        and     rax, rbp
        and     rax, r11
        cmp     rax, r11
        jne     .L66
        or      rdx, rsi
        test    rdx, r11
        jne     .L66
        sub     r8d, 16          ; refers to "len" in the caller
pg_utf8_verifystr()
        add     rdi, 16
        cmp     r8d, 15
        jg      .L51

I *think* these are the same instructions as from your version from
some time ago that handled two integers explicitly -- I rewrote it
like this to test different chunk sizes.

(Aside on 32-byte strides: Four chunks was within the noise level of
two chunks on the platform I tested. With 32 bytes, that increases the
chance that a mixed input would have non-ascii and defeat this
optimization, so should be significantly faster to make up for that.
Along those lines, in the future we could consider SSE2 (unrolled 2 x
16 bytes) for this path. Since it's part of the spec for x86-64, we
wouldn't need a runtime check -- just #ifdef it inline. And we could
piggy-back on the CRC SSE4.2 configure test for intrinsic support, so
that would avoid adding a bunch of complexity.)

That said, I think your suggestions are better on code clarity
grounds. I'm on the fence about "while(s < s_end)", so I went with
"while (len > 0)" because it matches the style in wchar.c.

> Also would be good to mention what exactly the return value means. I.e
> "returns false if the input contains any bytes with the high-bit set, or
> zeros".

Done.

> > +     /*
> > +      * Check if any high bits in the zero accumulator got cleared.
> > +      *
> > +      * XXX: As noted above, the zero check is only valid if the chunk had no
> > +      * high bits set. However, the compiler may perform these two checks in
> > +      * any order. That's okay because if any high bits were set, we would
> > +      * return false regardless, so invalid results from the zero check don't
> > +      * matter.
> > +      */
> > +     if (zero_cum != UINT64CONST(0x8080808080808080))
> > +             return false;

> I don't understand the "the compiler may perform these checks in any
> order" comment. We trust the compiler to do the right thing, and only
> reorder things when it's safe to do so. What is special here, why is it
> worth mentioning here?

Ah, that's a good question, and now that you mention it, the comment
is silly. When looking at the assembly output a while back, I was a
bit astonished that it didn't match my mental model of what was
happening, so I made this note. I've removed the whole XXX comment
here and expanded the first comment in the loop to:

/*
 * Capture any zero bytes in this chunk.
 *
 * First, add 0x7f to each byte. This sets the high bit in each byte,
 * unless it was a zero. If any resulting high bits are zero, the
 * corresponding high bits in the zero accumulator will be cleared.
 *
 * If none of the bytes in the chunk had the high bit set, the max
 * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
 * and we don't need to worry about carrying over to the next byte. If
 * any input bytes did have the high bit set, it doesn't matter
 * because we check for those separately.
 */

> > @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
> >       return s - start;
> >  }
> >
> > -static int
> > +static pg_noinline int
> >  pg_utf8_verifychar(const unsigned char *s, int len)
> >  {
> >       int                     l;
>
> Why force it to not be inlined?

Since the only direct caller is now only using it for small inputs, I
thought about saving space, but it's not enough to matter, so I'll go
ahead and leave it out. While at it, I removed the unnecessary
"inline" declaration for utf8_advance(), since the compiler can do
that anyway.

> > + * In a shift-based DFA, the input byte is an index into array of integers
> > + * whose bit pattern encodes the state transitions. To compute the current
> > + * state, we simply right-shift the integer by the current state and apply a
> > + * mask. In this scheme, the address of the transition only depends on the
> > + * input byte, so there is better pipelining.
>
> Should be "To compute the *next* state, ...", I think.

Fixed.

> The way the state transition table works is pretty inscrutable. That's
> understandable, because the values were found by an SMT solver, so I'm
> not sure if anything can be done about it.

Do you mean in general, or just the state values?

Like any state machine, the code is simple, and the complexity is
hidden in the data. Hopefully the first link I included in the comment
is helpful.

The SMT solver was only needed to allow 32-bit (instead of 64-bit)
entries in the transition table, so it's not strictly necessary. A
lookup table that fits in 1kB is nice from a cache perspective,
however.

With 64-bit, the state values are less weird-looking but they're still
just arbitrary numbers. As long as ERR = 0 and the largest is at most
9, it doesn't matter what they are, so I'm not sure it's much less
mysterious. You can see the difference between 32-bit and 64-bit in
[1].

--
In addition to Heikki's. review points, I've made a couple small
additional changes from v24: I rewrote this part, so we don't need
these macros anymore:

-                       if (!IS_HIGHBIT_SET(*s) ||
-                               IS_UTF8_2B_LEAD(*s) ||
-                               IS_UTF8_3B_LEAD(*s) ||
-                               IS_UTF8_4B_LEAD(*s))
+                       if (!IS_HIGHBIT_SET(*s) || pg_utf_mblen(s) > 1)

And I moved is_valid_ascii() to pg_wchar.h so it can be used
elsewhere. I'm not sure there's a better place to put it. I tried
using this for text_position(), for which I'll start a new thread.

[1] https://www.postgresql.org/message-id/attachment/125672/v22-addendum-32-bit-transitions.txt



--
John Naylor
EDB: http://www.enterprisedb.com


--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: speed up verifying UTF-8

From
John Naylor
Date:
I plan to push v25 early next week, unless there are further comments.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: speed up verifying UTF-8

From
John Naylor
Date:
On Fri, Dec 17, 2021 at 9:29 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I plan to push v25 early next week, unless there are further comments.

Pushed, thanks everyone!

-- 
John Naylor
EDB: http://www.enterprisedb.com