Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: [POC] verifying UTF-8 using SIMD instructions |
Date | |
Msg-id | CAFBsxsGKo53PC06KKBkOeD4eEA=hsG-pqu7HQpKZeNBx3ubibw@mail.gmail.com Whole thread Raw |
In response to | Re: [POC] verifying UTF-8 using SIMD instructions (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: [POC] verifying UTF-8 using SIMD instructions
|
List | pgsql-hackers |
On Mon, Feb 1, 2021 at 2:01 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 01/02/2021 19:32, John Naylor wrote:
> > It makes sense to start with the ascii subset of UTF-8 for a couple
> > reasons. First, ascii is very widespread in database content,
> > particularly in bulk loads. Second, ascii can be validated using the
> > simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
> > I'm guessing we can detect that at compile time and not mess with
> > runtime checks. The examples above using SSE for the general case are
> > much more complicated and involve SSE 4.2 or AVX.
>
> I wonder how using SSE compares with dealing with 64 or 32-bit words at
> a time, using regular instructions? That would be more portable.
I gave that a shot, and it's actually pretty good. According to this paper, [1], 16 bytes was best and gives a good apples-to-apples comparison to SSE registers, so I tried both 16 and 8 bytes.
> All supported encodings are ASCII subsets. Might be best to putt the
> ASCII-check into a static inline function and use it in all the verify
> functions. I presume it's only a few instructions, and these functions
> can be pretty performance sensitive.
I tried both the static inline function and also putting the whole optimized utf-8 loop in a separate function to which the caller passes a pointer to the appropriate pg_*_verifychar().
In the table below, "inline" refers to coding directly inside pg_utf8_verifystr(). Both C and SSE are in the same patch, with an #ifdef. I didn't bother splitting them out because for other encodings, we want one of the other approaches above. For those, "C retail" refers to a static inline function to code the contents of the inner loop, if I understood your suggestion correctly. This needs more boilerplate in each function, so I don't prefer this. "C func pointer" refers to the pointer approach I just mentioned. That is the cleanest looking way to generalize it, so I only tested that version with different strides -- 8- and 16-bytes
This is the same test I used earlier, which is the test in [2] but adding an almost-pure multibyte Chinese text of about the same size.
x64-64 Linux gcc 8.4.0:
build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 1480 | 848 | 428
inline SSE | 1617 | 634 | 63
inline C | 1481 | 843 | 50
C retail | 1493 | 838 | 49
C func pointer | 1467 | 851 | 49
C func pointer 8 | 1518 | 757 | 56
x64-64 MacOS clang 10.0.0:
build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 1086 | 760 | 374
inline SSE | 1081 | 529 | 70
inline C | 1093 | 649 | 49
C retail | 1132 | 695 | 152
C func pointer | 1085 | 609 | 59
C func pointer 8 | 1099 | 571 | 71
PowerPC-LE Linux gcc 4.8.5:
build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 2961 | 1525 | 871
inline SSE | (n/a) | (n/a) | (n/a)
inline C | 2911 | 1329 | 80
C retail | 2838 | 1311 | 102
C func pointer | 2828 | 1314 | 80
C func pointer 8 | 3143 | 1249 | 133
Looking at the results, the main advantage of SSE here is it's more robust for mixed inputs. If a 16-byte chunk is not ascii-only but contains a block of ascii at the front, we can skip those with a single CPU instruction, but in C, we have to verify the whole chunk using the slow path.
The "C func pointer approach" seems to win out over the "C retail" approach (static inline function).
Using an 8-byte stride is slightly better for mixed inputs on all platforms tested, but regresses on pure ascii and also seems to regress on pure multibyte. The difference in the multibyte caes is small enough that it could be random, but it happens on two platforms, so I'd say it's real. On the other hand, pure multibyte is not as common as mixed text.
Overall, I think the function pointer approach with an 8-byte stride is the best balance. If that's agreeable, next I plan to test with short inputs, because I think we'll want a guard if-statement to only loop through the fast path if the string is long enough to justify that.
> > I also gave a shot at doing full UTF-8 recognition using a DFA, but so
> > far that has made performance worse. If I ever have more success with
> > that, I'll add that in the mix.
>
> That's disappointing. Perhaps the SIMD algorithms have higher startup
> costs, so that you need longer inputs to benefit? In that case, it might
> make sense to check the length of the input and only use the SIMD
> algorithm if the input is long enough.
I changed topics a bit quickly, but here I'm talking about using a table-driven state machine to verify the multibyte case. It's possible I did something wrong, since my model implementation decodes, and having to keep track of how many bytes got verified might be the culprit. I'd like to try again to speed up multibyte, but that might be a PG15 project.
[1] https://arxiv.org/abs/2010.03090
[2] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On 01/02/2021 19:32, John Naylor wrote:
> > It makes sense to start with the ascii subset of UTF-8 for a couple
> > reasons. First, ascii is very widespread in database content,
> > particularly in bulk loads. Second, ascii can be validated using the
> > simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
> > I'm guessing we can detect that at compile time and not mess with
> > runtime checks. The examples above using SSE for the general case are
> > much more complicated and involve SSE 4.2 or AVX.
>
> I wonder how using SSE compares with dealing with 64 or 32-bit words at
> a time, using regular instructions? That would be more portable.
I gave that a shot, and it's actually pretty good. According to this paper, [1], 16 bytes was best and gives a good apples-to-apples comparison to SSE registers, so I tried both 16 and 8 bytes.
> All supported encodings are ASCII subsets. Might be best to putt the
> ASCII-check into a static inline function and use it in all the verify
> functions. I presume it's only a few instructions, and these functions
> can be pretty performance sensitive.
I tried both the static inline function and also putting the whole optimized utf-8 loop in a separate function to which the caller passes a pointer to the appropriate pg_*_verifychar().
In the table below, "inline" refers to coding directly inside pg_utf8_verifystr(). Both C and SSE are in the same patch, with an #ifdef. I didn't bother splitting them out because for other encodings, we want one of the other approaches above. For those, "C retail" refers to a static inline function to code the contents of the inner loop, if I understood your suggestion correctly. This needs more boilerplate in each function, so I don't prefer this. "C func pointer" refers to the pointer approach I just mentioned. That is the cleanest looking way to generalize it, so I only tested that version with different strides -- 8- and 16-bytes
This is the same test I used earlier, which is the test in [2] but adding an almost-pure multibyte Chinese text of about the same size.
x64-64 Linux gcc 8.4.0:
build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 1480 | 848 | 428
inline SSE | 1617 | 634 | 63
inline C | 1481 | 843 | 50
C retail | 1493 | 838 | 49
C func pointer | 1467 | 851 | 49
C func pointer 8 | 1518 | 757 | 56
x64-64 MacOS clang 10.0.0:
build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 1086 | 760 | 374
inline SSE | 1081 | 529 | 70
inline C | 1093 | 649 | 49
C retail | 1132 | 695 | 152
C func pointer | 1085 | 609 | 59
C func pointer 8 | 1099 | 571 | 71
PowerPC-LE Linux gcc 4.8.5:
build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 2961 | 1525 | 871
inline SSE | (n/a) | (n/a) | (n/a)
inline C | 2911 | 1329 | 80
C retail | 2838 | 1311 | 102
C func pointer | 2828 | 1314 | 80
C func pointer 8 | 3143 | 1249 | 133
Looking at the results, the main advantage of SSE here is it's more robust for mixed inputs. If a 16-byte chunk is not ascii-only but contains a block of ascii at the front, we can skip those with a single CPU instruction, but in C, we have to verify the whole chunk using the slow path.
The "C func pointer approach" seems to win out over the "C retail" approach (static inline function).
Using an 8-byte stride is slightly better for mixed inputs on all platforms tested, but regresses on pure ascii and also seems to regress on pure multibyte. The difference in the multibyte caes is small enough that it could be random, but it happens on two platforms, so I'd say it's real. On the other hand, pure multibyte is not as common as mixed text.
Overall, I think the function pointer approach with an 8-byte stride is the best balance. If that's agreeable, next I plan to test with short inputs, because I think we'll want a guard if-statement to only loop through the fast path if the string is long enough to justify that.
> > I also gave a shot at doing full UTF-8 recognition using a DFA, but so
> > far that has made performance worse. If I ever have more success with
> > that, I'll add that in the mix.
>
> That's disappointing. Perhaps the SIMD algorithms have higher startup
> costs, so that you need longer inputs to benefit? In that case, it might
> make sense to check the length of the input and only use the SIMD
> algorithm if the input is long enough.
I changed topics a bit quickly, but here I'm talking about using a table-driven state machine to verify the multibyte case. It's possible I did something wrong, since my model implementation decodes, and having to keep track of how many bytes got verified might be the culprit. I'd like to try again to speed up multibyte, but that might be a PG15 project.
[1] https://arxiv.org/abs/2010.03090
[2] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi
--
John Naylor
EDB: http://www.enterprisedb.com
pgsql-hackers by date: