Thread: use SSE2 for is_valid_ascii
new thread [was: WIP Patch: Add a function that returns binary JSONB as a bytea] > I wrote: > > We can also shave a > > few percent by having pg_utf8_verifystr use SSE2 for the ascii path. I > > can look into this. > > Here's a patch for that. If the input is mostly ascii, I'd expect that > part of the flame graph to shrink by 40-50% and give a small boost > overall. Here is an updated patch using the new USE_SSE2 symbol. The style is different from the last one in that each stanza has platform-specific code. I wanted to try it this way because is_valid_ascii() is already written in SIMD-ish style using general purpose registers and bit twiddling, so it seemed natural to see the two side-by-side. Sometimes they can share the same comment. If we think this is bad for readability, I can go back to one block each, but that way leads to duplication of code and it's difficult to see what's different for each platform, IMO. -- John Naylor EDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 10, 2022 at 01:50:14PM +0700, John Naylor wrote: > Here is an updated patch using the new USE_SSE2 symbol. The style is > different from the last one in that each stanza has platform-specific > code. I wanted to try it this way because is_valid_ascii() is already > written in SIMD-ish style using general purpose registers and bit > twiddling, so it seemed natural to see the two side-by-side. Sometimes > they can share the same comment. If we think this is bad for > readability, I can go back to one block each, but that way leads to > duplication of code and it's difficult to see what's different for > each platform, IMO. This is a neat patch. I don't know that we need an entirely separate code block for the USE_SSE2 path, but I do think that a little bit of extra commentary would improve the readability. IMO the existing comment for the zero accumulator has the right amount of detail. + /* + * Set all bits in each lane of the error accumulator where input + * bytes are zero. + */ + error_cum = _mm_or_si128(error_cum, + _mm_cmpeq_epi8(chunk, _mm_setzero_si128())); I wonder if reusing a zero vector (instead of creating a new one every time) has any noticeable effect on performance. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Aug 11, 2022 at 5:31 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > This is a neat patch. I don't know that we need an entirely separate code > block for the USE_SSE2 path, but I do think that a little bit of extra > commentary would improve the readability. IMO the existing comment for the > zero accumulator has the right amount of detail. > > + /* > + * Set all bits in each lane of the error accumulator where input > + * bytes are zero. > + */ > + error_cum = _mm_or_si128(error_cum, > + _mm_cmpeq_epi8(chunk, _mm_setzero_si128())); Okay, I will think about the comments, thanks for looking. > I wonder if reusing a zero vector (instead of creating a new one every > time) has any noticeable effect on performance. Creating a zeroed register is just FOO PXOR FOO, which should get hoisted out of the (unrolled in this case) loop, and which a recent CPU will just map to a hard-coded zero in the register file, in which case the execution latency is 0 cycles. :-) -- John Naylor EDB: http://www.enterprisedb.com
On Thu, Aug 11, 2022 at 11:10:34AM +0700, John Naylor wrote: >> I wonder if reusing a zero vector (instead of creating a new one every >> time) has any noticeable effect on performance. > > Creating a zeroed register is just FOO PXOR FOO, which should get > hoisted out of the (unrolled in this case) loop, and which a recent > CPU will just map to a hard-coded zero in the register file, in which > case the execution latency is 0 cycles. :-) Ah, indeed. At -O2, my compiler seems to zero out two registers before the loop with either approach: pxor %xmm0, %xmm0 ; accumulator pxor %xmm2, %xmm2 ; always zeros And within the loop, I see the following: movdqu (%rdi), %xmm1 movdqu (%rdi), %xmm3 addq $16, %rdi pcmpeqb %xmm2, %xmm1 ; check for zeros por %xmm3, %xmm0 ; OR data into accumulator por %xmm1, %xmm0 ; OR zero check results into accumulator cmpq %rdi, %rsi So the call to _mm_setzero_si128() within the loop is fine. Apologies for the noise. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
v3 applies on top of the v9 json_lex_string patch in [1] and adds a bit more to that, resulting in a simpler patch that is more amenable to additional SIMD-capable platforms. [1] https://www.postgresql.org/message-id/CAFBsxsFV4v802idV0-Bo%3DV7wLMHRbOZ4er0hgposhyGCikmVGA%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Attachment
On Thu, Aug 25, 2022 at 04:41:53PM +0700, John Naylor wrote: > v3 applies on top of the v9 json_lex_string patch in [1] and adds a > bit more to that, resulting in a simpler patch that is more amenable > to additional SIMD-capable platforms. LGTM -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Aug 26, 2022 at 10:26 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Aug 25, 2022 at 04:41:53PM +0700, John Naylor wrote: > > v3 applies on top of the v9 json_lex_string patch in [1] and adds a > > bit more to that, resulting in a simpler patch that is more amenable > > to additional SIMD-capable platforms. > > LGTM Thanks for looking, pushed with some rearrangements. -- John Naylor EDB: http://www.enterprisedb.com