Thread: use SSE2 for is_valid_ascii

use SSE2 for is_valid_ascii

From

John Naylor

Date:

10 August 2022, 06:50:14

new thread [was: WIP Patch: Add a function that returns binary JSONB as a bytea]

> I wrote:
> > We can also shave a
> > few percent by having pg_utf8_verifystr use SSE2 for the ascii path. I
> > can look into this.
>
> Here's a patch for that. If the input is mostly ascii, I'd expect that
> part of the flame graph to shrink by 40-50% and give a small boost
> overall.

Here is an updated patch using the new USE_SSE2 symbol. The style is
different from the last one in that each stanza has platform-specific
code. I wanted to try it this way because is_valid_ascii() is already
written in SIMD-ish style using general purpose registers and bit
twiddling, so it seemed natural to see the two side-by-side. Sometimes
they can share the same comment. If we think this is bad for
readability, I can go back to one block each, but that way leads to
duplication of code and it's difficult to see what's different for
each platform, IMO.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v2-0001-Use-SSE2-in-is_valid_ascii-where-available.patch

Re: use SSE2 for is_valid_ascii

From

Nathan Bossart

Date:

10 August 2022, 22:31:20

On Wed, Aug 10, 2022 at 01:50:14PM +0700, John Naylor wrote:
> Here is an updated patch using the new USE_SSE2 symbol. The style is
> different from the last one in that each stanza has platform-specific
> code. I wanted to try it this way because is_valid_ascii() is already
> written in SIMD-ish style using general purpose registers and bit
> twiddling, so it seemed natural to see the two side-by-side. Sometimes
> they can share the same comment. If we think this is bad for
> readability, I can go back to one block each, but that way leads to
> duplication of code and it's difficult to see what's different for
> each platform, IMO.

This is a neat patch.  I don't know that we need an entirely separate code
block for the USE_SSE2 path, but I do think that a little bit of extra
commentary would improve the readability.  IMO the existing comment for the
zero accumulator has the right amount of detail.

+        /*
+         * Set all bits in each lane of the error accumulator where input
+         * bytes are zero.
+         */
+        error_cum = _mm_or_si128(error_cum,
+                                 _mm_cmpeq_epi8(chunk, _mm_setzero_si128()));

I wonder if reusing a zero vector (instead of creating a new one every
time) has any noticeable effect on performance.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: use SSE2 for is_valid_ascii

From

John Naylor

Date:

11 August 2022, 04:10:34

On Thu, Aug 11, 2022 at 5:31 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> This is a neat patch.  I don't know that we need an entirely separate code
> block for the USE_SSE2 path, but I do think that a little bit of extra
> commentary would improve the readability.  IMO the existing comment for the
> zero accumulator has the right amount of detail.
>
> +               /*
> +                * Set all bits in each lane of the error accumulator where input
> +                * bytes are zero.
> +                */
> +               error_cum = _mm_or_si128(error_cum,
> +                                                                _mm_cmpeq_epi8(chunk, _mm_setzero_si128()));

Okay, I will think about the comments, thanks for looking.

> I wonder if reusing a zero vector (instead of creating a new one every
> time) has any noticeable effect on performance.

Creating a zeroed register is just FOO PXOR FOO, which should get
hoisted out of the (unrolled in this case) loop, and which a recent
CPU will just map to a hard-coded zero in the register file, in which
case the execution latency is 0 cycles. :-)

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: use SSE2 for is_valid_ascii

From

Nathan Bossart

Date:

11 August 2022, 05:35:30

On Thu, Aug 11, 2022 at 11:10:34AM +0700, John Naylor wrote:
>> I wonder if reusing a zero vector (instead of creating a new one every
>> time) has any noticeable effect on performance.
> 
> Creating a zeroed register is just FOO PXOR FOO, which should get
> hoisted out of the (unrolled in this case) loop, and which a recent
> CPU will just map to a hard-coded zero in the register file, in which
> case the execution latency is 0 cycles. :-)

Ah, indeed.  At -O2, my compiler seems to zero out two registers before the
loop with either approach:

    pxor    %xmm0, %xmm0    ; accumulator
    pxor    %xmm2, %xmm2    ; always zeros

And within the loop, I see the following:

    movdqu  (%rdi), %xmm1
    movdqu  (%rdi), %xmm3
    addq    $16, %rdi
    pcmpeqb %xmm2, %xmm1    ; check for zeros
    por %xmm3, %xmm0        ; OR data into accumulator
    por %xmm1, %xmm0        ; OR zero check results into accumulator
    cmpq    %rdi, %rsi

So the call to _mm_setzero_si128() within the loop is fine.  Apologies for
the noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: use SSE2 for is_valid_ascii

From

John Naylor

Date:

25 August 2022, 09:41:53

v3 applies on top of the v9 json_lex_string patch in [1] and adds a
bit more to that, resulting in a simpler patch that is more amenable
to additional SIMD-capable platforms.

[1] https://www.postgresql.org/message-id/CAFBsxsFV4v802idV0-Bo%3DV7wLMHRbOZ4er0hgposhyGCikmVGA%40mail.gmail.com

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v3-use-simd-is-valid-ascii.patch

Re: use SSE2 for is_valid_ascii

From

Nathan Bossart

Date:

26 August 2022, 03:26:25

On Thu, Aug 25, 2022 at 04:41:53PM +0700, John Naylor wrote:
> v3 applies on top of the v9 json_lex_string patch in [1] and adds a
> bit more to that, resulting in a simpler patch that is more amenable
> to additional SIMD-capable platforms.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: use SSE2 for is_valid_ascii

From

John Naylor

Date:

26 August 2022, 09:02:32

On Fri, Aug 26, 2022 at 10:26 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Aug 25, 2022 at 04:41:53PM +0700, John Naylor wrote:
> > v3 applies on top of the v9 json_lex_string patch in [1] and adds a
> > bit more to that, resulting in a simpler patch that is more amenable
> > to additional SIMD-capable platforms.
>
> LGTM

Thanks for looking, pushed with some rearrangements.

-- 
John Naylor
EDB: http://www.enterprisedb.com