Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From John Naylor
Subject Re: [POC] verifying UTF-8 using SIMD instructions
Date
Msg-id CAFBsxsFtTbnSehSVDBfy0dNLe+_TBhnvhyDt8_AfPct_XkTT7g@mail.gmail.com
Whole thread Raw
In response to Re: [POC] verifying UTF-8 using SIMD instructions  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers

On Wed, Jul 21, 2021 at 8:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Jul 22, 2021 at 6:16 AM John Naylor

> One question is whether this "one size fits all" approach will be
> extensible to wider SIMD.

Sure, it'll just take a little more work and complexity. For one, 16-byte SIMD can operate on 32-byte chunks with a bit of repetition:

-       __m128i         input;
+       __m128i         input1;
+       __m128i         input2;

-#define SIMD_STRIDE_LENGTH (sizeof(__m128i))
+#define SIMD_STRIDE_LENGTH 32

        while (len >= SIMD_STRIDE_LENGTH)
        {
-               input = vload(s);
+               input1 = vload(s);
+               input2 = vload(s + sizeof(input1));

-               check_for_zeros(input, &error);
+               check_for_zeros(input1, &error);
+               check_for_zeros(input2, &error);

                /*
                 * If the chunk is all ASCII, we can skip the full UTF-8 check, but we
@@ -460,17 +463,18 @@ pg_validate_utf8_sse42(const unsigned char *s, int len)
                 * sequences at the end. We only update prev_incomplete if the chunk
                 * contains non-ASCII, since the error is cumulative.
                 */
-               if (is_highbit_set(input))
+               if (is_highbit_set(bitwise_or(input1, input2)))
                {
-                       check_utf8_bytes(prev, input, &error);
-                       prev_incomplete = is_incomplete(input);
+                       check_utf8_bytes(prev, input1, &error);
+                       check_utf8_bytes(input1, input2, &error);
+                       prev_incomplete = is_incomplete(input2);
                }
                else
                {
                        error = bitwise_or(error, prev_incomplete);
                }

-               prev = input;
+               prev = input2;
                s += SIMD_STRIDE_LENGTH;
                len -= SIMD_STRIDE_LENGTH;
        }

So with a few #ifdefs, we can accommodate two sizes if we like. 

For another, the prevN() functions would need to change, at least on x86 -- that would require replacing _mm_alignr_epi8() with _mm256_alignr_epi8() plus _mm256_permute2x128_si256(). Also, we might have to do something with the vector typedef.

That said, I think we can punt on that until we have an application that's much more compute-intensive. As it is with SSE4, COPY FROM WHERE <selective predicate> already pushes the utf8 validation way down in profiles.

> FWIW here are some performance results from my humble RPI4:
>
> master:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>     4172 |  2763 |  1823
> (1 row)
>
> Your v15 patch:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>     2267 |  1248 |   399
> (1 row)
>
> Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>      909 |   620 |   318
> (1 row)
>
> It's so good I wonder if it's producing incorrect results :-)

Nice! If it passes regression tests, it *should* be fine, but stress testing would be welcome on any platform.

> I also tried to do a quick and dirty AltiVec patch to see if it could
> fit into the same code "shape", with less immediate success: it works
> out slower than the fallback code on the POWER7 machine I scrounged an
> account on.  I'm not sure what's wrong there, but maybe it's a uesful
> start (I'm probably confused about endianness, or the encoding of
> boolean vectors which may be different (is true 0x01or 0xff, does it
> matter?), or something else, and it's falling back on errors all the
> time?).

Hmm, I have access to a power8 machine to play with, but I also don't mind having some type of server-class hardware that relies on the recent nifty DFA fallback, which performs even better on powerpc64le than v15.

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [BUG]Update Toast data failure in logical replication
Next
From: Amit Kapila
Date:
Subject: Re: row filtering for logical replication