Home > mailing lists

Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From	John Naylor
Subject	Re: [POC] verifying UTF-8 using SIMD instructions
Date	July 22, 2021 14:38:50
Msg-id	CAFBsxsFtTbnSehSVDBfy0dNLe+_TBhnvhyDt8_AfPct_XkTT7g@mail.gmail.com Whole thread Raw
In response to	Re: [POC] verifying UTF-8 using SIMD instructions (Thomas Munro <thomas.munro@gmail.com>)
List	pgsql-hackers

Tree view

On Wed, Jul 21, 2021 at 8:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Jul 22, 2021 at 6:16 AM John Naylor

> One question is whether this "one size fits all" approach will be
> extensible to wider SIMD.

Sure, it'll just take a little more work and complexity. For one, 16-byte SIMD can operate on 32-byte chunks with a bit of repetition:

- __m128i input;
+ __m128i input1;
+ __m128i input2;

-#define SIMD_STRIDE_LENGTH (sizeof(__m128i))
+#define SIMD_STRIDE_LENGTH 32

while (len >= SIMD_STRIDE_LENGTH)
{
- input = vload(s);
+ input1 = vload(s);
+ input2 = vload(s + sizeof(input1));

- check_for_zeros(input, &error);
+ check_for_zeros(input1, &error);
+ check_for_zeros(input2, &error);

/*
* If the chunk is all ASCII, we can skip the full UTF-8 check, but we
@@ -460,17 +463,18 @@ pg_validate_utf8_sse42(const unsigned char *s, int len)
* sequences at the end. We only update prev_incomplete if the chunk
* contains non-ASCII, since the error is cumulative.
*/
- if (is_highbit_set(input))
+ if (is_highbit_set(bitwise_or(input1, input2)))
{
- check_utf8_bytes(prev, input, &error);
- prev_incomplete = is_incomplete(input);
+ check_utf8_bytes(prev, input1, &error);
+ check_utf8_bytes(input1, input2, &error);
+ prev_incomplete = is_incomplete(input2);
}
else
{
error = bitwise_or(error, prev_incomplete);
}

- prev = input;
+ prev = input2;
s += SIMD_STRIDE_LENGTH;
len -= SIMD_STRIDE_LENGTH;
}

So with a few #ifdefs, we can accommodate two sizes if we like.

For another, the prevN() functions would need to change, at least on x86 -- that would require replacing _mm_alignr_epi8() with _mm256_alignr_epi8() plus _mm256_permute2x128_si256(). Also, we might have to do something with the vector typedef.

That said, I think we can punt on that until we have an application that's much more compute-intensive. As it is with SSE4, COPY FROM WHERE <selective predicate> already pushes the utf8 validation way down in profiles.

> FWIW here are some performance results from my humble RPI4:
>
> master:
>
> chinese | mixed | ascii
> ---------+-------+-------
> 4172 | 2763 | 1823
> (1 row)
>
> Your v15 patch:
>
> chinese | mixed | ascii
> ---------+-------+-------
> 2267 | 1248 | 399
> (1 row)
>
> Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1:
>
> chinese | mixed | ascii
> ---------+-------+-------
> 909 | 620 | 318
> (1 row)
>
> It's so good I wonder if it's producing incorrect results :-)

Nice! If it passes regression tests, it *should* be fine, but stress testing would be welcome on any platform.

> I also tried to do a quick and dirty AltiVec patch to see if it could
> fit into the same code "shape", with less immediate success: it works
> out slower than the fallback code on the POWER7 machine I scrounged an
> account on. I'm not sure what's wrong there, but maybe it's a uesful
> start (I'm probably confused about endianness, or the encoding of
> boolean vectors which may be different (is true 0x01or 0xff, does it
> matter?), or something else, and it's falling back on errors all the
> time?).

Hmm, I have access to a power8 machine to play with, but I also don't mind having some type of server-class hardware that relies on the recent nifty DFA fallback, which performs even better on powerpc64le than v15.

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

From: Amit Kapila
Date: 22 July 2021, 13:41:08
Subject: Re: [BUG]Update Toast data failure in logical replication

From: Amit Kapila
Date: 22 July 2021, 14:44:58
Subject: Re: row filtering for logical replication

Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

Previous

Next