Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From John Naylor
Subject Re: [POC] verifying UTF-8 using SIMD instructions
Date
Msg-id CAFBsxsEChYg6Bh4iS_Sc2Kt1N=c_92tgdkX-hABiu6SSpeEpnA@mail.gmail.com
Whole thread Raw
In response to Re: [POC] verifying UTF-8 using SIMD instructions  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: [POC] verifying UTF-8 using SIMD instructions
List pgsql-hackers
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> -------+-------
>     447 |    46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Attached is a patch that does roughly what simdjson fallback did, except I use straight tests on the bytes and only calculate code points in assertion builds. In the course of doing this, I found that my earlier concerns about putting the ascii check in a static inline function were due to my suboptimal loop implementation. I had assumed that if the chunked ascii check failed, it had to check all those bytes one at a time. As it turns out, that's a waste of the branch predictor. In the v2 patch, we do the chunked ascii check every time we loop. With that, I can also confirm the claim in the Lemire paper that it's better to do the check on 16-byte chunks:

(MacOS, Clang 10)

master:

 chinese | mixed | ascii
---------+-------+-------
    1081 |   761 |   366

v2 patch, with 16-byte stride:

 chinese | mixed | ascii
---------+-------+-------
     806 |   474 |    83

patch but with 8-byte stride:

 chinese | mixed | ascii
---------+-------+-------
     792 |   490 |   105

I also included the fast path in all other multibyte encodings, and that is also pretty good performance-wise. It regresses from master on pure multibyte input, but that case is still faster than PG13, which I simulated by reverting 6c5576075b0f9 and b80e10638e3:

~PG13:

 chinese | mixed | ascii
---------+-------+-------
    1565 |   848 |   365

ascii fast-path plus pg_*_verifychar():

 chinese | mixed | ascii
---------+-------+-------
    1279 |   656 |    94


v2 has a rough start to having multiple implementations in src/backend/port. Next steps are:

1. Add more tests for utf-8 coverage (in addition to the ones to be added by the noError argument patch)
2. Add SSE4 validator -- it turns out the demo I referred to earlier doesn't match the algorithm in the paper. I plan to only copy the lookup tables from simdjson verbatim, but the code will basically be written from scratch, using  simdjson as a hint.
3. Adjust configure.ac

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Detecting pointer misalignment (was Re: pgsql: Implementation of subscripting for jsonb)
Next
From: Michael Paquier
Date:
Subject: Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)