Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From John Naylor
Subject Re: [POC] verifying UTF-8 using SIMD instructions
Date
Msg-id CAFBsxsHVXK_e5=yCdrB_vSUdV9waZf1tqh5SJpoPW65_Y_bm0A@mail.gmail.com
Whole thread Raw
In response to Re: [POC] verifying UTF-8 using SIMD instructions  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: [POC] verifying UTF-8 using SIMD instructions  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's
> algorithm from the simdjson library [1], see attached patch. I
> microbenchmarked it using the the same test I used before [2].

I've been looking at various iterations of Lemire's utf8 code, and trying it out was next on my list, so thanks for doing that!

> These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1
> 20210110"
>
> unpatched master:
>
> postgres=# \i mbverifystr-speed.sql
> CREATE FUNCTION
>   mixed | ascii
> -------+-------
>     728 |   393
> (1 row)
>
> v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:
>
>   mixed | ascii
> -------+-------
>     759 |    98
> (1 row)

Hmm, the mixed case got worse -- I haven't seen that in any of my tests.

> simdjson-utf8-hack.patch:
>
>   mixed | ascii
> -------+-------
>      53 |    31
> (1 row)
>
> So clearly that algorithm is fast. Not sure if it has a high startup
> cost, or large code size, or other tradeoffs that we don't want.

The simdjson lib uses everything up through AVX512 depending on what hardware is available. I seem to remember reading that high start-up cost is more relevant to floating point than to integer ops, but I could be wrong. Just the utf8 portion is surely tiny also.

> At
> least it depends on SIMD instructions, so it requires more code for the
> architecture-specific implementations and autoconf logic and all that.

One of his earlier demos [1] (in simdutf8check.h) had a version that used mostly SSE2 with just three intrinsics from SSSE3. That's widely available by now. He measured that at 0.7 cycles per byte, which is still good compared to AVX2 0.45 cycles per byte [2].

Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would assume that if that check (and the corresponding runtime check) passes, we can assume SSE2. That code has three licenses to choose from -- Apache 2, Boost, and MIT. Something like that might be straightforward to start from. I think the only obstacles to worry about are license and getting it to fit into our codebase. Adding more than zero high-level comments with a good description of how it works in detail is also a bit of a challenge.

> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> -------+-------
>     447 |    46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Okay, I'll look into that.

> PS. Your patch as it stands isn't safe on systems with strict alignment,
> the string passed to the verify function isn't guaranteed to be 8 bytes
> aligned. Use memcpy to fetch the next 8-byte chunk to fix.

Will do.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include
[2] https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: "Tang, Haiying"
Date:
Subject: RE: Support tab completion for upper character inputs in psql
Next
From: Amit Langote
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)