Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From John Naylor
Subject Re: [POC] verifying UTF-8 using SIMD instructions
Date
Msg-id CAFBsxsFGMif55e1doEL_X6+CvK7sCaFpDLEj4OQ+0GSfx-_wSQ@mail.gmail.com
Whole thread Raw
In response to Re: [POC] verifying UTF-8 using SIMD instructions  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: [POC] verifying UTF-8 using SIMD instructions
List pgsql-hackers


On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> -------+-------
>     447 |    46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

That fallback is very similar to my "inline C" case upthread, and they both actually check 16 bytes at a time (the comment is wrong in the patch you shared). I can work back and show how the performance changes with each difference (just MacOS, clang 10 here):

master

 mixed | ascii
-------+-------
   757 |   366

v1, but using memcpy()

 mixed | ascii
-------+-------
   601 |   129

remove zero-byte check:

 mixed | ascii
-------+-------
   588 |    93

inline ascii fastpath into pg_utf8_verifystr()

 mixed | ascii
-------+-------
   595 |    71

use 16-byte stride

 mixed | ascii
-------+-------
   652 |    49

With this cpu/compiler, v1 is fastest on the mixed input all else being equal. 

Maybe there's a smarter way to check for zeros in C. Or maybe be more careful about cache -- running memchr() on the whole input first might not be the best thing to do. 

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Custom compression methods