Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [POC] verifying UTF-8 using SIMD instructions
Date
Msg-id b92ed011-02b6-1403-741d-ae3f017001b5@iki.fi
Whole thread Raw
In response to Re: [POC] verifying UTF-8 using SIMD instructions  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [POC] verifying UTF-8 using SIMD instructions
Re: [POC] verifying UTF-8 using SIMD instructions
Re: [POC] verifying UTF-8 using SIMD instructions
List pgsql-hackers
On 07/02/2021 22:24, John Naylor wrote:
> Here is a more polished version of the function pointer approach, now 
> adapted to all multibyte encodings. Using the not-yet-committed tests 
> from [1], I found a thinko bug that resulted in the test for nul bytes 
> to not only be wrong, but probably also elided by the compiler. Doing it 
> correctly is noticeably slower on pure ascii, but still several times 
> faster than before, so the conclusions haven't changed any. I'll run 
> full measurements later this week, but I'll share the patch now for review.

As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's 
algorithm from the simdjson library [1], see attached patch. I 
microbenchmarked it using the the same test I used before [2].

These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1 
20210110"

unpatched master:

postgres=# \i mbverifystr-speed.sql
CREATE FUNCTION
  mixed | ascii
-------+-------
    728 |   393
(1 row)

v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:

  mixed | ascii
-------+-------
    759 |    98
(1 row)

simdjson-utf8-hack.patch:

  mixed | ascii
-------+-------
     53 |    31
(1 row)

So clearly that algorithm is fast. Not sure if it has a high startup 
cost, or large code size, or other tradeoffs that we don't want. At 
least it depends on SIMD instructions, so it requires more code for the 
architecture-specific implementations and autoconf logic and all that. 
Nevertheless I think it deserves a closer look, I'm a bit reluctant to 
put in half-way measures, when there's a clearly superior algorithm out 
there.

I also tested the fallback implementation from the simdjson library 
(included in the patch, if you uncomment it in simdjson-glue.c):

  mixed | ascii
-------+-------
    447 |    46
(1 row)

I think we should at least try to adopt that. At a high level, it looks 
pretty similar your patch: you load the data 8 bytes at a time, check if 
there are all ASCII. If there are any non-ASCII chars, you check the 
bytes one by one, otherwise you load the next 8 bytes. Your patch should 
be able to achieve the same performance, if done right. I don't think 
the simdjson code forbids \0 bytes, so that will add a few cycles, but 
still.

[1] https://github.com/simdjson/simdjson
[2] 
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

- Heikki

PS. Your patch as it stands isn't safe on systems with strict alignment, 
the string passed to the verify function isn't guaranteed to be 8 bytes 
aligned. Use memcpy to fetch the next 8-byte chunk to fix.


Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: repeated decoding of prepared transactions
Next
From: Pavel Borisov
Date:
Subject: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.