Re: [POC] verifying UTF-8 using SIMD instructions - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [POC] verifying UTF-8 using SIMD instructions
Date
Msg-id bc278857-f70b-716b-558f-6f9c1475fb61@iki.fi
Whole thread Raw
In response to [POC] verifying UTF-8 using SIMD instructions  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [POC] verifying UTF-8 using SIMD instructions
List pgsql-hackers
On 01/02/2021 19:32, John Naylor wrote:
> It makes sense to start with the ascii subset of UTF-8 for a couple 
> reasons. First, ascii is very widespread in database content, 
> particularly in bulk loads. Second, ascii can be validated using the 
> simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and 
> I'm guessing we can detect that at compile time and not mess with 
> runtime checks. The examples above using SSE for the general case are 
> much more complicated and involve SSE 4.2 or AVX.

I wonder how using SSE compares with dealing with 64 or 32-bit words at 
a time, using regular instructions? That would be more portable.

> Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is 
> okay, I'll do Linux/gcc and add more inputs). The test is the same as 
> Heikki shared in [3], but I added a case with >95% Chinese characters 
> just to show how that compares to the mixed ascii/multibyte case.
> 
> master:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>      1081 |   761 |   366
> 
> patch:
> 
>   chinese | mixed | ascii
> ---------+-------+-------
>      1103 |   498 |    51
> 
> The speedup in the pure ascii case is nice.

Yep.

> In the attached POC, I just have a pro forma portability stub, and left 
> full portability detection for later. The fast path is inlined inside 
> pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted 
> into a separate function to which is passed a function pointer for full 
> encoding validation. That would allow other encodings with strict ascii 
> subsets to use this as well, but coding that abstraction might be a 
> little messy, and b80e10638e3 already gives a performance boost over PG13.

All supported encodings are ASCII subsets. Might be best to putt the 
ASCII-check into a static inline function and use it in all the verify 
functions. I presume it's only a few instructions, and these functions 
can be pretty performance sensitive.

> I also gave a shot at doing full UTF-8 recognition using a DFA, but so 
> far that has made performance worse. If I ever have more success with 
> that, I'll add that in the mix.

That's disappointing. Perhaps the SIMD algorithms have higher startup 
costs, so that you need longer inputs to benefit? In that case, it might 
make sense to check the length of the input and only use the SIMD 
algorithm if the input is long enough.

- Heikki



pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Proposal: Save user's original authenticated identity for logging
Next
From: Peter Eisentraut
Date:
Subject: Re: SEARCH and CYCLE clauses