Home > mailing lists

Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From	John Naylor
Subject	Re: speed up verifying UTF-8
Date	July 26, 2021 14:09:00
Msg-id	CAFBsxsHR08mHEf06PvrMRstfcyPJLwF69g0r1pvRrxWD4GEVoQ@mail.gmail.com Whole thread Raw
In response to	Re: speed up verifying UTF-8 (John Naylor <john.naylor@enterprisedb.com>)
Responses	Re: speed up verifying UTF-8
List	pgsql-hackers

Attached is v20, which has a number of improvements:

1. Cleaned up and explained DFA coding.
2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to produce an optimized loop, using branch-free accumulators. That way, it doesn't need to be rewritten for different input lengths. I also think it's a bit easier to understand this way.
3. Put SSE helper functions in their own file.
4. Mostly-cosmetic edits to the configure detection.
5. Draft commit message.

With #2 above in place, I wanted to try different strides for the DFA, so more measurements (hopefully not much more of these):

Power8, gcc 4.8

HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
2944 | 1523 | 871 | 1473 | 1509

v20, 8-byte stride:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1189 | 550 | 246 | 600 | 936

v20, 16-byte stride (in the actual patch):
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
981 | 440 | 134 | 791 | 820

v20, 32-byte stride:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
857 | 481 | 141 | 834 | 839

Based on the above, I decided that 16 bytes had the best overall balance. Other platforms may differ, but I don't expect it to make a huge amount of difference.

Just for fun, I was also a bit curious about what Vladimir mentioned upthread about x86-64-v3 offering a different shift instruction. Somehow, clang 12 refused to build with that target, even though the release notes say it can, but gcc 11 was fine:

x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1:

HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1200 | 728 | 370 | 544 | 637

v20:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
459 | 243 | 77 | 424 | 440

v20, CFLAGS="-march=x86-64-v3 -O2" :
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
390 | 215 | 77 | 303 | 323

And, gcc does generate the desired shift here:

objdump -S src/port/pg_utf8_fallback.o | grep shrx
53: c4 e2 eb f7 d1 shrxq %rdx, %rcx, %rdx

While it looks good, clang can do about as good by simply unrolling all 16 shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since x86-64-v3 includes AVX2, and if we had that we would just use it with the SIMD algorithm.

Macbook x86, clang 12:

HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
974 | 691 | 370 | 456 | 526

v20, USE_FALLBACK_UTF8=1:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
351 | 172 | 88 | 349 | 350

v20, with SSE4:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
142 | 92 | 59 | 141 | 141

I'm pretty happy with the patch at this point.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v20-0001-Add-a-fast-path-for-validating-UTF-8-text.patch

pgsql-hackers by date:

Previous

From: Kyotaro Horiguchi
Date: 26 July 2021, 11:52:01
Subject: Re: shared-memory based stats collector

Next

From: Vladimir Sitnikov
Date: 26 July 2021, 14:55:29
Subject: Re: speed up verifying UTF-8