Attached is v20, which has a number of improvements:
1. Cleaned up and explained DFA coding.
2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to produce an optimized loop, using branch-free accumulators. That way, it doesn't need to be rewritten for different input lengths. I also think it's a bit easier to understand this way.
3. Put SSE helper functions in their own file.
4. Mostly-cosmetic edits to the configure detection.
5. Draft commit message.
With #2 above in place, I wanted to try different strides for the DFA, so more measurements (hopefully not much more of these):
Power8, gcc 4.8
HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
2944 | 1523 | 871 | 1473 | 1509
v20, 8-byte stride:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1189 | 550 | 246 | 600 | 936
v20, 16-byte stride (in the actual patch):
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
981 | 440 | 134 | 791 | 820
v20, 32-byte stride:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
857 | 481 | 141 | 834 | 839
Based on the above, I decided that 16 bytes had the best overall balance. Other platforms may differ, but I don't expect it to make a huge amount of difference.
Just for fun, I was also a bit curious about what Vladimir mentioned upthread about x86-64-v3 offering a different shift instruction. Somehow, clang 12 refused to build with that target, even though the release notes say it can, but gcc 11 was fine:
x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1:
HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1200 | 728 | 370 | 544 | 637
v20:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
459 | 243 | 77 | 424 | 440
v20, CFLAGS="-march=x86-64-v3 -O2" :
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
390 | 215 | 77 | 303 | 323
And, gcc does generate the desired shift here:
objdump -S src/port/pg_utf8_fallback.o | grep shrx
53: c4 e2 eb f7 d1 shrxq %rdx, %rcx, %rdx
While it looks good, clang can do about as good by simply unrolling all 16 shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since x86-64-v3 includes AVX2, and if we had that we would just use it with the SIMD algorithm.
Macbook x86, clang 12:
HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
974 | 691 | 370 | 456 | 526
v20, USE_FALLBACK_UTF8=1:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
351 | 172 | 88 | 349 | 350
v20, with SSE4:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
142 | 92 | 59 | 141 | 141
I'm pretty happy with the patch at this point.
--
John Naylor
EDB:
http://www.enterprisedb.com