Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsHR08mHEf06PvrMRstfcyPJLwF69g0r1pvRrxWD4GEVoQ@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8
List pgsql-hackers
Attached is v20, which has a number of improvements:

1. Cleaned up and explained DFA coding.
2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to produce an optimized loop, using branch-free accumulators. That way, it doesn't need to be rewritten for different input lengths. I also think it's a bit easier to understand this way.
3. Put SSE helper functions in their own file.
4. Mostly-cosmetic edits to the configure detection.
5. Draft commit message.

With #2 above in place, I wanted to try different strides for the DFA, so more measurements (hopefully not much more of these):

Power8, gcc 4.8

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v20, 8-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1189 |   550 |   246 |     600 |    936

v20, 16-byte stride (in the actual patch):
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     981 |   440 |   134 |     791 |    820

v20, 32-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     857 |   481 |   141 |     834 |    839

Based on the above, I decided that 16 bytes had the best overall balance. Other platforms may differ, but I don't expect it to make a huge amount of difference.

Just for fun, I was also a bit curious about what Vladimir mentioned upthread about x86-64-v3 offering a different shift instruction. Somehow, clang 12 refused to build with that target, even though the release notes say it can, but gcc 11 was fine:

x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1200 |   728 |   370 |     544 |    637

v20:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     459 |   243 |    77 |     424 |    440

v20, CFLAGS="-march=x86-64-v3 -O2" :
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     390 |   215 |    77 |     303 |    323

And, gcc does generate the desired shift here:

objdump -S src/port/pg_utf8_fallback.o | grep shrx
      53: c4 e2 eb f7 d1               shrxq %rdx, %rcx, %rdx

While it looks good, clang can do about as good by simply unrolling all 16 shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since x86-64-v3 includes AVX2, and if we had that we would just use it with the SIMD algorithm.

Macbook x86, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v20, USE_FALLBACK_UTF8=1:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     351 |   172 |    88 |     349 |    350

v20, with SSE4:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     142 |    92 |    59 |     141 |    141

I'm pretty happy with the patch at this point.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: shared-memory based stats collector
Next
From: Vladimir Sitnikov
Date:
Subject: Re: speed up verifying UTF-8