> - check_ascii() seems to be used only for 64-bit chunks. So why not
> function. We can rename it to check_ascii64() for clarity.
Well yes, but there's nothing so intrinsic to 64 bits that the name needs to reflect that. Earlier versions worked on 16 bytes at time. The compiler will optimize away the len check, but we could replace with an assert instead.
> - I was thinking, why not have a pg_utf8_verify64() that processes
> 64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
> extract a 64-bit chunk from the string. We can use the same chunk to
> extract the required bits from a two byte char or a 4 byte char. This
> way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
> etc.
Loading bytes from L1 is really fast -- I wouldn't even call it "extraction".
> More importantly, we can avoid the separate continuation-char
> checks for each individual byte.
On a pipelined superscalar CPU, I wouldn't expect it to matter in the slightest.
> Additionally, we can try to simplify
> the subsequent overlong or surrogate char checks. Something like this
My recent experience with itemptrs has made me skeptical of this kind of thing, but the idea was interesting enough that I couldn't resist trying it out. I have two attempts, which are attached as v16*.txt and apply independently. They are rough, and some comments are now lies. To simplify the constants, I do shift down to uint32, and I didn't bother working around that. v16alpha regressed on worst-case input, so for v16beta I went back to earlier coding for the one-byte ascii check. That helped, but it's still slower than v14.
That was not unexpected, but I was mildly shocked to find out that v15 is also slower than the v14 that Heikki posted. The only non-cosmetic difference is using pg_utf8_verifychar_internal within pg_utf8_verifychar. I'm not sure why it would make such a big difference here. The numbers on Power8 / gcc 4.8 (little endian):
HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
2951 | 1521 | 871 | 1474 | 1508
v14:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
885 | 607 | 179 | 774 | 1325
v15:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1085 | 671 | 180 | 1032 | 1799
v16alpha:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1268 | 822 | 180 | 1410 | 2518
v16beta:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1096 | 654 | 182 | 814 | 1403
As it stands now, for v17 I'm inclined to go back to v15, but without the attempt at being clever that seems to have slowed it down from v14.
Any interest in testing on 64-bit Arm?
--
John Naylor
EDB:
http://www.enterprisedb.com