Re: speed up verifying UTF-8 - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: speed up verifying UTF-8 |
Date | |
Msg-id | 2356cfd5-f8fd-0d5c-3a1c-28e170d89b6e@iki.fi Whole thread Raw |
In response to | speed up verifying UTF-8 (John Naylor <john.naylor@enterprisedb.com>) |
Responses |
Re: speed up verifying UTF-8
Re: speed up verifying UTF-8 |
List | pgsql-hackers |
On 02/06/2021 19:26, John Naylor wrote: > For v10, I've split the patch up into two parts. 0001 uses pure C > everywhere. This is much smaller and easier to review, and gets us the > most bang for the buck. > > One concern Heikki raised upthread is that platforms with poor > unaligned-memory access will see a regression. We could easily add an > #ifdef to take care of that, but I haven't done so here. > > To recap: On ascii-only input with storage taken out of the picture, > profiles of COPY FROM show a reduction from nealy 10% down to just over > 1%. In microbenchmarks found earlier in this thread, this works out to > about 7 times faster. On multibyte/mixed input, 0001 is a bit faster, > but not really enough to make a difference in copy performance. Nice! This kind of bit-twiddling is fun, so I couldn't resist tinkering with it, to see if we can shave some more instructions from it: > +/* from https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */ > +#define HAS_ZERO(chunk) ( \ > + ((chunk) - UINT64CONST(0x0101010101010101)) & \ > + ~(chunk) & \ > + UINT64CONST(0x8080808080808080)) > + > +/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */ > +static inline int > +check_ascii(const unsigned char *s, int len) > +{ > + uint64 half1, > + half2, > + highbits_set; > + > + if (len >= 2 * sizeof(uint64)) > + { > + memcpy(&half1, s, sizeof(uint64)); > + memcpy(&half2, s + sizeof(uint64), sizeof(uint64)); > + > + /* If there are zero bytes, bail and let the slow path handle it. */ > + if (HAS_ZERO(half1) || HAS_ZERO(half2)) > + return 0; > + > + /* Check if any bytes in this chunk have the high bit set. */ > + highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080)); > + > + if (!highbits_set) > + return 2 * sizeof(uint64); > + else > + return 0; > + } > + else > + return 0; > +} Some ideas: 1. Better to check if any high bits are set first. We care more about the speed of that than of detecting zero bytes, because input with high bits is valid but zeros are an error. 2. Since we check that there are no high bits, we can do the zero-checks with fewer instructions like this: /* NB: this is only correct if 'chunk' doesn't have any high bits set */ #define HAS_ZERO(chunk) ( \ ((chunk) + \ UINT64CONST(0x7f7f7f7f7f7f7f7f)) & \ UINT64CONST(0x8080808080808080) == UINT64CONST(0x8080808080808080)) 3. It's probably cheaper perform the HAS_ZERO check just once on (half1 | half2). We have to compute (half1 | half2) anyway. Putting all that together: /* Verify a chunk of bytes for valid ASCII including a zero-byte check. */ static inline int check_ascii(const unsigned char *s, int len) { uint64 half1, half2, highbits_set; uint64 x; if (len >= 2 * sizeof(uint64)) { memcpy(&half1, s, sizeof(uint64)); memcpy(&half2, s + sizeof(uint64), sizeof(uint64)); /* Check if any bytes in this chunk have the high bit set. */ highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080)); if (highbits_set) return 0; /* * Check if there are any zero bytes in this chunk. This is only correct * if there are no high bits set, but we checked that already. */ x = (half1 | half2) + UINT64CONST(0x7f7f7f7f7f7f7f7f); x &= UINT64CONST(0x8080808080808080); if (x != UINT64CONST(0x8080808080808080)) return 0; return 2 * sizeof(uint64); } else return 0; } In quick testing, that indeed compiles into fewer instructions. With GCC, there's no measurable difference in performance. But with clang, this version is much faster than the original, because the original version is much slower than when compiled with GCC. In other words, this version seems to avoid some clang misoptimization. I tested only with ASCII input, I haven't tried other cases. What test set have you been using for performance testing this? I'd like to know how this version compares, and I could also try running it on my old raspberry pi, which is more strict about alignmnt. > 0002 adds the SSE4 implementation on x86-64, and is equally fast on all > input, at the cost of greater complexity. Didn't look closely, but seems reasonable at a quick glance. - Heikki
pgsql-hackers by date: