Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: speed up verifying UTF-8
Date
Msg-id 2356cfd5-f8fd-0d5c-3a1c-28e170d89b6e@iki.fi
Whole thread Raw
In response to speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8
Re: speed up verifying UTF-8
List pgsql-hackers
On 02/06/2021 19:26, John Naylor wrote:
> For v10, I've split the patch up into two parts. 0001 uses pure C 
> everywhere. This is much smaller and easier to review, and gets us the 
> most bang for the buck.
> 
> One concern Heikki raised upthread is that platforms with poor 
> unaligned-memory access will see a regression. We could easily add an 
> #ifdef to take care of that, but I haven't done so here.
> 
> To recap: On ascii-only input with storage taken out of the picture, 
> profiles of COPY FROM show a reduction from nealy 10% down to just over 
> 1%. In microbenchmarks found earlier in this thread, this works out to 
> about 7 times faster. On multibyte/mixed input, 0001 is a bit faster, 
> but not really enough to make a difference in copy performance.

Nice!

This kind of bit-twiddling is fun, so I couldn't resist tinkering with 
it, to see if we can shave some more instructions from it:

> +/* from https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */
> +#define HAS_ZERO(chunk) ( \
> +    ((chunk) - UINT64CONST(0x0101010101010101)) & \
> +     ~(chunk) & \
> +     UINT64CONST(0x8080808080808080))
> +
> +/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
> +static inline int
> +check_ascii(const unsigned char *s, int len)
> +{
> +    uint64        half1,
> +                half2,
> +                highbits_set;
> +
> +    if (len >= 2 * sizeof(uint64))
> +    {
> +        memcpy(&half1, s, sizeof(uint64));
> +        memcpy(&half2, s + sizeof(uint64), sizeof(uint64));
> +
> +        /* If there are zero bytes, bail and let the slow path handle it. */
> +        if (HAS_ZERO(half1) || HAS_ZERO(half2))
> +            return 0;
> +
> +        /* Check if any bytes in this chunk have the high bit set. */
> +        highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
> +
> +        if (!highbits_set)
> +            return 2 * sizeof(uint64);
> +        else
> +            return 0;
> +    }
> +    else
> +        return 0;
> +}

Some ideas:

1. Better to check if any high bits are set first. We care more about 
the speed of that than of detecting zero bytes, because input with high 
bits is valid but zeros are an error.

2. Since we check that there are no high bits, we can do the zero-checks 
with fewer instructions like this:

/* NB: this is only correct if 'chunk' doesn't have any high bits set */
#define HAS_ZERO(chunk) ( \
   ((chunk) + \
    UINT64CONST(0x7f7f7f7f7f7f7f7f)) & \
   UINT64CONST(0x8080808080808080) == UINT64CONST(0x8080808080808080))

3. It's probably cheaper perform the HAS_ZERO check just once on (half1 
| half2). We have to compute (half1 | half2) anyway.


Putting all that together:

/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
static inline int
check_ascii(const unsigned char *s, int len)
{
    uint64        half1,
                half2,
                highbits_set;
    uint64        x;

    if (len >= 2 * sizeof(uint64))
    {
        memcpy(&half1, s, sizeof(uint64));
        memcpy(&half2, s + sizeof(uint64), sizeof(uint64));

        /* Check if any bytes in this chunk have the high bit set. */
        highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
        if (highbits_set)
            return 0;

        /*
         * Check if there are any zero bytes in this chunk. This is only correct
         * if there are no high bits set, but we checked that already.
         */
        x = (half1 | half2) + UINT64CONST(0x7f7f7f7f7f7f7f7f);
        x &= UINT64CONST(0x8080808080808080);
        if (x != UINT64CONST(0x8080808080808080))
            return 0;

        return 2 * sizeof(uint64);
    }
    else
        return 0;
}

In quick testing, that indeed compiles into fewer instructions. With 
GCC, there's no measurable difference in performance. But with clang, 
this version is much faster than the original, because the original 
version is much slower than when compiled with GCC. In other words, this 
version seems to avoid some clang misoptimization. I tested only with 
ASCII input, I haven't tried other cases.

What test set have you been using for performance testing this? I'd like 
to know how this version compares, and I could also try running it on my 
old raspberry pi, which is more strict about alignmnt.

> 0002 adds the SSE4 implementation on x86-64, and is equally fast on all 
> input, at the cost of greater complexity.

Didn't look closely, but seems reasonable at a quick glance.

- Heikki



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: parent foreign tables and row marks
Next
From: "Euler Taveira"
Date:
Subject: missing GRANT on pg_subscription columns