Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsGU9osh5j16FdzrFHLPTV0sR0ccxHx5p_gRwxqEFAjsbA@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: speed up verifying UTF-8
List pgsql-hackers


On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 03/06/2021 17:33, Greg Stark wrote:
> >> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1
> > | half2). We have to compute (half1 | half2) anyway.
> >
> > Wouldn't you have to check (half1 & half2) ?
>
> Ah, you're right of course. But & is not quite right either, it will
> give false positives. That's ok from a correctness point of view here,
> because we then fall back to checking byte by byte, but I don't think
> it's a good tradeoff.

Ah, of course.

>                 /*
>                  * Check if there are any zero bytes in this chunk.
>                  *
>                  * First, add 0x7f to each byte. This sets the high bit in each byte,
>                  * unless it was a zero. We already checked that none of the bytes had
>                  * the high bit set previously, so the max value each byte can have
>                  * after the addition is 0x7f + 0x7f = 0xfe, and we don't need to
>                  * worry about carrying over to the next byte.
>                  */
>                 x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
>                 x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f);
>
>                 /* then check that the high bit is set in each byte. */
>                 x = (x1 | x2);
>                 x &= UINT64CONST(0x8080808080808080);
>                 if (x != UINT64CONST(0x8080808080808080))
>                         return 0;

That seems right, I'll try that and update the patch. (Forgot to attach earlier anyway)

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: speed up verifying UTF-8
Next
From: Mark Dilger
Date:
Subject: Re: security_definer_search_path GUC