Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From Amit Khandekar
Subject Re: speed up verifying UTF-8
Date
Msg-id CAJ3gD9ejC+puY=Lgco2SGyD4tR46kye7qLZoskW0PXumtLcCpQ@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Tue, 13 Jul 2021 at 01:15, John Naylor <john.naylor@enterprisedb.com> wrote:
> > It seems like it would be easy to have pg_utf8_verify_one in my proposed pg_utf8.h header and replace the body of
pg_utf8_verifycharwith it. 
>
> 0001: I went ahead and tried this for v15, and also attempted some clean-up:
>
> - Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
> - Have pg_utf8_verifychar_internal return -1 for invalid input to match other functions in the file. We could also do
thisfor check_ascii, but it's not quite the same thing, because the string could still have valid bytes in it, just not
enoughto advance the pointer by the stride length. 
> - Remove hard-coded numbers (not wedded to this).
>
> - Use a call to pg_utf8_verifychar in the slow path.
> - Reduce pg_utf8_verifychar to thin wrapper around pg_utf8_verifychar_internal.

- check_ascii() seems to be used only for 64-bit chunks. So why not
remove the len argument and the len <= sizeof(int64) checks inside the
function. We can rename it to check_ascii64() for clarity.

- I was thinking, why not have a pg_utf8_verify64() that processes
64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
extract a 64-bit chunk from the string. We can use the same chunk to
extract the required bits from a two byte char or a 4 byte char. This
way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
etc. More importantly, we can avoid the separate continuation-char
checks for each individual byte. Additionally, we can try to simplify
the subsequent overlong or surrogate char checks. Something like this
:

int pg_utf8_verifychar_32(uint32 chunk)
{
    int        len, l;

    for (len = sizeof(chunk); len > 0; (len -= l), (chunk = chunk << l))
    {
        /* Is 2-byte lead */
        if ((chunk & 0xF0000000) == 0xC0000000)
        {
            l = 2;
            /* .......  .......  */
        }
        /* Is 3-byte lead */
        else if ((chunk & 0xF0000000) == 0xE0000000)
        {
            l = 3;
            if (len < l)
                break;

            /* b2 and b3 should be continuation bytes */
            if ((chunk & 0x00C0C000) != 0x00808000)
                return sizeof(chunk) - len;

            switch (chunk & 0xFF200000)
            {
                /* check 3-byte overlong: 1110.0000 1001.xxxx 10xx.xxxx
                 * i.e. (b1 == 0xE0 && b2 < 0xA0). We already know b2
is of the form
                 * 10xx since it's a continuation char. Additionally
condition b2 <=
                 * 0x9F means it is of the form 100x.xxxx.  i.e.
either 1000.xxxx
                 * or 1001.xxxx. So just verify that it is xx0x.xxxx
                 */
                case 0xE0000000:
                    return sizeof(chunk) - len;

                /* check surrogate: 1110.1101 101x.xxxx 10xx.xxxx
                 * i.e. (b1 == 0xED && b2 > 0x9F): Here, > 0x9F means either
                 * 1010.xxxx, 1011.xxxx, 1100.xxxx, or 1110.xxxx. Last
two are not
                 * possible because b2 is a continuation char. So it has to be
                 * first two. So just verify that it is xx1x.xxxx
                 */
                case 0xED200000:
                    return sizeof(chunk) - len;
                default:
                    ;
            }

        }
        /* Is 4-byte lead */
        else if ((chunk & 0xF0000000) == 0xF0000000)
        {
            /* .........  */
            l = 4;
        }
        else
            return sizeof(chunk) - len;
    }
    return sizeof(chunk) - len;
}



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: ERROR: "ft1" is of the wrong type.
Next
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: logical replication empty transactions