Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsHP556FGZZZi=gmYxZFXv0QjPXHWmbp1HdeFPYGVUQ-XQ@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8
List pgsql-hackers
> On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:

> > An alternative idea: should we optimize for validation of **valid** inputs rather than optimizing the worst case?
> > In other words, what if the implementation processes all characters always and uses a slower method in case of validation failure?
> > I would guess it is more important to be faster with accepting valid input rather than "faster to reject invalid input".
>
> > static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> >     if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid, then just accept it
> >         return s + len;
> >     }
> >     // slow path: the string is not valid, perform a slower analysis
> >     return s + ....;
> > }

This turned out to be a really good idea (v19 attached):

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     888 |   607 |   179 |     777 |   1328

v19:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     809 |   472 |   223 |     558 |    805

x86 Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     674 |   346 |    78 |     309 |    504

v19:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     379 |   181 |    94 |     219 |    376

Note that the branchy code's worst case (mixed8) is here the same speed as multibyte. With Vladimir's idea * , we call check_ascii only every 8 bytes of input, not every time we verify one multibyte character. Also, we only have to check the DFA state every time we loop over 8 bytes, not every time we step through the DFA. That means we have to walk backwards at the end to find the last leading byte, but the SSE code already knew how to do that, so I used that logic here in the caller, which will allow some simplification of how the SSE code returns.

The state check is likely why the ascii case is slightly slower than v14. We could go back to checking ascii 16 bytes at a time, since there's little penalty for doing so.

* (Greg was thinking the same thing upthread, but I don't think the branchy code I posted at the time could have taken advantage of this)

I'm pretty confident this improvement is architecture-independent. Next month I'll clean this up and rebase the SSE patch over this.

I wrote:

> + /*
> + * NB: This check must be strictly greater-than, otherwise an invalid byte
> + * at the end might not get detected.
> + */
> + while (len > sizeof(__m128i))

Note to self: I actually think this isn't needed anymore since I changed how the SSE code deals with remainder sequences at the end.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences
Next
From: Zhihong Yu
Date:
Subject: Re: Have I found an interval arithmetic bug?