Home > mailing lists

Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From	John Naylor
Subject	Re: speed up verifying UTF-8
Date	July 28, 2021 18:12:11
Msg-id	CAFBsxsH=jfWgo7-ToygfdjnC60C3V_N=6=EoCfQ50U3cED_W8g@mail.gmail.com Whole thread
In response to	Re: speed up verifying UTF-8 (John Naylor <john.naylor@enterprisedb.com>)
List	pgsql-hackers

Tree view

I wrote:

> On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
> >
> > >+ utf8_advance(s, state, len);
> > >+
> > >+ /*
> > >+ * If we saw an error during the loop, let the caller handle it. We treat
> > >+ * all other states as success.
> > >+ */
> > >+ if (state == ERR)
> > >+ return 0;
> >
> > Did you mean state = utf8_advance(s, state, len); there? (reassign state variable)
>
> Yep, that's a bug, thanks for catching!

Fixed in v21, with a regression test added. Also, utf8_advance() now directly changes state by a passed pointer rather than returning a value. Some cosmetic changes:

s/valid_bytes/non_error_bytes/ since the former is kind of misleading now.

Some other var name and symbol changes. In my first DFA experiment, ASC conflicted with the parser or scanner somehow, but it doesn't here, so it's clearer to use this.

Rewrote a lot of comments about the state machine and regression tests.
--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

v21-0001-Add-fast-paths-for-validating-UTF-8-text.patch

pgsql-hackers by date:

From: Andres Freund
Date: 28 July 2021, 18:10:46
Subject: Re: Asynchronous and "direct" IO support for PostgreSQL.

From: Tom Lane
Date: 28 July 2021, 18:32:13
Subject: Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

Attachment

Previous

Next