Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsGdHEeci+pQNtVXT=yyfTJ3-1+=zoJcyozrx0VBBnYLNQ@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: speed up verifying UTF-8
List pgsql-hackers
On Thu, Jun 3, 2021 at 9:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> Some ideas:
>
> 1. Better to check if any high bits are set first. We care more about
> the speed of that than of detecting zero bytes, because input with high
> bits is valid but zeros are an error.
>
> 2. Since we check that there are no high bits, we can do the zero-checks
> with fewer instructions like this:

Both ideas make sense, and I like the shortcut we can take with the zero check. I think Greg is right that the zero check needs “half1 & half2”, so I tested with that (updated patches attached).

> What test set have you been using for performance testing this? I'd like

The microbenchmark is the same one you attached to [1], which I extended with a 95% multibyte case. With the new zero check:

clang 12.0.5 / MacOS:

master:

 chinese | mixed | ascii
---------+-------+-------
     981 |   688 |   371

0001:

 chinese | mixed | ascii
---------+-------+-------
     932 |   548 |   110

plus optimized zero check:

 chinese | mixed | ascii
---------+-------+-------
     689 |   573 |    59

It makes sense that the Chinese text case is faster since the zero check is skipped.

gcc 4.8.5 / Linux:

master:

 chinese | mixed | ascii
---------+-------+-------
    2561 |  1493 |   825

0001:

 chinese | mixed | ascii
---------+-------+-------
    2968 |  1035 |   158

plus optimized zero check:

 chinese | mixed | ascii
---------+-------+-------
    2413 |  1078 |   137

The second machine is a bit older and has an old compiler, but there is still a small speed increase. In fact, without Heikki's tweaks, 0001 regresses on multibyte.

(Note: I'm not seeing the 7x improvement I claimed for 0001 here, but that was from memory and I think that was a different machine and newer gcc. We can report a range of results as we proceed.)

[1] https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Isaac Morland
Date:
Subject: Re: security_definer_search_path GUC
Next
From: Pavel Stehule
Date:
Subject: Re: security_definer_search_path GUC