Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsEzzTR=Zd=HnT2TZcQ8So1AzWbD1xXUvRsos8w-0C_nPg@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8  (Vladimir Sitnikov <sitnikov.vladimir@gmail.com>)
List pgsql-hackers
I wrote:

> To simplify the constants, I do shift down to uint32, and I didn't bother working around that. v16alpha regressed on worst-case input, so for v16beta I went back to earlier coding for the one-byte ascii check. That helped, but it's still slower than v14.

It occurred to me that I could rewrite the switch test into simple comparisons, like I already had for the 2- and 4-byte lead cases. While at it, I folded the leading byte and continuation tests into a single operation, like this:

/* 3-byte lead with two continuation bytes */
else if ((chunk & 0xF0C0C00000000000) == 0xE080800000000000)

...and also tried using 64-bit constants to avoid shifting. Still didn't quite beat v14, but got pretty close:

> The numbers on Power8 / gcc 4.8 (little endian):
>
> HEAD:
>
>  chinese | mixed | ascii | mixed16 | mixed8
> ---------+-------+-------+---------+--------
>     2951 |  1521 |   871 |    1474 |   1508
>
> v14:
>
>  chinese | mixed | ascii | mixed16 | mixed8
> ---------+-------+-------+---------+--------
>      885 |   607 |   179 |     774 |   1325

v16gamma:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     952 |   632 |   180 |     800 |   1333

A big-endian 64-bit platform just might shave enough cycles to beat v14 this way... or not.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: data corruption hazard in reorderbuffer.c
Next
From: Tomas Vondra
Date:
Subject: Re: data corruption hazard in reorderbuffer.c