Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From John Naylor
Subject Re: speed up verifying UTF-8
Date
Msg-id CAFBsxsHu4WrdQHbr73+aez4wjzLHF-LYihutn0-Huex8vNX5zg@mail.gmail.com
Whole thread Raw
In response to Re: speed up verifying UTF-8  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: speed up verifying UTF-8  (Vladimir Sitnikov <sitnikov.vladimir@gmail.com>)
List pgsql-hackers
I wrote:

> On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
> >
> > Have you considered shift-based DFA for a portable implementation https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ?
>
> I did consider some kind of DFA a while back and it was too slow.

I took a closer look at this "shift-based DFA", and it seemed pretty straightforward to implement this on top of my DFA attempt from some months ago. The DFA technique is not a great fit with our API, since we need to return how many bytes we found valid. On x86 (not our target for the fallback, but convenient to test) all my attempts were either worse than HEAD in multiple cases, or showed no improvement for the important ASCII case. On Power8, it's more compelling, and competitive with v14, so I'll characterize it on that platform as I describe the patch series:

0001 is a pure DFA, and has decent performance on multibyte, but terrible on ascii.
0002 dispatches on the leading byte category, unrolls the DFA loop according to how many valid bytes we need, and only checks the DFA state afterwards. It's good on multibyte (3-byte, at least) but still terrible on ascii.
0003 adds a 1-byte ascii fast path -- while robust on all inputs, it still regresses a bit on ascii.
0004 uses the same 8-byte ascii check as previous patches do.
0005 and 0006 use combinations of 1- and 8-byte ascii checks similar to in v17.

0005 seems the best on Power8, and is very close to v4. FWIW, v14's measurements seem lucky and fragile -- if I change any little thing, even

- return -1;
+ return 0;

it easily loses 100-200ms on non-pure-ascii tests. That said, v14 still seems the logical choice, unless there is some further tweak on top of v17 or v18 that gives some non-x86 platform a significant boost.

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v18-0001:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1257 |  1681 |  1385 |    1744 |   2018

v18-0002:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     951 |  1381 |  1217 |    1469 |   1172

v18-0003:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     911 |  1111 |   942 |    1112 |    865

v18-0004:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     987 |   730 |   222 |    1325 |   2306

v18-0005:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     962 |   664 |   180 |     928 |   1179

v18-0006:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     908 |   663 |   244 |    1026 |   1464

and for comparison,

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     888 |   607 |   179 |     777 |   1328

v17-0003:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1205 |   662 |   180 |     767 |   1256


Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v18-0001:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1334 |  2713 |  2802 |    2665 |   2541

v18-0002:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     733 |  1212 |  1064 |    1034 |   1007

v18-0003:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     653 |   560 |   370 |     420 |    465

v18-0004:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     574 |   402 |    88 |     584 |   1033

v18-0005:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1345 |   730 |   334 |     578 |    909

v18-0006:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     674 |   485 |   153 |     594 |    989

and for comparison,

v14:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     674 |   346 |    78 |     309 |    504

v17-0002:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     516 |   324 |    78 |     331 |    544

--
Attachment

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: 回复: Why is XLOG_FPI_FOR_HINT always need backups?
Next
From: Yugo NAGATA
Date:
Subject: Re: pgbench: using prepared BEGIN statement in a pipeline could cause an error