I wrote:
> On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov <
sitnikov.vladimir@gmail.com> wrote:
> >
> > Have you considered shift-based DFA for a portable implementation
https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ?
>
> I did consider some kind of DFA a while back and it was too slow.
I took a closer look at this "shift-based DFA", and it seemed pretty straightforward to implement this on top of my DFA attempt from some months ago. The DFA technique is not a great fit with our API, since we need to return how many bytes we found valid. On x86 (not our target for the fallback, but convenient to test) all my attempts were either worse than HEAD in multiple cases, or showed no improvement for the important ASCII case. On Power8, it's more compelling, and competitive with v14, so I'll characterize it on that platform as I describe the patch series:
0001 is a pure DFA, and has decent performance on multibyte, but terrible on ascii.
0002 dispatches on the leading byte category, unrolls the DFA loop according to how many valid bytes we need, and only checks the DFA state afterwards. It's good on multibyte (3-byte, at least) but still terrible on ascii.
0003 adds a 1-byte ascii fast path -- while robust on all inputs, it still regresses a bit on ascii.
0004 uses the same 8-byte ascii check as previous patches do.
0005 and 0006 use combinations of 1- and 8-byte ascii checks similar to in v17.
0005 seems the best on Power8, and is very close to v4. FWIW, v14's measurements seem lucky and fragile -- if I change any little thing, even
- return -1;
+ return 0;
it easily loses 100-200ms on non-pure-ascii tests. That said, v14 still seems the logical choice, unless there is some further tweak on top of v17 or v18 that gives some non-x86 platform a significant boost.
Power8, gcc 4.8:
HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
2944 | 1523 | 871 | 1473 | 1509
v18-0001:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1257 | 1681 | 1385 | 1744 | 2018
v18-0002:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
951 | 1381 | 1217 | 1469 | 1172
v18-0003:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
911 | 1111 | 942 | 1112 | 865
v18-0004:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
987 | 730 | 222 | 1325 | 2306
v18-0005:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
962 | 664 | 180 | 928 | 1179
v18-0006:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
908 | 663 | 244 | 1026 | 1464
and for comparison,
v14:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
888 | 607 | 179 | 777 | 1328
v17-0003:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1205 | 662 | 180 | 767 | 1256
Macbook, clang 12:
HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
974 | 691 | 370 | 456 | 526
v18-0001:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1334 | 2713 | 2802 | 2665 | 2541
v18-0002:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
733 | 1212 | 1064 | 1034 | 1007
v18-0003:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
653 | 560 | 370 | 420 | 465
v18-0004:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
574 | 402 | 88 | 584 | 1033
v18-0005:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1345 | 730 | 334 | 578 | 909
v18-0006:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
674 | 485 | 153 | 594 | 989
and for comparison,