On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote:
> On Thu, 4 Apr 2024 at 11:50, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> If we can verify this approach won't cause segfaults and can stomach the
>> regression between 8 and 16 bytes, I'd happily pivot to this approach so
>> that we can avoid the function call dance that I have in v25.
>
> If we're worried about regressions with some narrow range of byte
> values, wouldn't it make more sense to compare that to cc4826dd5~1 at
> the latest rather than to some version that's already probably faster
> than PG16?
Good point. When compared with REL_16_STABLE, Ants's idea still wins:
bytes v25 v25+ants REL_16_STABLE
2 1108.205 1033.132 2039.342
4 1311.227 1289.373 3207.217
8 1927.954 2360.113 3200.238
16 2281.091 2365.408 4457.769
32 3856.992 2390.688 6206.689
64 3648.72 3242.498 9619.403
128 4108.549 3607.148 17912.081
256 4910.076 4496.852 33591.385
As before, with 2 and 4 bytes, HEAD is using the inlined approach, but
REL_16_STABLE is doing a function call. For 8 bytes, REL_16_STABLE is
doing a function call as well as a call to a function pointer. At 16
bytes, it's doing a function call and two calls to a function pointer.
With Ant's approach, both 8 and 16 bytes require a single call to a
function pointer, and of course we are using the AVX-512 implementation for
both.
I think this is sufficient to justify switching approaches.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com