On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aasma@cybertec.at> wrote:
> But given that we know the data length and we have it in a register
> already, it's easy enough to just mask out data past the end with a
> shift. See patch 1. Performance benefit is about 1.5x Measured on a
> small test harness that just hashes and finalizes an array of strings,
> with a data dependency between consecutive hashes (next address
> depends on the previous hash output).
Interesting work! I've taken this idea and (I'm guessing, haven't
tested) improved it by re-using an intermediate step for the
conditional, simplifying the creation of the mask, and moving the
bitscan out of the longest dependency chain. Since you didn't attach
the test harness, would you like to run this and see how it fares?
(v16-0001 is same as your 0001, and v16-0002 builds upon it.) I plan
to test myself as well, but since your test tries to model true
latency, I'm more interested in that one.
> Not sure if the second one is worth the extra code.
I'd say it's not worth optimizing the case we think won't be taken
anyway. I also like having a simple path to assert against.