On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Another option could be to use SIMD instructions to check multiple bytes
> is zero in a single operation. Maybe just an idea to keep in mind and experiment
> if we feel the need later on.
Could do. I just wrote it that way to give the compiler flexibility to
do SIMD implicitly. That seemed easier than messing around with SIMD
intrinsics. I guess the compiler won't use SIMD with the single
size_t-at-a-time version as it can't be certain it's ok to access the
memory beyond the first zero word. Because I wrote the "if" condition
using bitwise-OR, there's no boolean short-circuiting, so the compiler
sees it must be safe to access all the memory for the loop iteration.
If I use -march=native or -march=znver2 on my Zen2 machine, gcc does
use SIMD operators. Clang uses some 128-bit registers without
specifying -march:
drowley@amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros &&
for i in {1..3}; do ./allzeros; done
char: done in 1940539 nanoseconds
size_t: done in 261731 nanoseconds (7.41425 times faster than char)
size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char)
size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char)
char: done in 3030132 nanoseconds
size_t: done in 477044 nanoseconds (6.35189 times faster than char)
size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char)
size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char)
char: done in 3214037 nanoseconds
size_t: done in 256901 nanoseconds (12.5108 times faster than char)
size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char)
size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char)
David