> Hm. These results are so similar that I'm tempted to suggest we just
> remove the section of code dedicated to alignment. Is there any reason not
> to do that?
It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simplify the code.
> Does this hand-rolled loop unrolling offer any particular advantage? What
> do the numbers look like if we don't do this or if we process, say, 4
> vectors at a time?
The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.
buf | Not Unrolled | Unrolled x2 | Unrolled x4
------+-------------+-------------+-------------
16 | 4.774 | 4.759 | 5.634
32 | 6.872 | 6.486 | 7.348
64 | 11.070 | 10.249 | 10.617
128 | 20.003 | 16.205 | 16.764
256 | 40.234 | 28.377 | 29.108
512 | 83.825 | 53.420 | 53.658
1024 | 191.181 | 101.677 | 102.727
2048 | 389.160 | 200.291 | 201.544
4096 | 785.742 | 404.593 | 399.134
8192 | 1587.226 | 811.314 | 810.961
/* Process 4 vectors */
for (; i < loop_bytes; i += vec_len * 4)
{
vec64_1 = svld1(pred, (const uint64 *) (buf + i));
accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64_1));
vec64_2 = svld1(pred, (const uint64 *) (buf + i + vec_len));
accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64_2));
vec64_3 = svld1(pred, (const uint64 *) (buf + i + 2 * vec_len));
accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec64_3));
vec64_4 = svld1(pred, (const uint64 *) (buf + i + 3 * vec_len));
accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec64_4));
}
-Chiranmoy