Here is what I have staged for commit. One notable difference in this
version of the patch is that I've changed
+ if (nelem <= nelem_per_iteration)
+ goto one_by_one;
to
+ if (nelem < nelem_per_iteration)
+ goto one_by_one;
I realized that there's no reason to jump to the one-by-one linear search
code when nelem == nelem_per_iteration, as the worst thing that will happen
is that we'll process all the elements twice if the value isn't present in
the array. My benchmark that I've been using also shows a significant
speedup for this case with this change (on the order of 75%), which I
imagine might be due to a combination of branch prediction, caching, fewer
instructions, etc.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com