Fabien COELHO писал 2021-07-06 09:13:
> Hello Yura,
>
>> I believe most "range" values are small, much smaller than UINT32_MAX.
>> In this case, according to [1] fastest method is Lemire's one (I'd
>> take
>> original version from [2]) [...]
>
> Yep.
>
> I share your point that the range is more often 32 bits.
>
> However, I'm not enthousiastic at combining two methods depending on
> the range, the function looks complex enough without that, so I would
> suggest not to take this option. Also, the decision process adds to
> the average cost, which is undesirable.
Given 99.99% cases will be in the likely case, branch predictor should
eliminate decision cost.
And as Dean Rasheed correctly mentioned, mask method will
have really bad pattern for branch predictor if range is not just below
or equal to power of two.
For example, rand_range(10000) will have 60% probability to pass through
`while (val > range)` and 40% probability to go to next loop iteration.
rand_range(100000) will have 76%/24% probabilities. Branch predictor
doesn't like it. Even rand_range(1000000), which is quite close to 2^20,
will have 95%/5%, and still not enough to please BP.
But with unbias multiply method it will be 0.0002%/99.9998% for 10000,
0,002%/99.998% for 100000 and 0.02%/99.98% for 1000000 - much-much
better.
Branch predictor will make it almost free (i hope).
And __builtin_clzl is not free lunch either, it has latency 3-4 cycles
on modern processor. Well, probably it could run in parallel with some
part of xoroshiro, but it depends on how compiler will optimize this
function.
> I would certainly select the unbias multiply method if we want a u32
> range variant.
There could be two functions.
regards,
Sokolov Yura.