Hello Thomas,
> modular_multiply() is an interesting device. I will leave this to
> committers with a stronger mathematical background than me, but I have
> a small comment in passing:
For testing this function, I have manually commented out the various
shortcuts so that the manual multiplication code is always used, and the
tests passed. I just did it again.
> +#ifdef HAVE__BUILTIN_POPCOUNTLL
> + return __builtin_popcountll(n);
> +#else /* use clever no branching bitwise operator version */
>
> I think it is not enough for the compiler to have
> __builtin_popcountll(). The CPU that this is eventually executed on
> must actually have the POPCNT instruction[1] (or equivalent on ARM,
> POWER etc), or the program will die with SIGILL.
Hmmm, I'd be pretty surprised: The point of the builtin is to delegate to
the compiler the hassle of selecting the best option available, depending
on target hardware class: whether it issues a cpu/sse4 instruction is
not/should not be our problem.
> There are CPUs in circulation produced in this decade that don't have
> it.
Then the compiler, when generating code that is expected to run for a
large class of hardware which include such old ones, should not use a
possibly unavailable instruction, or the runtime should take
responsability for dynamically selecting a viable option.
My understanding is that it should always be safe to call __builtin_XYZ
functions when available. Then if you compile saying that you want code
specific to cpu X and then run it on cpu Y and it fails, basically you
just shot yourself in the foot.
> I have previously considered something like this[2], but realised you
> would therefore need a runtime check. There are a couple of ways to do
> that: see commit a7a73875 for one example, also
> __builtin_cpu_supports(), and direct CPU ID bit checks (some platforms).
> There is also the GCC "multiversion" system that takes care of that for
> you but that worked only for C++ last time I checked.
ISTM that the purpose of a dynamic check would be to have the better
hardware benefit even when compiling for a generic class of hardware which
may or may not have the better option.
This approach is fine for reaching a better performance/portability
compromise, but I do not think that it is that useful for pgbench to go to
this level of optimization, unless someone else takes care, hence the
compiler builtin.
An interesting side effect of your mail is that while researching a bit on
the subject I noticed __builtin_clzll() which helps improve the nbits code
compared to using popcount. Patch attached uses CLZ insted of POPCOUNT.
--
Fabien.