On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> On 2024-Mar-31, Nathan Bossart wrote:
>> + popcnt = _mm512_reduce_add_epi64(accum);
>> + return popcnt + pg_popcount_fast(buf, bytes);
>
> Hmm, doesn't this arrangement cause an extra function call to
> pg_popcount_fast to be used here? Given the level of micro-optimization
> being used by this code, I would have thought that you'd have tried to
> avoid that. (At least, maybe avoid the call if bytes is 0, no?)
Yes, it does. I did another benchmark on very small arrays and can see the
overhead. This is the time in milliseconds to run pg_popcount() on an
array 1 billion times:
size (bytes) HEAD AVX512-POPCNT
1 1707.685 3480.424
2 1926.694 4606.182
4 3210.412 5284.506
8 1920.703 3640.968
16 2936.91 4045.586
32 3627.956 5538.418
64 5347.213 3748.212
I suspect that anything below 64 bytes will see this regression, as that is
the earliest point where there are enough bytes for ZMM registers.
We could avoid the call if there are no remaining bytes, but the numbers
for the smallest arrays probably wouldn't improve much, and that might
actually add some overhead due to branching. The other option to avoid
this overhead is to put most of pg_bitutils.c into its header file so that
we can inline the call.
Reviewing the current callers of pg_popcount(), IIUC the only ones that are
passing very small arrays are the bit_count() implementations and a call in
the syslogger for a single byte. I don't know how much to worry about the
overhead for bit_count() since there's presumably a bunch of other
overhead, and the syslogger one could probably be fixed via an inline
function that pulled the value from pg_number_of_ones (which would probably
be an improvement over the status quo, anyway). But this is all to save a
couple of nanoseconds...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com