On Tue, Apr 02, 2024 at 01:40:21PM -0500, Nathan Bossart wrote:
> On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
>> I don't like the double evaluation of the macro argument. Seems like
>> you could get the same results more safely with
>>
>> static inline uint64
>> pg_popcount(const char *buf, int bytes)
>> {
>> if (bytes < 64)
>> {
>> uint64 popcnt = 0;
>>
>> while (bytes--)
>> popcnt += pg_number_of_ones[(unsigned char) *buf++];
>>
>> return popcnt;
>> }
>> return pg_popcount_optimized(buf, bytes);
>> }
>
> Yeah, I like that better. I'll do some testing to see what the threshold
> really should be before posting an actual patch.
My testing shows that inlining wins with fewer than 8 bytes for the current
"fast" implementation. The "fast" implementation wins with fewer than 64
bytes compared to the AVX-512 implementation. These results are pretty
intuitive because those are the points at which the optimizations kick in.
In v21, 0001 is just the above inlining idea, which seems worth doing
independent of $SUBJECT. 0002 and 0003 are the AVX-512 patches, which I've
modified similarly to 0001, i.e., I've inlined the "fast" version in the
function pointer to avoid the function call overhead when there are fewer
than 64 bytes. All of this overhead juggling should result in choosing the
optimal popcount implementation depending on how many bytes there are to
process, roughly speaking.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com