(new thread)
On Wed, Sep 03, 2025 at 02:47:25PM -0400, Andres Freund wrote:
>> I see a variety for increased CPU usage:
>> 
>> 1) The private ref count infrastructure in bufmgr.c gets a bit slower once
>>    more buffers are pinned
> 
> The problem mainly seems to be that the branches in the loop at the start of
> GetPrivateRefCountEntry() are entirely unpredictable in this workload.  I had
> an old patch that tried to make it possible to use SIMD for the search, by
> using a separate array for the Buffer ids - with that gcc generates fairly
> crappy code, but does make the code branchless.
> 
> Here that substantially reduces the overhead of doing prefetching. Afterwards
> it's not a meaningful source of misses anymore.
I quickly hacked together some patches for this.  0001 adds new static
variables so that we have a separate array of the buffers and the index for
the current ReservedRefCountEntry.  0002 optimizes the linear search in
GetPrivateRefCountEntry() using our simd.h routines.  This stuff feels
expensive (see vector8_highbit_mask()'s implementation for AArch64), but if
the main goal is to avoid branches, I think this is about as "branchless"
as we can make it.  I'm going to stare at this a bit longer, but I figured
I'd get something on the lists while it is fresh in my mind.
-- 
nathan