Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date
Msg-id 2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3@euyewe6tf3z3
Whole thread Raw
In response to [PATCH] Let's get rid of the freelist and the buffer_strategy_lock  ("Greg Burd" <greg@burd.me>)
List pgsql-hackers
Hi,

On 2025-07-11 13:26:53 -0400, Greg Burd wrote:
> In conversations [1] recently about considering how best to adapt the code
> to become NUMA-aware Andres commented, "FWIW, I've started to wonder if we
> shouldn't just get rid of the freelist entirely" and because I'm a glutton
> for punishment (and I think this idea has some merit) I took him up on this
> task.

> In freelist.c the function StrategyGetBuffer() currently tries first to use
> the BufferAccessStrategy, if present, via GetBufferFromRing().  Failing
> that, the second step is to check the "freelist" as defined by
> StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it contains
> any available buffers, and finally it will "Use the clock-sweep algorithm to
> find a free buffer."  The freelist was intended to be "a list of buffers
> that are prime candidates for replacement" but I question the value of that
> versus the overhead of managing it.  Without the list some operations are
> likely (I plan to measure this) faster due, as Anders points out in [1],
> "just using clock sweep actually makes things like DROP TABLE perform better
> because it doesn't need to maintain the freelist anymore."  It may be the
> case that with very large NBuffers where most are in use that this approach
> is slower (I plan to measure this too), but in those cases I'd imagine the
> freelist is likely empty and so the code will use the clock-sweep algorithm
> anyway, so I'm not sure there is a performance penalty at all.
>
> This change does remove the have_free_buffer() function used by the
> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
> issues, but honestly I've not thought too deeply on this one.

I think we'll likely need something to replace it.

TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop prewarming
shared buffers if doing so would just evict buffers.  But it's not clear to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones?  It'd make more sense to me to
stop autoprewarm once NBuffers have been prewarmed...



> v2-0001 Eliminate the freelist from the buffer manager and depend on clock-sweep.
>
>
> Once removed [2] and tests passing [3] I took a long hard look at the
> buffer_strategy_lock that used to serialize concurrent access to members of
> BufferStrategyControl and I couldn't find a good reason to keep it around.
> Let's review what it is guarding:
>
> completePasses: a count of the number of times the clock-sweep hand wraps
> around.  StrategySyncStart() provides this to the bgwriter which in turn
> uses it to compute a strategic location at which to start scanning for pages
> to evict.  There's an interesting comment that indicates both a
> "requirement" and an equivocal "but that's highly unlikely and wouldn't be
> particularly harmful" statement conflicting with itself.  I tried to find a
> reason that nextVictimBuffers could overflow or that the use of the
> completePasses value could somehow cause harm if off by one or more in the
> bgwriter and either I missed it (please tell me) or there isn't one.
> However, it does make sense to change completePasses into an atomic value so
> that it is consistent across backends and in the bgwriter.

I don't think it's *quite* that easy. If you just maintain nextVictimBuffer
and completePasses as separate atomic counters, without a lock making the two
consistent, StrategySyncStart(), as coded right now / in your patch, won't
necessarily return reasonable values for the two.  Which I think would lead to
bgwriter suddenly becoming overly active for one cycle and then very inactive
for a while after.

With really large shared buffers that'd be rather rare to be hit, but with
small shared buffers I think it'd be common enough to worry.

The most obvious way around this would be to make the clock hand a 64bit
atomic, which would avoid the need to have a separate tracker for the number
of passes.  Unfortunately doing so would require doing a modulo operation each
clock tick, which I think would likely be too expensive on common platforms -
on small shared_buffers I actually see existing, relatively rarely reached,
modulo in ClockSweepTick() show up on a Skylake-X system.

It'd be easier if we could rely on NBuffers to be a power of two, but that
doesn't seem like a realistic requirement.


I think the easiest way here would be to make StrategySyncStart(), a
relatively rare operation, retry whenever it detects that it saw out-of-sync
nextVictimBuffer. E.g. something roughly like

while (true)
{
        completePasses = atomic_read_u32(&StrategyControl->completePasses);
        pg_memory_barrier();
        nextVictimBuffer = atomic_read_u32(&StrategyControl->nextVictimBuffer);
        pg_memory_barrier();

        if (completePasses == atomic_read_u32(&StrategyControl->completePasses) &&
            nextVictimBuffer <= atomic_read_u32(&StrategyControl->nextVictimBuffer))
           break;
}

which I think would detect the case that we read a nextVictimBuffer value that was
decreased while reading a completePasses value that was increased.


I think while at it, we should make ClockSweepTick() decrement
nextVictimBuffer by atomically subtracting NBuffers, rather than using CAS. I
recently noticed that the CAS sometimes has to retry a fair number of times,
which in turn makes the `victim % NBuffers` show up in profiles.


> I'd like to hear ideas for worst cases to test and/or benchmark.  I plan on
> attempting what I saw Michael do with flamegraphs before/after/delta in a
> follow up to this.  If people feel this has merit I'll add it to CF/2.

What I've benchmarked is both single threaded and concurrent clock sweep, by
doing pg_prewarm() of per-pgbench-client relations. I used

c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM
pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT
pg_prewarm('copytest_:client_id');")-t1
 

(with c=1 for the single-threaded case obviously)

The reason for the pg_shmem_allocations_numa is to ensure that shared_buffers
is actually mapped, as otherwise the bottleneck will be the kernel zeroing out
buffers.

The reason for doing -t1 is that I wanted to compare freelist vs clock sweep,
rather than clock sweep in general.

Note that I patched EvictUnpinnedBufferInternal() to call
StrategyFreeBuffer(), otherwise running this a second time won't actually
measure the freelist.  And the first time run after postmaster start will
always be more noisy...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Rustam ALLAKOV
Date:
Subject: Re: Foreign key isolation tests
Next
From: "Greg Burd"
Date:
Subject: Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock