Thread: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

Hello,

In conversations [1] recently about considering how best to adapt the code to become NUMA-aware Andres commented,
"FWIW,I've started to wonder if we shouldn't just get rid of the freelist entirely" and because I'm a glutton for
punishment(and I think this idea has some merit) I took him up on this task. 

In freelist.c the function StrategyGetBuffer() currently tries first to use the BufferAccessStrategy, if present, via
GetBufferFromRing(). Failing that, the second step is to check the "freelist" as defined by
StrategyControl->firstFreeBuffer/lastFreeBufferto determine if it contains any available buffers, and finally it will
"Usethe clock-sweep algorithm to find a free buffer."  The freelist was intended to be "a list of buffers that are
primecandidates for replacement" but I question the value of that versus the overhead of managing it.  Without the list
someoperations are likely (I plan to measure this) faster due, as Anders points out in [1], "just using clock sweep
actuallymakes things like DROP TABLE perform better because it doesn't need to maintain the freelist anymore."  It may
bethe case that with very large NBuffers where most are in use that this approach is slower (I plan to measure this
too),but in those cases I'd imagine the freelist is likely empty and so the code will use the clock-sweep algorithm
anyway,so I'm not sure there is a performance penalty at all. 

This change does remove the have_free_buffer() function used by the contrib/pg_prewarm module.  On the surface this
doesn'tseem to cause any issues, but honestly I've not thought too deeply on this one. 

v2-0001 Eliminate the freelist from the buffer manager and depend on clock-sweep.


Once removed [2] and tests passing [3] I took a long hard look at the buffer_strategy_lock that used to serialize
concurrentaccess to members of BufferStrategyControl and I couldn't find a good reason to keep it around.  Let's review
whatit is guarding: 

completePasses: a count of the number of times the clock-sweep hand wraps around.  StrategySyncStart() provides this to
thebgwriter which in turn uses it to compute a strategic location at which to start scanning for pages to evict.
There'san interesting comment that indicates both a "requirement" and an equivocal "but that's highly unlikely and
wouldn'tbe particularly harmful" statement conflicting with itself.  I tried to find a reason that nextVictimBuffers
couldoverflow or that the use of the completePasses value could somehow cause harm if off by one or more in the
bgwriterand either I missed it (please tell me) or there isn't one.  However, it does make sense to change
completePassesinto an atomic value so that it is consistent across backends and in the bgwriter. 

bgwprocno: when not -1 is the PID of the allocation notification latch (ProcGlobal->allProcs[bgwprocno].procLatch).
Thisis a "power savings" feature where the goal is to signal the bgwriter "When a backend starts using buffers again,
itwill wake us up by setting our latch."  Here the code reads, "Since we don't want to rely on a spinlock for this we
forcea read from shared memory once, and then set the latch based on that value." and uses INT_ACCESS_ONCE() to read
thevalue and set the latch.  The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no reason to use
atomicor other synchronization here. 

And that's all there is to it now that the freelist is gone.  As a result, IMO it seems unnecessary to require a spin
lockfor access to BufferStrategyControl. 

v2-0002 Remove the buffer_strategy_lock.


This attached patch is also a branch on GitHub [2] if that is of interest or helpful, it passes check world [3] (I use
theGitHub PR to trigger CirrusCI tests, not as the way to convey the change set). 

I also made a few minor changes such that we're consistently referring to "clock-sweep" (not "clock sweep" or
"clocksweep"),I'm not wedded to those but consistency isn't a bad thing, right? 

As an aside, we're really implementing "generalized CLOCK" [4][5] which uses counters rather a single bit as pointed
out[6] in the CLOCK-Pro [7] paper, but I digress. 

I'd like to hear ideas for worst cases to test and/or benchmark.  I plan on attempting what I saw Michael do with
flamegraphsbefore/after/delta in a follow up to this.  If people feel this has merit I'll add it to CF/2. 

thank you for your time considering this idea,

-greg

[1] https://postgr.es/m/ndvygkpdx44pmi4xbkf52gfrl77cohpefr42tipvd5dgiaeuyd%40fe2og2kxyjnc
[2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v2
[3] https://github.com/gburd/postgres/pull/4/checks
[4] V. F. Nicola, A. Dan, and D. M. Dias, "Analysis of the Generalized Clock Buffer Replacement Scheme for Database
TransactionProcessing", Proceeding of 1992 ACM SIGMETRICS Conference, June 1992, pp. 35-46. 
[5] A. J. Smith, "Sequentiality and Prefetching in Database Systems", ACM Trans. on Database Systems, Vol. 3, No. 3,
1978,pp. 223-247. 
[6] "In a generalized CLOCK version called GCLOCK [25,17], a counter is associated with each page rather than a single
bit.Its counter will be incremented if a page is hit. The cycling clock hand sweeps over the pages decrementing their
countersuntil a page whose counter is zero is found for replacement." 
[7] CLOCK-Pro: An Effective Improvement of the CLOCK Replacement
https://www.usenix.org/legacy/event/usenix05/tech/general/full_papers/jiang/jiang.pdf

Attachment
On Fri, Jul 11, 2025 at 01:26:53PM -0400, Greg Burd wrote:
> This change does remove the have_free_buffer() function used by the
> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
> issues, but honestly I've not thought too deeply on this one.

Hm.  ISTM we'll either need to invent another similarly inexpensive way to
test for this or to justify to ourselves that it's not necessary.  My guess
is that we do want to keep autoprewarm from evicting things, but FWIW the
docs already say "prewarming may also evict other data from cache" [0].

> Once removed [2] and tests passing [3] I took a long hard look at the
> buffer_strategy_lock that used to serialize concurrent access to members
> of BufferStrategyControl and I couldn't find a good reason to keep it
> around.  Let's review what it is guarding:
> 
> completePasses: a count of the number of times the clock-sweep hand wraps
> around.  StrategySyncStart() provides this to the bgwriter which in turn
> uses it to compute a strategic location at which to start scanning for
> pages to evict.  There's an interesting comment that indicates both a
> "requirement" and an equivocal "but that's highly unlikely and wouldn't
> be particularly harmful" statement conflicting with itself.  I tried to
> find a reason that nextVictimBuffers could overflow or that the use of
> the completePasses value could somehow cause harm if off by one or more
> in the bgwriter and either I missed it (please tell me) or there isn't
> one.  However, it does make sense to change completePasses into an atomic
> value so that it is consistent across backends and in the bgwriter.
> 
> bgwprocno: when not -1 is the PID of the allocation notification latch
> (ProcGlobal->allProcs[bgwprocno].procLatch).  This is a "power savings"
> feature where the goal is to signal the bgwriter "When a backend starts
> using buffers again, it will wake us up by setting our latch."  Here the
> code reads, "Since we don't want to rely on a spinlock for this we force
> a read from shared memory once, and then set the latch based on that
> value." and uses INT_ACCESS_ONCE() to read the value and set the latch.
> The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no
> reason to use atomic or other synchronization here.
> 
> And that's all there is to it now that the freelist is gone.  As a
> result, IMO it seems unnecessary to require a spin lock for access to
> BufferStrategyControl.

I haven't followed your line of reasoning closely yet, but I typically
recommend that patches that replace locks with atomics use functions with
full barrier semantics (e.g., pg_atomic_read_membarrier_u32(),
pg_atomic_fetch_add_u32()) to make things easier to reason about.  But that
might not be as straightforward in cases like StrategySyncStart() where we
atomically retrieve two values that are used together.  Nevertheless,
minimizing cognitive load might be nice, and there's a chance it doesn't
impact performance very much.

[0] https://www.postgresql.org/docs/devel/pgprewarm.html

-- 
nathan



Hi,

On 2025-07-11 13:26:53 -0400, Greg Burd wrote:
> In conversations [1] recently about considering how best to adapt the code
> to become NUMA-aware Andres commented, "FWIW, I've started to wonder if we
> shouldn't just get rid of the freelist entirely" and because I'm a glutton
> for punishment (and I think this idea has some merit) I took him up on this
> task.

> In freelist.c the function StrategyGetBuffer() currently tries first to use
> the BufferAccessStrategy, if present, via GetBufferFromRing().  Failing
> that, the second step is to check the "freelist" as defined by
> StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it contains
> any available buffers, and finally it will "Use the clock-sweep algorithm to
> find a free buffer."  The freelist was intended to be "a list of buffers
> that are prime candidates for replacement" but I question the value of that
> versus the overhead of managing it.  Without the list some operations are
> likely (I plan to measure this) faster due, as Anders points out in [1],
> "just using clock sweep actually makes things like DROP TABLE perform better
> because it doesn't need to maintain the freelist anymore."  It may be the
> case that with very large NBuffers where most are in use that this approach
> is slower (I plan to measure this too), but in those cases I'd imagine the
> freelist is likely empty and so the code will use the clock-sweep algorithm
> anyway, so I'm not sure there is a performance penalty at all.
>
> This change does remove the have_free_buffer() function used by the
> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
> issues, but honestly I've not thought too deeply on this one.

I think we'll likely need something to replace it.

TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop prewarming
shared buffers if doing so would just evict buffers.  But it's not clear to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones?  It'd make more sense to me to
stop autoprewarm once NBuffers have been prewarmed...



> v2-0001 Eliminate the freelist from the buffer manager and depend on clock-sweep.
>
>
> Once removed [2] and tests passing [3] I took a long hard look at the
> buffer_strategy_lock that used to serialize concurrent access to members of
> BufferStrategyControl and I couldn't find a good reason to keep it around.
> Let's review what it is guarding:
>
> completePasses: a count of the number of times the clock-sweep hand wraps
> around.  StrategySyncStart() provides this to the bgwriter which in turn
> uses it to compute a strategic location at which to start scanning for pages
> to evict.  There's an interesting comment that indicates both a
> "requirement" and an equivocal "but that's highly unlikely and wouldn't be
> particularly harmful" statement conflicting with itself.  I tried to find a
> reason that nextVictimBuffers could overflow or that the use of the
> completePasses value could somehow cause harm if off by one or more in the
> bgwriter and either I missed it (please tell me) or there isn't one.
> However, it does make sense to change completePasses into an atomic value so
> that it is consistent across backends and in the bgwriter.

I don't think it's *quite* that easy. If you just maintain nextVictimBuffer
and completePasses as separate atomic counters, without a lock making the two
consistent, StrategySyncStart(), as coded right now / in your patch, won't
necessarily return reasonable values for the two.  Which I think would lead to
bgwriter suddenly becoming overly active for one cycle and then very inactive
for a while after.

With really large shared buffers that'd be rather rare to be hit, but with
small shared buffers I think it'd be common enough to worry.

The most obvious way around this would be to make the clock hand a 64bit
atomic, which would avoid the need to have a separate tracker for the number
of passes.  Unfortunately doing so would require doing a modulo operation each
clock tick, which I think would likely be too expensive on common platforms -
on small shared_buffers I actually see existing, relatively rarely reached,
modulo in ClockSweepTick() show up on a Skylake-X system.

It'd be easier if we could rely on NBuffers to be a power of two, but that
doesn't seem like a realistic requirement.


I think the easiest way here would be to make StrategySyncStart(), a
relatively rare operation, retry whenever it detects that it saw out-of-sync
nextVictimBuffer. E.g. something roughly like

while (true)
{
        completePasses = atomic_read_u32(&StrategyControl->completePasses);
        pg_memory_barrier();
        nextVictimBuffer = atomic_read_u32(&StrategyControl->nextVictimBuffer);
        pg_memory_barrier();

        if (completePasses == atomic_read_u32(&StrategyControl->completePasses) &&
            nextVictimBuffer <= atomic_read_u32(&StrategyControl->nextVictimBuffer))
           break;
}

which I think would detect the case that we read a nextVictimBuffer value that was
decreased while reading a completePasses value that was increased.


I think while at it, we should make ClockSweepTick() decrement
nextVictimBuffer by atomically subtracting NBuffers, rather than using CAS. I
recently noticed that the CAS sometimes has to retry a fair number of times,
which in turn makes the `victim % NBuffers` show up in profiles.


> I'd like to hear ideas for worst cases to test and/or benchmark.  I plan on
> attempting what I saw Michael do with flamegraphs before/after/delta in a
> follow up to this.  If people feel this has merit I'll add it to CF/2.

What I've benchmarked is both single threaded and concurrent clock sweep, by
doing pg_prewarm() of per-pgbench-client relations. I used

c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM
pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT
pg_prewarm('copytest_:client_id');")-t1
 

(with c=1 for the single-threaded case obviously)

The reason for the pg_shmem_allocations_numa is to ensure that shared_buffers
is actually mapped, as otherwise the bottleneck will be the kernel zeroing out
buffers.

The reason for doing -t1 is that I wanted to compare freelist vs clock sweep,
rather than clock sweep in general.

Note that I patched EvictUnpinnedBufferInternal() to call
StrategyFreeBuffer(), otherwise running this a second time won't actually
measure the freelist.  And the first time run after postmaster start will
always be more noisy...

Greetings,

Andres Freund



On Fri, Jul 11, 2025, at 2:50 PM, Nathan Bossart wrote:
> On Fri, Jul 11, 2025 at 01:26:53PM -0400, Greg Burd wrote:
>> This change does remove the have_free_buffer() function used by the
>> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
>> issues, but honestly I've not thought too deeply on this one.
>
> Hm.  ISTM we'll either need to invent another similarly inexpensive way to
> test for this or to justify to ourselves that it's not necessary.  My guess
> is that we do want to keep autoprewarm from evicting things, but FWIW the
> docs already say "prewarming may also evict other data from cache" [0].

Thank you for spending time reviewing my proposal/patch!

I briefly considered how one might use what was left after surgery to produce some similar boolean signal to no avail.
Ithink that autoprewarm was simply trying to at most warm NBuffers then stop.  The freelist at startup was just a
convenientthing to drain and get that done.  Maybe I'll try adapting autoprewarm to consider that global instead.
 

>> Once removed [2] and tests passing [3] I took a long hard look at the
>> buffer_strategy_lock that used to serialize concurrent access to members
>> of BufferStrategyControl and I couldn't find a good reason to keep it
>> around.  Let's review what it is guarding:
>> 
>> completePasses: a count of the number of times the clock-sweep hand wraps
>> around.  StrategySyncStart() provides this to the bgwriter which in turn
>> uses it to compute a strategic location at which to start scanning for
>> pages to evict.  There's an interesting comment that indicates both a
>> "requirement" and an equivocal "but that's highly unlikely and wouldn't
>> be particularly harmful" statement conflicting with itself.  I tried to
>> find a reason that nextVictimBuffers could overflow or that the use of
>> the completePasses value could somehow cause harm if off by one or more
>> in the bgwriter and either I missed it (please tell me) or there isn't
>> one.  However, it does make sense to change completePasses into an atomic
>> value so that it is consistent across backends and in the bgwriter.
>> 
>> bgwprocno: when not -1 is the PID of the allocation notification latch
>> (ProcGlobal->allProcs[bgwprocno].procLatch).  This is a "power savings"
>> feature where the goal is to signal the bgwriter "When a backend starts
>> using buffers again, it will wake us up by setting our latch."  Here the
>> code reads, "Since we don't want to rely on a spinlock for this we force
>> a read from shared memory once, and then set the latch based on that
>> value." and uses INT_ACCESS_ONCE() to read the value and set the latch.
>> The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no
>> reason to use atomic or other synchronization here.
>> 
>> And that's all there is to it now that the freelist is gone.  As a
>> result, IMO it seems unnecessary to require a spin lock for access to
>> BufferStrategyControl.
>
> I haven't followed your line of reasoning closely yet, but I typically
> recommend that patches that replace locks with atomics use functions with
> full barrier semantics (e.g., pg_atomic_read_membarrier_u32(),
> pg_atomic_fetch_add_u32()) to make things easier to reason about.  But that
> might not be as straightforward in cases like StrategySyncStart() where we
> atomically retrieve two values that are used together.  Nevertheless,
> minimizing cognitive load might be nice, and there's a chance it doesn't
> impact performance very much.

Good thought.  I'll review carefully and see if I can either explain here solid reasons I believe that they don't need
fullbarrier semantics or change the patch accordingly.
 

again, thank you for your time, best.

-greg

> [0] https://www.postgresql.org/docs/devel/pgprewarm.html
>
> -- 
> nathan