Home > mailing lists
Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers

From	Greg Burd
Subject	Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date	July 17 21:35:13
Msg-id	f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me Whole thread Raw
In response to	Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock (Andres Freund <andres@anarazel.de>)
List	pgsql-hackers
Tree view
On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
> Hi,
>
> On 2025-07-11 13:26:53 -0400, Greg Burd wrote:
>> In conversations [1] recently about considering how best to adapt the 
>> code
>> to become NUMA-aware Andres commented, "FWIW, I've started to wonder 
>> if we
>> shouldn't just get rid of the freelist entirely" and because I'm a 
>> glutton
>> for punishment (and I think this idea has some merit) I took him up 
>> on this
>> task.
>
>> In freelist.c the function StrategyGetBuffer() currently tries first 
>> to use
>> the BufferAccessStrategy, if present, via GetBufferFromRing().  Failing
>> that, the second step is to check the "freelist" as defined by
>> StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it 
>> contains
>> any available buffers, and finally it will "Use the clock-sweep 
>> algorithm to
>> find a free buffer."  The freelist was intended to be "a list of buffers
>> that are prime candidates for replacement" but I question the value 
>> of that
>> versus the overhead of managing it.  Without the list some operations are
>> likely (I plan to measure this) faster due, as Anders points out in [1],
>> "just using clock sweep actually makes things like DROP TABLE perform 
>> better
>> because it doesn't need to maintain the freelist anymore."  It may be the
>> case that with very large NBuffers where most are in use that this 
>> approach
>> is slower (I plan to measure this too), but in those cases I'd 
>> imagine the
>> freelist is likely empty and so the code will use the clock-sweep 
>> algorithm
>> anyway, so I'm not sure there is a performance penalty at all.
>>
>> This change does remove the have_free_buffer() function used by the
>> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
>> issues, but honestly I've not thought too deeply on this one.

Thank you for spending time reviewing my proposal and patch!  I value 
your time
and insights and in this case your inspiration. :)

> I think we'll likely need something to replace it.

Fair, this (v5) patch doesn't yet try to address this.

> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
> right. The goal of the use have_free_buffer() is obviously to stop 
> prewarming
> shared buffers if doing so would just evict buffers.  But it's not 
> clear to me
> that we should just stop when there aren't any free buffers - what if the
> previous buffer contents aren't the right ones?  It'd make more sense 
> to me to
> stop autoprewarm once NBuffers have been prewarmed...

I had the same high level reaction, that autoprewarm was leveraging 
something
convenient but not necessarily required or even correct.  I'd considered 
using
NBuffers as you describe due to similar intuitions, I'll dig into that 
idea for
the next revision after I get to know autoprewarm a bit better.

>> v2-0001 Eliminate the freelist from the buffer manager and depend on 
>> clock-sweep.
>>
>>
>> Once removed [2] and tests passing [3] I took a long hard look at the
>> buffer_strategy_lock that used to serialize concurrent access to 
>> members of
>> BufferStrategyControl and I couldn't find a good reason to keep it 
>> around.
>> Let's review what it is guarding:
>>
>> completePasses: a count of the number of times the clock-sweep hand wraps
>> around.  StrategySyncStart() provides this to the bgwriter which in turn
>> uses it to compute a strategic location at which to start scanning 
>> for pages
>> to evict.  There's an interesting comment that indicates both a
>> "requirement" and an equivocal "but that's highly unlikely and 
>> wouldn't be
>> particularly harmful" statement conflicting with itself.  I tried to 
>> find a
>> reason that nextVictimBuffers could overflow or that the use of the
>> completePasses value could somehow cause harm if off by one or more 
>> in the
>> bgwriter and either I missed it (please tell me) or there isn't one.
>> However, it does make sense to change completePasses into an atomic 
>> value so
>> that it is consistent across backends and in the bgwriter.
>
> I don't think it's *quite* that easy. If you just maintain 
> nextVictimBuffer
> and completePasses as separate atomic counters, without a lock making 
> the two
> consistent, StrategySyncStart(), as coded right now / in your patch, won't
> necessarily return reasonable values for the two.  Which I think would 
> lead to
> bgwriter suddenly becoming overly active for one cycle and then very 
> inactive
> for a while after.

Keeping them separate as atomics still requires coordination that I 
think is unnecessary.  I spent some time and came up with a working 
version [1] where
the two values (nextVictimBuffer and completePasses) were in a singular 
uint64
atomic, but as soon as I had it working I didn't like the idea at all.

> With really large shared buffers that'd be rather rare to be hit, but with
> small shared buffers I think it'd be common enough to worry.

Agreed, not coordinating these values isn't a viable solution.

> The most obvious way around this would be to make the clock hand a 64bit
> atomic, which would avoid the need to have a separate tracker for the 
> number
> of passes.  Unfortunately doing so would require doing a modulo 
> operation each
> clock tick, which I think would likely be too expensive on common 
> platforms -
> on small shared_buffers I actually see existing, relatively rarely 
> reached,
> modulo in ClockSweepTick() show up on a Skylake-X system.

So, this idea came back to me today as I tossed out the union branch and 
started
over.

a) can't require a power of 2 for NBuffers
b) would like a power of 2 for NBuffers to make a few things more efficient
c) a simple uint64 atomic counter would simplify things

The attached (v5) patch takes this approach *and* avoids the modulo you were
concerned with.  My approach is to have nextVictimBuffer as a uint64 
that only
increments (and at some point 200 years or so might wrap around, but I 
digress).
To get the actual "victim" you modulo that, but not with "%" you call
clock_modulo().  In that function I use a "next power of 2" value rather 
than
NBuffers to efficiently find the modulo and adjust for the actual 
value.  Same
for completePasses which is now a function clock_passes() that does similar
trickery and returns the number of times the counter (nextVictimBuffer) has
"wrapped" around modulo NBuffers.

Now that both values exist in the same uint64 it can be the atomic 
vessel that coordinates them, no synchronization problems at all and no 
requirement for
the buffer_strategy_lock.

> It'd be easier if we could rely on NBuffers to be a power of two, but that
> doesn't seem like a realistic requirement.

Yes, this was the good idea I ran with it in a slightly more creative 
way that
doesn't require users to set NBuffers to a power of 2.

> I think the easiest way here would be to make StrategySyncStart(), a
> relatively rare operation, retry whenever it detects that it saw 
> out-of-sync
> nextVictimBuffer. E.g. something roughly like
>
> while (true)
> {
>        completePasses = atomic_read_u32(&StrategyControl->completePasses);
>        pg_memory_barrier();
>        nextVictimBuffer = 
> atomic_read_u32(&StrategyControl->nextVictimBuffer);
>        pg_memory_barrier();
>
>        if (completePasses ==
> atomic_read_u32(&StrategyControl->completePasses) &&
>            nextVictimBuffer <=
> atomic_read_u32(&StrategyControl->nextVictimBuffer))
>           break;
> }
>
> which I think would detect the case that we read a nextVictimBuffer
> value that was
> decreased while reading a completePasses value that was increased.

Okay, interesting idea.  If you dislike this approach I'll circle back and
consider it.

> I think while at it, we should make ClockSweepTick() decrement
> nextVictimBuffer by atomically subtracting NBuffers, rather than using 
> CAS. I
> recently noticed that the CAS sometimes has to retry a fair number of
> times,
> which in turn makes the `victim % NBuffers` show up in profiles.

In my (v5) patch there is one CAS that increments NBuffers.  All other
operations on NBuffers are atomic reads.  The modulo you mention is gone
entirely, unnecessary AFAICT.

>> I'd like to hear ideas for worst cases to test and/or benchmark.  I 
>> plan on
>> attempting what I saw Michael do with flamegraphs before/after/delta in a
>> follow up to this.  If people feel this has merit I'll add it to CF/2.
>
> What I've benchmarked is both single threaded and concurrent clock 
> sweep, by
> doing pg_prewarm() of per-pgbench-client relations. I used
>
> c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT
> numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE
> size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo
> "SELECT pg_prewarm('copytest_:client_id');") -t1
>
> (with c=1 for the single-threaded case obviously)
>
> The reason for the pg_shmem_allocations_numa is to ensure that 
> shared_buffers
> is actually mapped, as otherwise the bottleneck will be the kernel 
> zeroing out
> buffers.
>
> The reason for doing -t1 is that I wanted to compare freelist vs clock 
> sweep,
> rather than clock sweep in general.
>
> Note that I patched EvictUnpinnedBufferInternal() to call
> StrategyFreeBuffer(), otherwise running this a second time won't actually
> measure the freelist.  And the first time run after postmaster start will
> always be more noisy...

This is very helpful, thanks! I've started doing some of this but I was
anxious to get this out before the weekend.  I'll work on the prewarm module
and get some benchmarks done next week.

Meanwhile, the tests except for Windows pass [2] for this new patch 
[3].  I'll
dig into the Windows issues next week as well.

> Greetings,
>
> Andres Freund

I'm very curious to hear back your thoughts (or anyone else) on this
approach.

best,

-greg


[1] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v4
[2] https://github.com/gburd/postgres/pull/6/checks
[3] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v5
Attachment

pgsql-hackers by date:
From: Andrey Borodin
Date: 17 July, 21:34:05
Subject: Re: IPC/MultixactCreation on the Standby server
From: Laurenz Albe
Date: 17 July, 21:53:13
Subject: Re: Fix PQport to never return NULL if the connection is valid
Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers

Attachment

Previous

Next