Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers

From Greg Burd
Subject Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date
Msg-id fa0b7e51-75bf-43ac-9285-7b866eb3a032@burd.me
Whole thread Raw
In response to Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock  (Andres Freund <andres@anarazel.de>)
Responses Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
List pgsql-hackers
On 7/18/25 13:03, Andres Freund wrote:
> Hi,

Hello.  Thanks again for taking the time to review the email and patch,
I think we're onto something good here.

>
> I'd be curious if anybody wants to argue for keeping the clock sweep. Except
> for the have_free_buffer() use in autoprewarm, it's a rather trivial
> patch. And I really couldn't measure regressions above the noise level, even
> if absurdly extreme use cases.

Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
for keeping the freelist"?

> On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
>> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
>>> I think we'll likely need something to replace it.
>> Fair, this (v5) patch doesn't yet try to address this.
>>
>>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
>>> right. The goal of the use have_free_buffer() is obviously to stop
>>> prewarming
>>> shared buffers if doing so would just evict buffers.  But it's not clear
>>> to me
>>> that we should just stop when there aren't any free buffers - what if the
>>> previous buffer contents aren't the right ones?  It'd make more sense to
>>> me to
>>> stop autoprewarm once NBuffers have been prewarmed...
>> I had the same high level reaction, that autoprewarm was leveraging
>> something
>> convenient but not necessarily required or even correct.  I'd considered
>> using
>> NBuffers as you describe due to similar intuitions, I'll dig into that idea
>> for
>> the next revision after I get to know autoprewarm a bit better.
> Cool. I do think that'll be good enough.

I re-added the have_free_buffer() function only now it returns false
once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
has made its first complete pass.  With that I reverted my changes in
the autoprewarm module.  The net should be the same behavior as before
at startup when using that module.

>>> The most obvious way around this would be to make the clock hand a 64bit
>>> atomic, which would avoid the need to have a separate tracker for the
>>> number
>>> of passes.  Unfortunately doing so would require doing a modulo
>>> operation each
>>> clock tick, which I think would likely be too expensive on common
>>> platforms -
>>> on small shared_buffers I actually see existing, relatively rarely
>>> reached,
>>> modulo in ClockSweepTick() show up on a Skylake-X system.
>> So, this idea came back to me today as I tossed out the union branch and
>> started
>> over.
>>
>> a) can't require a power of 2 for NBuffers
>> b) would like a power of 2 for NBuffers to make a few things more efficient
>> c) a simple uint64 atomic counter would simplify things
>>
>> The attached (v5) patch takes this approach *and* avoids the modulo you were
>> concerned with.  My approach is to have nextVictimBuffer as a uint64 that
>> only
>> increments (and at some point 200 years or so might wrap around, but I
>> digress).
>> To get the actual "victim" you modulo that, but not with "%" you call
>> clock_modulo().  In that function I use a "next power of 2" value rather
>> than
>> NBuffers to efficiently find the modulo and adjust for the actual value.
>> Same
>> for completePasses which is now a function clock_passes() that does similar
>> trickery and returns the number of times the counter (nextVictimBuffer) has
>> "wrapped" around modulo NBuffers.
> Yea, that could work!  It'd be interesting to see some performance numbers for
> this...

Still no performance comparisons yet, but my gut says this should reduce
contention across cores on a very hot path so I'd imagine some
performance improvement.

>> Now that both values exist in the same uint64 it can be the atomic vessel
>> that coordinates them, no synchronization problems at all and no requirement
>> for the buffer_strategy_lock.
> Nice!
>
>
>>> I think while at it, we should make ClockSweepTick() decrement
>>> nextVictimBuffer by atomically subtracting NBuffers, rather than using
>>> CAS. I
>>> recently noticed that the CAS sometimes has to retry a fair number of
>>> times,
>>> which in turn makes the `victim % NBuffers` show up in profiles.
>> In my (v5) patch there is one CAS that increments NBuffers.  All other
>> operations on NBuffers are atomic reads.  The modulo you mention is gone
>> entirely, unnecessary AFAICT.
> There shouldn't be any CASes needed now, right? Just a fetch-add? The latter
> often scales *way* better under contention.
>
> [Looks at the patch ...]
>
> Which I think is true in your patch, I don't see any CAS.

You are correct, no CAS at all anymore just a mental mistake in the last
email.  Now there are only atomic reads and single atomic fetch-add in
ClockSweepTick().

>> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
>> I'll dig into the Windows issues next week as well.
> FWIW, there are backtraces generated on windows. E.g.
>
>
https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt
>
> 000000cd`827fdea0 00007ff7`6ad82f88     ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp
@77]
 
> 000000cd`827fdee0 00007ff7`6aae2b7c     postgres!ExceptionalCondition(
>             char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
>             char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
>             int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
> 000000cd`827fdf20 00007ff7`6aae272c     postgres!clock_modulo(
>             unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
> 000000cd`827fdf60 00007ff7`6aad8647     postgres!StrategySyncStart(
>             unsigned int * complete_passes = 0x000000cd`827fdfc0,
>             unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c
@300]
 
> 000000cd`827fdfa0 00007ff7`6aa254a3     postgres!BgBufferSync(
>             struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37
[c:\cirrus\src\backend\storage\buffer\bufmgr.c@ 3649]
 
> 000000cd`827fe030 00007ff7`6aa278a7     postgres!BackgroundWriterMain(
>             void * startup_data = 0x00000000`00000000,
>             unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
> 000000cd`827ff5a0 00007ff7`6a8daf19     postgres!SubPostmasterMain(
>             int argc = 0n3,
>             char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
> 000000cd`827ff620 00007ff7`6af0f5a9     postgres!main(
>             int argc = 0n3,
>             char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
>
> I.e. your new assertion failed for some reason that i can't *immediately* see.
I put that in as a precaution and as a way to communicate the intention
of the other code above it.  I never imagined it would assert.  I've
changed clock_read() to only assert when the modulo differs and left
that assert in the calling ClockSweepTick() function because it was
redundant and I'm curious to see if we see a similar assert when testing
the modulo.
>> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
>>
>>      /*
>>       * Compute strategy_delta = how many buffers have been scanned by the
>> -     * clock sweep since last time.  If first time through, assume none. Then
>> -     * see if we are still ahead of the clock sweep, and if so, how many
>> +     * clock-sweep since last time.  If first time through, assume none. Then
>> +     * see if we are still ahead of the clock-sweep, and if so, how many
>>       * buffers we could scan before we'd catch up with it and "lap" it. Note:
>>       * weird-looking coding of xxx_passes comparisons are to avoid bogus
>>       * behavior when the passes counts wrap around.
>>       */
>>      if (saved_info_valid)
>>      {
>> -        int32        passes_delta = strategy_passes - prev_strategy_passes;
>> +        int32        passes_delta;
>> +
>> +        if (unlikely(prev_strategy_passes > strategy_passes))
>> +        {
>> +            /* wrap-around case */
>> +            passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
>> +        }
>> +        else
>> +        {
>> +            passes_delta = (int32) (strategy_passes - prev_strategy_passes);
>> +        }
>>
>>          strategy_delta = strategy_buf_id - prev_strategy_buf_id;
>>          strategy_delta += (long) passes_delta * NBuffers;
> That seems somewhat independent of the rest of the change, or am I missing something?

That change is there to cover the possibility of someone managing to
overflow and wrap a uint64 which is *highly* unlikely.  If this degree
of paranoia isn't required I'm happy to remove it.

>> +static uint32 NBuffersPow2;        /* NBuffers rounded up to the next power of 2 */
>> +static uint32 NBuffersPow2Shift;    /* Amount to bitshift NBuffers for
>> +                                     * division */
> For performance in ClockSweepTick() it might more sense to store the mask
> (i.e. NBuffersPow2 - 1), rather than the actual power of two.
Agreed, I've done that and created one more calculated value that could
be pre-computed once and never again (unless NBuffers changes) at runtime.
> Greetings,
>
> Andres Freund


thanks again for the review, v6 attached and re-based onto afa5c365ec5,
also on GitHub at [1][2].


-greg


[1] https://github.com/gburd/postgres/pull/7/checks

[2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v6

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Proposal: QUALIFY clause
Next
From: Masahiko Sawada
Date:
Subject: Re: Conflict detection for update_deleted in logical replication