Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date
Msg-id 4h5wgvkjsrjtf7km7b74lag6abmzhrhmfvutzymetetv4zv6yn@ms7wbujlyip7
Whole thread Raw
In response to Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock  (Greg Burd <greg@burd.me>)
List pgsql-hackers
Hi,

On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
> On 7/18/25 13:03, Andres Freund wrote:
> Hello.  Thanks again for taking the time to review the email and patch,
> I think we're onto something good here.
> 
> >
> > I'd be curious if anybody wants to argue for keeping the clock sweep. Except
> > for the have_free_buffer() use in autoprewarm, it's a rather trivial
> > patch. And I really couldn't measure regressions above the noise level, even
> > if absurdly extreme use cases.
> 
> Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
> for keeping the freelist"?

Err, yes :(


> > On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
> >> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
> >>> I think we'll likely need something to replace it.
> >> Fair, this (v5) patch doesn't yet try to address this.
> >>
> >>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
> >>> right. The goal of the use have_free_buffer() is obviously to stop
> >>> prewarming
> >>> shared buffers if doing so would just evict buffers.  But it's not clear
> >>> to me
> >>> that we should just stop when there aren't any free buffers - what if the
> >>> previous buffer contents aren't the right ones?  It'd make more sense to
> >>> me to
> >>> stop autoprewarm once NBuffers have been prewarmed...
> >> I had the same high level reaction, that autoprewarm was leveraging
> >> something
> >> convenient but not necessarily required or even correct.  I'd considered
> >> using
> >> NBuffers as you describe due to similar intuitions, I'll dig into that idea
> >> for
> >> the next revision after I get to know autoprewarm a bit better.
> > Cool. I do think that'll be good enough.
> 
> I re-added the have_free_buffer() function only now it returns false
> once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
> has made its first complete pass.  With that I reverted my changes in
> the autoprewarm module.  The net should be the same behavior as before
> at startup when using that module.

I don't think we should have a have_free_buffer() that doesn't actually test
whether we have a free buffer, that seems too likely to cause
misunderstandings down the line.  What if we instead just limit the amount of
buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
the number of to-be-loaded buffers, so that shouldn't be hard.


> >> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
> >> I'll dig into the Windows issues next week as well.
> > FWIW, there are backtraces generated on windows. E.g.
> >
> >
https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt
> >
> > 000000cd`827fdea0 00007ff7`6ad82f88     ucrtbased!abort(void)+0x5a
[minkernel\crts\ucrt\src\appcrt\startup\abort.cpp@ 77]
 
> > 000000cd`827fdee0 00007ff7`6aae2b7c     postgres!ExceptionalCondition(
> >             char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
> >             char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
> >             int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
> > 000000cd`827fdf20 00007ff7`6aae272c     postgres!clock_modulo(
> >             unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
> > 000000cd`827fdf60 00007ff7`6aad8647     postgres!StrategySyncStart(
> >             unsigned int * complete_passes = 0x000000cd`827fdfc0,
> >             unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c
[c:\cirrus\src\backend\storage\buffer\freelist.c@ 300]
 
> > 000000cd`827fdfa0 00007ff7`6aa254a3     postgres!BgBufferSync(
> >             struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37
[c:\cirrus\src\backend\storage\buffer\bufmgr.c@ 3649]
 
> > 000000cd`827fe030 00007ff7`6aa278a7     postgres!BackgroundWriterMain(
> >             void * startup_data = 0x00000000`00000000,
> >             unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
> > 000000cd`827ff5a0 00007ff7`6a8daf19     postgres!SubPostmasterMain(
> >             int argc = 0n3,
> >             char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
> > 000000cd`827ff620 00007ff7`6af0f5a9     postgres!main(
> >             int argc = 0n3,
> >             char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
> >
> > I.e. your new assertion failed for some reason that i can't *immediately* see.
> I put that in as a precaution and as a way to communicate the intention
> of the other code above it.  I never imagined it would assert.  I've
> changed clock_read() to only assert when the modulo differs and left
> that assert in the calling ClockSweepTick() function because it was
> redundant and I'm curious to see if we see a similar assert when testing
> the modulo.

Do you understand why it triggered? Because I don't immediately. The fact that
it triggered only on windows, where the compiler is rather different, makes it
worth understanding imo.


> >> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
> >>
> >>      /*
> >>       * Compute strategy_delta = how many buffers have been scanned by the
> >> -     * clock sweep since last time.  If first time through, assume none. Then
> >> -     * see if we are still ahead of the clock sweep, and if so, how many
> >> +     * clock-sweep since last time.  If first time through, assume none. Then
> >> +     * see if we are still ahead of the clock-sweep, and if so, how many
> >>       * buffers we could scan before we'd catch up with it and "lap" it. Note:
> >>       * weird-looking coding of xxx_passes comparisons are to avoid bogus
> >>       * behavior when the passes counts wrap around.
> >>       */
> >>      if (saved_info_valid)
> >>      {
> >> -        int32        passes_delta = strategy_passes - prev_strategy_passes;
> >> +        int32        passes_delta;
> >> +
> >> +        if (unlikely(prev_strategy_passes > strategy_passes))
> >> +        {
> >> +            /* wrap-around case */
> >> +            passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
> >> +        }
> >> +        else
> >> +        {
> >> +            passes_delta = (int32) (strategy_passes - prev_strategy_passes);
> >> +        }
> >>
> >>          strategy_delta = strategy_buf_id - prev_strategy_buf_id;
> >>          strategy_delta += (long) passes_delta * NBuffers;
> > That seems somewhat independent of the rest of the change, or am I missing something?
> 
> That change is there to cover the possibility of someone managing to
> overflow and wrap a uint64 which is *highly* unlikely.

That risk existed previously too - I'm not against shoring things up, I'd just
do it in a precursor commit, to make this easier to review.

> If this degree of paranoia isn't required I'm happy to remove it.

That does indeed seem really unlikely.  Assuming that postgres stays up for 10
years without a single restart, it'd be ~59 billion ticks a second.

I don't mind a defense, but I think we'd be better off putting it into
ClockSweepTick() or such, simply erroring out if we ever hit this. It's
unlikely that we'd get (and keep) all the relevant untested code correct ime.
Then we also can assert that prev_strategy_passes <= strategy_passes.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Thom Brown
Date:
Subject: Re: Proposal: QUALIFY clause
Next
From: Marko Tiikkaja
Date:
Subject: Re: Proposal: QUALIFY clause