Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers
From | Greg Burd |
---|---|
Subject | Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock |
Date | |
Msg-id | fa0b7e51-75bf-43ac-9285-7b866eb3a032@burd.me Whole thread Raw |
In response to | Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock (Andres Freund <andres@anarazel.de>) |
Responses |
Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
|
List | pgsql-hackers |
On 7/18/25 13:03, Andres Freund wrote: > Hi, Hello. Thanks again for taking the time to review the email and patch, I think we're onto something good here. > > I'd be curious if anybody wants to argue for keeping the clock sweep. Except > for the have_free_buffer() use in autoprewarm, it's a rather trivial > patch. And I really couldn't measure regressions above the noise level, even > if absurdly extreme use cases. Hmmm... was "argue for keeping the clock sweep" supposed to read "argue for keeping the freelist"? > On 2025-07-17 14:35:13 -0400, Greg Burd wrote: >> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote: >>> I think we'll likely need something to replace it. >> Fair, this (v5) patch doesn't yet try to address this. >> >>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite >>> right. The goal of the use have_free_buffer() is obviously to stop >>> prewarming >>> shared buffers if doing so would just evict buffers. But it's not clear >>> to me >>> that we should just stop when there aren't any free buffers - what if the >>> previous buffer contents aren't the right ones? It'd make more sense to >>> me to >>> stop autoprewarm once NBuffers have been prewarmed... >> I had the same high level reaction, that autoprewarm was leveraging >> something >> convenient but not necessarily required or even correct. I'd considered >> using >> NBuffers as you describe due to similar intuitions, I'll dig into that idea >> for >> the next revision after I get to know autoprewarm a bit better. > Cool. I do think that'll be good enough. I re-added the have_free_buffer() function only now it returns false once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock has made its first complete pass. With that I reverted my changes in the autoprewarm module. The net should be the same behavior as before at startup when using that module. >>> The most obvious way around this would be to make the clock hand a 64bit >>> atomic, which would avoid the need to have a separate tracker for the >>> number >>> of passes. Unfortunately doing so would require doing a modulo >>> operation each >>> clock tick, which I think would likely be too expensive on common >>> platforms - >>> on small shared_buffers I actually see existing, relatively rarely >>> reached, >>> modulo in ClockSweepTick() show up on a Skylake-X system. >> So, this idea came back to me today as I tossed out the union branch and >> started >> over. >> >> a) can't require a power of 2 for NBuffers >> b) would like a power of 2 for NBuffers to make a few things more efficient >> c) a simple uint64 atomic counter would simplify things >> >> The attached (v5) patch takes this approach *and* avoids the modulo you were >> concerned with. My approach is to have nextVictimBuffer as a uint64 that >> only >> increments (and at some point 200 years or so might wrap around, but I >> digress). >> To get the actual "victim" you modulo that, but not with "%" you call >> clock_modulo(). In that function I use a "next power of 2" value rather >> than >> NBuffers to efficiently find the modulo and adjust for the actual value. >> Same >> for completePasses which is now a function clock_passes() that does similar >> trickery and returns the number of times the counter (nextVictimBuffer) has >> "wrapped" around modulo NBuffers. > Yea, that could work! It'd be interesting to see some performance numbers for > this... Still no performance comparisons yet, but my gut says this should reduce contention across cores on a very hot path so I'd imagine some performance improvement. >> Now that both values exist in the same uint64 it can be the atomic vessel >> that coordinates them, no synchronization problems at all and no requirement >> for the buffer_strategy_lock. > Nice! > > >>> I think while at it, we should make ClockSweepTick() decrement >>> nextVictimBuffer by atomically subtracting NBuffers, rather than using >>> CAS. I >>> recently noticed that the CAS sometimes has to retry a fair number of >>> times, >>> which in turn makes the `victim % NBuffers` show up in profiles. >> In my (v5) patch there is one CAS that increments NBuffers. All other >> operations on NBuffers are atomic reads. The modulo you mention is gone >> entirely, unnecessary AFAICT. > There shouldn't be any CASes needed now, right? Just a fetch-add? The latter > often scales *way* better under contention. > > [Looks at the patch ...] > > Which I think is true in your patch, I don't see any CAS. You are correct, no CAS at all anymore just a mental mistake in the last email. Now there are only atomic reads and single atomic fetch-add in ClockSweepTick(). >> Meanwhile, the tests except for Windows pass [2] for this new patch [3]. >> I'll dig into the Windows issues next week as well. > FWIW, there are backtraces generated on windows. E.g. > > https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt > > 000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @77] > 000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition( > char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers", > char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c", > int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67] > 000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo( > unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139] > 000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart( > unsigned int * complete_passes = 0x000000cd`827fdfc0, > unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @300] > 000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync( > struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c@ 3649] > 000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain( > void * startup_data = 0x00000000`00000000, > unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236] > 000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain( > int argc = 0n3, > char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714] > 000000cd`827ff620 00007ff7`6af0f5a9 postgres!main( > int argc = 0n3, > char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222] > > I.e. your new assertion failed for some reason that i can't *immediately* see. I put that in as a precaution and as a way to communicate the intention of the other code above it. I never imagined it would assert. I've changed clock_read() to only assert when the modulo differs and left that assert in the calling ClockSweepTick() function because it was redundant and I'm curious to see if we see a similar assert when testing the modulo. >> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context) >> >> /* >> * Compute strategy_delta = how many buffers have been scanned by the >> - * clock sweep since last time. If first time through, assume none. Then >> - * see if we are still ahead of the clock sweep, and if so, how many >> + * clock-sweep since last time. If first time through, assume none. Then >> + * see if we are still ahead of the clock-sweep, and if so, how many >> * buffers we could scan before we'd catch up with it and "lap" it. Note: >> * weird-looking coding of xxx_passes comparisons are to avoid bogus >> * behavior when the passes counts wrap around. >> */ >> if (saved_info_valid) >> { >> - int32 passes_delta = strategy_passes - prev_strategy_passes; >> + int32 passes_delta; >> + >> + if (unlikely(prev_strategy_passes > strategy_passes)) >> + { >> + /* wrap-around case */ >> + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes); >> + } >> + else >> + { >> + passes_delta = (int32) (strategy_passes - prev_strategy_passes); >> + } >> >> strategy_delta = strategy_buf_id - prev_strategy_buf_id; >> strategy_delta += (long) passes_delta * NBuffers; > That seems somewhat independent of the rest of the change, or am I missing something? That change is there to cover the possibility of someone managing to overflow and wrap a uint64 which is *highly* unlikely. If this degree of paranoia isn't required I'm happy to remove it. >> +static uint32 NBuffersPow2; /* NBuffers rounded up to the next power of 2 */ >> +static uint32 NBuffersPow2Shift; /* Amount to bitshift NBuffers for >> + * division */ > For performance in ClockSweepTick() it might more sense to store the mask > (i.e. NBuffersPow2 - 1), rather than the actual power of two. Agreed, I've done that and created one more calculated value that could be pre-computed once and never again (unless NBuffers changes) at runtime. > Greetings, > > Andres Freund thanks again for the review, v6 attached and re-based onto afa5c365ec5, also on GitHub at [1][2]. -greg [1] https://github.com/gburd/postgres/pull/7/checks [2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v6
Attachment
pgsql-hackers by date: