Re: O(n) tasks cause lengthy startups and checkpoints - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: O(n) tasks cause lengthy startups and checkpoints |
Date | |
Msg-id | CALj2ACVod5vJS7M7LqmPXfnhmxRPKB2CQKYBvUK3iqhomzqh1Q@mail.gmail.com Whole thread Raw |
In response to | Re: O(n) tasks cause lengthy startups and checkpoints (Nathan Bossart <nathandbossart@gmail.com>) |
Responses |
Re: O(n) tasks cause lengthy startups and checkpoints
|
List | pgsql-hackers |
On Wed, Nov 30, 2022 at 10:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > cfbot is not happy with v16. AFAICT this is just due to poor placement, so > here is another attempt with the tests moved to a new location. Apologies > for the noise. Thanks for the patches. I spent some time on reviewing v17 patch set and here are my comments: 0001: 1. I think the custodian process needs documentation - it needs a definition in glossary.sgml and perhaps a dedicated page describing what tasks it takes care of. 2. + LWLockReleaseAll(); + ConditionVariableCancelSleep(); + AbortBufferIO(); + UnlockBuffers(); + ReleaseAuxProcessResources(false); + AtEOXact_Buffers(false); + AtEOXact_SMgr(); + AtEOXact_Files(false); + AtEOXact_HashTables(false); Do we need all of these in the exit path? Isn't the stuff that ShutdownAuxiliaryProcess() does enough for the custodian process? AFAICS, the custodian process uses LWLocks (which the ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared buffers and so on. Having said that, I'm fine to keep them for future use and all of those cleanup functions exit if nothing related occurs. 3. + * Advertise out latch that backends can use to wake us up while we're Typo - %s/out/our 4. Is it a good idea to add log messages in the DoCustodianTasks() loop? Maybe at a debug level? The log message can say the current task the custodian is processing. And/Or setting the custodian's status on the ps display is also a good idea IMO. 0002 and 0003: 1. +CHECKPOINT; +DO $$ I think we need to ensure that there are some snapshot files before the checkpoint. Otherwise, it may happen that the above test case exits without the custodian process doing anything. 2. I think the best way to test the custodian process code is by adding a TAP test module to see actually the custodian process kicks in. Perhaps, add elog(DEBUGX,...) messages to various custodian process functions and see if we see the logs in server logs. 0004: I think the 0004 patch can be merged into 0001, 0002 and 0003 patches. Otherwise the patch LGTM. Few thoughts: 1. I think we can trivially extend the custodian process to remove any future WAL files on the old timeline, something like the attached 0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file). While this offloads the recovery a bit, the server may archive such WAL files before the custodian removes them. We can do a bit more to stop the server from archiving such WAL files, but that needs more coding. I don't think we need to do all that now, perhaps, we can give it a try once the basic custodian stuff gets in. 2. Moving RemovePgTempFiles() to the custodian can bring up the server soon. The idea is that the postmaster just renames the temp directories and informs the custodian so that it can go delete such temp files and directories. I have personally seen cases where the server spent a good amount of time cleaning up temp files. We can park it for later. 3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster. 4. PreallocXlogFiles() - if we ever have plans to make pre-allocation more aggressive (pre-allocate more than 1 WAL file), perhaps letting custodian do that is a good idea. Again, too many tasks for a single process. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: