Re: O(n) tasks cause lengthy startups and checkpoints - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: O(n) tasks cause lengthy startups and checkpoints
Date
Msg-id CALj2ACVod5vJS7M7LqmPXfnhmxRPKB2CQKYBvUK3iqhomzqh1Q@mail.gmail.com
Whole thread Raw
In response to Re: O(n) tasks cause lengthy startups and checkpoints  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: O(n) tasks cause lengthy startups and checkpoints
List pgsql-hackers
On Wed, Nov 30, 2022 at 10:48 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
>
> cfbot is not happy with v16.  AFAICT this is just due to poor placement, so
> here is another attempt with the tests moved to a new location.  Apologies
> for the noise.

Thanks for the patches. I spent some time on reviewing v17 patch set
and here are my comments:

0001:
1. I think the custodian process needs documentation - it needs a
definition in glossary.sgml and perhaps a dedicated page describing
what tasks it takes care of.

2.
+        LWLockReleaseAll();
+        ConditionVariableCancelSleep();
+        AbortBufferIO();
+        UnlockBuffers();
+        ReleaseAuxProcessResources(false);
+        AtEOXact_Buffers(false);
+        AtEOXact_SMgr();
+        AtEOXact_Files(false);
+        AtEOXact_HashTables(false);
Do we need all of these in the exit path? Isn't the stuff that
ShutdownAuxiliaryProcess() does enough for the custodian process?
AFAICS, the custodian process uses LWLocks (which the
ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared
buffers and so on.
Having said that, I'm fine to keep them for future use and all of
those cleanup functions exit if nothing related occurs.

3.
+     * Advertise out latch that backends can use to wake us up while we're
Typo - %s/out/our

4. Is it a good idea to add log messages in the DoCustodianTasks()
loop? Maybe at a debug level? The log message can say the current task
the custodian is processing. And/Or setting the custodian's status on
the ps display is also a good idea IMO.

0002 and 0003:
1.
+CHECKPOINT;
+DO $$
I think we need to ensure that there are some snapshot files before
the checkpoint. Otherwise, it may happen that the above test case
exits without the custodian process doing anything.

2. I think the best way to test the custodian process code is by
adding a TAP test module to see actually the custodian process kicks
in. Perhaps, add elog(DEBUGX,...) messages to various custodian
process functions and see if we see the logs in server logs.

0004:
I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
Otherwise the patch LGTM.

Few thoughts:
1. I think we can trivially extend the custodian process to remove any
future WAL files on the old timeline, something like the attached
0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file).
While this offloads the recovery a bit, the server may archive such
WAL files before the custodian removes them. We can do a bit more to
stop the server from archiving such WAL files, but that needs more
coding. I don't think we need to do all that now, perhaps, we can give
it a try once the basic custodian stuff gets in.
2. Moving RemovePgTempFiles() to the custodian can bring up the server
soon. The idea is that the postmaster just renames the temp
directories and informs the custodian so that it can go delete such
temp files and directories. I have personally seen cases where the
server spent a good amount of time cleaning up temp files. We can park
it for later.
3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster.
4. PreallocXlogFiles() - if we ever have plans to make pre-allocation
more aggressive (pre-allocate more than 1 WAL file), perhaps letting
custodian do that is a good idea. Again, too many tasks for a single
process.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Dean Rasheed
Date:
Subject: Re: Non-decimal integer literals