Home > mailing lists

Re: O(n) tasks cause lengthy startups and checkpoints - Mailing list pgsql-hackers

From	Nathan Bossart
Subject	Re: O(n) tasks cause lengthy startups and checkpoints
Date	December 2, 2022 19:15:07
Msg-id	20221202191507.GA2277157@nathanxps13 Whole thread
In response to	Re: O(n) tasks cause lengthy startups and checkpoints (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses	Re: O(n) tasks cause lengthy startups and checkpoints
List	pgsql-hackers

Tree view

On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> The test appears to reliably create snapshot and mapping files, so if the
>> directories are empty at some point after the checkpoint at the end, we can
>> be reasonably certain the custodian took action.  I didn't add explicit
>> checks that there are files in the directories before the checkpoint
>> because a concurrent checkpoint could make such checks unreliable.
> 
> I think you're right. I added sqls to see if the snapshot and mapping
> files count > 0, see [1] and the cirrus-ci members are happy too -
> https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> think we can consider adding these count > 0 checks to tests.

My worry about adding "count > 0" checks is that a concurrent checkpoint
could make them unreliable.  In other words, those checks might ordinarily
work, but if an automatic checkpoint causes the files be cleaned up just
beforehand, they will fail.

> Having said above, I'm okay to process ShutdownRequestPending as early
> as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> alongside ShutdownRequestPending?

I'm not seeing a need for CHECK_FOR_INTERRUPTS.  Do you see one?

> While thinking about this, one thing that really struck me is what
> happens if we let the custodian exit, say after processing
> ShutdownRequestPending immediately or after a restart, leaving other
> queued tasks? The custodian will never get to work on those tasks
> unless the requestors (say checkpoint or some other process) requests
> it to do so after restart. Maybe, we don't need to worry about it.
> Maybe we need to worry about it. Maybe it's an overkill to save the
> custodian's task state to disk so that it can come up and do the
> leftover tasks upon restart.

Yes, tasks will need to be retried when the server starts again.  The ones
in this patch set should be requested again during the next checkpoint.
Temporary file cleanup would always be requested during server start, so
that should be handled as well.  Even today, the server might abruptly shut
down while executing these tasks, and we don't have any special handling
for that.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: Corey Huinker
Date: 02 December 2022, 19:06:09
Subject: Re: Error-safe user functions

From: Nathan Bossart
Date: 02 December 2022, 19:21:01
Subject: Re: wake up logical workers after ALTER SUBSCRIPTION

Re: O(n) tasks cause lengthy startups and checkpoints - Mailing list pgsql-hackers

Previous

Next