Re: O(n) tasks cause lengthy startups and checkpoints - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: O(n) tasks cause lengthy startups and checkpoints
Date
Msg-id 20221202191507.GA2277157@nathanxps13
Whole thread Raw
In response to Re: O(n) tasks cause lengthy startups and checkpoints  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: O(n) tasks cause lengthy startups and checkpoints
List pgsql-hackers
On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> The test appears to reliably create snapshot and mapping files, so if the
>> directories are empty at some point after the checkpoint at the end, we can
>> be reasonably certain the custodian took action.  I didn't add explicit
>> checks that there are files in the directories before the checkpoint
>> because a concurrent checkpoint could make such checks unreliable.
> 
> I think you're right. I added sqls to see if the snapshot and mapping
> files count > 0, see [1] and the cirrus-ci members are happy too -
> https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> think we can consider adding these count > 0 checks to tests.

My worry about adding "count > 0" checks is that a concurrent checkpoint
could make them unreliable.  In other words, those checks might ordinarily
work, but if an automatic checkpoint causes the files be cleaned up just
beforehand, they will fail.

> Having said above, I'm okay to process ShutdownRequestPending as early
> as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> alongside ShutdownRequestPending?

I'm not seeing a need for CHECK_FOR_INTERRUPTS.  Do you see one?

> While thinking about this, one thing that really struck me is what
> happens if we let the custodian exit, say after processing
> ShutdownRequestPending immediately or after a restart, leaving other
> queued tasks? The custodian will never get to work on those tasks
> unless the requestors (say checkpoint or some other process) requests
> it to do so after restart. Maybe, we don't need to worry about it.
> Maybe we need to worry about it. Maybe it's an overkill to save the
> custodian's task state to disk so that it can come up and do the
> leftover tasks upon restart.

Yes, tasks will need to be retried when the server starts again.  The ones
in this patch set should be requested again during the next checkpoint.
Temporary file cleanup would always be requested during server start, so
that should be handled as well.  Even today, the server might abruptly shut
down while executing these tasks, and we don't have any special handling
for that.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: Error-safe user functions
Next
From: Nathan Bossart
Date:
Subject: Re: wake up logical workers after ALTER SUBSCRIPTION