Re: O(n) tasks cause lengthy startups and checkpoints - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: O(n) tasks cause lengthy startups and checkpoints
Date
Msg-id CALj2ACW1F-T866iaqA_9h9fA+Axdyc86CHz7_k55RGH5HH2rtA@mail.gmail.com
Whole thread Raw
In response to Re: O(n) tasks cause lengthy startups and checkpoints  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: O(n) tasks cause lengthy startups and checkpoints
List pgsql-hackers
On Sat, Dec 3, 2022 at 12:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> > On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> The test appears to reliably create snapshot and mapping files, so if the
> >> directories are empty at some point after the checkpoint at the end, we can
> >> be reasonably certain the custodian took action.  I didn't add explicit
> >> checks that there are files in the directories before the checkpoint
> >> because a concurrent checkpoint could make such checks unreliable.
> >
> > I think you're right. I added sqls to see if the snapshot and mapping
> > files count > 0, see [1] and the cirrus-ci members are happy too -
> > https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> > think we can consider adding these count > 0 checks to tests.
>
> My worry about adding "count > 0" checks is that a concurrent checkpoint
> could make them unreliable.  In other words, those checks might ordinarily
> work, but if an automatic checkpoint causes the files be cleaned up just
> beforehand, they will fail.

Hm. It would have been better with a TAP test module for testing the
custodian code reliably. Anyway, that mustn't stop the patch getting
in. If required, we can park the TAP test module for later - IMO.
Others may have different thoughts here.

> > Having said above, I'm okay to process ShutdownRequestPending as early
> > as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> > alongside ShutdownRequestPending?
>
> I'm not seeing a need for CHECK_FOR_INTERRUPTS.  Do you see one?

Since the custodian has SignalHandlerForShutdownRequest as SIGINT and
SIGTERM handlers, unlike StatementCancelHandler and die respectively,
no need of CFI I guess. And also none of the CFI signal handler flags
applies to the custodian.

> > While thinking about this, one thing that really struck me is what
> > happens if we let the custodian exit, say after processing
> > ShutdownRequestPending immediately or after a restart, leaving other
> > queued tasks? The custodian will never get to work on those tasks
> > unless the requestors (say checkpoint or some other process) requests
> > it to do so after restart. Maybe, we don't need to worry about it.
> > Maybe we need to worry about it. Maybe it's an overkill to save the
> > custodian's task state to disk so that it can come up and do the
> > leftover tasks upon restart.
>
> Yes, tasks will need to be retried when the server starts again.  The ones
> in this patch set should be requested again during the next checkpoint.
> Temporary file cleanup would always be requested during server start, so
> that should be handled as well.  Even today, the server might abruptly shut
> down while executing these tasks, and we don't have any special handling
> for that.

Right.

The v18 patch set posted upthread
https://www.postgresql.org/message-id/20221201214026.GA1799688%40nathanxps13
looks good to me. I see the CF entry is marked RfC -
https://commitfest.postgresql.org/41/3448/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
Next
From: Amit Kapila
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply