Re: needless complexity in StartupXLOG - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: needless complexity in StartupXLOG
Date
Msg-id 20210726201523.GY20766@tamriel.snowman.net
Whole thread Raw
In response to Re: needless complexity in StartupXLOG  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: needless complexity in StartupXLOG  (Robert Haas <robertmhaas@gmail.com>)
Re: needless complexity in StartupXLOG  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote:
> > Yeah, tend to agree with this too ... but something I find a bit curious
> > is the comment:
> >
> > * Insert a special WAL record to mark the end of
> > * recovery, since we aren't doing a checkpoint.
> >
> > ... immediately after setting promoted = true, and then at the end of
> > StartupXLOG() having:
> >
> > if (promoted)
> >         RequestCheckpoint(CHECKPOINT_FORCE);
> >
> > maybe I'm missing something, but seems like that comment isn't being
> > terribly clear.  Perhaps we aren't doing a full checkpoint *there*, but
> > sure looks like we're going to do one moments later regardless of
> > anything else since we've set promoted to true..?
>
> Yep. So it's a question of whether we allow operations that might
> write WAL in the meantime. When we write the checkpoint record right
> here, there can't be any WAL from the new server lifetime until the
> checkpoint completes. When we write an end-of-recovery record, there
> can. And there could actually be quite a bit, because if we do the
> checkpoint right in this section of code, it will be a fast
> checkpoint, whereas in the code you quoted above, it's a spread
> checkpoint, which takes a lot longer. So the question is whether it's
> reasonable to give the checkpoint some time to complete or whether it
> needs to be completed right now.

All I was really trying to point out above was that the comment might be
improved upon, just so someone understands that we aren't doing a
checkpoint at this particular place, but one will be done later due to
the promotion.  Maybe I'm being a bit extra with that, but that was my
thought when reading the code and the use of the promoted flag variable.

> > Agreed that simpler logic is better, provided it's correct logic, of
> > course.  Finding better ways to test all of this would be really nice.
>
> Yeah, and there again, it's a lot easier to test if we don't have as
> many cases. Now no single change is going to fix that, but the number
> of flag variables in xlog.c is simply bonkers. This particular stretch
> of code uses 3 of them to even decide whether to attempt the test in
> question, and all of those are set in complex ways depending on the
> values of still more flag variables. The comments where
> bgwriterLaunched is set claim that we only do this during archive
> recovery, not crash recovery, but the code depends on the value of
> ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
> can't get the bgwriter to run even during crash recovery in the
> scenario described by:
>
>          * It's possible that archive recovery was requested, but we don't
>          * know how far we need to replay the WAL before we reach consistency.
>          * This can happen for example if a base backup is taken from a
>          * running server using an atomic filesystem snapshot, without calling
>          * pg_start/stop_backup. Or if you just kill a running primary server
>          * and put it into archive recovery by creating a recovery signal
>          * file.
>
> If we ran the bgwriter all the time during crash recovery, we'd know
> for sure whether that causes any problems. If we only do it like this
> in certain corner cases, it's much more likely that we have bugs.
> Grumble, grumble.

Yeah ... not to mention that it really is just incredibly dangerous to
use such an approach for PITR.  For my 2c, I'd rather we figure out a
way to prevent this than to imply that we support it when we have no way
of knowing if we actually have replayed far enough to be consistent.
That isn't to say that using snapshots for database backups isn't
possible, but it should be done in-between pg_start/stop_backup calls
which properly grab the returned info from those and store the backup
label with the snapshot, etc.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)
Next
From: ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Subject: Re: Removing "long int"-related limit on hash table sizes