Re: recovery_init_sync_method=wal - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: recovery_init_sync_method=wal
Date
Msg-id 20210321220718.GO20766@tamriel.snowman.net
Whole thread Raw
In response to Re: recovery_init_sync_method=wal  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:
> On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfrost@snowman.net> wrote:
> > Presuming that we do add to the documentation the language to document
> > what's assumed (and already done by modern backup tools) that they're
> > fsync'ing everything they're restoring, do we/can we have an option
> > which those tools could set that explicitly tells PG "everything in the
> > cluster has been fsync'd already, you don't need to do anything extra"?
> > Perhaps also/seperately one for WAL that's restored with restore command
> > if we think that's necessary?
>
> In the earlier thread, we did contemplate
> recovery_init_sync_method=none, but it has the problem that after
> recovery completes you have a cluster running with a setting that is
> really bad if you eventually crash again and run crash recovery again,
> this time with a dirty kernel page cache.  I was one of the people
> voting against that feature, but I also wrote a strawman patch just
> for the visceral experience of every cell in my body suddenly
> whispering "yeah, nah, I'm not committing that" as I wrote the
> weaselwordage for the manual.

Why not have a 'recovery_init_sync_method=backup'?  It's not like
there's a question about if we're doing recovery from a backup or not.

> In that thread we also contemplated safe ways for a basebackup-type
> tool to promise that data has been sync'd, to avoid that problem with
> the GUC.  Maybe something like: the backup label file could contain a
> "SYNC_DONE" message.  But then what if someone copies the whole
> directory to a new location, how can you invalidate the promise?  This
> is another version of the question of whether it's our problem or the
> user's to worry about buffering of pgdata files that they copy around
> with unknown tools.  If it's our problem, maybe something like:
> "SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect
> cases where we still believe the claim.  But there's probably a long
> tail of weird ways for whatever you come up with to be deceived.

I don't really care for the idea of backup tools modifying the backup
label... they're explicitly told not to do that and that seems like the
best move.  I also don't particularly care about silly "what ifs" like
if someone randomly copies the data directory after the restore- yes,
there's a lot of ways that people can screw things up by doing things
that aren't sane, but that doesn't mean we should try to cater to such
cases.

> In the case of the recovery_init_sync_method=wal patch proposed in
> *this* thread, here's the thing: there's not much to gain by trying to
> skip the sync, anyway!  For the WAL, you'll be opening those files
> soon anyway to replay them, and if they're already sync'd then fsync()
> will return quickly.  For the relfile data, you'll be opening all
> relfiles that are referenced by the WAL soon anyway, and syncing them
> if required.  So that just leaves relfiles that are not referenced in
> the WAL you're about to replay.  Whether we have a duty to sync those
> too is that central question again, and one of the things I'm trying
> to get an answer to with this thread.  The
> recovery_init_sync_method=wal patch only makes sense if the answer is
> "no".

I'm not too bothered by an extra fsync() for WAL files, just to be
clear, it's running around fsync'ing everything else that seems
objectionable to me.

> > Otherwise, just in general, agree with doing this to address the risks
> > discussed around regular crash recovery.  We have some pretty clear "if
> > the DB was doing recovery and was interrupted, you need to restore from
> > backup" messages today in xlog.c, and this patch didn't seem to change
> > that?  Am I missing something or isn't the idea here that these changes
> > would make it so you aren't going to end up with corruption in those
> > cases?  Specifically looking at-
> >
> > xlog.c:6509-
> >         case DB_IN_CRASH_RECOVERY:
> >             ereport(LOG,
> >                     (errmsg("database system was interrupted while in recovery at %s",
> >                             str_time(ControlFile->time)),
> >                      errhint("This probably means that some data is corrupted and"
> >                              " you will have to use the last backup for recovery.")));
> >             break;
> >
> >         case DB_IN_ARCHIVE_RECOVERY:
> >             ereport(LOG,
> >                     (errmsg("database system was interrupted while in recovery at log time %s",
> >                             str_time(ControlFile->checkPointCopy.time)),
> >                      errhint("If this has occurred more than once some data might be corrupted"
> >                              " and you might need to choose an earlier recovery target.")));
> >             break;
>
> Maybe I missed your point... but I don't think anything changes here?
> If recovery is crashing in some deterministic way (not just because
> you lost power the first time, but rather because a particular log
> record hits the same bug or gets confused by the same corruption and
> implodes every time) it'll probably keep doing so, and our sync
> algorithm doesn't seem to make a difference to that.

These errors aren't thrown only in the case where we hit a bad XLOG
record though, are they..?  Maybe I missed that somehow but it seems
like these get thrown in the simple case that we, say, lost power,
started recovery and didn't finish recovery and lost power again, even
though with your patches hopefully that wouldn't actually result in a
failure case or in corruption..?

In the 'bad XLOG' or 'confused by corruption' cases, I wonder if it'd be
helpful to write that out more explicitly somehow..  essentially
segregating these cases.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: recovery_init_sync_method=wal
Next
From: Stephen Frost
Date:
Subject: Re: shared memory stats: high level design decisions: consistency, dropping