Thread: recovery_init_sync_method=wal

recovery_init_sync_method=wal

From

Thomas Munro

Date:

20 March 2021, 05:35:29

Hello hackers,

I'm starting a new thread and CF entry for the material for r15 from
the earlier thread[1] that introduced the recovery_init_sync_method
GUC for r14.  I wrote a summary of this topic as I see it, while it's
still fresh on my mind from working on commit 61752afb, starting from
what problem this solves.

TL;DR: Here's a patch that adds a less pessimistic, faster starting
crash recovery mode based on first principles.

=== Background ===

Why do we synchronise the data directory before we run crash recovery?

1.  WAL: It's not safe for changes to data pages to reach the disk
before the WAL.  This is the basic log-before-data rule.  Suppose we
didn't do that.  If our online cluster crashed after calling
pwrite(<some WAL data>) but before calling fdatasync(), the WAL data
we later read in crash recovery may differ from what's really on disk,
and it'd be dangerous to replay its contents, because its effects to
data pages might then be written to disk at any time and break the
rule.  If that happens and you lose power and then run crash recovery
a second time, now you have some phantom partial changes already
applied but no WAL left to redo them, leading to hazards including
xids being recycled, effects of committed transactions being partially
lost, multi-page changes being half done, and other such corruption.

2.  Data files: We can't skip changes to a data page based on the page
header's LSN if the page is not known to be on disk (that is, it is
clean in PostgreSQL's buffer pool, but possibly dirty in the kernel's
page cache).  Otherwise, the end-of-recovery checkpoint will do
nothing for the page (assuming nothing else marks the page dirty in
our buffer pool before that), so we'll complete the checkpoint, and
allow the WAL to be discarded.  Then we might lose power before the
kernel gets a chance to write the data page to disk, and when the
lights come back on we'll run crash recovery but we don't replay that
forgotten change from before the bogus checkpoint, and we have lost
committed changes.  (I don't think this can happen with
full_page_writes=on, because in that mode we never skip pages and
always do a full replay, which has various tradeoffs.)

I believe those are the fundamental reasons.  If you know of more
reasons, I'd love to hear them.

Why don't we synchronise the data directory for a clean startup?

When we start up the database from a shutdown checkpoint, we take the
checkpoint at face value.  A checkpoint is a promise that all changes
up to a given LSN have been durably stored on disk.  There are a
couple of cases where that isn't true:

1.  You were previously running with fsync=off.  That's OK, we told
you not to do that.  Checkpoints created without fsync barriers to
enforce the strict WAL-then-data-then-checkpoint protocol are
forgeries.

2.  You made a file system-level copy of a cluster that you shut down
cleanly first, using cp, tar, scp, rsync, xmodem etc.  Now you start
up the copy.  Its checkpoint is a forgery.  (Maybe our manual should
mention this problem under "25.2. File System Level Backup" where it
teaches you to rsync your cluster.)

How do the existing recovery_init_sync_method modes work?

You can think of recovery_init_sync_method as different "query plans"
for finding dirty buffers in the kernel's page cache to sync.

1.  fsync:  Go through the directory tree and call fsync() on each
file, just in case that file had some dirty pages.  This is a terrible
plan if the files aren't currently in the kernel's VFS cache, because
it could take up to a few milliseconds to get each one in there
(random read to slower SSDs or network storage or IOPS-capped cloud
storage).  If there really is a lot of dirty data, that's a good bet,
because the files must have been in the VFS cache already.  But if
there are one million mostly read-only tables, it could take ages just
to *open* all the files, even though there's not much to actually
write out.

2.  syncfs:  Go through the kernel page cache instead, looking for
dirty data in the small number of file systems that contain our
directories.  This is driven by data that is already in the kernel's
cache, so we avoid the need to perform I/O to search for dirty data.
That's great if your cluster is running mostly alone on the file
system in question, but it's not great if you're running another
PostgreSQL cluster on the same file system, because now we generate
extra write I/O when it finds incidental other stuff to write out.

These are both scatter gun approaches that can sometimes do a lot of
useless work, and I'd like to find a precise version that uses
information we already have about what might be dirty according to the
meaning of a checkpoint and a transaction log.  The attached patch
does that as follows:

1.  Sync the WAL using fsync(), to enforce the log-before-data rule.
That's moved into the existing loop that scans the WAL files looking
for temporary files to unlink.  (I suppose it should perhaps do the
"presync" trick too.  Not done yet.)

2.  While replaying the WAL, if we ever decide to skip a page because
of its LSN, remember to fsync() the file in the next checkpoint anyway
(because the data might be dirty in the kernel).   This way we sync
all files that changed since the last checkpoint (even if we don't
have to apply the change again).  (A more paranoid mode would mark the
page dirty instead, so that we'll not only fsync() it, but we'll also
write it out again.  This would defend against kernels that have
writeback failure modes that include keeping changes but dropping
their own dirty flag.  Not done here.)

One thing about this approach is that it takes the checkpoint it
recovers from at face value.  This is similar to the current situation
with startup from a clean shutdown checkpoint.  If you're starting up
a database that was previously running with fsync=off, it won't fix
the problems that might have created, and if you beamed a copy of your
crashed cluster to another machine with rsync and took no steps to
sync it, then it won't fix the problems caused by random files that
are not yet flushed to disk, and that don't happen to be dirtied (or
skipped with BLK_DONE) by WAL replay.
recovery_init_sync_method=fsync,syncfs will fix at least that second
problem for you.

Now, what holes are there in this scheme?

[1] https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org

Attachment

v8-0001-Provide-recovery_init_sync_method-wal.patch

Re: recovery_init_sync_method=wal

From

Stephen Frost

Date:

21 March 2021, 18:31:43

Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:
> 2.  You made a file system-level copy of a cluster that you shut down
> cleanly first, using cp, tar, scp, rsync, xmodem etc.  Now you start
> up the copy.  Its checkpoint is a forgery.  (Maybe our manual should
> mention this problem under "25.2. File System Level Backup" where it
> teaches you to rsync your cluster.)

Yes, it'd be good to get some updates to the backup documentation around
this which stresses in all cases that your backup utility should make
sure to fsync everything it restores.

> These are both scatter gun approaches that can sometimes do a lot of
> useless work, and I'd like to find a precise version that uses
> information we already have about what might be dirty according to the
> meaning of a checkpoint and a transaction log.  The attached patch
> does that as follows:
>
> 1.  Sync the WAL using fsync(), to enforce the log-before-data rule.
> That's moved into the existing loop that scans the WAL files looking
> for temporary files to unlink.  (I suppose it should perhaps do the
> "presync" trick too.  Not done yet.)
>
> 2.  While replaying the WAL, if we ever decide to skip a page because
> of its LSN, remember to fsync() the file in the next checkpoint anyway
> (because the data might be dirty in the kernel).   This way we sync
> all files that changed since the last checkpoint (even if we don't
> have to apply the change again).  (A more paranoid mode would mark the
> page dirty instead, so that we'll not only fsync() it, but we'll also
> write it out again.  This would defend against kernels that have
> writeback failure modes that include keeping changes but dropping
> their own dirty flag.  Not done here.)

Presuming that we do add to the documentation the language to document
what's assumed (and already done by modern backup tools) that they're
fsync'ing everything they're restoring, do we/can we have an option
which those tools could set that explicitly tells PG "everything in the
cluster has been fsync'd already, you don't need to do anything extra"?
Perhaps also/seperately one for WAL that's restored with restore command
if we think that's necessary?

Otherwise, just in general, agree with doing this to address the risks
discussed around regular crash recovery.  We have some pretty clear "if
the DB was doing recovery and was interrupted, you need to restore from
backup" messages today in xlog.c, and this patch didn't seem to change
that?  Am I missing something or isn't the idea here that these changes
would make it so you aren't going to end up with corruption in those
cases?  Specifically looking at-

xlog.c:6509-
        case DB_IN_CRASH_RECOVERY:
            ereport(LOG,
                    (errmsg("database system was interrupted while in recovery at %s",
                            str_time(ControlFile->time)),
                     errhint("This probably means that some data is corrupted and"
                             " you will have to use the last backup for recovery.")));
            break;

        case DB_IN_ARCHIVE_RECOVERY:
            ereport(LOG,
                    (errmsg("database system was interrupted while in recovery at log time %s",
                            str_time(ControlFile->checkPointCopy.time)),
                     errhint("If this has occurred more than once some data might be corrupted"
                             " and you might need to choose an earlier recovery target.")));
            break;

Thanks!

Stephen

Attachment

signature.asc

Re: recovery_init_sync_method=wal

From

Thomas Munro

Date:

22 March 2021, 00:56:23

On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfrost@snowman.net> wrote:
> Presuming that we do add to the documentation the language to document
> what's assumed (and already done by modern backup tools) that they're
> fsync'ing everything they're restoring, do we/can we have an option
> which those tools could set that explicitly tells PG "everything in the
> cluster has been fsync'd already, you don't need to do anything extra"?
> Perhaps also/seperately one for WAL that's restored with restore command
> if we think that's necessary?

In the earlier thread, we did contemplate
recovery_init_sync_method=none, but it has the problem that after
recovery completes you have a cluster running with a setting that is
really bad if you eventually crash again and run crash recovery again,
this time with a dirty kernel page cache.  I was one of the people
voting against that feature, but I also wrote a strawman patch just
for the visceral experience of every cell in my body suddenly
whispering "yeah, nah, I'm not committing that" as I wrote the
weaselwordage for the manual.

In that thread we also contemplated safe ways for a basebackup-type
tool to promise that data has been sync'd, to avoid that problem with
the GUC.  Maybe something like: the backup label file could contain a
"SYNC_DONE" message.  But then what if someone copies the whole
directory to a new location, how can you invalidate the promise?  This
is another version of the question of whether it's our problem or the
user's to worry about buffering of pgdata files that they copy around
with unknown tools.  If it's our problem, maybe something like:
"SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect
cases where we still believe the claim.  But there's probably a long
tail of weird ways for whatever you come up with to be deceived.

In the case of the recovery_init_sync_method=wal patch proposed in
*this* thread, here's the thing: there's not much to gain by trying to
skip the sync, anyway!  For the WAL, you'll be opening those files
soon anyway to replay them, and if they're already sync'd then fsync()
will return quickly.  For the relfile data, you'll be opening all
relfiles that are referenced by the WAL soon anyway, and syncing them
if required.  So that just leaves relfiles that are not referenced in
the WAL you're about to replay.  Whether we have a duty to sync those
too is that central question again, and one of the things I'm trying
to get an answer to with this thread.  The
recovery_init_sync_method=wal patch only makes sense if the answer is
"no".

> Otherwise, just in general, agree with doing this to address the risks
> discussed around regular crash recovery.  We have some pretty clear "if
> the DB was doing recovery and was interrupted, you need to restore from
> backup" messages today in xlog.c, and this patch didn't seem to change
> that?  Am I missing something or isn't the idea here that these changes
> would make it so you aren't going to end up with corruption in those
> cases?  Specifically looking at-
>
> xlog.c:6509-
>         case DB_IN_CRASH_RECOVERY:
>             ereport(LOG,
>                     (errmsg("database system was interrupted while in recovery at %s",
>                             str_time(ControlFile->time)),
>                      errhint("This probably means that some data is corrupted and"
>                              " you will have to use the last backup for recovery.")));
>             break;
>
>         case DB_IN_ARCHIVE_RECOVERY:
>             ereport(LOG,
>                     (errmsg("database system was interrupted while in recovery at log time %s",
>                             str_time(ControlFile->checkPointCopy.time)),
>                      errhint("If this has occurred more than once some data might be corrupted"
>                              " and you might need to choose an earlier recovery target.")));
>             break;

Maybe I missed your point... but I don't think anything changes here?
If recovery is crashing in some deterministic way (not just because
you lost power the first time, but rather because a particular log
record hits the same bug or gets confused by the same corruption and
implodes every time) it'll probably keep doing so, and our sync
algorithm doesn't seem to make a difference to that.

Re: recovery_init_sync_method=wal

From

Stephen Frost

Date:

22 March 2021, 01:07:18

Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:
> On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfrost@snowman.net> wrote:
> > Presuming that we do add to the documentation the language to document
> > what's assumed (and already done by modern backup tools) that they're
> > fsync'ing everything they're restoring, do we/can we have an option
> > which those tools could set that explicitly tells PG "everything in the
> > cluster has been fsync'd already, you don't need to do anything extra"?
> > Perhaps also/seperately one for WAL that's restored with restore command
> > if we think that's necessary?
>
> In the earlier thread, we did contemplate
> recovery_init_sync_method=none, but it has the problem that after
> recovery completes you have a cluster running with a setting that is
> really bad if you eventually crash again and run crash recovery again,
> this time with a dirty kernel page cache.  I was one of the people
> voting against that feature, but I also wrote a strawman patch just
> for the visceral experience of every cell in my body suddenly
> whispering "yeah, nah, I'm not committing that" as I wrote the
> weaselwordage for the manual.

Why not have a 'recovery_init_sync_method=backup'?  It's not like
there's a question about if we're doing recovery from a backup or not.

> In that thread we also contemplated safe ways for a basebackup-type
> tool to promise that data has been sync'd, to avoid that problem with
> the GUC.  Maybe something like: the backup label file could contain a
> "SYNC_DONE" message.  But then what if someone copies the whole
> directory to a new location, how can you invalidate the promise?  This
> is another version of the question of whether it's our problem or the
> user's to worry about buffering of pgdata files that they copy around
> with unknown tools.  If it's our problem, maybe something like:
> "SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect
> cases where we still believe the claim.  But there's probably a long
> tail of weird ways for whatever you come up with to be deceived.

I don't really care for the idea of backup tools modifying the backup
label... they're explicitly told not to do that and that seems like the
best move.  I also don't particularly care about silly "what ifs" like
if someone randomly copies the data directory after the restore- yes,
there's a lot of ways that people can screw things up by doing things
that aren't sane, but that doesn't mean we should try to cater to such
cases.

> In the case of the recovery_init_sync_method=wal patch proposed in
> *this* thread, here's the thing: there's not much to gain by trying to
> skip the sync, anyway!  For the WAL, you'll be opening those files
> soon anyway to replay them, and if they're already sync'd then fsync()
> will return quickly.  For the relfile data, you'll be opening all
> relfiles that are referenced by the WAL soon anyway, and syncing them
> if required.  So that just leaves relfiles that are not referenced in
> the WAL you're about to replay.  Whether we have a duty to sync those
> too is that central question again, and one of the things I'm trying
> to get an answer to with this thread.  The
> recovery_init_sync_method=wal patch only makes sense if the answer is
> "no".

I'm not too bothered by an extra fsync() for WAL files, just to be
clear, it's running around fsync'ing everything else that seems
objectionable to me.

> > Otherwise, just in general, agree with doing this to address the risks
> > discussed around regular crash recovery.  We have some pretty clear "if
> > the DB was doing recovery and was interrupted, you need to restore from
> > backup" messages today in xlog.c, and this patch didn't seem to change
> > that?  Am I missing something or isn't the idea here that these changes
> > would make it so you aren't going to end up with corruption in those
> > cases?  Specifically looking at-
> >
> > xlog.c:6509-
> >         case DB_IN_CRASH_RECOVERY:
> >             ereport(LOG,
> >                     (errmsg("database system was interrupted while in recovery at %s",
> >                             str_time(ControlFile->time)),
> >                      errhint("This probably means that some data is corrupted and"
> >                              " you will have to use the last backup for recovery.")));
> >             break;
> >
> >         case DB_IN_ARCHIVE_RECOVERY:
> >             ereport(LOG,
> >                     (errmsg("database system was interrupted while in recovery at log time %s",
> >                             str_time(ControlFile->checkPointCopy.time)),
> >                      errhint("If this has occurred more than once some data might be corrupted"
> >                              " and you might need to choose an earlier recovery target.")));
> >             break;
>
> Maybe I missed your point... but I don't think anything changes here?
> If recovery is crashing in some deterministic way (not just because
> you lost power the first time, but rather because a particular log
> record hits the same bug or gets confused by the same corruption and
> implodes every time) it'll probably keep doing so, and our sync
> algorithm doesn't seem to make a difference to that.

These errors aren't thrown only in the case where we hit a bad XLOG
record though, are they..?  Maybe I missed that somehow but it seems
like these get thrown in the simple case that we, say, lost power,
started recovery and didn't finish recovery and lost power again, even
though with your patches hopefully that wouldn't actually result in a
failure case or in corruption..?

In the 'bad XLOG' or 'confused by corruption' cases, I wonder if it'd be
helpful to write that out more explicitly somehow..  essentially
segregating these cases.

Thanks,

Stephen

Attachment

signature.asc