recovery_init_sync_method=wal - Mailing list pgsql-hackers

From Thomas Munro
Subject recovery_init_sync_method=wal
Date
Msg-id CA+hUKG+8Wm8TSfMWPteMEHfh194RytVTBNoOkggTQT1p5NTY7Q@mail.gmail.com
Whole thread Raw
Responses Re: recovery_init_sync_method=wal
List pgsql-hackers
Hello hackers,

I'm starting a new thread and CF entry for the material for r15 from
the earlier thread[1] that introduced the recovery_init_sync_method
GUC for r14.  I wrote a summary of this topic as I see it, while it's
still fresh on my mind from working on commit 61752afb, starting from
what problem this solves.

TL;DR: Here's a patch that adds a less pessimistic, faster starting
crash recovery mode based on first principles.

=== Background ===

Why do we synchronise the data directory before we run crash recovery?

1.  WAL: It's not safe for changes to data pages to reach the disk
before the WAL.  This is the basic log-before-data rule.  Suppose we
didn't do that.  If our online cluster crashed after calling
pwrite(<some WAL data>) but before calling fdatasync(), the WAL data
we later read in crash recovery may differ from what's really on disk,
and it'd be dangerous to replay its contents, because its effects to
data pages might then be written to disk at any time and break the
rule.  If that happens and you lose power and then run crash recovery
a second time, now you have some phantom partial changes already
applied but no WAL left to redo them, leading to hazards including
xids being recycled, effects of committed transactions being partially
lost, multi-page changes being half done, and other such corruption.

2.  Data files: We can't skip changes to a data page based on the page
header's LSN if the page is not known to be on disk (that is, it is
clean in PostgreSQL's buffer pool, but possibly dirty in the kernel's
page cache).  Otherwise, the end-of-recovery checkpoint will do
nothing for the page (assuming nothing else marks the page dirty in
our buffer pool before that), so we'll complete the checkpoint, and
allow the WAL to be discarded.  Then we might lose power before the
kernel gets a chance to write the data page to disk, and when the
lights come back on we'll run crash recovery but we don't replay that
forgotten change from before the bogus checkpoint, and we have lost
committed changes.  (I don't think this can happen with
full_page_writes=on, because in that mode we never skip pages and
always do a full replay, which has various tradeoffs.)

I believe those are the fundamental reasons.  If you know of more
reasons, I'd love to hear them.

Why don't we synchronise the data directory for a clean startup?

When we start up the database from a shutdown checkpoint, we take the
checkpoint at face value.  A checkpoint is a promise that all changes
up to a given LSN have been durably stored on disk.  There are a
couple of cases where that isn't true:

1.  You were previously running with fsync=off.  That's OK, we told
you not to do that.  Checkpoints created without fsync barriers to
enforce the strict WAL-then-data-then-checkpoint protocol are
forgeries.

2.  You made a file system-level copy of a cluster that you shut down
cleanly first, using cp, tar, scp, rsync, xmodem etc.  Now you start
up the copy.  Its checkpoint is a forgery.  (Maybe our manual should
mention this problem under "25.2. File System Level Backup" where it
teaches you to rsync your cluster.)

How do the existing recovery_init_sync_method modes work?

You can think of recovery_init_sync_method as different "query plans"
for finding dirty buffers in the kernel's page cache to sync.

1.  fsync:  Go through the directory tree and call fsync() on each
file, just in case that file had some dirty pages.  This is a terrible
plan if the files aren't currently in the kernel's VFS cache, because
it could take up to a few milliseconds to get each one in there
(random read to slower SSDs or network storage or IOPS-capped cloud
storage).  If there really is a lot of dirty data, that's a good bet,
because the files must have been in the VFS cache already.  But if
there are one million mostly read-only tables, it could take ages just
to *open* all the files, even though there's not much to actually
write out.

2.  syncfs:  Go through the kernel page cache instead, looking for
dirty data in the small number of file systems that contain our
directories.  This is driven by data that is already in the kernel's
cache, so we avoid the need to perform I/O to search for dirty data.
That's great if your cluster is running mostly alone on the file
system in question, but it's not great if you're running another
PostgreSQL cluster on the same file system, because now we generate
extra write I/O when it finds incidental other stuff to write out.

These are both scatter gun approaches that can sometimes do a lot of
useless work, and I'd like to find a precise version that uses
information we already have about what might be dirty according to the
meaning of a checkpoint and a transaction log.  The attached patch
does that as follows:

1.  Sync the WAL using fsync(), to enforce the log-before-data rule.
That's moved into the existing loop that scans the WAL files looking
for temporary files to unlink.  (I suppose it should perhaps do the
"presync" trick too.  Not done yet.)

2.  While replaying the WAL, if we ever decide to skip a page because
of its LSN, remember to fsync() the file in the next checkpoint anyway
(because the data might be dirty in the kernel).   This way we sync
all files that changed since the last checkpoint (even if we don't
have to apply the change again).  (A more paranoid mode would mark the
page dirty instead, so that we'll not only fsync() it, but we'll also
write it out again.  This would defend against kernels that have
writeback failure modes that include keeping changes but dropping
their own dirty flag.  Not done here.)

One thing about this approach is that it takes the checkpoint it
recovers from at face value.  This is similar to the current situation
with startup from a clean shutdown checkpoint.  If you're starting up
a database that was previously running with fsync=off, it won't fix
the problems that might have created, and if you beamed a copy of your
crashed cluster to another machine with rsync and took no steps to
sync it, then it won't fix the problems caused by random files that
are not yet flushed to disk, and that don't happen to be dirtied (or
skipped with BLK_DONE) by WAL replay.
recovery_init_sync_method=fsync,syncfs will fix at least that second
problem for you.

Now, what holes are there in this scheme?

[1] https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org

Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Maintaining a list of pgindent commits for "git blame" to ignore
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Custom compression methods