Thread: recovery_init_sync_method=wal
Hello hackers, I'm starting a new thread and CF entry for the material for r15 from the earlier thread[1] that introduced the recovery_init_sync_method GUC for r14. I wrote a summary of this topic as I see it, while it's still fresh on my mind from working on commit 61752afb, starting from what problem this solves. TL;DR: Here's a patch that adds a less pessimistic, faster starting crash recovery mode based on first principles. === Background === Why do we synchronise the data directory before we run crash recovery? 1. WAL: It's not safe for changes to data pages to reach the disk before the WAL. This is the basic log-before-data rule. Suppose we didn't do that. If our online cluster crashed after calling pwrite(<some WAL data>) but before calling fdatasync(), the WAL data we later read in crash recovery may differ from what's really on disk, and it'd be dangerous to replay its contents, because its effects to data pages might then be written to disk at any time and break the rule. If that happens and you lose power and then run crash recovery a second time, now you have some phantom partial changes already applied but no WAL left to redo them, leading to hazards including xids being recycled, effects of committed transactions being partially lost, multi-page changes being half done, and other such corruption. 2. Data files: We can't skip changes to a data page based on the page header's LSN if the page is not known to be on disk (that is, it is clean in PostgreSQL's buffer pool, but possibly dirty in the kernel's page cache). Otherwise, the end-of-recovery checkpoint will do nothing for the page (assuming nothing else marks the page dirty in our buffer pool before that), so we'll complete the checkpoint, and allow the WAL to be discarded. Then we might lose power before the kernel gets a chance to write the data page to disk, and when the lights come back on we'll run crash recovery but we don't replay that forgotten change from before the bogus checkpoint, and we have lost committed changes. (I don't think this can happen with full_page_writes=on, because in that mode we never skip pages and always do a full replay, which has various tradeoffs.) I believe those are the fundamental reasons. If you know of more reasons, I'd love to hear them. Why don't we synchronise the data directory for a clean startup? When we start up the database from a shutdown checkpoint, we take the checkpoint at face value. A checkpoint is a promise that all changes up to a given LSN have been durably stored on disk. There are a couple of cases where that isn't true: 1. You were previously running with fsync=off. That's OK, we told you not to do that. Checkpoints created without fsync barriers to enforce the strict WAL-then-data-then-checkpoint protocol are forgeries. 2. You made a file system-level copy of a cluster that you shut down cleanly first, using cp, tar, scp, rsync, xmodem etc. Now you start up the copy. Its checkpoint is a forgery. (Maybe our manual should mention this problem under "25.2. File System Level Backup" where it teaches you to rsync your cluster.) How do the existing recovery_init_sync_method modes work? You can think of recovery_init_sync_method as different "query plans" for finding dirty buffers in the kernel's page cache to sync. 1. fsync: Go through the directory tree and call fsync() on each file, just in case that file had some dirty pages. This is a terrible plan if the files aren't currently in the kernel's VFS cache, because it could take up to a few milliseconds to get each one in there (random read to slower SSDs or network storage or IOPS-capped cloud storage). If there really is a lot of dirty data, that's a good bet, because the files must have been in the VFS cache already. But if there are one million mostly read-only tables, it could take ages just to *open* all the files, even though there's not much to actually write out. 2. syncfs: Go through the kernel page cache instead, looking for dirty data in the small number of file systems that contain our directories. This is driven by data that is already in the kernel's cache, so we avoid the need to perform I/O to search for dirty data. That's great if your cluster is running mostly alone on the file system in question, but it's not great if you're running another PostgreSQL cluster on the same file system, because now we generate extra write I/O when it finds incidental other stuff to write out. These are both scatter gun approaches that can sometimes do a lot of useless work, and I'd like to find a precise version that uses information we already have about what might be dirty according to the meaning of a checkpoint and a transaction log. The attached patch does that as follows: 1. Sync the WAL using fsync(), to enforce the log-before-data rule. That's moved into the existing loop that scans the WAL files looking for temporary files to unlink. (I suppose it should perhaps do the "presync" trick too. Not done yet.) 2. While replaying the WAL, if we ever decide to skip a page because of its LSN, remember to fsync() the file in the next checkpoint anyway (because the data might be dirty in the kernel). This way we sync all files that changed since the last checkpoint (even if we don't have to apply the change again). (A more paranoid mode would mark the page dirty instead, so that we'll not only fsync() it, but we'll also write it out again. This would defend against kernels that have writeback failure modes that include keeping changes but dropping their own dirty flag. Not done here.) One thing about this approach is that it takes the checkpoint it recovers from at face value. This is similar to the current situation with startup from a clean shutdown checkpoint. If you're starting up a database that was previously running with fsync=off, it won't fix the problems that might have created, and if you beamed a copy of your crashed cluster to another machine with rsync and took no steps to sync it, then it won't fix the problems caused by random files that are not yet flushed to disk, and that don't happen to be dirtied (or skipped with BLK_DONE) by WAL replay. recovery_init_sync_method=fsync,syncfs will fix at least that second problem for you. Now, what holes are there in this scheme? [1] https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Attachment
Greetings, * Thomas Munro (thomas.munro@gmail.com) wrote: > 2. You made a file system-level copy of a cluster that you shut down > cleanly first, using cp, tar, scp, rsync, xmodem etc. Now you start > up the copy. Its checkpoint is a forgery. (Maybe our manual should > mention this problem under "25.2. File System Level Backup" where it > teaches you to rsync your cluster.) Yes, it'd be good to get some updates to the backup documentation around this which stresses in all cases that your backup utility should make sure to fsync everything it restores. > These are both scatter gun approaches that can sometimes do a lot of > useless work, and I'd like to find a precise version that uses > information we already have about what might be dirty according to the > meaning of a checkpoint and a transaction log. The attached patch > does that as follows: > > 1. Sync the WAL using fsync(), to enforce the log-before-data rule. > That's moved into the existing loop that scans the WAL files looking > for temporary files to unlink. (I suppose it should perhaps do the > "presync" trick too. Not done yet.) > > 2. While replaying the WAL, if we ever decide to skip a page because > of its LSN, remember to fsync() the file in the next checkpoint anyway > (because the data might be dirty in the kernel). This way we sync > all files that changed since the last checkpoint (even if we don't > have to apply the change again). (A more paranoid mode would mark the > page dirty instead, so that we'll not only fsync() it, but we'll also > write it out again. This would defend against kernels that have > writeback failure modes that include keeping changes but dropping > their own dirty flag. Not done here.) Presuming that we do add to the documentation the language to document what's assumed (and already done by modern backup tools) that they're fsync'ing everything they're restoring, do we/can we have an option which those tools could set that explicitly tells PG "everything in the cluster has been fsync'd already, you don't need to do anything extra"? Perhaps also/seperately one for WAL that's restored with restore command if we think that's necessary? Otherwise, just in general, agree with doing this to address the risks discussed around regular crash recovery. We have some pretty clear "if the DB was doing recovery and was interrupted, you need to restore from backup" messages today in xlog.c, and this patch didn't seem to change that? Am I missing something or isn't the idea here that these changes would make it so you aren't going to end up with corruption in those cases? Specifically looking at- xlog.c:6509- case DB_IN_CRASH_RECOVERY: ereport(LOG, (errmsg("database system was interrupted while in recovery at %s", str_time(ControlFile->time)), errhint("This probably means that some data is corrupted and" " you will have to use the last backup for recovery."))); break; case DB_IN_ARCHIVE_RECOVERY: ereport(LOG, (errmsg("database system was interrupted while in recovery at log time %s", str_time(ControlFile->checkPointCopy.time)), errhint("If this has occurred more than once some data might be corrupted" " and you might need to choose an earlier recovery target."))); break; Thanks! Stephen
Attachment
On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfrost@snowman.net> wrote: > Presuming that we do add to the documentation the language to document > what's assumed (and already done by modern backup tools) that they're > fsync'ing everything they're restoring, do we/can we have an option > which those tools could set that explicitly tells PG "everything in the > cluster has been fsync'd already, you don't need to do anything extra"? > Perhaps also/seperately one for WAL that's restored with restore command > if we think that's necessary? In the earlier thread, we did contemplate recovery_init_sync_method=none, but it has the problem that after recovery completes you have a cluster running with a setting that is really bad if you eventually crash again and run crash recovery again, this time with a dirty kernel page cache. I was one of the people voting against that feature, but I also wrote a strawman patch just for the visceral experience of every cell in my body suddenly whispering "yeah, nah, I'm not committing that" as I wrote the weaselwordage for the manual. In that thread we also contemplated safe ways for a basebackup-type tool to promise that data has been sync'd, to avoid that problem with the GUC. Maybe something like: the backup label file could contain a "SYNC_DONE" message. But then what if someone copies the whole directory to a new location, how can you invalidate the promise? This is another version of the question of whether it's our problem or the user's to worry about buffering of pgdata files that they copy around with unknown tools. If it's our problem, maybe something like: "SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect cases where we still believe the claim. But there's probably a long tail of weird ways for whatever you come up with to be deceived. In the case of the recovery_init_sync_method=wal patch proposed in *this* thread, here's the thing: there's not much to gain by trying to skip the sync, anyway! For the WAL, you'll be opening those files soon anyway to replay them, and if they're already sync'd then fsync() will return quickly. For the relfile data, you'll be opening all relfiles that are referenced by the WAL soon anyway, and syncing them if required. So that just leaves relfiles that are not referenced in the WAL you're about to replay. Whether we have a duty to sync those too is that central question again, and one of the things I'm trying to get an answer to with this thread. The recovery_init_sync_method=wal patch only makes sense if the answer is "no". > Otherwise, just in general, agree with doing this to address the risks > discussed around regular crash recovery. We have some pretty clear "if > the DB was doing recovery and was interrupted, you need to restore from > backup" messages today in xlog.c, and this patch didn't seem to change > that? Am I missing something or isn't the idea here that these changes > would make it so you aren't going to end up with corruption in those > cases? Specifically looking at- > > xlog.c:6509- > case DB_IN_CRASH_RECOVERY: > ereport(LOG, > (errmsg("database system was interrupted while in recovery at %s", > str_time(ControlFile->time)), > errhint("This probably means that some data is corrupted and" > " you will have to use the last backup for recovery."))); > break; > > case DB_IN_ARCHIVE_RECOVERY: > ereport(LOG, > (errmsg("database system was interrupted while in recovery at log time %s", > str_time(ControlFile->checkPointCopy.time)), > errhint("If this has occurred more than once some data might be corrupted" > " and you might need to choose an earlier recovery target."))); > break; Maybe I missed your point... but I don't think anything changes here? If recovery is crashing in some deterministic way (not just because you lost power the first time, but rather because a particular log record hits the same bug or gets confused by the same corruption and implodes every time) it'll probably keep doing so, and our sync algorithm doesn't seem to make a difference to that.
Greetings, * Thomas Munro (thomas.munro@gmail.com) wrote: > On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfrost@snowman.net> wrote: > > Presuming that we do add to the documentation the language to document > > what's assumed (and already done by modern backup tools) that they're > > fsync'ing everything they're restoring, do we/can we have an option > > which those tools could set that explicitly tells PG "everything in the > > cluster has been fsync'd already, you don't need to do anything extra"? > > Perhaps also/seperately one for WAL that's restored with restore command > > if we think that's necessary? > > In the earlier thread, we did contemplate > recovery_init_sync_method=none, but it has the problem that after > recovery completes you have a cluster running with a setting that is > really bad if you eventually crash again and run crash recovery again, > this time with a dirty kernel page cache. I was one of the people > voting against that feature, but I also wrote a strawman patch just > for the visceral experience of every cell in my body suddenly > whispering "yeah, nah, I'm not committing that" as I wrote the > weaselwordage for the manual. Why not have a 'recovery_init_sync_method=backup'? It's not like there's a question about if we're doing recovery from a backup or not. > In that thread we also contemplated safe ways for a basebackup-type > tool to promise that data has been sync'd, to avoid that problem with > the GUC. Maybe something like: the backup label file could contain a > "SYNC_DONE" message. But then what if someone copies the whole > directory to a new location, how can you invalidate the promise? This > is another version of the question of whether it's our problem or the > user's to worry about buffering of pgdata files that they copy around > with unknown tools. If it's our problem, maybe something like: > "SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect > cases where we still believe the claim. But there's probably a long > tail of weird ways for whatever you come up with to be deceived. I don't really care for the idea of backup tools modifying the backup label... they're explicitly told not to do that and that seems like the best move. I also don't particularly care about silly "what ifs" like if someone randomly copies the data directory after the restore- yes, there's a lot of ways that people can screw things up by doing things that aren't sane, but that doesn't mean we should try to cater to such cases. > In the case of the recovery_init_sync_method=wal patch proposed in > *this* thread, here's the thing: there's not much to gain by trying to > skip the sync, anyway! For the WAL, you'll be opening those files > soon anyway to replay them, and if they're already sync'd then fsync() > will return quickly. For the relfile data, you'll be opening all > relfiles that are referenced by the WAL soon anyway, and syncing them > if required. So that just leaves relfiles that are not referenced in > the WAL you're about to replay. Whether we have a duty to sync those > too is that central question again, and one of the things I'm trying > to get an answer to with this thread. The > recovery_init_sync_method=wal patch only makes sense if the answer is > "no". I'm not too bothered by an extra fsync() for WAL files, just to be clear, it's running around fsync'ing everything else that seems objectionable to me. > > Otherwise, just in general, agree with doing this to address the risks > > discussed around regular crash recovery. We have some pretty clear "if > > the DB was doing recovery and was interrupted, you need to restore from > > backup" messages today in xlog.c, and this patch didn't seem to change > > that? Am I missing something or isn't the idea here that these changes > > would make it so you aren't going to end up with corruption in those > > cases? Specifically looking at- > > > > xlog.c:6509- > > case DB_IN_CRASH_RECOVERY: > > ereport(LOG, > > (errmsg("database system was interrupted while in recovery at %s", > > str_time(ControlFile->time)), > > errhint("This probably means that some data is corrupted and" > > " you will have to use the last backup for recovery."))); > > break; > > > > case DB_IN_ARCHIVE_RECOVERY: > > ereport(LOG, > > (errmsg("database system was interrupted while in recovery at log time %s", > > str_time(ControlFile->checkPointCopy.time)), > > errhint("If this has occurred more than once some data might be corrupted" > > " and you might need to choose an earlier recovery target."))); > > break; > > Maybe I missed your point... but I don't think anything changes here? > If recovery is crashing in some deterministic way (not just because > you lost power the first time, but rather because a particular log > record hits the same bug or gets confused by the same corruption and > implodes every time) it'll probably keep doing so, and our sync > algorithm doesn't seem to make a difference to that. These errors aren't thrown only in the case where we hit a bad XLOG record though, are they..? Maybe I missed that somehow but it seems like these get thrown in the simple case that we, say, lost power, started recovery and didn't finish recovery and lost power again, even though with your patches hopefully that wouldn't actually result in a failure case or in corruption..? In the 'bad XLOG' or 'confused by corruption' cases, I wonder if it'd be helpful to write that out more explicitly somehow.. essentially segregating these cases. Thanks, Stephen