recovery_init_sync_method=wal - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | recovery_init_sync_method=wal |
Date | |
Msg-id | CA+hUKG+8Wm8TSfMWPteMEHfh194RytVTBNoOkggTQT1p5NTY7Q@mail.gmail.com Whole thread Raw |
Responses |
Re: recovery_init_sync_method=wal
|
List | pgsql-hackers |
Hello hackers, I'm starting a new thread and CF entry for the material for r15 from the earlier thread[1] that introduced the recovery_init_sync_method GUC for r14. I wrote a summary of this topic as I see it, while it's still fresh on my mind from working on commit 61752afb, starting from what problem this solves. TL;DR: Here's a patch that adds a less pessimistic, faster starting crash recovery mode based on first principles. === Background === Why do we synchronise the data directory before we run crash recovery? 1. WAL: It's not safe for changes to data pages to reach the disk before the WAL. This is the basic log-before-data rule. Suppose we didn't do that. If our online cluster crashed after calling pwrite(<some WAL data>) but before calling fdatasync(), the WAL data we later read in crash recovery may differ from what's really on disk, and it'd be dangerous to replay its contents, because its effects to data pages might then be written to disk at any time and break the rule. If that happens and you lose power and then run crash recovery a second time, now you have some phantom partial changes already applied but no WAL left to redo them, leading to hazards including xids being recycled, effects of committed transactions being partially lost, multi-page changes being half done, and other such corruption. 2. Data files: We can't skip changes to a data page based on the page header's LSN if the page is not known to be on disk (that is, it is clean in PostgreSQL's buffer pool, but possibly dirty in the kernel's page cache). Otherwise, the end-of-recovery checkpoint will do nothing for the page (assuming nothing else marks the page dirty in our buffer pool before that), so we'll complete the checkpoint, and allow the WAL to be discarded. Then we might lose power before the kernel gets a chance to write the data page to disk, and when the lights come back on we'll run crash recovery but we don't replay that forgotten change from before the bogus checkpoint, and we have lost committed changes. (I don't think this can happen with full_page_writes=on, because in that mode we never skip pages and always do a full replay, which has various tradeoffs.) I believe those are the fundamental reasons. If you know of more reasons, I'd love to hear them. Why don't we synchronise the data directory for a clean startup? When we start up the database from a shutdown checkpoint, we take the checkpoint at face value. A checkpoint is a promise that all changes up to a given LSN have been durably stored on disk. There are a couple of cases where that isn't true: 1. You were previously running with fsync=off. That's OK, we told you not to do that. Checkpoints created without fsync barriers to enforce the strict WAL-then-data-then-checkpoint protocol are forgeries. 2. You made a file system-level copy of a cluster that you shut down cleanly first, using cp, tar, scp, rsync, xmodem etc. Now you start up the copy. Its checkpoint is a forgery. (Maybe our manual should mention this problem under "25.2. File System Level Backup" where it teaches you to rsync your cluster.) How do the existing recovery_init_sync_method modes work? You can think of recovery_init_sync_method as different "query plans" for finding dirty buffers in the kernel's page cache to sync. 1. fsync: Go through the directory tree and call fsync() on each file, just in case that file had some dirty pages. This is a terrible plan if the files aren't currently in the kernel's VFS cache, because it could take up to a few milliseconds to get each one in there (random read to slower SSDs or network storage or IOPS-capped cloud storage). If there really is a lot of dirty data, that's a good bet, because the files must have been in the VFS cache already. But if there are one million mostly read-only tables, it could take ages just to *open* all the files, even though there's not much to actually write out. 2. syncfs: Go through the kernel page cache instead, looking for dirty data in the small number of file systems that contain our directories. This is driven by data that is already in the kernel's cache, so we avoid the need to perform I/O to search for dirty data. That's great if your cluster is running mostly alone on the file system in question, but it's not great if you're running another PostgreSQL cluster on the same file system, because now we generate extra write I/O when it finds incidental other stuff to write out. These are both scatter gun approaches that can sometimes do a lot of useless work, and I'd like to find a precise version that uses information we already have about what might be dirty according to the meaning of a checkpoint and a transaction log. The attached patch does that as follows: 1. Sync the WAL using fsync(), to enforce the log-before-data rule. That's moved into the existing loop that scans the WAL files looking for temporary files to unlink. (I suppose it should perhaps do the "presync" trick too. Not done yet.) 2. While replaying the WAL, if we ever decide to skip a page because of its LSN, remember to fsync() the file in the next checkpoint anyway (because the data might be dirty in the kernel). This way we sync all files that changed since the last checkpoint (even if we don't have to apply the change again). (A more paranoid mode would mark the page dirty instead, so that we'll not only fsync() it, but we'll also write it out again. This would defend against kernels that have writeback failure modes that include keeping changes but dropping their own dirty flag. Not done here.) One thing about this approach is that it takes the checkpoint it recovers from at face value. This is similar to the current situation with startup from a clean shutdown checkpoint. If you're starting up a database that was previously running with fsync=off, it won't fix the problems that might have created, and if you beamed a copy of your crashed cluster to another machine with rsync and took no steps to sync it, then it won't fix the problems caused by random files that are not yet flushed to disk, and that don't happen to be dirtied (or skipped with BLK_DONE) by WAL replay. recovery_init_sync_method=fsync,syncfs will fix at least that second problem for you. Now, what holes are there in this scheme? [1] https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Attachment
pgsql-hackers by date: