Thread: Missing data_sync_elevel() for some calls of pg_fsync()?
Hi all, I was just looking at some callers of pg_fsync(), to notice that some code paths don't use data_sync_elevel(). For some code paths, that's actually better to never PANIC (say backup_label file, logical decoding snapshot, lock file where FATAL/LOG are used now, etc.). However I have spotted three code paths where this is not done and I think that's not fine: - 2PC file generated at checkpoint time. - WAL segment initialization. - Temporary state file for a replication slot save, which may cause ERRORs at checkpoint time. Any thoughts? -- Michael
Attachment
On Mon, Dec 2, 2019 at 5:58 PM Michael Paquier <michael@paquier.xyz> wrote: > I was just looking at some callers of pg_fsync(), to notice that some > code paths don't use data_sync_elevel(). For some code paths, that's > actually better to never PANIC (say backup_label file, logical > decoding snapshot, lock file where FATAL/LOG are used now, etc.). > However I have spotted three code paths where this is not done and I > think that's not fine: > - 2PC file generated at checkpoint time. > - WAL segment initialization. > - Temporary state file for a replication slot save, which may cause > ERRORs at checkpoint time. One of the distinctions I had in mind when reviewing/working on the PANIC stuff was this: 1. Some code opens a file, writes some stuff to it, closes, and then fsyncs it, and if that fails and and it ever retries it'll redo all of those steps again. We know that some kernels might have thrown away the data, but we don't care about the copy in the kernel's cache because we'll write it out again next time around. 2. Some code, primarily buffer pool write-back code, writes data out to the file, then throws away the only copy we have of it other than the WAL by using the buffer for some other block, and then later (usually in the checkpointer) fsyncs it. One way to look at it is that if the fsync fails, the only place left to get that data (which may represent committed transactions) is the WAL, and the only way to get it is crash recovery. Another way to look at it is that if we didn't PANIC, the checkpointer would try again, but it's easily demonstrable that if it tries again, certain kernels will do nothing and then tell you that it succeeded, so we need to prevent that or our checkpoint would be a data-eating lie. I didn't look closely at your patch, but I think those are category 1, no? Admittedly there is an argument that we should panic in those cases too, because otherwise we're second guessing how the kernel works, and I did make a similar argument for why sync_file_range() failure is panic-worthy.
On Mon, 2 Dec 2019 at 13:43, Thomas Munro <thomas.munro@gmail.com> wrote:
1. Some code opens a file, writes some stuff to it, closes, and then
fsyncs it, and if that fails and and it ever retries it'll redo all of
those steps again. We know that some kernels might have thrown away
the data, but we don't care about the copy in the kernel's cache
because we'll write it out again next time around.
Can we trust the kernel to be reporting the EIO or ENOSPC only from writeback buffers for the actual file we're fsync()ing though? Not from buffers it flushed while performing our fsync() request, failed to flush, and complained about?
I'm not confident I want to assume that.