Thread: Missing data_sync_elevel() for some calls of pg_fsync()?

Missing data_sync_elevel() for some calls of pg_fsync()?

From

Michael Paquier

Date:

02 December 2019, 04:58:26

Hi all,

I was just looking at some callers of pg_fsync(), to notice that some
code paths don't use data_sync_elevel().  For some code paths, that's
actually better to never PANIC (say backup_label file, logical
decoding snapshot, lock file where FATAL/LOG are used now, etc.).
However I have spotted three code paths where this is not done and I
think that's not fine:
- 2PC file generated at checkpoint time.
- WAL segment initialization.
- Temporary state file for a replication slot save, which may cause
ERRORs at checkpoint time.

Any thoughts?
--
Michael

Attachment

Re: Missing data_sync_elevel() for some calls of pg_fsync()?

From

Thomas Munro

Date:

02 December 2019, 05:43:23

On Mon, Dec 2, 2019 at 5:58 PM Michael Paquier <michael@paquier.xyz> wrote:
> I was just looking at some callers of pg_fsync(), to notice that some
> code paths don't use data_sync_elevel().  For some code paths, that's
> actually better to never PANIC (say backup_label file, logical
> decoding snapshot, lock file where FATAL/LOG are used now, etc.).
> However I have spotted three code paths where this is not done and I
> think that's not fine:
> - 2PC file generated at checkpoint time.
> - WAL segment initialization.
> - Temporary state file for a replication slot save, which may cause
> ERRORs at checkpoint time.

One of the distinctions I had in mind when reviewing/working on the
PANIC stuff was this:

1.  Some code opens a file, writes some stuff to it, closes, and then
fsyncs it, and if that fails and and it ever retries it'll redo all of
those steps again.  We know that some kernels might have thrown away
the data, but we don't care about the copy in the kernel's cache
because we'll write it out again next time around.

2.  Some code, primarily buffer pool write-back code, writes data out
to the file, then throws away the only copy we have of it other than
the WAL by using the buffer for some other block, and then later
(usually in the checkpointer) fsyncs it.  One way to look at it is
that if the fsync fails, the only place left to get that data (which
may represent committed transactions) is the WAL, and the only way to
get it is crash recovery.  Another way to look at it is that if we
didn't PANIC, the checkpointer would try again, but it's easily
demonstrable that if it tries again, certain kernels will do nothing
and then tell you that it succeeded, so we need to prevent that or our
checkpoint would be a data-eating lie.

I didn't look closely at your patch, but I think those are category 1,
no?  Admittedly there is an argument that we should panic in those
cases too, because otherwise we're second guessing how the kernel
works, and I did make a similar argument for why sync_file_range()
failure is panic-worthy.

Re: Missing data_sync_elevel() for some calls of pg_fsync()?

From

Craig Ringer

Date:

02 December 2019, 05:45:53

On Mon, 2 Dec 2019 at 13:43, Thomas Munro <thomas.munro@gmail.com> wrote:

1. Some code opens a file, writes some stuff to it, closes, and then
fsyncs it, and if that fails and and it ever retries it'll redo all of
those steps again. We know that some kernels might have thrown away
the data, but we don't care about the copy in the kernel's cache
because we'll write it out again next time around.

Can we trust the kernel to be reporting the EIO or ENOSPC only from writeback buffers for the actual file we're fsync()ing though? Not from buffers it flushed while performing our fsync() request, failed to flush, and complained about?

I'm not confident I want to assume that.

Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise