Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id 20180402205808.GZ24540@tamriel.snowman.net
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Anthony Iliopoulos <ailiop@altatus.com>)
Responses Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
List pgsql-hackers
Greetings,

* Anthony Iliopoulos (ailiop@altatus.com) wrote:
> On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:
> > On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:
> > > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> > > > Throwing away the dirty pages *and* persisting the error seems a lot
> > > > more reasonable. Then provide a fcntl (or whatever) extension that can
> > > > clear the error status in the few cases that the application that wants
> > > > to gracefully deal with the case.
> > >
> > > Given precisely that the dirty pages which cannot been written-out are
> > > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > > are essentially correct: the first call indicates that a writeback error
> > > indeed occurred, while subsequent calls have no reason to indicate an error
> > > (assuming no other errors occurred in the meantime).
> >
> > Meh^2.
> >
> > "no reason" - except that there's absolutely no way to know what state
> > the data is in. And that your application needs explicit handling of
> > such failures. And that one FD might be used in a lots of different
> > parts of the application, that fsyncs in one part of the application
> > might be an ok failure, and in another not.  Requiring explicit actions
> > to acknowledge "we've thrown away your data for unknown reason" seems
> > entirely reasonable.
>
> As long as fsync() indicates error on first invocation, the application
> is fully aware that between this point of time and the last call to fsync()
> data has been lost. Persisting this error any further does not change this
> or add any new info - on the contrary it adds confusion as subsequent write()s
> and fsync()s on other pages can succeed, but will be reported as failures.

fsync() doesn't reflect the status of given pages, however, it reflects
the status of the file descriptor, and as such the file, on which it's
called.  This notion that fsync() is actually only responsible for the
changes which were made to a file since the last fsync() call is pure
foolishness.  If we were able to pass a list of pages or data ranges to
fsync() for it to verify they're on disk then perhaps things would be
different, but we can't, all we can do is ask to "please flush all the
dirty pages associated with this file descriptor, which represents this
file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant
storage?" and we would certainly be happy to use it, and to repeatedly
try to flush out pages which weren't synced to disk due to some
transient error, and to track those cases and make sure that we don't
incorrectly assume that they've been transferred to persistent storage.

> The application will need to deal with that first error irrespective of
> subsequent return codes from fsync(). Conceptually every fsync() invocation
> demarcates an epoch for which it reports potential errors, so the caller
> needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later
*retrying* the fsync(), which is when we get back an "all good!
everything with this file descriptor you've opened is sync'd!" and
happily expect that to be truth, when, in reality, it's an unfortunate
lie and there are still pages associated with that file descriptor which
are, in reality, dirty and not sync'd to disk.

Consider two independent programs where the first one writes to a file
and then calls the second one whose job it is to go out and fsync(),
perhaps async from the first, those files.  Is the second program
supposed to go write to each page that the first one wrote to, in order
to ensure that all the dirty bits are set so that the fsync() will
actually return if all the dirty pages are written?

> Callers that are not affected by the potential outcome of fsync() and
> do not react on errors, have no reason for calling it in the first place
> (and thus masking failure from subsequent callers that may indeed care).

Reacting on an error from an fsync() call could, based on how it's
documented and actually implemented in other OS's, mean "run another
fsync() to see if the error has resolved itself."  Requiring that to
mean "you have to go dirty all of the pages you previously dirtied to
actually get a subsequent fsync() to do anything" is really just not
reasonable- a given program may have no idea what was written to
previously nor any particular reason to need to know, on the expectation
that the fsync() call will flush any dirty pages, as it's documented to
do.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: BRIN FSM vacuuming questions
Next
From: Stephen Frost
Date:
Subject: Re: Disabling memory display in EXPLAIN ANALYZE