Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> Given precisely that the dirty pages which cannot been written-out are
> practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> are essentially correct: the first call indicates that a writeback error
> indeed occurred, while subsequent calls have no reason to indicate an error
> (assuming no other errors occurred in the meantime).

Like other people here, I think this is 100% unreasonable, starting
with "the dirty pages which cannot been written out are practically
thrown away".  Who decided that was OK, and on the basis of what
wording in what specification?  I think it's always unreasonable to
throw away the user's data.  If the writes are going to fail, then let
them keep on failing every time.  *That* wouldn't cause any data loss,
because we'd never be able to checkpoint, and eventually the user
would have to kill the server uncleanly, and that would trigger
recovery.

Also, this really does make it impossible to write reliable programs.
Imagine that, while the server is running, somebody runs a program
which opens a file in the data directory, calls fsync() on it, and
closes it.  If the fsync() fails, postgres is now borked and has no
way of being aware of the problem.  If we knew, we could PANIC, but
we'll never find out, because the unrelated process ate the error.
This is exactly the sort of ill-considered behavior that makes fcntl()
locking nearly useless.

Even leaving that aside, a PANIC means a prolonged outage on a
prolonged system - it could easily take tens of minutes or longer to
run recovery.  So saying "oh, just do that" is not really an answer.
Sure, we can do it, but it's like trying to lose weight by
intentionally eating a tapeworm.  Now, it's possible to shorten the
checkpoint_timeout so that recovery runs faster, but then performance
drops because data has to be fsync()'d more often instead of getting
buffered in the OS cache for the maximum possible time.  We could also
dodge this issue in another way: suppose that when we write a page
out, we don't consider it really written until fsync() succeeds.  Then
we wouldn't need to PANIC if an fsync() fails; we could just re-write
the page.  Unfortunately, this would also be terrible for performance,
for pretty much the same reasons: letting the OS cache absorb lots of
dirty blocks and do write-combining is *necessary* for good
performance.

> The error reporting is thus consistent with the intended semantics (which
> are sadly not properly documented). Repeated calls to fsync() simply do not
> imply that the kernel will retry to writeback the previously-failed pages,
> so the application needs to be aware of that. Persisting the error at the
> fsync() level would essentially mean moving application policy into the
> kernel.

I might accept this argument if I accepted that it was OK to decide
that an fsync() failure means you can forget that the write() ever
happened in the first place, but it's hard to imagine an application
that wants that behavior.  If the application didn't care about
whether the bytes really got to disk or not, it would not have called
fsync() in the first place.  If it does care, reporting the error only
once is never an improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: "Tsunakawa, Takayuki"
Date:
Subject: RE: Speedup of relation deletes during recovery
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] MERGE SQL Statement for PG11