Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id CAMsr+YF1wcY6TkRXvtD=se5=qX1zJ8N8dieBV7qeCOoG7fXRCg@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
> TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.

It's not necessary on ext3/ext4 with errors=remount-ro, but that's only because the FS stops us dead in our tracks.

I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going to lose a write, it should at minimum mark the FD as broken so no further fsync() or anything else can succeed on the FD, and an app that cares about durability must repeat the whole set of work since the prior succesful fsync(). Just reporting it once and forgetting it is madness.

But even if we convince the kernel folks of that, how do other platforms behave? And how long before these kernels are out of use? We'd better deal with it, crazy or no.

Please see my StackOverflow post for the kernel-level explanation. Note also the test case link there. https://stackoverflow.com/a/42436054/398670

> Retrying fsync() is not OK at
> least on Linux. When fsync() returns success it means "all writes since the
> last fsync have hit disk" but we assume it means "all writes since the last
> SUCCESSFUL fsync have hit disk".

If that's actually the case, we need to push back on this kernel brain
damage, because as you're describing it fsync would be completely useless.

It's not useless, it's just telling us something other than what we think it means. The promise it seems to give us is that if it reports an error once, everything *after* that is useless, so we should throw our toys, close and reopen everything, and redo from the last known-good state.

Though as Tomas posted below, it provides rather weaker guarantees than I thought in some other areas too. See that lwn.net article he linked.
 
Moreover, POSIX is entirely clear that successful fsync means all
preceding writes for the file have been completed, full stop, doesn't
matter when they were issued.

I can't find anything that says so to me. Please quote relevant spec.

I'm working from http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which states that

"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is detected. It doesn't say that error has to be persistent and that subsequent calls must also report the error. It also says:

"If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed."

but that doesn't clarify matters much either, because it can be read to mean that once there's been an error reported for some IO operations there's no guarantee those operations are ever completed even after a subsequent fsync returns success.

I'm not seeking to defend what the kernel seems to be doing. Rather, saying that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: [HACKERS] Pluggable storage
Next
From: Pavan Deolasee
Date:
Subject: Re: [HACKERS] MERGE SQL Statement for PG11