Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id CAMsr+YETSXaZ-kVMekgmsZFL2X7A7198ghHN4JFoHG-k3TvT2Q@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
On 5 April 2018 at 15:09, Craig Ringer <craig@2ndquadrant.com> wrote:
 
Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.

I just confirmed this with a tweak to the test that

records the file position
close()s the fd
sync()s
open(s) the file
lseek()s back to the recorded position

This causes the test to completely ignore the I/O error, which is not reported to it at any time.

Fair enough, really, when you look at it from the kernel's point of view. What else can it do? Nobody has the file open. It'd have to mark the file its self as bad somehow. But that's pretty bad for our robustness AFAICS.
 
There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.


If dirty writeback fails between our close() and re-open() I see the same behaviour as with sync(). To test that I set dirty_writeback_centisecs and dirty_expire_centisecs to 1 and added a usleep(3*100*1000) between close() and open(). (It's still plenty slow). So sync() is a convenient way to simulate something other than our own fsync() writing out the dirty buffer.


If I omit the sync() then we get the error reported by fsync() once when we re open() the file and fsync() it, because the buffers weren't written out yet, so the error wasn't generated until we re-open()ed the file. But I doubt that'll happen much in practice because dirty writeback will get to it first so the error will be seen and discarded before we reopen the file in the checkpointer.

In other words, it looks like *even with a new kernel with the error reporting bug fixes*, if I understand how the backends and checkpointer interact when it comes to file descriptors, we're unlikely to notice I/O errors and fail a checkpoint. We may notice I/O errors if a backend does its own eager writeback for large I/O operations, or if the checkpointer fsync()s a file before the kernel's dirty writeback gets around to trying to flush the pages that will fail.

I haven't tested anything with multiple processes / multiple FDs yet, where we keep one fd open while writing on another.

But at this point I don't see any way to make Pg reliably detect I/O errors and fail a checkpoint then redo and retry. To even fix this by PANICing like I proposed originally, we need to know we have to PANIC.

AFAICS it's completely unsafe to write(), close(), open() and fsync() and expect that the fsync() makes any promises about the write(). Which if I read Pg's low level storage code right, makes it completely unable to reliably detect I/O errors.

When put it that way, it sounds fair enough too. How long is the kernel meant to remember that there was a write error on the file triggered by a write initiated by some seemingly unrelated process, some unbounded time ago, on a since-closed file?

But it seems to put Pg on the fast track to O_DIRECT.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] logical decoding of two-phase transactions
Next
From: Magnus Hagander
Date:
Subject: Re: Online enabling of checksums