Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Anthony Iliopoulos
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id 20180409105039.GA4233@ai-wks
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Greg Stark <stark@mit.edu>)
Responses Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
List pgsql-hackers
On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:
> On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> > On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> >> On 8 April 2018 at 04:27, Craig Ringer <craig@2ndquadrant.com> wrote:
> >> > On 8 April 2018 at 10:16, Thomas Munro <thomas.munro@enterprisedb.com>
> >
> > The question is, what should the kernel and application do in cases
> > where this is simply not possible (according to freebsd that keeps
> > dirty pages around after failure, for example, -EIO from the block
> > layer is a contract for unrecoverable errors so it is pointless to
> > keep them dirty). You'd need a specialized interface to clear-out
> > the errors (and drop the dirty pages), or potentially just remount
> > the filesystem.
> 
> Well firstly that's not necessarily the question. ENOSPC is not an
> unrecoverable error. And even unrecoverable errors for a single write
> doesn't mean the write will never be able to succeed in the future.

To make things a bit simpler, let us focus on EIO for the moment.
The contract between the block layer and the filesystem layer is
assumed to be that of, when an EIO is propagated up to the fs,
then you may assume that all possibilities for recovering have
been exhausted in lower layers of the stack. Mind you, I am not
claiming that this contract is either documented or necessarily
respected (in fact there have been studies on the error propagation
and handling of the block layer, see [1]). Let us assume that
this is the design contract though (which appears to be the case
across a number of open-source kernels), and if not - it's a bug.
In this case, indeed the specific write()s will never be able
to succeed in the future, at least not as long as the BIOs are
allocated to the specific failing LBAs.

> But secondly doesn't such an interface already exist? When the device
> is dropped any dirty pages already get dropped with it. What's the
> point in dropping them but keeping the failing device?

I think there are degrees of failure. There are certainly cases
where one may encounter localized unrecoverable medium errors
(specific to certain LBAs) that are non-maskable from the block
layer and below. That does not mean that the device is dropped
at all, so it does make sense to continue all other operations
to all other regions of the device that are functional. In cases
of total device failure, then the filesystem will prevent you
from proceeding anyway.

> But just to underline the point. "pointless to keep them dirty" is
> exactly backwards from the application's point of view. If the error
> writing to persistent media really is unrecoverable then it's all the
> more critical that the pages be kept so the data can be copied to some
> other device. The last thing user space expects to happen is if the
> data can't be written to persistent storage then also immediately
> delete it from RAM. (And the *really* last thing user space expects is
> for this to happen and return no error.)

Right. This implies though that apart from the kernel having
to keep around the dirtied-but-unrecoverable pages for an
unbounded time, that there's further an interface for obtaining
the exact failed pages so that you can read them back. This in
turn means that there needs to be an association between the
fsync() caller and the specific dirtied pages that the caller
intents to drain (for which we'd need an fsync_range(), among
other things). BTW, currently the failed writebacks are not
dropped from memory, but rather marked clean. They could be
lost though due to memory pressure or due to explicit request
(e.g. proc drop_caches), unless mlocked.

There is a clear responsibility of the application to keep
its buffers around until a successful fsync(). The kernels
do report the error (albeit with all the complexities of
dealing with the interface), at which point the application
may not assume that the write()s where ever even buffered
in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping
buffers over the (kernel) fence and idemnifying the application
from any further responsibility, i.e. a hard assurance
that either the kernel will persist the pages or it will
keep them around till the application recovers them
asynchronously, the filesystem is unmounted, or the system
is rebooted.

Best regards,
Anthony

[1] https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf


pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Optimizing nested ConvertRowtypeExpr execution
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Optimizing nested ConvertRowtypeExpr execution