Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers
From | Anthony Iliopoulos |
---|---|
Subject | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date | |
Msg-id | 20180409105039.GA4233@ai-wks Whole thread Raw |
In response to | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS (Greg Stark <stark@mit.edu>) |
Responses |
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
List | pgsql-hackers |
On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote: > On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop@altatus.com> wrote: > > On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote: > >> On 8 April 2018 at 04:27, Craig Ringer <craig@2ndquadrant.com> wrote: > >> > On 8 April 2018 at 10:16, Thomas Munro <thomas.munro@enterprisedb.com> > > > > The question is, what should the kernel and application do in cases > > where this is simply not possible (according to freebsd that keeps > > dirty pages around after failure, for example, -EIO from the block > > layer is a contract for unrecoverable errors so it is pointless to > > keep them dirty). You'd need a specialized interface to clear-out > > the errors (and drop the dirty pages), or potentially just remount > > the filesystem. > > Well firstly that's not necessarily the question. ENOSPC is not an > unrecoverable error. And even unrecoverable errors for a single write > doesn't mean the write will never be able to succeed in the future. To make things a bit simpler, let us focus on EIO for the moment. The contract between the block layer and the filesystem layer is assumed to be that of, when an EIO is propagated up to the fs, then you may assume that all possibilities for recovering have been exhausted in lower layers of the stack. Mind you, I am not claiming that this contract is either documented or necessarily respected (in fact there have been studies on the error propagation and handling of the block layer, see [1]). Let us assume that this is the design contract though (which appears to be the case across a number of open-source kernels), and if not - it's a bug. In this case, indeed the specific write()s will never be able to succeed in the future, at least not as long as the BIOs are allocated to the specific failing LBAs. > But secondly doesn't such an interface already exist? When the device > is dropped any dirty pages already get dropped with it. What's the > point in dropping them but keeping the failing device? I think there are degrees of failure. There are certainly cases where one may encounter localized unrecoverable medium errors (specific to certain LBAs) that are non-maskable from the block layer and below. That does not mean that the device is dropped at all, so it does make sense to continue all other operations to all other regions of the device that are functional. In cases of total device failure, then the filesystem will prevent you from proceeding anyway. > But just to underline the point. "pointless to keep them dirty" is > exactly backwards from the application's point of view. If the error > writing to persistent media really is unrecoverable then it's all the > more critical that the pages be kept so the data can be copied to some > other device. The last thing user space expects to happen is if the > data can't be written to persistent storage then also immediately > delete it from RAM. (And the *really* last thing user space expects is > for this to happen and return no error.) Right. This implies though that apart from the kernel having to keep around the dirtied-but-unrecoverable pages for an unbounded time, that there's further an interface for obtaining the exact failed pages so that you can read them back. This in turn means that there needs to be an association between the fsync() caller and the specific dirtied pages that the caller intents to drain (for which we'd need an fsync_range(), among other things). BTW, currently the failed writebacks are not dropped from memory, but rather marked clean. They could be lost though due to memory pressure or due to explicit request (e.g. proc drop_caches), unless mlocked. There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place. What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted. Best regards, Anthony [1] https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf
pgsql-hackers by date: