Thread: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Hi all

Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should've been a fatal error.

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

Pg wrote some blocks, which went to OS dirty buffers for writeback. Writeback failed due to an underlying storage error. The block I/O layer and XFS marked the writeback page as failed (AS_EIO), but had no way to tell the app about the failure. When Pg called fsync() on the FD during the next checkpoint, fsync() returned EIO because of the flagged page, to tell Pg that a previous async write failed. Pg treated the checkpoint as failed and didn't advance the redo start position in the control file.

All good so far.

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.

The write never made it to disk, but we completed the checkpoint, and merrily carried on our way. Whoops, data loss.

The clear-error-and-continue behaviour of fsync is not documented as far as I can tell. Nor is fsync() returning EIO unless you have a very new linux man-pages with the patch I wrote to add it. But from what I can see in the POSIX standard we are not given any guarantees about what happens on fsync() failure at all, so we're probably wrong to assume that retrying fsync( ) is safe.

If the server had been using ext3 or ext4 with errors=remount-ro, the problem wouldn't have occurred because the first I/O error would've remounted the FS and stopped Pg from continuing. But XFS doesn't have that option. There may be other situations where this can occur too, involving LVM and/or multipath, but I haven't comprehensively dug out the details yet.

It proved possible to recover the system by faking up a backup label from before the first incorrectly-successful checkpoint, forcing redo to repeat and write the lost blocks. But ... what a mess.

I posted about the underlying fsync issue here some time ago:


but haven't had a chance to follow up about the Pg specifics.

I've been looking at the problem on and off and haven't come up with a good answer. I think we should just PANIC and let redo sort it out by repeating the failed write when it repeats work since the last checkpoint.

The API offered by async buffered writes and fsync offers us no way to find out which page failed, so we can't just selectively redo that write. I think we do know the relfilenode associated with the fd that failed to fsync, but not much more. So the alternative seems to be some sort of potentially complex online-redo scheme where we replay WAL only the relation on which we had the fsync() error, while otherwise servicing queries normally. That's likely to be extremely error-prone and hard to test, and it's trying to solve a case where on other filesystems the whole DB would grind to a halt anyway.

I looked into whether we can solve it with use of the AIO API instead, but the mess is even worse there - from what I can tell you can't even reliably guarantee fsync at all on all Linux kernel versions.

We already PANIC on fsync() failure for WAL segments. We just need to do the same for data forks at least for EIO. This isn't as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us.

There are rather a lot of pg_fsync() callers. While we could handle this case-by-case for each one, I'm tempted to just make pg_fsync() itself intercept EIO and PANIC. Thoughts?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> writes:
> TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

> Retrying fsync() is not OK at
> least on Linux. When fsync() returns success it means "all writes since the
> last fsync have hit disk" but we assume it means "all writes since the last
> SUCCESSFUL fsync have hit disk".

If that's actually the case, we need to push back on this kernel brain
damage, because as you're describing it fsync would be completely useless.

Moreover, POSIX is entirely clear that successful fsync means all
preceding writes for the file have been completed, full stop, doesn't
matter when they were issued.

            regards, tom lane


On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> TL;DR: Pg should PANIC on fsync() EIO return.
>
> Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.
--
Michael

Attachment
On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
>> Craig Ringer <craig@2ndquadrant.com> writes:
>>> TL;DR: Pg should PANIC on fsync() EIO return.
>>
>> Surely you jest.
>
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

"Current kernels might report a writeback error on an fsync() call,
but there are a number of ways in which that can fail to happen."

That's... I'm speechless.

-- 
Thomas Munro
http://www.enterprisedb.com


On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer <craig@2ndquadrant.com> writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> > 
> > Surely you jest.
> 
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:
> Craig, is the phenomenon you described the same as the second issue
> "Reporting writeback errors" discussed in this article?
> https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to
change it:

 "Storing that value in the file structure has an important benefit: it makes
it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT
CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR
OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see
attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

     1    open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
     2    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     3    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     4    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     5    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     6    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     7    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     8    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560
     9    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space
lefton device)
 
    10    dup(2)                                  = 4
    11    fcntl(4, F_GETFL)                       = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
    12    brk(NULL)                               = 0x1299000
    13    brk(0x12ba000)                          = 0x12ba000
    14    fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
    15    write(4, "write(1): No space left on devic"..., 34write(1): No space left on device
    16    ) = 34
    17    close(4)                                = 0
    18    fsync(3)                                = -1 EIO (Input/output error)
    19    dup(2)                                  = 4
    20    fcntl(4, F_GETFL)                       = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
    21    fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
    22    write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output error
    23    ) = 29
    24    close(4)                                = 0
    25    close(3)                                = 0
    26    open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
    27    fsync(3)                                = 0
    28    write(3, "\0", 1)                       = 1
    29    fsync(3)                                = 0
    30    exit_group(0)                           = ?

2: EIO isn't seen initially due to writeback page cache;
9: ENOSPC due to small device
18: original IO error reported by fsync, good
25: the original FD is closed
26: ..and file reopened
27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror
should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the
2nd fsync(), but I think may itself be problematic, no?  In any case, the 2nd
byte certainly never got written to DM error, and the failure status was lost
following fsync().

I get the exact same behavior if I break after one write() loop, such as to
avoid ENOSPC.

Justin

Attachment
On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> The retries are the source of the problem ; the first fsync() can return EIO,
> and also *clears the error* causing a 2nd fsync (of the same data) to return
> success.

What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch.  If
write back failed, *the page is still dirty*.  So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).

-- 
Thomas Munro
http://www.enterprisedb.com


On 29 March 2018 at 13:06, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> The retries are the source of the problem ; the first fsync() can return EIO,
> and also *clears the error* causing a 2nd fsync (of the same data) to return
> success.

What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch.  If
write back failed, *the page is still dirty*.  So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).

You'd think so. But it doesn't appear to work that way. You can see yourself with the error device-mapper destination mapped over part of a volume.

I wrote a test case here.


I don't pretend the kernel behaviour is sane. And it's possible I've made an error in my analysis. But since I've observed this in the wild, and seen it in a test case, I strongly suspect that's what I've described is just what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O subsystem and doesn't dirty it again on I/O error? I haven't dug that deep on the kernel side. See the stackoverflow post for details on what I found in kernel code analysis.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 29 March 2018 at 10:48, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
>> Craig Ringer <craig@2ndquadrant.com> writes:
>>> TL;DR: Pg should PANIC on fsync() EIO return.
>>
>> Surely you jest.
>
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue
"Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

A variant of it, by the looks.

The problem in our case is that the kernel only tells us about the error once. It then forgets about it. So yes, that seems like a variant of the statement:
 
"Current kernels might report a writeback error on an fsync() call,
but there are a number of ways in which that can fail to happen."

That's... I'm speechless.

Yeah.

It's a bit nuts.

I was astonished when I saw the behaviour, and that it appears undocumented.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 29 March 2018 at 10:30, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> TL;DR: Pg should PANIC on fsync() EIO return.
>
> Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check
the returned status, sometimes doing retries like in mdsync, so what is
proposed here would be a regression.


I covered this in my original post.

Yes, we check the return value. But what do we do about it? For fsyncs of heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint later, which will retry the fsync(). **Which will now appear to succeed** because the kernel forgot that it lost our writes after telling us the first time. So we do check the error code, which returns success, and we complete the checkpoint and move on.

But we only retried the fsync, not the writes before the fsync.

So we lost data. Or rather, failed to detect that the kernel did so, so our checkpoint was bad and could not be completed.

The problem is that we keep retrying checkpoints *without* repeating the writes leading up to the checkpoint, and retrying fsync.

I don't pretend the kernel behaviour is sane, but we'd better deal with it anyway.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
> TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.

It's not necessary on ext3/ext4 with errors=remount-ro, but that's only because the FS stops us dead in our tracks.

I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going to lose a write, it should at minimum mark the FD as broken so no further fsync() or anything else can succeed on the FD, and an app that cares about durability must repeat the whole set of work since the prior succesful fsync(). Just reporting it once and forgetting it is madness.

But even if we convince the kernel folks of that, how do other platforms behave? And how long before these kernels are out of use? We'd better deal with it, crazy or no.

Please see my StackOverflow post for the kernel-level explanation. Note also the test case link there. https://stackoverflow.com/a/42436054/398670

> Retrying fsync() is not OK at
> least on Linux. When fsync() returns success it means "all writes since the
> last fsync have hit disk" but we assume it means "all writes since the last
> SUCCESSFUL fsync have hit disk".

If that's actually the case, we need to push back on this kernel brain
damage, because as you're describing it fsync would be completely useless.

It's not useless, it's just telling us something other than what we think it means. The promise it seems to give us is that if it reports an error once, everything *after* that is useless, so we should throw our toys, close and reopen everything, and redo from the last known-good state.

Though as Tomas posted below, it provides rather weaker guarantees than I thought in some other areas too. See that lwn.net article he linked.
 
Moreover, POSIX is entirely clear that successful fsync means all
preceding writes for the file have been completed, full stop, doesn't
matter when they were issued.

I can't find anything that says so to me. Please quote relevant spec.

I'm working from http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which states that

"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is detected. It doesn't say that error has to be persistent and that subsequent calls must also report the error. It also says:

"If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed."

but that doesn't clarify matters much either, because it can be read to mean that once there's been an error reported for some IO operations there's no guarantee those operations are ever completed even after a subsequent fsync returns success.

I'm not seeking to defend what the kernel seems to be doing. Rather, saying that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Craig Ringer <craig@2ndquadrant.com> writes:
>> > TL;DR: Pg should PANIC on fsync() EIO return.
>>
>> Surely you jest.
>
> No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
> well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed.  That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

>> Moreover, POSIX is entirely clear that successful fsync means all
>> preceding writes for the file have been completed, full stop, doesn't
>> matter when they were issued.
>
> I can't find anything that says so to me. Please quote relevant spec.
>
> I'm working from
> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which
> states that
>
> "The fsync() function shall request that all data for the open file
> descriptor named by fildes is to be transferred to the storage device
> associated with the file described by fildes. The nature of the transfer is
> implementation-defined. The fsync() function shall not return until the
> system has completed that action or until an error is detected."
>
> My reading is that POSIX does not specify what happens AFTER an error is
> detected. It doesn't say that error has to be persistent and that subsequent
> calls must also report the error. It also says:

FWIW my reading is the same as Tom's.  It says "all data for the open
file descriptor" without qualification or special treatment after
errors.  Not "some".

> I'm not seeking to defend what the kernel seems to be doing. Rather, saying
> that we might see similar behaviour on other platforms, crazy or not. I
> haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave
that way without strong evidence...  This is openly acknowledged to be
"a mess" and "a surprise" in the Filesystem Summit article.  I am not
really qualified to comment, but from a cursory glance at FreeBSD's
vfs_bio.c I think it's doing what you'd hope for... see the code near
the comment "Failed write, redirty."

-- 
Thomas Munro
http://www.enterprisedb.com


On 29 March 2018 at 20:07, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 28 March 2018 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Craig Ringer <craig@2ndquadrant.com> writes:
>> > TL;DR: Pg should PANIC on fsync() EIO return.
>>
>> Surely you jest.
>
> No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as
> well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at
https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
writeup seems to want a scheme where pages stay dirty after a
writeback failure so that we can try to fsync them again. Note that
that has never been the case in Linux after hard writeback failures,
AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways,
ie that writeback failure leaves you with pages that are neither
written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even
worse before his errseq_t stuff landed.  That fixed cases of
completely unreported writeback failures due to sharing of PG_error
for both writeback and read errors with certain filesystems, but it
doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

In more ways than one ;)

> I'm not seeking to defend what the kernel seems to be doing. Rather, saying
> that we might see similar behaviour on other platforms, crazy or not. I
> haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave
that way without strong evidence...  This is openly acknowledged to be
"a mess" and "a surprise" in the Filesystem Summit article.  I am not
really qualified to comment, but from a cursory glance at FreeBSD's
vfs_bio.c I think it's doing what you'd hope for... see the code near
the comment "Failed write, redirty."

Ok, that's reassuring, but doesn't help us on the platform the great majority of users deploy on :(

"If on Linux, PANIC"

Hrm.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I found your discussion with kernel hacker Jeff Layton at
> https://lwn.net/Articles/718734/ in which he said: "The stackoverflow
> writeup seems to want a scheme where pages stay dirty after a
> writeback failure so that we can try to fsync them again. Note that
> that has never been the case in Linux after hard writeback failures,
> AFAIK, so programs should definitely not assume that behavior."

And a bit below in the same comments, to this question about PG: "So,
what are the options at this point? The assumption was that we can
repeat the fsync (which as you point out is not the case), or shut
down the database and perform recovery from WAL", the same Jeff Layton
seems to agree PANIC is the appropriate response:
"Replaying the WAL synchronously sounds like the simplest approach
when you get an error on fsync. These are uncommon occurrences for the
most part, so having to fall back to slow, synchronous error recovery
modes when this occurs is probably what you want to do.".
And right after, he confirms the errseq_t patches are about always
detecting this, not more:
"The main thing I working on is to better guarantee is that you
actually get an error when this occurs rather than silently corrupting
your data. The circumstances where that can occur require some
corner-cases, but I think we need to make sure that it doesn't occur."

Jeff's comments in the pull request that merged errseq_t are worth
reading as well:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

> The article above that says the same thing a couple of different ways,
> ie that writeback failure leaves you with pages that are neither
> written to disk successfully nor marked dirty.
>
> If I'm reading various articles correctly, the situation was even
> worse before his errseq_t stuff landed.  That fixed cases of
> completely unreported writeback failures due to sharing of PG_error
> for both writeback and read errors with certain filesystems, but it
> doesn't address the clean pages problem.

Indeed, that's exactly how I read it as well (opinion formed
independently before reading your sentence above). The errseq_t
patches landed in v4.13 by the way, so very recently.

> Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel
versions before v4.13, which at this point is pretty much everything
out there, not even detecting this reliably. This is messy.


On Fri, Mar 30, 2018 at 5:20 AM, Catalin Iacob <iacobcatalin@gmail.com> wrote:
> Jeff's comments in the pull request that merged errseq_t are worth
> reading as well:
>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

Wow.  It looks like there may be a separate question of when each
filesystem adopted this new infrastructure?

>> Yeah, I see why you want to PANIC.
>
> Indeed. Even doing that leaves question marks about all the kernel
> versions before v4.13, which at this point is pretty much everything
> out there, not even detecting this reliably. This is messy.

The pre-errseq_t problems are beyond our control.  There's nothing we
can do about that in userspace (except perhaps abandon OS-buffered IO,
a big project).  We just need to be aware that this problem exists in
certain kernel versions and be grateful to Layton for fixing it.

The dropped dirty flag problem is something we can and in my view
should do something about, whatever we might think about that design
choice.  As Andrew Gierth pointed out to me in an off-list chat about
this, by the time you've reached this state, both PostgreSQL's buffer
and the kernel's buffer are clean and might be reused for another
block at any time, so your data might be gone from the known universe
-- we don't even have the option to rewrite our buffers in general.
Recovery is the only option.

Thank you to Craig for chasing this down and +1 for his proposal, on Linux only.

-- 
Thomas Munro
http://www.enterprisedb.com


On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

> >> Yeah, I see why you want to PANIC.
> >
> > Indeed. Even doing that leaves question marks about all the kernel
> > versions before v4.13, which at this point is pretty much everything
> > out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.

Best regards,
Anthony




On 31 March 2018 at 21:24, Anthony Iliopoulos <ailiop@altatus.com> wrote:
On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

> >> Yeah, I see why you want to PANIC.
> >
> > Indeed. Even doing that leaves question marks about all the kernel
> > versions before v4.13, which at this point is pretty much everything
> > out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel
versions from userspace, but it will be messy whatsoever. On EIO
errors, the kernel will not restore the dirty page flags, but it
will flip the error flags on the failed pages. One could mmap()
the file in question, obtain the PFNs (via /proc/pid/pagemap)
and enumerate those to match the ones with the error flag switched
on (via /proc/kpageflags). This could serve at least as a detection
mechanism, but one could also further use this info to logically
map the pages that failed IO back to the original file offsets,
and potentially retry IO just for those file ranges that cover
the failed pages. Just an idea, not tested.

That sounds like a huge amount of complexity, with uncertainty as to how it'll behave kernel-to-kernel, for negligble benefit.

I was exploring the idea of doing selective recovery of one relfilenode, based on the assumption that we know the filenode related to the fd that failed to fsync(). We could redo only WAL on that relation. But it fails the same test: it's too complex for a niche case that shouldn't happen in the first place, so it'll probably have bugs, or grow bugs in bitrot over time.

Remember, if you're on ext4 with errors=remount-ro, you get shut down even harder than a PANIC. So we should just use the big hammer here.

I'll send a patch this week.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> writes:
> So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is.
If we make enough of a stink maybe it'll get fixed.

            regards, tom lane


On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> So we should just use the big hammer here.
>
> And bitch, loudly and publicly, about how broken this kernel behavior is.
> If we make enough of a stink maybe it'll get fixed.

That won't fix anything released already, so as per the information
gathered something has to be done anyway.  The discussion of this thread
is spreading quite a lot actually.

Handling things at a low-level looks like a better plan for the backend.
Tools like pg_basebackup and pg_dump also issue fsync's on the data
created, we should do an equivalent for them, with some exit() calls in
file_utils.c.  As of now failures are logged to stderr but not
considered fatal.
--
Michael

Attachment
On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:
>    On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ailiop@altatus.com>
>    wrote:
> 
>      On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:
> 
>      > >> Yeah, I see why you want to PANIC.
>      > >
>      > > Indeed. Even doing that leaves question marks about all the kernel
>      > > versions before v4.13, which at this point is pretty much everything
>      > > out there, not even detecting this reliably. This is messy.
> 
>      There may still be a way to reliably detect this on older kernel
>      versions from userspace, but it will be messy whatsoever. On EIO
>      errors, the kernel will not restore the dirty page flags, but it
>      will flip the error flags on the failed pages. One could mmap()
>      the file in question, obtain the PFNs (via /proc/pid/pagemap)
>      and enumerate those to match the ones with the error flag switched
>      on (via /proc/kpageflags). This could serve at least as a detection
>      mechanism, but one could also further use this info to logically
>      map the pages that failed IO back to the original file offsets,
>      and potentially retry IO just for those file ranges that cover
>      the failed pages. Just an idea, not tested.
> 
>    That sounds like a huge amount of complexity, with uncertainty as to how
>    it'll behave kernel-to-kernel, for negligble benefit.

Those interfaces have been around since the kernel 2.6 times and are
rather stable, but I was merely responding to your original post comment
regarding having a way of finding out which page(s) failed. I assume
that indeed there would be no benefit, especially since those errors
are usually not transient (typically they come from hard medium faults),
and although a filesystem could theoretically mask the error by allocating
a different logical block, I am not aware of any implementation that
currently does that.

>    I was exploring the idea of doing selective recovery of one relfilenode,
>    based on the assumption that we know the filenode related to the fd that
>    failed to fsync(). We could redo only WAL on that relation. But it fails
>    the same test: it's too complex for a niche case that shouldn't happen in
>    the first place, so it'll probably have bugs, or grow bugs in bitrot over
>    time.

Fully agree, those cases should be sufficiently rare that a complex
and possibly non-maintainable solution is not really warranted.

>    Remember, if you're on ext4 with errors=remount-ro, you get shut down even
>    harder than a PANIC. So we should just use the big hammer here.

I am not entirely sure what you mean here, does Pg really treat write()
errors as fatal? Also, the kind of errors that ext4 detects with this
option is at the superblock level and govern metadata rather than actual
data writes (recall that those are buffered anyway, no actual device IO
has to take place at the time of write()).

Best regards,
Anthony


On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
> > So we should just use the big hammer here.
>
> And bitch, loudly and publicly, about how broken this kernel behavior is.
> If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the
manpage patches and errseq_t fixes on the reporting level). The issue is,
the kernel needs to deal with hard IO errors at that level somehow, and
since those errors typically persist, re-dirtying the pages would not
really solve the problem (unless some filesystem remaps the request to a
different block, assuming the device is alive). Keeping around dirty
pages that cannot possibly be written out is essentially a memory leak,
as those pages would stay around even after the application has exited.

Best regards,
Anthony


On Fri, Mar 30, 2018 at 10:18 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> ... on Linux only.

Apparently I was too optimistic.  I had looked only at FreeBSD, which
keeps the page around and dirties it so we can retry, but the other
BSDs apparently don't (FreeBSD changed that in 1999).  From what I can
tell from the sources below, we have:

Linux, OpenBSD, NetBSD: retrying fsync() after EIO lies
FreeBSD, Illumos: retrying fsync() after EIO tells the truth

Maybe my drive-by assessment of those kernel routines is wrong and
someone will correct me, but I'm starting to think you might be better
to assume the worst on all systems.  Perhaps a GUC that defaults to
panicking, so that users on those rare OSes could turn that off?  Even
then I'm not sure if the failure mode will be that great anyway or if
it's worth having two behaviours.  Thoughts?

http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html
https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059
https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867
https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2631
https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441

-- 
Thomas Munro
http://www.enterprisedb.com


On 2 April 2018 at 02:24, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
 

Maybe my drive-by assessment of those kernel routines is wrong and
someone will correct me, but I'm starting to think you might be better
to assume the worst on all systems.  Perhaps a GUC that defaults to
panicking, so that users on those rare OSes could turn that off?  Even
then I'm not sure if the failure mode will be that great anyway or if
it's worth having two behaviours.  Thoughts?


I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach.

I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily.

BTW, the support team at 2ndQ is presently working on two separate issues where ENOSPC resulted in DB corruption, though neither of them involve logs of lost page writes. I'm planning on taking some time tomorrow to write a torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in the test case I linked to in my original StackOverflow post.

If this is just an EIO issue then I see no point doing anything other than PANICing unconditionally.

If it's a concern for ENOSPC too, we should try harder to fail more nicely whenever we possibly can.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Hi,

On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:
> On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> > Craig Ringer <craig@2ndquadrant.com> writes:
> > > So we should just use the big hammer here.
> >
> > And bitch, loudly and publicly, about how broken this kernel behavior is.
> > If we make enough of a stink maybe it'll get fixed.
> 
> It is not likely to be fixed (beyond what has been done already with the
> manpage patches and errseq_t fixes on the reporting level). The issue is,
> the kernel needs to deal with hard IO errors at that level somehow, and
> since those errors typically persist, re-dirtying the pages would not
> really solve the problem (unless some filesystem remaps the request to a
> different block, assuming the device is alive).

Throwing away the dirty pages *and* persisting the error seems a lot
more reasonable. Then provide a fcntl (or whatever) extension that can
clear the error status in the few cases that the application that wants
to gracefully deal with the case.


> Keeping around dirty
> pages that cannot possibly be written out is essentially a memory leak,
> as those pages would stay around even after the application has exited.

Why do dirty pages need to be kept around in the case of persistent
errors? I don't think the lack of automatic recovery in that case is
what anybody is complaining about. It's that the error goes away and
there's no reasonable way to separate out such an error from some
potential transient errors.

Greetings,

Andres Freund


On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:
> > On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:
> > > Craig Ringer <craig@2ndquadrant.com> writes:
> > > > So we should just use the big hammer here.
> > >
> > > And bitch, loudly and publicly, about how broken this kernel behavior is.
> > > If we make enough of a stink maybe it'll get fixed.
> > 
> > It is not likely to be fixed (beyond what has been done already with the
> > manpage patches and errseq_t fixes on the reporting level). The issue is,
> > the kernel needs to deal with hard IO errors at that level somehow, and
> > since those errors typically persist, re-dirtying the pages would not
> > really solve the problem (unless some filesystem remaps the request to a
> > different block, assuming the device is alive).
> 
> Throwing away the dirty pages *and* persisting the error seems a lot
> more reasonable. Then provide a fcntl (or whatever) extension that can
> clear the error status in the few cases that the application that wants
> to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are
practically thrown away, the semantics of fsync() (after the 4.13 fixes)
are essentially correct: the first call indicates that a writeback error
indeed occurred, while subsequent calls have no reason to indicate an error
(assuming no other errors occurred in the meantime).

The error reporting is thus consistent with the intended semantics (which
are sadly not properly documented). Repeated calls to fsync() simply do not
imply that the kernel will retry to writeback the previously-failed pages,
so the application needs to be aware of that. Persisting the error at the
fsync() level would essentially mean moving application policy into the
kernel.

Best regards,
Anthony


On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:
> On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> > Throwing away the dirty pages *and* persisting the error seems a lot
> > more reasonable. Then provide a fcntl (or whatever) extension that can
> > clear the error status in the few cases that the application that wants
> > to gracefully deal with the case.
> 
> Given precisely that the dirty pages which cannot been written-out are
> practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> are essentially correct: the first call indicates that a writeback error
> indeed occurred, while subsequent calls have no reason to indicate an error
> (assuming no other errors occurred in the meantime).

Meh^2.

"no reason" - except that there's absolutely no way to know what state
the data is in. And that your application needs explicit handling of
such failures. And that one FD might be used in a lots of different
parts of the application, that fsyncs in one part of the application
might be an ok failure, and in another not.  Requiring explicit actions
to acknowledge "we've thrown away your data for unknown reason" seems
entirely reasonable.


> The error reporting is thus consistent with the intended semantics (which
> are sadly not properly documented). Repeated calls to fsync() simply do not
> imply that the kernel will retry to writeback the previously-failed pages,
> so the application needs to be aware of that.

Which isn't what I've suggested.


> Persisting the error at the fsync() level would essentially mean
> moving application policy into the kernel.

Meh.

Greetings,

Andres Freund


On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:
> On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:
> > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> > > Throwing away the dirty pages *and* persisting the error seems a lot
> > > more reasonable. Then provide a fcntl (or whatever) extension that can
> > > clear the error status in the few cases that the application that wants
> > > to gracefully deal with the case.
> > 
> > Given precisely that the dirty pages which cannot been written-out are
> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > are essentially correct: the first call indicates that a writeback error
> > indeed occurred, while subsequent calls have no reason to indicate an error
> > (assuming no other errors occurred in the meantime).
> 
> Meh^2.
> 
> "no reason" - except that there's absolutely no way to know what state
> the data is in. And that your application needs explicit handling of
> such failures. And that one FD might be used in a lots of different
> parts of the application, that fsyncs in one part of the application
> might be an ok failure, and in another not.  Requiring explicit actions
> to acknowledge "we've thrown away your data for unknown reason" seems
> entirely reasonable.

As long as fsync() indicates error on first invocation, the application
is fully aware that between this point of time and the last call to fsync()
data has been lost. Persisting this error any further does not change this
or add any new info - on the contrary it adds confusion as subsequent write()s
and fsync()s on other pages can succeed, but will be reported as failures.

The application will need to deal with that first error irrespective of
subsequent return codes from fsync(). Conceptually every fsync() invocation
demarcates an epoch for which it reports potential errors, so the caller
needs to take responsibility for that particular epoch.

Callers that are not affected by the potential outcome of fsync() and
do not react on errors, have no reason for calling it in the first place
(and thus masking failure from subsequent callers that may indeed care).

Best regards,
Anthony


Greetings,

* Anthony Iliopoulos (ailiop@altatus.com) wrote:
> On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:
> > On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:
> > > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:
> > > > Throwing away the dirty pages *and* persisting the error seems a lot
> > > > more reasonable. Then provide a fcntl (or whatever) extension that can
> > > > clear the error status in the few cases that the application that wants
> > > > to gracefully deal with the case.
> > >
> > > Given precisely that the dirty pages which cannot been written-out are
> > > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > > are essentially correct: the first call indicates that a writeback error
> > > indeed occurred, while subsequent calls have no reason to indicate an error
> > > (assuming no other errors occurred in the meantime).
> >
> > Meh^2.
> >
> > "no reason" - except that there's absolutely no way to know what state
> > the data is in. And that your application needs explicit handling of
> > such failures. And that one FD might be used in a lots of different
> > parts of the application, that fsyncs in one part of the application
> > might be an ok failure, and in another not.  Requiring explicit actions
> > to acknowledge "we've thrown away your data for unknown reason" seems
> > entirely reasonable.
>
> As long as fsync() indicates error on first invocation, the application
> is fully aware that between this point of time and the last call to fsync()
> data has been lost. Persisting this error any further does not change this
> or add any new info - on the contrary it adds confusion as subsequent write()s
> and fsync()s on other pages can succeed, but will be reported as failures.

fsync() doesn't reflect the status of given pages, however, it reflects
the status of the file descriptor, and as such the file, on which it's
called.  This notion that fsync() is actually only responsible for the
changes which were made to a file since the last fsync() call is pure
foolishness.  If we were able to pass a list of pages or data ranges to
fsync() for it to verify they're on disk then perhaps things would be
different, but we can't, all we can do is ask to "please flush all the
dirty pages associated with this file descriptor, which represents this
file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant
storage?" and we would certainly be happy to use it, and to repeatedly
try to flush out pages which weren't synced to disk due to some
transient error, and to track those cases and make sure that we don't
incorrectly assume that they've been transferred to persistent storage.

> The application will need to deal with that first error irrespective of
> subsequent return codes from fsync(). Conceptually every fsync() invocation
> demarcates an epoch for which it reports potential errors, so the caller
> needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later
*retrying* the fsync(), which is when we get back an "all good!
everything with this file descriptor you've opened is sync'd!" and
happily expect that to be truth, when, in reality, it's an unfortunate
lie and there are still pages associated with that file descriptor which
are, in reality, dirty and not sync'd to disk.

Consider two independent programs where the first one writes to a file
and then calls the second one whose job it is to go out and fsync(),
perhaps async from the first, those files.  Is the second program
supposed to go write to each page that the first one wrote to, in order
to ensure that all the dirty bits are set so that the fsync() will
actually return if all the dirty pages are written?

> Callers that are not affected by the potential outcome of fsync() and
> do not react on errors, have no reason for calling it in the first place
> (and thus masking failure from subsequent callers that may indeed care).

Reacting on an error from an fsync() call could, based on how it's
documented and actually implemented in other OS's, mean "run another
fsync() to see if the error has resolved itself."  Requiring that to
mean "you have to go dirty all of the pages you previously dirtied to
actually get a subsequent fsync() to do anything" is really just not
reasonable- a given program may have no idea what was written to
previously nor any particular reason to need to know, on the expectation
that the fsync() call will flush any dirty pages, as it's documented to
do.

Thanks!

Stephen

Attachment
Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:
>
> fsync() doesn't reflect the status of given pages, however, it reflects
> the status of the file descriptor, and as such the file, on which it's
> called.  This notion that fsync() is actually only responsible for the
> changes which were made to a file since the last fsync() call is pure
> foolishness.  If we were able to pass a list of pages or data ranges to
> fsync() for it to verify they're on disk then perhaps things would be
> different, but we can't, all we can do is ask to "please flush all the
> dirty pages associated with this file descriptor, which represents this
> file we opened, to disk, and let us know if you were successful."
>
> Give us a way to ask "are these specific pages written out to persistant
> storage?" and we would certainly be happy to use it, and to repeatedly
> try to flush out pages which weren't synced to disk due to some
> transient error, and to track those cases and make sure that we don't
> incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy
interface but further changing its established semantics (no matter how
unreasonable they may be) is probably not the way to go.

Would using sync_file_range() be helpful? Potential errors would only
apply to pages that cover the requested file ranges. There are a few
caveats though:

(a) it still messes with the top-level error reporting so mixing it
with callers that use fsync() and do care about errors will produce
the same issue (clearing the error status).

(b) the error-reporting granularity is coarse (failure reporting applies
to the entire requested range so you still don't know which particular
pages/file sub-ranges failed writeback)

(c) the same "report and forget" semantics apply to repeated invocations
of the sync_file_range() call, so again action will need to be taken
upon first error encountered for the particular ranges.

> > The application will need to deal with that first error irrespective of
> > subsequent return codes from fsync(). Conceptually every fsync() invocation
> > demarcates an epoch for which it reports potential errors, so the caller
> > needs to take responsibility for that particular epoch.
> 
> We do deal with that error- by realizing that it failed and later
> *retrying* the fsync(), which is when we get back an "all good!
> everything with this file descriptor you've opened is sync'd!" and
> happily expect that to be truth, when, in reality, it's an unfortunate
> lie and there are still pages associated with that file descriptor which
> are, in reality, dirty and not sync'd to disk.

It really turns out that this is not how the fsync() semantics work
though, exactly because the nature of the errors: even if the kernel
retained the dirty bits on the failed pages, retrying persisting them
on the same disk location would simply fail. Instead the kernel opts
for marking those pages clean (since there is no other recovery
strategy), and reporting once to the caller who can potentially deal
with it in some manner. It is sadly a bad and undocumented convention.

> Consider two independent programs where the first one writes to a file
> and then calls the second one whose job it is to go out and fsync(),
> perhaps async from the first, those files.  Is the second program
> supposed to go write to each page that the first one wrote to, in order
> to ensure that all the dirty bits are set so that the fsync() will
> actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather
than fsync(), but as long as an application needs to ensure data
are persisted to storage, it needs to retain those data in its heap
until fsync() is successful instead of discarding them and relying
on the kernel after write(). The pattern should be roughly like:
write() -> fsync() -> free(), rather than write() -> free() -> fsync().
For example, if a partition gets full upon fsync(), then the application
has a chance to persist the data in a different location, while
the kernel cannot possibly make this decision and recover.

> > Callers that are not affected by the potential outcome of fsync() and
> > do not react on errors, have no reason for calling it in the first place
> > (and thus masking failure from subsequent callers that may indeed care).
> 
> Reacting on an error from an fsync() call could, based on how it's
> documented and actually implemented in other OS's, mean "run another
> fsync() to see if the error has resolved itself."  Requiring that to
> mean "you have to go dirty all of the pages you previously dirtied to
> actually get a subsequent fsync() to do anything" is really just not
> reasonable- a given program may have no idea what was written to
> previously nor any particular reason to need to know, on the expectation
> that the fsync() call will flush any dirty pages, as it's documented to
> do.

I think we are conflating a few issues here: having the OS kernel being
responsible for error recovery (so that subsequent fsync() would fix
the problems) is one. This clearly is a design which most kernels have
not really adopted for reasons outlined above (although having the FS
layer recovering from hard errors transparently is open for discussion
from what it seems [1]). Now, there is the issue of granularity of
error reporting: userspace could benefit from a fine-grained indication
of failed pages (or file ranges). Another issue is that of reporting
semantics (report and clear), which is also a design choice made to
avoid having higher-resolution error tracking and the corresponding
memory overheads [1].

Best regards,
Anthony

[1] https://lwn.net/Articles/718734/


On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote:
> Would using sync_file_range() be helpful? Potential errors would only
> apply to pages that cover the requested file ranges. There are a few
> caveats though:

To quote sync_file_range(2):
   Warning
       This  system  call  is  extremely  dangerous and should not be used in portable programs.  None of these
operationswrites out the
 
       file's metadata.  Therefore, unless the application is strictly performing overwrites of already-instantiated
disk blocks,  there
 
       are no guarantees that the data will be available after a crash.  There is no user interface to know if a write
ispurely an over‐
 
       write.  On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is
impossible.  When
 
       writing  into  preallocated  space,  many filesystems also require calls into the block allocator, which this
systemcall does not
 
       sync out to disk.  This system call does not flush disk write caches and thus does not provide any data
integrityon systems  with
 
       volatile disk write caches.

Given the lack of metadata safety that seems entirely a no go.  We use
sfr(2), but only to force the kernel's hand around writing back earlier
without throwing away cache contents.


> > > The application will need to deal with that first error irrespective of
> > > subsequent return codes from fsync(). Conceptually every fsync() invocation
> > > demarcates an epoch for which it reports potential errors, so the caller
> > > needs to take responsibility for that particular epoch.
> > 
> > We do deal with that error- by realizing that it failed and later
> > *retrying* the fsync(), which is when we get back an "all good!
> > everything with this file descriptor you've opened is sync'd!" and
> > happily expect that to be truth, when, in reality, it's an unfortunate
> > lie and there are still pages associated with that file descriptor which
> > are, in reality, dirty and not sync'd to disk.
> 
> It really turns out that this is not how the fsync() semantics work
> though

Except on freebsd and solaris, and perhaps others.


>, exactly because the nature of the errors: even if the kernel
> retained the dirty bits on the failed pages, retrying persisting them
> on the same disk location would simply fail.

That's not guaranteed at all, think NFS.


> Instead the kernel opts for marking those pages clean (since there is
> no other recovery strategy), and reporting once to the caller who can
> potentially deal with it in some manner. It is sadly a bad and
> undocumented convention.

It's broken behaviour justified post facto with the only rational that
was available, which explains why it's so unconvincing. You could just
say "this ship has sailed, and it's to onerous to change because xxx"
and this'd be a done deal. But claiming this is reasonable behaviour is
ridiculous.

Again, you could just continue to error for this fd and still throw away
the data.


> > Consider two independent programs where the first one writes to a file
> > and then calls the second one whose job it is to go out and fsync(),
> > perhaps async from the first, those files.  Is the second program
> > supposed to go write to each page that the first one wrote to, in order
> > to ensure that all the dirty bits are set so that the fsync() will
> > actually return if all the dirty pages are written?
> 
> I think what you have in mind are the semantics of sync() rather
> than fsync()

If you open the same file with two fds, and write with one, and fsync
with another that's definitely supposed to work. And sync() isn't a
realistic replacement in any sort of way because it's obviously
systemwide, and thus entirely and completely unsuitable. Nor does it
have any sort of better error reporting behaviour, does it?

Greetings,

Andres Freund


On 3 April 2018 at 07:05, Anthony Iliopoulos <ailiop@altatus.com> wrote:
Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:
>
> fsync() doesn't reflect the status of given pages, however, it reflects
> the status of the file descriptor, and as such the file, on which it's
> called.  This notion that fsync() is actually only responsible for the
> changes which were made to a file since the last fsync() call is pure
> foolishness.  If we were able to pass a list of pages or data ranges to
> fsync() for it to verify they're on disk then perhaps things would be
> different, but we can't, all we can do is ask to "please flush all the
> dirty pages associated with this file descriptor, which represents this
> file we opened, to disk, and let us know if you were successful."
>
> Give us a way to ask "are these specific pages written out to persistant
> storage?" and we would certainly be happy to use it, and to repeatedly
> try to flush out pages which weren't synced to disk due to some
> transient error, and to track those cases and make sure that we don't
> incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy
interface but further changing its established semantics (no matter how
unreasonable they may be) is probably not the way to go.

They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.

So I don't buy this argument.

It really turns out that this is not how the fsync() semantics work
though, exactly because the nature of the errors: even if the kernel
retained the dirty bits on the failed pages, retrying persisting them
on the same disk location would simply fail.

*might* simply fail.

It depends on why the error ocurred.
 
I originally identified this behaviour on a multipath system. Multipath defaults to "throw the writes away, nobody really cares anyway" on error. It seems to figure a higher level will retry, or the application will receive the error and retry.

(See no_path_retry in multipath config. AFAICS the default is insanely dangerous and only suitable for specialist apps that understand the quirks; you should use no_path_retry=queue).

Instead the kernel opts
for marking those pages clean (since there is no other recovery
strategy),
and reporting once to the caller who can potentially deal
with it in some manner. It is sadly a bad and undocumented convention.


It could mark the FD.

It's not just undocumented, it's a slightly creative interpretation of the POSIX spec for fsync.
 
> Consider two independent programs where the first one writes to a file
> and then calls the second one whose job it is to go out and fsync(),
> perhaps async from the first, those files.  Is the second program
> supposed to go write to each page that the first one wrote to, in order
> to ensure that all the dirty bits are set so that the fsync() will
> actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather
than fsync(), but as long as an application needs to ensure data
are persisted to storage, it needs to retain those data in its heap
until fsync() is successful instead of discarding them and relying
on the kernel after write().

This is almost exactly what we tell application authors using PostgreSQL: the data isn't written until you receive a successful commit confirmation, so you'd better not forget it.

We provide applications with *clear boundaries* so they can know *exactly* what was, and was not, written. I guess the argument from the kernel is the same is true: whatever was written since the last *successful* fsync is potentially lost and must be redone.

But the fsync behaviour is utterly undocumented and dubiously standard.
 
I think we are conflating a few issues here: having the OS kernel being
responsible for error recovery (so that subsequent fsync() would fix
the problems) is one. This clearly is a design which most kernels have
not really adopted for reasons outlined above

[citation needed] 

What do other major platforms do here? The post above suggests it's a bit of a mix of behaviours.

Now, there is the issue of granularity of
error reporting: userspace could benefit from a fine-grained indication
of failed pages (or file ranges).

Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map a sync failure back to an individual AIO write.

But it seems AIO just adds more problems and fixes none. Flush behaviour with AIO from what I can tell is inconsistent version to version and generally unhelpful. The kernel should really report such sync failures back to the app on its AIO write mapping, but it seems nothing of the sort happens.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
> On Apr 2, 2018, at 16:27, Craig Ringer <craig@2ndQuadrant.com> wrote:
>
> They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(),
orat least a surprising interpretation of it. 

Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that
wewill convince every kernel team to declare "What fools we've been!" and push a change... and even if they did,
PostgreSQLcan look forward to many years of running on kernels with the broken semantics.  Given that, I think the
PANICoption is the soundest one, as unappetizing as it is. 

--
-- Christophe Pettus
   xof@thebuild.com




On April 2, 2018 5:03:39 PM PDT, Christophe Pettus <xof@thebuild.com> wrote:
>
>> On Apr 2, 2018, at 16:27, Craig Ringer <craig@2ndQuadrant.com> wrote:
>>
>> They're undocumented and extremely surprising semantics that are
>arguably a violation of the POSIX spec for fsync(), or at least a
>surprising interpretation of it.
>
>Even accepting that (I personally go with surprising over violation, as
>if my vote counted), it is highly unlikely that we will convince every
>kernel team to declare "What fools we've been!" and push a change...
>and even if they did, PostgreSQL can look forward to many years of
>running on kernels with the broken semantics.  Given that, I think the
>PANIC option is the soundest one, as unappetizing as it is.

Don't we pretty much already have agreement in that? And Craig is the main proponent of it?

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


> On Apr 2, 2018, at 17:05, Andres Freund <andres@anarazel.de> wrote:
>
> Don't we pretty much already have agreement in that? And Craig is the main proponent of it?

For sure on the second sentence; the first was not clear to me.

--
-- Christophe Pettus
   xof@thebuild.com



On Mon, Apr 2, 2018 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
>>Even accepting that (I personally go with surprising over violation, as
>>if my vote counted), it is highly unlikely that we will convince every
>>kernel team to declare "What fools we've been!" and push a change...
>>and even if they did, PostgreSQL can look forward to many years of
>>running on kernels with the broken semantics.  Given that, I think the
>>PANIC option is the soundest one, as unappetizing as it is.
>
> Don't we pretty much already have agreement in that? And Craig is the main proponent of it?

I wonder how bad it will be in practice if we PANIC. Craig said "This
isn't as bad as it seems because AFAICS fsync only returns EIO in
cases where we should be stopping the world anyway, and many FSes will
do that for us". It would be nice to get more information on that.

-- 
Peter Geoghegan


On Tue, Apr 3, 2018 at 3:03 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> I see little benefit to not just PANICing unconditionally on EIO, really. It
> shouldn't happen, and if it does, we want to be pretty conservative and
> adopt a data-protective approach.
>
> I'm rather more worried by doing it on ENOSPC. Which looks like it might be
> necessary from what I recall finding in my test case + kernel code reading.
> I really don't want to respond to a possibly-transient ENOSPC by PANICing
> the whole server unnecessarily.

Yeah, it'd be nice to give an administrator the chance to free up some
disk space after ENOSPC is reported, and stay up.  Running out of
space really shouldn't take down the database without warning!  The
question is whether the data remains in cache and marked dirty, so
that retrying is a safe option (since it's potentially gone from our
own buffers, so if the OS doesn't have it the only place your
committed data can definitely still be found is the WAL... recovery
time).  Who can tell us?  Do we need a per-filesystem answer?  Delayed
allocation is a somewhat filesystem-specific thing, so maybe.
Interestingly, there don't seem to be many operating systems that can
report ENOSPC from fsync(), based on a quick scan through some
documentation:

POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no
Illumos/Solaris, Linux, macOS: yes

I don't know if macOS really means it or not; it just tells you to see
the errors for read(2) and write(2).  By the way, speaking of macOS, I
was curious to see if the common BSD heritage would show here.  Yeah,
somewhat.  It doesn't appear to keep buffers on writeback error, if
this is the right code[1] (though it could be handling it somewhere
else for all I know).

[1] https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695

-- 
Thomas Munro
http://www.enterprisedb.com


On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> Given precisely that the dirty pages which cannot been written-out are
> practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> are essentially correct: the first call indicates that a writeback error
> indeed occurred, while subsequent calls have no reason to indicate an error
> (assuming no other errors occurred in the meantime).

Like other people here, I think this is 100% unreasonable, starting
with "the dirty pages which cannot been written out are practically
thrown away".  Who decided that was OK, and on the basis of what
wording in what specification?  I think it's always unreasonable to
throw away the user's data.  If the writes are going to fail, then let
them keep on failing every time.  *That* wouldn't cause any data loss,
because we'd never be able to checkpoint, and eventually the user
would have to kill the server uncleanly, and that would trigger
recovery.

Also, this really does make it impossible to write reliable programs.
Imagine that, while the server is running, somebody runs a program
which opens a file in the data directory, calls fsync() on it, and
closes it.  If the fsync() fails, postgres is now borked and has no
way of being aware of the problem.  If we knew, we could PANIC, but
we'll never find out, because the unrelated process ate the error.
This is exactly the sort of ill-considered behavior that makes fcntl()
locking nearly useless.

Even leaving that aside, a PANIC means a prolonged outage on a
prolonged system - it could easily take tens of minutes or longer to
run recovery.  So saying "oh, just do that" is not really an answer.
Sure, we can do it, but it's like trying to lose weight by
intentionally eating a tapeworm.  Now, it's possible to shorten the
checkpoint_timeout so that recovery runs faster, but then performance
drops because data has to be fsync()'d more often instead of getting
buffered in the OS cache for the maximum possible time.  We could also
dodge this issue in another way: suppose that when we write a page
out, we don't consider it really written until fsync() succeeds.  Then
we wouldn't need to PANIC if an fsync() fails; we could just re-write
the page.  Unfortunately, this would also be terrible for performance,
for pretty much the same reasons: letting the OS cache absorb lots of
dirty blocks and do write-combining is *necessary* for good
performance.

> The error reporting is thus consistent with the intended semantics (which
> are sadly not properly documented). Repeated calls to fsync() simply do not
> imply that the kernel will retry to writeback the previously-failed pages,
> so the application needs to be aware of that. Persisting the error at the
> fsync() level would essentially mean moving application policy into the
> kernel.

I might accept this argument if I accepted that it was OK to decide
that an fsync() failure means you can forget that the write() ever
happened in the first place, but it's hard to imagine an application
that wants that behavior.  If the application didn't care about
whether the bytes really got to disk or not, it would not have called
fsync() in the first place.  If it does care, reporting the error only
once is never an improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Mon, Apr 2, 2018 at 7:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Also, this really does make it impossible to write reliable programs.
> Imagine that, while the server is running, somebody runs a program
> which opens a file in the data directory, calls fsync() on it, and
> closes it.  If the fsync() fails, postgres is now borked and has no
> way of being aware of the problem.  If we knew, we could PANIC, but
> we'll never find out, because the unrelated process ate the error.
> This is exactly the sort of ill-considered behavior that makes fcntl()
> locking nearly useless.

I fear that the conventional wisdom from the Kernel people is now "you
should be using O_DIRECT for granular control".  The LWN article
Thomas linked (https://lwn.net/Articles/718734) cites Ted Ts'o:

"Monakhov asked why a counter was needed; Layton said it was to handle
multiple overlapping writebacks. Effectively, the counter would record
whether a writeback had failed since the file was opened or since the
last fsync(). Ts'o said that should be fine; applications that want
more information should use O_DIRECT. For most applications, knowledge
that an error occurred somewhere in the file is all that is necessary;
applications that require better granularity already use O_DIRECT."

-- 
Peter Geoghegan


Hi Robert,

On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote:
> On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> > Given precisely that the dirty pages which cannot been written-out are
> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)
> > are essentially correct: the first call indicates that a writeback error
> > indeed occurred, while subsequent calls have no reason to indicate an error
> > (assuming no other errors occurred in the meantime).
> 
> Like other people here, I think this is 100% unreasonable, starting
> with "the dirty pages which cannot been written out are practically
> thrown away".  Who decided that was OK, and on the basis of what
> wording in what specification?  I think it's always unreasonable to

If you insist on strict conformance to POSIX, indeed the linux
glibc configuration and associated manpage are probably wrong in
stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation
matches that of the flexibility allowed by not supporting SIO.
There's a long history of brokenness between linux and posix,
and I think there was never an intention of conforming to the
standard.

> throw away the user's data.  If the writes are going to fail, then let
> them keep on failing every time.  *That* wouldn't cause any data loss,
> because we'd never be able to checkpoint, and eventually the user
> would have to kill the server uncleanly, and that would trigger
> recovery.

I believe (as tried to explain earlier) there is a certain assumption
being made that the writer and original owner of data is responsible
for dealing with potential errors in order to avoid data loss (which
should be only of interest to the original writer anyway). It would
be very questionable for the interface to persist the error while
subsequent writes and fsyncs to different offsets may as well go through.
Another process may need to write into the file and fsync, while being
unaware of those newly introduced semantics is now faced with EIO
because some unrelated previous process failed some earlier writes
and did not bother to clear the error for those writes. In a similar
scenario where the second process is aware of the new semantics, it would
naturally go ahead and clear the global error in order to proceed
with its own write()+fsync(), which would essentially amount to the
same problematic semantics you have now.

> Also, this really does make it impossible to write reliable programs.
> Imagine that, while the server is running, somebody runs a program
> which opens a file in the data directory, calls fsync() on it, and
> closes it.  If the fsync() fails, postgres is now borked and has no
> way of being aware of the problem.  If we knew, we could PANIC, but
> we'll never find out, because the unrelated process ate the error.
> This is exactly the sort of ill-considered behavior that makes fcntl()
> locking nearly useless.

Fully agree, and the errseq_t fixes have dealt exactly with the issue
of making sure that the error is reported to all file descriptors that
*happen to be open at the time of error*. But I think one would have a
hard time defending a modification to the kernel where this is further
extended to cover cases where:

process A does write() on some file offset which fails writeback,
fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback,
but fsync() gets EIO due to (uncleared) failures of earlier process.

This would be a highly user-visible change of semantics from edge-
triggered to level-triggered behavior.

> dodge this issue in another way: suppose that when we write a page
> out, we don't consider it really written until fsync() succeeds.  Then

That's the only way to think about fsync() guarantees unless you
are on a kernel that keeps retrying to persist dirty pages. Assuming
such a model, after repeated and unrecoverable hard failures the
process would have to explicitly inform the kernel to drop the dirty
pages. All the process could do at that point is read back to userspace
the dirty/failed pages and attempt to rewrite them at a different place
(which is current possible too). Most applications would not bother
though to inform the kernel and drop the permanently failed pages;
and thus someone eventually would hit the case that a large amount
of failed writeback pages are running his server out of memory,
at which point people will complain that those semantics are completely
unreasonable.

> we wouldn't need to PANIC if an fsync() fails; we could just re-write
> the page.  Unfortunately, this would also be terrible for performance,
> for pretty much the same reasons: letting the OS cache absorb lots of
> dirty blocks and do write-combining is *necessary* for good
> performance.

Not sure I understand this case. The application may indeed re-write
a bunch of pages that have failed and proceed with fsync(). The kernel
will deal with combining the writeback of all the re-written pages. But
further the necessity of combining for performance really depends on
the exact storage medium. At the point you start caring about
write-combining, the kernel community will naturally redirect you to
use DIRECT_IO.

> > The error reporting is thus consistent with the intended semantics (which
> > are sadly not properly documented). Repeated calls to fsync() simply do not
> > imply that the kernel will retry to writeback the previously-failed pages,
> > so the application needs to be aware of that. Persisting the error at the
> > fsync() level would essentially mean moving application policy into the
> > kernel.
> 
> I might accept this argument if I accepted that it was OK to decide
> that an fsync() failure means you can forget that the write() ever
> happened in the first place, but it's hard to imagine an application
> that wants that behavior.  If the application didn't care about
> whether the bytes really got to disk or not, it would not have called
> fsync() in the first place.  If it does care, reporting the error only
> once is never an improvement.

Again, conflating two separate issues, that of buffering and retrying
failed pages and that of error reporting. Yes it would be convenient
for applications not to have to care at all about recovery of failed
write-backs, but at some point they would have to face this issue one
way or another (I am assuming we are always talking about hard failures,
other kinds of failures are probably already being dealt with transparently
at the kernel level).

As for the reporting, it is also unreasonable to effectively signal
and persist an error on a file-wide granularity while it pertains
to subsets of that file and other writes can go through, but I am
repeating myself.

I suppose that if the check-and-clear semantics are problematic for
Pg, one could suggest a kernel patch that opts-in to a level-triggered
reporting of fsync() on a per-descriptor basis, which seems to be
non-intrusive and probably sufficient to cover your expected use-case.

Best regards,
Anthony


On 3 April 2018 at 11:35, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> Hi Robert,
>
> Fully agree, and the errseq_t fixes have dealt exactly with the issue
> of making sure that the error is reported to all file descriptors that
> *happen to be open at the time of error*. But I think one would have a
> hard time defending a modification to the kernel where this is further
> extended to cover cases where:
>
> process A does write() on some file offset which fails writeback,
> fsync() gets EIO and exit()s.
>
> process B does write() on some other offset which succeeds writeback,
> but fsync() gets EIO due to (uncleared) failures of earlier process.


Surely that's exactly what process B would want? If it calls fsync and
gets a success and later finds out that the file is corrupt and didn't
match what was in memory it's not going to be happy.

This seems like an attempt to co-opt fsync for a new and different
purpose for which it's poorly designed. It's not an async error
reporting mechanism for writes. It would be useless as that as any
process could come along and open your file and eat the errors for
writes you performed. An async error reporting mechanism would have to
document which writes it was giving errors for and give you ways to
control that.

The semantics described here are useless for everyone. For a program
needing to know the error status of the writes it executed, it doesn't
know which writes are included in which fsync call. For a program
using fsync for its original intended purpose of guaranteeing that the
all writes are synced to disk it no longer has any guarantee at all.


> This would be a highly user-visible change of semantics from edge-
> triggered to level-triggered behavior.

It was always documented as level-triggered. This edge-triggered
concept is a completely surprise to application writers.

-- 
greg


On Tue, Apr 03, 2018 at 12:26:05PM +0100, Greg Stark wrote:
> On 3 April 2018 at 11:35, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> > Hi Robert,
> >
> > Fully agree, and the errseq_t fixes have dealt exactly with the issue
> > of making sure that the error is reported to all file descriptors that
> > *happen to be open at the time of error*. But I think one would have a
> > hard time defending a modification to the kernel where this is further
> > extended to cover cases where:
> >
> > process A does write() on some file offset which fails writeback,
> > fsync() gets EIO and exit()s.
> >
> > process B does write() on some other offset which succeeds writeback,
> > but fsync() gets EIO due to (uncleared) failures of earlier process.
> 
> 
> Surely that's exactly what process B would want? If it calls fsync and
> gets a success and later finds out that the file is corrupt and didn't
> match what was in memory it's not going to be happy.

You can't possibly make this assumption. Process B may be reading
and writing to completely disjoint regions from those of process A,
and as such not really caring about earlier failures, only wanting
to ensure its own writes go all the way through. But even if it did
care, the file interfaces make no transactional guarantees. Even
without fsync() there is nothing preventing process B from reading
dirty pages from process A, and based on their content proceed to
to its own business and write/persist new data, while process A
further modifies the not-yet-flushed pages in-memory before flushing.
In this case you'd need explicit synchronization/locking between
the processes anyway, so why would fsync() be an exception?

> This seems like an attempt to co-opt fsync for a new and different
> purpose for which it's poorly designed. It's not an async error
> reporting mechanism for writes. It would be useless as that as any
> process could come along and open your file and eat the errors for
> writes you performed. An async error reporting mechanism would have to
> document which writes it was giving errors for and give you ways to
> control that.

The errseq_t fixes deal with that; errors will be reported to any
process that has an open fd, irrespective to who is the actual caller
of the fsync() that may have induced errors. This is anyway required
as the kernel may evict dirty pages on its own by doing writeback and
as such there needs to be a way to report errors on all open fds.

> The semantics described here are useless for everyone. For a program
> needing to know the error status of the writes it executed, it doesn't
> know which writes are included in which fsync call. For a program

If EIO persists between invocations until explicitly cleared, a process
cannot possibly make any decision as to if it should clear the error
and proceed or some other process will need to leverage that without
coordination, or which writes actually failed for that matter.
We would be back to the case of requiring explicit synchronization
between processes that care about this, in which case the processes
may as well synchronize over calling fsync() in the first place.

Having an opt-in persisting EIO per-fd would practically be a form
of "contract" between "cooperating" processes anyway.

But instead of deconstructing and debating the semantics of the
current mechanism, why not come up with the ideal desired form of
error reporting/tracking granularity etc., and see how this may be
fitted into kernels as a new interface.

Best regards,
Anthony


On 3 April 2018 at 10:54, Robert Haas <robertmhaas@gmail.com> wrote:
 
I think it's always unreasonable to
throw away the user's data.

Well, we do that. If a txn aborts, all writes in the txn are discarded.

I think that's perfectly reasonable. Though we also promise an all or nothing effect, we make exceptions even there.

The FS doesn't offer transactional semantics, but the fsync behaviour can be interpreted kind of similarly.

I don't *agree* with it, but I don't think it's as wholly unreasonable as all that. I think leaving it undocumented is absolutely gobsmacking, and it's dubious at best, but it's not totally insane.
 
If the writes are going to fail, then let
them keep on failing every time. 

Like we do, where we require an explicit rollback.

But POSIX may pose issues there, it doesn't really define any interface for that AFAIK. Unless you expect the app to close() and re-open() the file. Replacing one nonstandard issue with another may not be a win.
 
*That* wouldn't cause any data loss,
because we'd never be able to checkpoint, and eventually the user
would have to kill the server uncleanly, and that would trigger
recovery.

Yep. That's what I expected to happen on unrecoverable I/O errors. Because, y'know, unrecoverable.

I was stunned to learn it's not so. And I'm even more amazed to learn that ext4's errors=remount-ro apparently doesn't concern its self with mere user data, and may exhibit the same behaviour - I need to rerun my test case on it tomorrow.
 
Also, this really does make it impossible to write reliable programs.

In the presence of multiple apps interacting on the same file, yes. I think that's a little bit of a stretch though.

For a single app, you can recover by remembering and redoing all the writes you did.

Sucks if your app wants to have multiple processes working together on a file without some kind of journal or WAL, relying on fsync() alone, mind you. But at least we have WAL.

Hrm. I wonder how this interacts with wal_level=minimal.
 
Even leaving that aside, a PANIC means a prolonged outage on a
prolonged system - it could easily take tens of minutes or longer to
run recovery.  So saying "oh, just do that" is not really an answer.
Sure, we can do it, but it's like trying to lose weight by
intentionally eating a tapeworm.  Now, it's possible to shorten the
checkpoint_timeout so that recovery runs faster, but then performance
drops because data has to be fsync()'d more often instead of getting
buffered in the OS cache for the maximum possible time.

It's also spikier. Users have more issues with latency with short, frequent checkpoints.
 
  We could also
dodge this issue in another way: suppose that when we write a page
out, we don't consider it really written until fsync() succeeds.  Then
we wouldn't need to PANIC if an fsync() fails; we could just re-write
the page.  Unfortunately, this would also be terrible for performance,
for pretty much the same reasons: letting the OS cache absorb lots of
dirty blocks and do write-combining is *necessary* for good
performance.

Our double-caching is already plenty bad enough anyway, as well.

(Ideally I want to be able to swap buffers between shared_buffers and the OS buffer-cache. Almost like a 2nd level of buffer pinning. When we write out a block, we *transfer* ownership to the OS.  Yeah, I'm dreaming. But we'd sure need to be able to trust the OS not to just forget the block then!)
 
> The error reporting is thus consistent with the intended semantics (which
> are sadly not properly documented). Repeated calls to fsync() simply do not
> imply that the kernel will retry to writeback the previously-failed pages,
> so the application needs to be aware of that. Persisting the error at the
> fsync() level would essentially mean moving application policy into the
> kernel.

I might accept this argument if I accepted that it was OK to decide
that an fsync() failure means you can forget that the write() ever
happened in the first place, but it's hard to imagine an application
that wants that behavior.  If the application didn't care about
whether the bytes really got to disk or not, it would not have called
fsync() in the first place.  If it does care, reporting the error only
once is never an improvement.

Many RDBMSes do just that. It's hardly behaviour unique to the kernel. They report an ERROR on a statement in a txn then go on with life, merrily forgetting that anything was ever wrong.

I agree with PostgreSQL's stance that this is wrong. We require an explicit rollback (or ROLLBACK TO SAVEPOINT) to  restore the session to a usable state. This is good.

But we're the odd one out there. Almost everyone else does much like what fsync() does on Linux, report the error and forget it.

In any case, we're not going to get anyone to backpatch a fix for this into all kernels, so we're stuck working around it.

I'll do some testing with ENOSPC tomorrow, propose a patch, report back.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 3 April 2018 at 14:36, Anthony Iliopoulos <ailiop@altatus.com> wrote:

> If EIO persists between invocations until explicitly cleared, a process
> cannot possibly make any decision as to if it should clear the error

I still don't understand what "clear the error" means here. The writes
still haven't been written out. We don't care about tracking errors,
we just care whether all the writes to the file have been flushed to
disk. By "clear the error" you mean throw away the dirty pages and
revert part of the file to some old data? Why would anyone ever want
that?

> But instead of deconstructing and debating the semantics of the
> current mechanism, why not come up with the ideal desired form of
> error reporting/tracking granularity etc., and see how this may be
> fitted into kernels as a new interface.

Because Postgres is portable software that won't be able to use some
Linux-specific interface. And doesn't really need any granular error
reporting system anyways. It just needs to know when all writes have
been synced to disk.

-- 
greg


On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote:
> On 3 April 2018 at 14:36, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> 
> > If EIO persists between invocations until explicitly cleared, a process
> > cannot possibly make any decision as to if it should clear the error
> 
> I still don't understand what "clear the error" means here. The writes
> still haven't been written out. We don't care about tracking errors,
> we just care whether all the writes to the file have been flushed to
> disk. By "clear the error" you mean throw away the dirty pages and
> revert part of the file to some old data? Why would anyone ever want
> that?

It means that the responsibility of recovering the data is passed
back to the application. The writes may never be able to be written
out. How would a kernel deal with that? Either discard the data
(and have the writer acknowledge) or buffer the data until reboot
and simply risk going OOM. It's not what someone would want, but
rather *need* to deal with, one way or the other. At least on the
application-level there's a fighting chance for restoring to a
consistent state. The kernel does not have that opportunity.

> > But instead of deconstructing and debating the semantics of the
> > current mechanism, why not come up with the ideal desired form of
> > error reporting/tracking granularity etc., and see how this may be
> > fitted into kernels as a new interface.
> 
> Because Postgres is portable software that won't be able to use some
> Linux-specific interface. And doesn't really need any granular error

I don't really follow this argument, Pg is admittedly using non-portable
interfaces (e.g the sync_file_range()). While it's nice to avoid platform
specific hacks, expecting that the POSIX semantics will be consistent
across systems is simply a 90's pipe dream. While it would be lovely
to have really consistent interfaces for application writers, this is
simply not going to happen any time soon.

And since those problematic semantics of fsync() appear to be prevalent
in other systems as well that are not likely to be changed, you cannot
rely on preconception that once buffers are handed over to kernel you
have a guarantee that they will be eventually persisted no matter what.
(Why even bother having fsync() in that case? The kernel would eventually
evict and writeback dirty pages anyway. The point of reporting the error
back to the application is to give it a chance to recover - the kernel
could repeat "fsync()" itself internally if this would solve anything).

> reporting system anyways. It just needs to know when all writes have
> been synced to disk.

Well, it does know when *some* writes have *not* been synced to disk,
exactly because the responsibility is passed back to the application.
I do realize this puts more burden back to the application, but what
would a viable alternative be? Would you rather have a kernel that
risks periodically going OOM due to this design decision?

Best regards,
Anthony


On Tue, Apr 3, 2018 at 6:35 AM, Anthony Iliopoulos <ailiop@altatus.com> wrote:
>> Like other people here, I think this is 100% unreasonable, starting
>> with "the dirty pages which cannot been written out are practically
>> thrown away".  Who decided that was OK, and on the basis of what
>> wording in what specification?  I think it's always unreasonable to
>
> If you insist on strict conformance to POSIX, indeed the linux
> glibc configuration and associated manpage are probably wrong in
> stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation
> matches that of the flexibility allowed by not supporting SIO.
> There's a long history of brokenness between linux and posix,
> and I think there was never an intention of conforming to the
> standard.

Well, then the man page probably shouldn't say CONFORMING TO 4.3BSD,
POSIX.1-2001, which on the first system I tested, it did.  Also, the
summary should be changed from the current "fsync, fdatasync -
synchronize a file's in-core state with storage device" by adding ",
possibly by randomly undoing some of the changes you think you made to
the file".

> I believe (as tried to explain earlier) there is a certain assumption
> being made that the writer and original owner of data is responsible
> for dealing with potential errors in order to avoid data loss (which
> should be only of interest to the original writer anyway). It would
> be very questionable for the interface to persist the error while
> subsequent writes and fsyncs to different offsets may as well go through.

No, that's not questionable at all.  fsync() doesn't take any argument
saying which part of the file you care about, so the kernel is
entirely not entitled to assume it knows to which writes a given
fsync() call was intended to apply.

> Another process may need to write into the file and fsync, while being
> unaware of those newly introduced semantics is now faced with EIO
> because some unrelated previous process failed some earlier writes
> and did not bother to clear the error for those writes. In a similar
> scenario where the second process is aware of the new semantics, it would
> naturally go ahead and clear the global error in order to proceed
> with its own write()+fsync(), which would essentially amount to the
> same problematic semantics you have now.

I don't deny that it's possible that somebody could have an
application which is utterly indifferent to the fact that earlier
modifications to a file failed due to I/O errors, but is A-OK with
that as long as later modifications can be flushed to disk, but I
don't think that's a normal thing to want.

>> Also, this really does make it impossible to write reliable programs.
>> Imagine that, while the server is running, somebody runs a program
>> which opens a file in the data directory, calls fsync() on it, and
>> closes it.  If the fsync() fails, postgres is now borked and has no
>> way of being aware of the problem.  If we knew, we could PANIC, but
>> we'll never find out, because the unrelated process ate the error.
>> This is exactly the sort of ill-considered behavior that makes fcntl()
>> locking nearly useless.
>
> Fully agree, and the errseq_t fixes have dealt exactly with the issue
> of making sure that the error is reported to all file descriptors that
> *happen to be open at the time of error*.

Well, in PostgreSQL, we have a background process called the
checkpointer which is the process that normally does all of the
fsync() calls but only a subset of the write() calls.  The
checkpointer does not, however, necessarily have every file open all
the time, so these fixes aren't sufficient to make sure that the
checkpointer ever sees an fsync() failure.  What you have (or someone
has) basically done here is made an undocumented assumption about
which file descriptors might care about a particular error, but it
just so happens that PostgreSQL has never conformed to that
assumption.  You can keep on saying the problem is with our
assumptions, but it doesn't seem like a very good guess to me to
suppose that we're the only program that has ever made them.  The
documentation for fsync() gives zero indication that it's
edge-triggered, and so complaining that people wouldn't like it if it
became level-triggered seems like an ex post facto justification for a
poorly-chosen behavior: they probably think (as we did prior to a week
ago) that it already is.

> Not sure I understand this case. The application may indeed re-write
> a bunch of pages that have failed and proceed with fsync(). The kernel
> will deal with combining the writeback of all the re-written pages. But
> further the necessity of combining for performance really depends on
> the exact storage medium. At the point you start caring about
> write-combining, the kernel community will naturally redirect you to
> use DIRECT_IO.

Well, the way PostgreSQL works today, we typically run with say 8GB of
shared_buffers even if the system memory is, say, 200GB.  As pages are
evicted from our relatively small cache to the operating system, we
track which files need to be fsync()'d at checkpoint time, but we
don't hold onto the blocks.  Until checkpoint time, the operating
system is left to decide whether it's better to keep caching the dirty
blocks (thus leaving less memory for other things, but possibly
allowing write-combining if the blocks are written again) or whether
it should clean them to make room for other things.  This means that
only a small portion of the operating system memory is directly
managed by PostgreSQL, while allowing the effective size of our cache
to balloon to some very large number if the system isn't under heavy
memory pressure.

Now, I hear the DIRECT_IO thing and I assume we're eventually going to
have to go that way: Linux kernel developers seem to think that "real
men use O_DIRECT" and so if other forms of I/O don't provide useful
guarantees, well that's our fault for not using O_DIRECT.  That's a
political reason, not a technical reason, but it's a reason all the
same.

Unfortunately, that is going to add a huge amount of complexity,
because if we ran with shared_buffers set to a large percentage of
system memory, we couldn't allocate large chunks of memory for sorts
and hash tables from the operating system any more.  We'd have to
allocate it from our own shared_buffers because that's basically all
the memory there is and using substantially more might run the system
out entirely.  So it's a huge, huge architectural change.  And even
once it's done it is in some ways inferior to what we are doing today
-- true, it gives us superior control over writeback timing, but it
also makes PostgreSQL play less nicely with other things running on
the same machine, because now PostgreSQL has a dedicated chunk of
whatever size it has, rather than using some portion of the OS buffer
cache that can grow and shrink according to memory needs both of other
parts of PostgreSQL and other applications on the system.

> I suppose that if the check-and-clear semantics are problematic for
> Pg, one could suggest a kernel patch that opts-in to a level-triggered
> reporting of fsync() on a per-descriptor basis, which seems to be
> non-intrusive and probably sufficient to cover your expected use-case.

That would certainly be better than nothing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Tue, Apr 3, 2018 at 1:29 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Interestingly, there don't seem to be many operating systems that can
> report ENOSPC from fsync(), based on a quick scan through some
> documentation:
>
> POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no
> Illumos/Solaris, Linux, macOS: yes

Oops, reading comprehension fail.  POSIX yes (since issue 5), via the
note that read() and write()'s error conditions can also be returned.

-- 
Thomas Munro
http://www.enterprisedb.com


On Tue, Apr  3, 2018 at 05:47:01PM -0400, Robert Haas wrote:
> Well, in PostgreSQL, we have a background process called the
> checkpointer which is the process that normally does all of the
> fsync() calls but only a subset of the write() calls.  The
> checkpointer does not, however, necessarily have every file open all
> the time, so these fixes aren't sufficient to make sure that the
> checkpointer ever sees an fsync() failure.

There has been a lot of focus in this thread on the workflow:

    write() -> blocks remain in kernel memory -> fsync() -> panic?

But what happens in this workflow:

    write() -> kernel syncs blocks to storage -> fsync()

Is fsync() going to see a "kernel syncs blocks to storage" failure?

There was already discussion that if the fsync() causes the "syncs
blocks to storage", fsync() will only report the failure once, but will
it see any failure in the second workflow?  There is indication that a
failed write to storage reports back an error once and clears the dirty
flag, but do we know it keeps things around long enough to report an
error to a future fsync()?

You would think it does, but I have to ask since our fsync() assumptions
have been wrong for so long.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian <bruce@momjian.us> wrote:
> There has been a lot of focus in this thread on the workflow:
>
>         write() -> blocks remain in kernel memory -> fsync() -> panic?
>
> But what happens in this workflow:
>
>         write() -> kernel syncs blocks to storage -> fsync()
>
> Is fsync() going to see a "kernel syncs blocks to storage" failure?
>
> There was already discussion that if the fsync() causes the "syncs
> blocks to storage", fsync() will only report the failure once, but will
> it see any failure in the second workflow?  There is indication that a
> failed write to storage reports back an error once and clears the dirty
> flag, but do we know it keeps things around long enough to report an
> error to a future fsync()?
>
> You would think it does, but I have to ask since our fsync() assumptions
> have been wrong for so long.

I believe there were some problems of that nature (with various
twists, based on other concurrent activity and possibly different
fds), and those problems were fixed by the errseq_t system developed
by Jeff Layton in Linux 4.13.  Call that "bug #1".

The second issues is that the pages are marked clean after the error
is reported, so further attempts to fsync() the data (in our case for
a new attempt to checkpoint) will be futile but appear successful.
Call that "bug #2", with the proviso that some people apparently think
it's reasonable behaviour and not a bug.  At least there is a
plausible workaround for that: namely the nuclear option proposed by
Craig.

-- 
Thomas Munro
http://www.enterprisedb.com


On Wed, Apr  4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:
> On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > There has been a lot of focus in this thread on the workflow:
> >
> >         write() -> blocks remain in kernel memory -> fsync() -> panic?
> >
> > But what happens in this workflow:
> >
> >         write() -> kernel syncs blocks to storage -> fsync()
> >
> > Is fsync() going to see a "kernel syncs blocks to storage" failure?
> >
> > There was already discussion that if the fsync() causes the "syncs
> > blocks to storage", fsync() will only report the failure once, but will
> > it see any failure in the second workflow?  There is indication that a
> > failed write to storage reports back an error once and clears the dirty
> > flag, but do we know it keeps things around long enough to report an
> > error to a future fsync()?
> >
> > You would think it does, but I have to ask since our fsync() assumptions
> > have been wrong for so long.
> 
> I believe there were some problems of that nature (with various
> twists, based on other concurrent activity and possibly different
> fds), and those problems were fixed by the errseq_t system developed
> by Jeff Layton in Linux 4.13.  Call that "bug #1".

So all our non-cutting-edge Linux systems are vulnerable and there is no
workaround Postgres can implement?  Wow.

> The second issues is that the pages are marked clean after the error
> is reported, so further attempts to fsync() the data (in our case for
> a new attempt to checkpoint) will be futile but appear successful.
> Call that "bug #2", with the proviso that some people apparently think
> it's reasonable behaviour and not a bug.  At least there is a
> plausible workaround for that: namely the nuclear option proposed by
> Craig.

Yes, that one I understood.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Tue, Apr  3, 2018 at 10:05:19PM -0400, Bruce Momjian wrote:
> On Wed, Apr  4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:
> > I believe there were some problems of that nature (with various
> > twists, based on other concurrent activity and possibly different
> > fds), and those problems were fixed by the errseq_t system developed
> > by Jeff Layton in Linux 4.13.  Call that "bug #1".
> 
> So all our non-cutting-edge Linux systems are vulnerable and there is no
> workaround Postgres can implement?  Wow.

Uh, are you sure it fixes our use-case?  From the email description it
sounded like it only reported fsync errors for every open file
descriptor at the time of the failure, but the checkpoint process might
open the file _after_ the failure and try to fsync a write that happened
_before_ the failure.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On 4 April 2018 at 05:47, Robert Haas <robertmhaas@gmail.com> wrote:
 

Now, I hear the DIRECT_IO thing and I assume we're eventually going to
have to go that way: Linux kernel developers seem to think that "real
men use O_DIRECT" and so if other forms of I/O don't provide useful
guarantees, well that's our fault for not using O_DIRECT.  That's a
political reason, not a technical reason, but it's a reason all the
same.

I looked into buffered AIO a while ago, by the way, and just ... hell no. Run, run as fast as you can.

The trouble with direct I/O is that it pushes a _lot_ of work back on PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc. It's absurd to have the kernel do this, unless you want it reliable, in which case you bypass it and drive the hardware directly.

We'd need pools of writer threads to deal with all the blocking I/O. It'd be such a nightmare. Hey, why bother having a kernel at all, except for drivers?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Tue, Apr  3, 2018 at 10:05:19PM -0400, Bruce Momjian wrote:
>> On Wed, Apr  4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:
>> > I believe there were some problems of that nature (with various
>> > twists, based on other concurrent activity and possibly different
>> > fds), and those problems were fixed by the errseq_t system developed
>> > by Jeff Layton in Linux 4.13.  Call that "bug #1".
>>
>> So all our non-cutting-edge Linux systems are vulnerable and there is no
>> workaround Postgres can implement?  Wow.
>
> Uh, are you sure it fixes our use-case?  From the email description it
> sounded like it only reported fsync errors for every open file
> descriptor at the time of the failure, but the checkpoint process might
> open the file _after_ the failure and try to fsync a write that happened
> _before_ the failure.

I'm not sure of anything.  I can see that it's designed to report
errors since the last fsync() of the *file* (presumably via any fd),
which sounds like the desired behaviour:

https://github.com/torvalds/linux/blob/master/mm/filemap.c#L682

 * When userland calls fsync (or something like nfsd does the equivalent), we
 * want to report any writeback errors that occurred since the last fsync (or
 * since the file was opened if there haven't been any).

But I'm not sure what the lifetime of the passed-in "file" and more
importantly "file->f_wb_err" is.  Specifically, what happens to it if
no one has the file open at all, between operations?  It is reference
counted, see fs/file_table.c.  I don't know enough about it to
comment.

-- 
Thomas Munro
http://www.enterprisedb.com


On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Uh, are you sure it fixes our use-case?  From the email description it
>> sounded like it only reported fsync errors for every open file
>> descriptor at the time of the failure, but the checkpoint process might
>> open the file _after_ the failure and try to fsync a write that happened
>> _before_ the failure.
>
> I'm not sure of anything.  I can see that it's designed to report
> errors since the last fsync() of the *file* (presumably via any fd),
> which sounds like the desired behaviour:
>
> [..]

Scratch that.  Whenever you open a file descriptor you can't see any
preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen
files at will from any process, and in particular to fsync() from a
different process that didn't inherit the fd but instead opened it
later.  But it looks like that might be able to eat errors that
occurred during asynchronous writeback (when there was nobody to
report them to), before you opened the file?

If so I'm not sure how that can possibly be considered to be an
implementation of _POSIX_SYNCHRONIZED_IO:  "the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state."  Note "the file", not "this file descriptor + copies", and
without reference to when you opened it.

> But I'm not sure what the lifetime of the passed-in "file" and more
> importantly "file->f_wb_err" is.

It's really inode->i_mapping->wb_err's lifetime that I should have
been asking about there, not file->f_wb_err, but I see now that that
question is irrelevant due to the above.

-- 
Thomas Munro
http://www.enterprisedb.com


On 4 April 2018 at 13:29, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Uh, are you sure it fixes our use-case?  From the email description it
>> sounded like it only reported fsync errors for every open file
>> descriptor at the time of the failure, but the checkpoint process might
>> open the file _after_ the failure and try to fsync a write that happened
>> _before_ the failure.
>
> I'm not sure of anything.  I can see that it's designed to report
> errors since the last fsync() of the *file* (presumably via any fd),
> which sounds like the desired behaviour:
>
> [..]

Scratch that.  Whenever you open a file descriptor you can't see any
preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen
files at will from any process, and in particular to fsync() from a
different process that didn't inherit the fd but instead opened it
later.  But it looks like that might be able to eat errors that
occurred during asynchronous writeback (when there was nobody to
report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.

Does that mean that the ONLY ways to do reliable I/O are:

- single-process, single-file-descriptor write() then fsync(); on failure, retry all work since last successful fsync()

or

- direct I/O

?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 4 April 2018 at 13:29, Thomas Munro <thomas.munro@enterprisedb.com>
> wrote:
>> /* Ensure that we skip any errors that predate opening of the file */
>> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
>>
>> [...]
>
> Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel
> will deliberately hide writeback errors that predate our fsync() call from
> us?

Predates the opening of the file by the process that calls fsync().
Yeah, it sure looks that way based on the above code fragment.  Does
anyone know better?

> Does that mean that the ONLY ways to do reliable I/O are:
>
> - single-process, single-file-descriptor write() then fsync(); on failure,
> retry all work since last successful fsync()

I suppose you could some up with some crazy complicated IPC scheme to
make sure that the checkpointer always has an fd older than any writes
to be flushed, with some fallback strategy for when it can't take any
more fds.

I haven't got any good ideas right now.

> - direct I/O

As a bit of an aside, I gather that when you resize files (think
truncating/extending relation files) you still need to call fsync()
even if you read/write all data with O_DIRECT, to make it flush the
filesystem meta-data.  I have no idea if that could also be affected
by eaten writeback errors.

-- 
Thomas Munro
http://www.enterprisedb.com


On 4 April 2018 at 14:00, Craig Ringer <craig@2ndquadrant.com> wrote:
On 4 April 2018 at 13:29, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Uh, are you sure it fixes our use-case?  From the email description it
>> sounded like it only reported fsync errors for every open file
>> descriptor at the time of the failure, but the checkpoint process might
>> open the file _after_ the failure and try to fsync a write that happened
>> _before_ the failure.
>
> I'm not sure of anything.  I can see that it's designed to report
> errors since the last fsync() of the *file* (presumably via any fd),
> which sounds like the desired behaviour:
>
> [..]

Scratch that.  Whenever you open a file descriptor you can't see any
preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen
files at will from any process, and in particular to fsync() from a
different process that didn't inherit the fd but instead opened it
later.  But it looks like that might be able to eat errors that
occurred during asynchronous writeback (when there was nobody to
report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.



Warning, this runs a Docker container in privileged mode on your system, and it uses devicemapper. Read it before you run it, and while I've tried to keep it safe, beware that it might eat your system.

For now it tests only xfs and EIO. Other FSs should be easy enough.

I haven't added coverage for multi-processing yet, but given what you found above, I should. I'll probably just system() a copy of the same proc with instructions to only fsync(). I'll do that next.

I haven't worked out a reliable way to trigger ENOSPC on fsync() yet, when mapping without the error hole. It happens sometimes but I don't know why, it almost always happens on write() instead. I know it can happen on nfs, but I'm hoping for a saner example than that to test with. ext4 and xfs do delayed allocation but eager reservation so it shouldn't happen to them.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr  4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:
> On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> > On 4 April 2018 at 13:29, Thomas Munro <thomas.munro@enterprisedb.com>
> > wrote:
> >> /* Ensure that we skip any errors that predate opening of the file */
> >> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> >>
> >> [...]
> >
> > Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel
> > will deliberately hide writeback errors that predate our fsync() call from
> > us?
> 
> Predates the opening of the file by the process that calls fsync().
> Yeah, it sure looks that way based on the above code fragment.  Does
> anyone know better?

Uh, just to clarify, what is new here is that it is ignoring any
_errors_ that happened before the open().  It is not ignoring write()'s
that happened but have not been written to storage before the open().

FYI, pg_test_fsync has always tested the ability to fsync() writes()
from from other processes:

    Test if fsync on non-write file descriptor is honored:
    (If the times are similar, fsync() can sync data written on a different
    descriptor.)
            write, fsync, close                5360.341 ops/sec     187 usecs/op
            write, close, fsync                4785.240 ops/sec     209 usecs/op

Those two numbers should be similar.  I added this as a check to make
sure the behavior we were relying on was working.  I never tested sync
errors though.

I think the fundamental issue is that we always assumed that writes to
the kernel that could not be written to storage would remain in the
kernel until they succeeded, and that fsync() would report their
existence.

I can understand why kernel developers don't want to keep failed sync
buffers in memory, and once they are gone we lose reporting of their
failure.  Also, if the kernel is going to not retry the syncs, how long
should it keep reporting the sync failure?  To the first fsync that
happens after the failure?  How long should it continue to record the
failure?  What if no fsync() every happens, which is likely for
non-Postgres workloads?  I think once they decided to discard failed
syncs and not retry them, the fsync behavior we are complaining about
was almost required.

Our only option might be to tell administrators to closely watch for
kernel write failure messages, and then restore or failover.  :-(

The last time I remember being this surprised about storage was in the
early Postgres years when we learned that just because the BSD file
system uses 8k pages doesn't mean those are atomically written to
storage.  We knew the operating system wrote the data in 8k chunks to
storage but:

o  the 8k pages are written as separate 512-byte sectors
o  the 8k might be contiguous logically on the drive but not physically
o  even 512-byte sectors are not written atomically

This is why we added pre-page images are written to WAL, which is what
full_page_writes controls.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Wed, Apr  4, 2018 at 10:40:16AM +0800, Craig Ringer wrote:
> The trouble with direct I/O is that it pushes a _lot_ of work back on
> PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc.
> It's absurd to have the kernel do this, unless you want it reliable, in which
> case you bypass it and drive the hardware directly.
> 
> We'd need pools of writer threads to deal with all the blocking I/O. It'd be
> such a nightmare. Hey, why bother having a kernel at all, except for drivers?

I believe this is how Oracle views the kernel, so there is precedent for
this approach, though I am not advocating it.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On 4 April 2018 at 15:51, Craig Ringer <craig@2ndquadrant.com> wrote:
On 4 April 2018 at 14:00, Craig Ringer <craig@2ndquadrant.com> wrote:
On 4 April 2018 at 13:29, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Uh, are you sure it fixes our use-case?  From the email description it
>> sounded like it only reported fsync errors for every open file
>> descriptor at the time of the failure, but the checkpoint process might
>> open the file _after_ the failure and try to fsync a write that happened
>> _before_ the failure.
>
> I'm not sure of anything.  I can see that it's designed to report
> errors since the last fsync() of the *file* (presumably via any fd),
> which sounds like the desired behaviour:
>
> [..]

Scratch that.  Whenever you open a file descriptor you can't see any
preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen
files at will from any process, and in particular to fsync() from a
different process that didn't inherit the fd but instead opened it
later.  But it looks like that might be able to eat errors that
occurred during asynchronous writeback (when there was nobody to
report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.




Update. Now supports multiple FSes.

I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. Didn't try zfs-on-linux or other platforms yet.

Still working on getting ENOSPC on fsync() rather than write(). Kernel code reading suggests this is possible, but all the above FSes reserve space eagerly on write( ) even if they do delayed allocation of the actual storage, so it doesn't seem to happen at least in my simple single-process test.

I'm not overly inclined to complain about a fsync() succeeding after a write() error. That seems reasonable enough, the kernel told the app at the time of the failure. What else is it going to do? I don't personally even object hugely to the current fsync() behaviour if it were, say, DOCUMENTED and conformant to the relevant standards, though not giving us any sane way to find out the affected file ranges makes it drastically harder to recover sensibly.

But what's come out since on this thread, that we cannot even rely on fsync() giving us an EIO *once* when it loses our data, because:

- all currently widely deployed kernels can fail to deliver info due to recently fixed limitation; and
- the kernel deliberately hides errors from us if they relate to writes that occurred before we opened the FD (?)

... that's really troubling. I thought we could at least fix this by PANICing on EIO, and was mostly worried about ENOSPC. But now it seems we can't even do that and expect reliability. So how the @#$ are we meant to do?

It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2018 at 22:00, Craig Ringer <craig@2ndquadrant.com> wrote:
 
It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.


Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be. 

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

What bewilders me is that running with data=journal doesn't seem to be safe either. WTF? 

[26438.846111] EXT4-fs (dm-0): mounted filesystem with journalled data mode. Opts: errors=remount-ro,data_err=abort,data=journal
[26454.125319] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 12 (offset 0 size 0 starting block 59393)
[26454.125326] Buffer I/O error on device dm-0, logical block 59393
[26454.125337] Buffer I/O error on device dm-0, logical block 59394
[26454.125343] Buffer I/O error on device dm-0, logical block 59395
[26454.125350] Buffer I/O error on device dm-0, logical block 59396

and splat, there goes your data anyway.

It's possible that this is in some way related to using the device-mapper "error" target and a loopback device in testing. But I don't really see how.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Apr  4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:
> On 4 April 2018 at 22:00, Craig Ringer <craig@2ndquadrant.com> wrote:
>  
> 
>     It's the error reporting issues around closing and reopening files with
>     outstanding buffered I/O that's really going to hurt us here. I'll be
>     expanding my test case to cover that shortly.
> 
> 
> 
> Also, just to be clear, this is not in any way confined to xfs and/or lvm as I
> originally thought it might be. 
> 
> Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help
> either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is
only affected by metadata writes.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On 4 April 2018 at 22:25, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Apr  4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:
> On 4 April 2018 at 22:00, Craig Ringer <craig@2ndquadrant.com> wrote:
>  
>
>     It's the error reporting issues around closing and reopening files with
>     outstanding buffered I/O that's really going to hurt us here. I'll be
>     expanding my test case to cover that shortly.
>
>
>
> Also, just to be clear, this is not in any way confined to xfs and/or lvm as I
> originally thought it might be. 
>
> Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help
> either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is
only affected by metadata writes.

Yep, I gathered. I was referring to data_err.  

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Wed, Apr 4, 2018 at 4:42 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On 4 April 2018 at 22:25, Bruce Momjian <bruce@momjian.us> wrote:
>>
>> On Wed, Apr  4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:
>> > On 4 April 2018 at 22:00, Craig Ringer <craig@2ndquadrant.com> wrote:
>> >  
>> >
>> >     It's the error reporting issues around closing and reopening files with
>> >     outstanding buffered I/O that's really going to hurt us here. I'll be
>> >     expanding my test case to cover that shortly.
>> >
>> >
>> >
>> > Also, just to be clear, this is not in any way confined to xfs and/or lvm as I
>> > originally thought it might be.
>> >
>> > Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help
>> > either (so what does it do?).
>>
>> Anthony Iliopoulos reported in this thread that errors=remount-ro is
>> only affected by metadata writes.
>
>
> Yep, I gathered. I was referring to data_err.  

As far as I recall data_err=abort pertains to the jbd2 handling of
potential writeback errors. Jbd2 will inetrnally attempt to drain
the data upon txn commit (and it's even kind enough to restore
the EIO at the address space level, that otherwise would get eaten).

When data_err=abort is set, then jbd2 forcibly shuts down the
entire journal, with the error being propagated upwards to ext4.
I am not sure at which point this would be manifested to userspace
and how, but in principle any subsequent fs operations would get
some filesystem error due to the journal being down (I would
assume similar to remounting the fs read-only).

Since you are using data=journal, I would indeed expect to see
something more than what you saw in dmesg.

I can have a look later, I plan to also respond to some of the other
interesting issues that you guys raised in the thread.

Best regards,
Anthony

On 4 April 2018 at 21:49, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Apr  4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:
> On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> > On 4 April 2018 at 13:29, Thomas Munro <thomas.munro@enterprisedb.com>
> > wrote:
> >> /* Ensure that we skip any errors that predate opening of the file */
> >> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> >>
> >> [...]
> >
> > Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel
> > will deliberately hide writeback errors that predate our fsync() call from
> > us?
>
> Predates the opening of the file by the process that calls fsync().
> Yeah, it sure looks that way based on the above code fragment.  Does
> anyone know better?

Uh, just to clarify, what is new here is that it is ignoring any
_errors_ that happened before the open().  It is not ignoring write()'s
that happened but have not been written to storage before the open().

FYI, pg_test_fsync has always tested the ability to fsync() writes()
from from other processes:

        Test if fsync on non-write file descriptor is honored:
        (If the times are similar, fsync() can sync data written on a different
        descriptor.)
                write, fsync, close                5360.341 ops/sec     187 usecs/op
                write, close, fsync                4785.240 ops/sec     209 usecs/op

Those two numbers should be similar.  I added this as a check to make
sure the behavior we were relying on was working.  I never tested sync
errors though.

I think the fundamental issue is that we always assumed that writes to
the kernel that could not be written to storage would remain in the
kernel until they succeeded, and that fsync() would report their
existence.

I can understand why kernel developers don't want to keep failed sync
buffers in memory, and once they are gone we lose reporting of their
failure.  Also, if the kernel is going to not retry the syncs, how long
should it keep reporting the sync failure?

Ideally until the app tells it not to.

But there's no standard API for that.

The obvious answer seems to be "until the FD is closed". But we just discussed how Pg relies on being able to open and close files freely. That may not be as reasonable a thing to do as we thought it was when you consider error reporting. What's the kernel meant to do? How long should it remember "I had an error while doing writeback on this file"? Should it flag the file metadata and remember across reboots? Obviously not, but where does it stop? Tell the next program that does an fsync() and forget? How could it associate a dirty buffer on a file with no open FDs with any particular program at all? And what if the app did a write then closed the file and went away, never to bother to check the file again, like most apps do?

Some I/O errors are transient (network issue, etc). Some are recoverable with some sort of action, like disk space issues, but may take a long time before an admin steps in. Some are entirely unrecoverable (disk 1 in striped array is on fire) and there's no possible recovery. Currently we kind of hope the kernel will deal with figuring out which is which and retrying. Turns out it doesn't do that so much, and I don't think the reasons for that are wholly unreasonable. We may have been asking too much.

That does leave us in a pickle when it comes to the checkpointer and opening/closing FDs. I don't know what the "right" thing for the kernel to do from our perspective even is here, but the best I can come up with is actually pretty close to what it does now. Report the fsync() error to the first process that does an fsync() since the writeback error if one has occurred, then forget about it. Ideally I'd have liked it to mark all FDs pointing to the file with a flag to report EIO on next fsync too, but it turns out that won't even help us due to our opening and closing behaviour, so we're going to have to take responsibility for handling and communicating that ourselves, preventing checkpoint completion if any backend gets an fsync error. Probably by PANICing. Some extra work may be needed to ensure reliable ordering and stop checkpoints completing if their fsync() succeeds due to a recent failed fsync() on a normal backend that hasn't PANICed or where the postmaster hasn't noticed yet.

Our only option might be to tell administrators to closely watch for
kernel write failure messages, and then restore or failover.  :-(

Speaking of, there's not necessarily any lost page write error in the logs AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical block 59393" or the like.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

On 04. 04. 2018 15:49, Bruce Momjian wrote:
> I can understand why kernel developers don't want to keep failed sync
> buffers in memory, and once they are gone we lose reporting of their
> failure.  Also, if the kernel is going to not retry the syncs, how long
> should it keep reporting the sync failure?  To the first fsync that
> happens after the failure?  How long should it continue to record the
> failure?  What if no fsync() every happens, which is likely for
> non-Postgres workloads?  I think once they decided to discard failed
> syncs and not retry them, the fsync behavior we are complaining about
> was almost required.
Ideally the kernel would keep its data for as little time as possible.
With fsync, it doesn't really know which process is interested in
knowing about a write error, it just assumes the caller will know how to
deal with it. Most unfortunate issue is there's no way to get
information about a write error.

Thinking aloud - couldn't/shouldn't a write error also be a file system
event reported by inotify? Admittedly that's only a thing on Linux, but
still.


Kind regards,
Gasper


On Wed, Apr  4, 2018 at 11:23:51PM +0800, Craig Ringer wrote:
> On 4 April 2018 at 21:49, Bruce Momjian <bruce@momjian.us> wrote:
>     I can understand why kernel developers don't want to keep failed sync
>     buffers in memory, and once they are gone we lose reporting of their
>     failure.  Also, if the kernel is going to not retry the syncs, how long
>     should it keep reporting the sync failure?
> 
> Ideally until the app tells it not to.
> 
> But there's no standard API for that.

You would almost need an API that registers _before_ the failure that
you care about sync failures, and that you plan to call fsync() to
gather such information.  I am not sure how you would allow more than
the first fsync() to see the failure unless you added _another_ API to
clear the fsync failure, but I don't see the point since the first
fsync() might call that clear function.  How many applications are going
to know there is _another_ application that cares about the failure? Not
many.

> Currently we kind of hope the kernel will deal with figuring out which
> is which and retrying. Turns out it doesn't do that so much, and I
> don't think the reasons for that are wholly unreasonable. We may have
> been asking too much.

Agreed.

>     Our only option might be to tell administrators to closely watch for
>     kernel write failure messages, and then restore or failover.  :-(
> 
> Speaking of, there's not necessarily any lost page write error in the logs
> AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical
> block 59393" or the like.

I assume that is the kernel logs.  I am thinking the kernel logs have to
be monitored, but how many administrators do that?  The other issue I
think you are pointing out is how is the administrator going to know
this is a Postgres file?  I guess any sync error to a device that
contains Postgres has to assume Postgres is corrupted.  :-(

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO.
> Didn't try zfs-on-linux or other platforms yet.

I think ZFS will be an outlier here, at least in a pure
write()/fsync() test.  (1) It doesn't even use the OS page cache,
except when you mmap()*.  (2) Its idea of syncing data is to journal
it, and its journal presumably isn't in the OS page cache.  In other
words it doesn't use Linux's usual write-back code paths.

While contemplating what exactly it would do (not sure), I came across
an interesting old thread on the freebsd-current mailing list that
discusses UFS, ZFS and the meaning of POSIX fsync().  Here we see a
report of FreeBSD + UFS doing exactly what the code suggests:

https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html

That is, it keeps the pages dirty so it tells the truth later.
Apparently like Solaris/Illumos (based on drive-by code inspection,
see explicit treatment of retrying, though I'm not entirely sure if
the retry flag is set just for async write-back), and apparently
unlike every other kernel I've tried to grok so far (things descended
from ancestral BSD but not descended from FreeBSD, with macOS/Darwin
apparently in the first category for this purpose).

Here's a new ticket in the NetBSD bug database for this stuff:

http://gnats.netbsd.org/53152

As mentioned in that ticket and by Andres earlier in this thread,
keeping the page dirty isn't the only strategy that would work and may
be problematic in different ways (it tells the truth but floods your
cache with unflushable stuff until eventually you force unmount it and
your buffers are eventually invalidated after ENXIO errors?  I don't
know.).  I have no qualified opinion on that.  I just know that we
need a way for fsync() to tell the truth about all preceding writes or
our checkpoints are busted.

*We mmap() + msync() in pg_flush_data() if you don't have
sync_file_range(), and I see now that that is probably not a great
idea on ZFS because you'll finish up double-buffering (or is that
triple-buffering?), flooding your page cache with transient data.
Oops.  That is off-topic and not relevant for the checkpoint
correctness topic of this thread through, since pg_flush_data() is
advisory only.

-- 
Thomas Munro
http://www.enterprisedb.com


On Thu, Apr 5, 2018 at 9:28 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO.
>> Didn't try zfs-on-linux or other platforms yet.
>
> While contemplating what exactly it would do (not sure),

See manual for failmode=wait | continue | panic.  Even "continue"
returns EIO to all new write requests, so they apparently didn't
bother to supply an 'eat-my-data-but-tell-me-everything-is-fine' mode.
Figures.

-- 
Thomas Munro
http://www.enterprisedb.com


Summary to date:


It's worse than I thought originally, because:

- Most widely deployed kernels have cases where they don't tell you about losing your writes at all; and
- Information about loss of writes can be masked by closing and re-opening a file

So the checkpointer cannot trust that a successful fsync() means ... a successful fsync().

Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.

There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.

I previously though that errors=remount-ro was a sufficient safeguard. It isn't. There doesn't seem to be anything that is, for ext3, ext4, btrfs or xfs.

It's not clear to me yet why data_err=abort isn't sufficient in data=ordered or data=writeback mode on ext3 or ext4, needs more digging. (In my test tools that's:
    make FSTYPE=ext4 MKFSOPTS="" MOUNTOPTS="errors=remount-ro,data_err=abort,data=journal"
as of the current version d7fe802ec). AFAICS that's because data_error=abort only affects data=ordered, not data=journal. If you use data=ordered, you at least get retries of the same write failing. This post https://lkml.org/lkml/2008/10/10/80 added the option and has some explanation, but doesn't explain why it doesn't affect data=journal.

zfs is probably not affected by the issues, per Thomas Munro. I haven't run my test scripts on it yet because my kernel doesn't have zfs support and I'm prioritising the multi-process / open-and-close issues.

So far none of the FSes and options I've tried exhibit the behavour I actually want, which is to make the fs readonly or inaccessible on I/O error.

ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().

I think what we really need is a block-layer fix, where an I/O error flips the block device into read-only mode, as if     blockdev --setro    had been used. Though I'd settle for a kernel panic, frankly. I don't think anybody really wants this, but I'd rather either of those to silent data loss.

I'm currently tweaking my test to do some close and reopen the file between each write() and fsync(), and to support running with nfs.

I've also just found the device-mapper "flakey" driver, which looks fantastic for simulating unreliable I/O with intermittent faults. I've been using the "error" target in a mapping, which lets me remap some of the device to always error, but "flakey" looks very handy for actual PostgreSQL testing.

For the sake of Google, these are errors known to be associated with the problem:

ext4, and ext3 mounted with ext4 driver:

[42084.327345] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 12 (offset 0 size 0 starting block 59393)
[42084.327352] Buffer I/O error on device dm-0, logical block 59393

xfs:

[42193.771367] XFS (dm-0): writeback error on sector 118784
[42193.784477] XFS (dm-0): writeback error on sector 118784

jfs: (nil, silence in the kernel logs)

You should also beware of "lost page write" or "lost write" errors.
On 5 April 2018 at 15:09, Craig Ringer <craig@2ndquadrant.com> wrote:
 
Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.

I just confirmed this with a tweak to the test that

records the file position
close()s the fd
sync()s
open(s) the file
lseek()s back to the recorded position

This causes the test to completely ignore the I/O error, which is not reported to it at any time.

Fair enough, really, when you look at it from the kernel's point of view. What else can it do? Nobody has the file open. It'd have to mark the file its self as bad somehow. But that's pretty bad for our robustness AFAICS.
 
There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.


If dirty writeback fails between our close() and re-open() I see the same behaviour as with sync(). To test that I set dirty_writeback_centisecs and dirty_expire_centisecs to 1 and added a usleep(3*100*1000) between close() and open(). (It's still plenty slow). So sync() is a convenient way to simulate something other than our own fsync() writing out the dirty buffer.


If I omit the sync() then we get the error reported by fsync() once when we re open() the file and fsync() it, because the buffers weren't written out yet, so the error wasn't generated until we re-open()ed the file. But I doubt that'll happen much in practice because dirty writeback will get to it first so the error will be seen and discarded before we reopen the file in the checkpointer.

In other words, it looks like *even with a new kernel with the error reporting bug fixes*, if I understand how the backends and checkpointer interact when it comes to file descriptors, we're unlikely to notice I/O errors and fail a checkpoint. We may notice I/O errors if a backend does its own eager writeback for large I/O operations, or if the checkpointer fsync()s a file before the kernel's dirty writeback gets around to trying to flush the pages that will fail.

I haven't tested anything with multiple processes / multiple FDs yet, where we keep one fd open while writing on another.

But at this point I don't see any way to make Pg reliably detect I/O errors and fail a checkpoint then redo and retry. To even fix this by PANICing like I proposed originally, we need to know we have to PANIC.

AFAICS it's completely unsafe to write(), close(), open() and fsync() and expect that the fsync() makes any promises about the write(). Which if I read Pg's low level storage code right, makes it completely unable to reliably detect I/O errors.

When put it that way, it sounds fair enough too. How long is the kernel meant to remember that there was a write error on the file triggered by a write initiated by some seemingly unrelated process, some unbounded time ago, on a since-closed file?

But it seems to put Pg on the fast track to O_DIRECT.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr  5, 2018 at 03:09:57PM +0800, Craig Ringer wrote:
> ENOSPC doesn't seem to be a concern during normal operation of major file
> systems (ext3, ext4, btrfs, xfs) because they reserve space before returning
> from write(). But if a buffered write does manage to fail due to ENOSPC we'll
> definitely see the same problems. This makes ENOSPC on NFS a potentially data
> corrupting condition since NFS doesn't preallocate space before returning from
> write().

This does explain why NFS has a reputation for unreliability for
Postgres.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Note: as I've brought up in another thread, it turns out that PG is not
handling fsync errors correctly even when the OS _does_ do the right
thing (discovered by testing on FreeBSD).

-- 
Andrew (irc:RhodiumToad)


On 6 April 2018 at 07:37, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
Note: as I've brought up in another thread, it turns out that PG is not
handling fsync errors correctly even when the OS _does_ do the right
thing (discovered by testing on FreeBSD).

Yikes. For other readers, the related thread for this is https://www.postgresql.org/message-id/87y3i1ia4w.fsf@news-spur.riddles.org.uk 

Meanwhile, I've extended my test to run postgres on a deliberately faulty volume and confirmed my results there. 

_____________

2018-04-06 01:11:40.555 UTC [58] LOG:  checkpoint starting: immediate force wait
2018-04-06 01:11:40.567 UTC [58] ERROR:  could not fsync file "base/12992/16386": Input/output error
2018-04-06 01:11:40.655 UTC [66] ERROR:  checkpoint request failed
2018-04-06 01:11:40.655 UTC [66] HINT:  Consult recent messages in the server log for details.
2018-04-06 01:11:40.655 UTC [66] STATEMENT:  CHECKPOINT

Checkpoint failed with checkpoint request failed
HINT:  Consult recent messages in the server log for details.

Retrying

2018-04-06 01:11:41.568 UTC [58] LOG:  checkpoint starting: immediate force wait
2018-04-06 01:11:41.614 UTC [58] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.046 s; sync files=3, longest=0.000 s, average=0.000 s; distance=2727 kB, estimate=2779 kB

Ooops, it worked! We ignored the error and checkpointed OK.
_____________

 
Given your report, now I have to wonder if we even reissued the fsync() at all this time. 'perf' time. OK, with

sudo perf record -e syscalls:sys_enter_fsync,syscalls:sys_exit_fsync -a
sudo perf script

I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.

        postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd: 0x00000005
        postgres  9602 [003] 72380.325931:  syscalls:sys_exit_fsync: 0xfffffffffffffffb
... 
        postgres  9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd: 0x00000005
        postgres  9602 [000] 72381.336840:  syscalls:sys_exit_fsync: 0x0

... and Pg continues merrily on its way without realising it lost data:

[72379.834872] XFS (dm-0): writeback error on sector 118752
[72380.324707] XFS (dm-0): writeback error on sector 118688

In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 6 April 2018 at 07:37, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>> Note: as I've brought up in another thread, it turns out that PG is not
>> handling fsync errors correctly even when the OS _does_ do the right
>> thing (discovered by testing on FreeBSD).
>
> Yikes. For other readers, the related thread for this is
> https://www.postgresql.org/message-id/87y3i1ia4w.fsf@news-spur.riddles.org.uk

Yeah.  That's really embarrassing, especially after beating up on
various operating systems all week.  It's also an independent issue --
let's keep that on the other thread and get it fixed.

> I see the failed fync, then the same fd being fsync()d without error on the
> next checkpoint, which succeeds.
>
>         postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:
> 0x00000005
>         postgres  9602 [003] 72380.325931:  syscalls:sys_exit_fsync:
> 0xfffffffffffffffb
> ...
>         postgres  9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd:
> 0x00000005
>         postgres  9602 [000] 72381.336840:  syscalls:sys_exit_fsync: 0x0
>
> ... and Pg continues merrily on its way without realising it lost data:
>
> [72379.834872] XFS (dm-0): writeback error on sector 118752
> [72380.324707] XFS (dm-0): writeback error on sector 118688
>
> In this test I set things up so the checkpointer would see the first fsync()
> error. But if I make checkpoints less frequent, the bgwriter aggressive, and
> kernel dirty writeback aggressive, it should be possible to have the failure
> go completely unobserved too. I'll try that next, because we've already
> largely concluded that the solution to the issue above is to PANIC on
> fsync() error. But if we don't see the error at all we're in trouble.

I suppose you only see errors because the file descriptors linger open
in the virtual file descriptor cache, which is a matter of luck
depending on how many relation segment files you touched.  One thing
you could try to confirm our understand of the Linux 4.13+ policy
would be to hack PostgreSQL so that it reopens the file descriptor
every time in mdsync().  See attached.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment
On 6 April 2018 at 10:53, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 6 April 2018 at 07:37, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>> Note: as I've brought up in another thread, it turns out that PG is not
>> handling fsync errors correctly even when the OS _does_ do the right
>> thing (discovered by testing on FreeBSD).
>
> Yikes. For other readers, the related thread for this is
> https://www.postgresql.org/message-id/87y3i1ia4w.fsf@news-spur.riddles.org.uk

Yeah.  That's really embarrassing, especially after beating up on
various operating systems all week.  It's also an independent issue --
let's keep that on the other thread and get it fixed.

> I see the failed fync, then the same fd being fsync()d without error on the
> next checkpoint, which succeeds.
>
>         postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:
> 0x00000005
>         postgres  9602 [003] 72380.325931:  syscalls:sys_exit_fsync:
> 0xfffffffffffffffb
> ...
>         postgres  9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd:
> 0x00000005
>         postgres  9602 [000] 72381.336840:  syscalls:sys_exit_fsync: 0x0
>
> ... and Pg continues merrily on its way without realising it lost data:
>
> [72379.834872] XFS (dm-0): writeback error on sector 118752
> [72380.324707] XFS (dm-0): writeback error on sector 118688
>
> In this test I set things up so the checkpointer would see the first fsync()
> error. But if I make checkpoints less frequent, the bgwriter aggressive, and
> kernel dirty writeback aggressive, it should be possible to have the failure
> go completely unobserved too. I'll try that next, because we've already
> largely concluded that the solution to the issue above is to PANIC on
> fsync() error. But if we don't see the error at all we're in trouble.

I suppose you only see errors because the file descriptors linger open
in the virtual file descriptor cache, which is a matter of luck
depending on how many relation segment files you touched.

In this case I think it's because the kernel didn't get around to doing the writeback before the eagerly forced checkpoint fsync()'d it. Or we didn't even queue it for writeback from our own shared_buffers until just before we fsync()'d it. After all, it's a contrived test case that tries to reproduce the issue rapidly with big writes and frequent checkpoints.

So the checkpointer had the relation open to fsync() it, and it was the checkpointer's fsync() that did writeback on the dirty page and noticed the error.

If we the kernel had done the writeback before the checkpointer opened the relation to fsync() it, we might not have seen the error at all - though as you note this depends on the file descriptor cache. You can see the silent-error behaviour in my standalone test case where I confirmed the post-4.13 behaviour. (I'm on 4.14 here).

I can try to reproduce it with postgres too, but it not only requires closing and reopening the FDs, it also requires forcing writeback before opening the fd. To make it occur in a practical timeframe I have to make my kernel writeback settings insanely aggressive and/or call sync() before re-open()ing. I don't really think it's worth it, since I've confirmed the behaviour already with the simpler test in standalone/ in the rest repo. To try it yourself, clone 


and in the master branch

cd testcases/fsync-error-clear
less README
make REOPEN=reopen standalone-run


I've pushed the postgres test to that repo too; "make postgres-run".

You'll need docker, and be warned, it's using privileged docker containers and messing with dmsetup.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
So, what can we actually do about this new Linux behaviour?

Idea 1:

* whenever you open a file, either tell the checkpointer so it can
open it too (and wait for it to tell you that it has done so, because
it's not safe to write() until then), or send it a copy of the file
descriptor via IPC (since duplicated file descriptors share the same
f_wb_err)

* if the checkpointer can't take any more file descriptors (how would
that limit even work in the IPC case?), then it somehow needs to tell
you that so that you know that you're responsible for fsyncing that
file yourself, both on close (due to fd cache recycling) and also when
the checkpointer tells you to

Maybe it could be made to work, but sheesh, that seems horrible.  Is
there some simpler idea along these lines that could make sure that
fsync() is only ever called on file descriptors that were opened
before all unflushed writes, or file descriptors cloned from such file
descriptors?

Idea 2:

Give up, complain that this implementation is defective and
unworkable, both on POSIX-compliance grounds and on POLA grounds, and
campaign to get it fixed more fundamentally (actual details left to
the experts, no point in speculating here, but we've seen a few
approaches that work on other operating systems including keeping
buffers dirty and marking the whole filesystem broken/read-only).

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

Any other ideas?

For a while I considered suggesting an idea which I now think doesn't
work.  I thought we could try asking for a new fcntl interface that
spits out wb_err counter.  Call it an opaque error token or something.
Then we could store it in our fsync queue and safely close the file.
Check again before fsync()ing, and if we ever see a different value,
PANIC because it means a writeback error happened while we weren't
looking.  Sadly I think it doesn't work because AIUI inodes are not
pinned in kernel memory when no one has the file open and there are no
dirty buffers, so I think the counters could go away and be reset.
Perhaps you could keep inodes pinned by keeping the associated buffers
dirty after an error (like FreeBSD), but if you did that you'd have
solved the problem already and wouldn't really need the wb_err system
at all.  Is there some other idea long these lines that could work?

-- 
Thomas Munro
http://www.enterprisedb.com


On Sun, Apr  8, 2018 at 02:16:07PM +1200, Thomas Munro wrote:
> So, what can we actually do about this new Linux behaviour?
> 
> Idea 1:
> 
> * whenever you open a file, either tell the checkpointer so it can
> open it too (and wait for it to tell you that it has done so, because
> it's not safe to write() until then), or send it a copy of the file
> descriptor via IPC (since duplicated file descriptors share the same
> f_wb_err)
> 
> * if the checkpointer can't take any more file descriptors (how would
> that limit even work in the IPC case?), then it somehow needs to tell
> you that so that you know that you're responsible for fsyncing that
> file yourself, both on close (due to fd cache recycling) and also when
> the checkpointer tells you to
> 
> Maybe it could be made to work, but sheesh, that seems horrible.  Is
> there some simpler idea along these lines that could make sure that
> fsync() is only ever called on file descriptors that were opened
> before all unflushed writes, or file descriptors cloned from such file
> descriptors?
> 
> Idea 2:
> 
> Give up, complain that this implementation is defective and
> unworkable, both on POSIX-compliance grounds and on POLA grounds, and
> campaign to get it fixed more fundamentally (actual details left to
> the experts, no point in speculating here, but we've seen a few
> approaches that work on other operating systems including keeping
> buffers dirty and marking the whole filesystem broken/read-only).
> 
> Idea 3:
> 
> Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

Idea 4 would be for people to assume their database is corrupt if their
server logs report any I/O error on the file systems Postgres uses.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


> On Apr 7, 2018, at 19:33, Bruce Momjian <bruce@momjian.us> wrote:
> Idea 4 would be for people to assume their database is corrupt if their
> server logs report any I/O error on the file systems Postgres uses.

Pragmatically, that's where we are right now.  The best answer in this bad situation is (a) fix the error, then (b)
replayfrom a checkpoint before the error occurred, but it appears we can't even guarantee that a PostgreSQL process
willbe the one to see the error. 

--
-- Christophe Pettus
   xof@thebuild.com



On 8 April 2018 at 10:16, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
So, what can we actually do about this new Linux behaviour?

Yeah, I've been cooking over that myself.

More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.

We have a storage abstraction that makes this way, way less painful than it should be.

We can virtualize relfilenodes into storage extents in relatively few big files. We could use sparse regions to make the addressing more convenient, but that makes copying and backup painful, so I'd rather not.

Even one file per tablespace for persistent relation heaps, another for indexes, another for each fork type.

That way we can use something like your #1 (which is what I was also thinking about then rejecting previously), but reduce the pain by reducing the FD count drastically so exhausting FDs stops being a problem.


Previously I was leaning toward what you've described here:
 
* whenever you open a file, either tell the checkpointer so it can
open it too (and wait for it to tell you that it has done so, because
it's not safe to write() until then), or send it a copy of the file
descriptor via IPC (since duplicated file descriptors share the same
f_wb_err)

* if the checkpointer can't take any more file descriptors (how would
that limit even work in the IPC case?), then it somehow needs to tell
you that so that you know that you're responsible for fsyncing that
file yourself, both on close (due to fd cache recycling) and also when
the checkpointer tells you to

Maybe it could be made to work, but sheesh, that seems horrible.  Is
there some simpler idea along these lines that could make sure that
fsync() is only ever called on file descriptors that were opened
before all unflushed writes, or file descriptors cloned from such file
descriptors?

... and got stuck on "yuck, that's awful".

I was assuming we'd force early checkpoints if the checkpointer hit its fd limit, but that's even worse.

We'd need to urgently do away with segmented relations, and partitions would start to become a hinderance.

Even then it's going to be an unworkable nightmare with heavily partitioned systems, systems that use schema-sharding, etc. And it'll mean we need to play with process limits and, often, system wide limits on FDs. I imagine the performance implications won't be pretty.


Idea 2:

Give up, complain that this implementation is defective and
unworkable, both on POSIX-compliance grounds and on POLA grounds, and
campaign to get it fixed more fundamentally (actual details left to
the experts, no point in speculating here, but we've seen a few
approaches that work on other operating systems including keeping
buffers dirty and marking the whole filesystem broken/read-only).

This appears to be what SQLite does AFAICS.


though it has the huge luxury of a single writer, so it's probably only subject to the original issue not the multiprocess / checkpointer issues we face.

 
Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

That seems to be what the kernel folks will expect. But that's going to KILL performance. We'll need writer threads to have any hope of it not *totally* sucking, because otherwise simple things like updating a heap tuple and two related indexes will incur enormous disk latencies.

But I suspect it's the path forward.

Goody.


 
Any other ideas?

For a while I considered suggesting an idea which I now think doesn't
work.  I thought we could try asking for a new fcntl interface that
spits out wb_err counter.  Call it an opaque error token or something.
Then we could store it in our fsync queue and safely close the file.
Check again before fsync()ing, and if we ever see a different value,
PANIC because it means a writeback error happened while we weren't
looking.  Sadly I think it doesn't work because AIUI inodes are not
pinned in kernel memory when no one has the file open and there are no
dirty buffers, so I think the counters could go away and be reset.
Perhaps you could keep inodes pinned by keeping the associated buffers
dirty after an error (like FreeBSD), but if you did that you'd have
solved the problem already and wouldn't really need the wb_err system
at all.  Is there some other idea long these lines that could work?

I think our underlying data syncing concept is fundamentally broken, and it's not really the kernel's fault. 

We assume that we can safely:

procA: open()
procA: write()
procA: close()

... some long time later, unbounded as far as the kernel is concerned ...

procB: open()
procB: fsync()
procB: close()


If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every caller? In that case how could we recover? We'd need some new API to say "yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now". Otherwise it'd keep reporting an error after we did redo to recover, too.

I never really clicked to the fact that we closed relations with pending buffered writes, left them closed, then reopened them to fsync. That's .... well, the kernel isn't the only thing doing crazy things here.

Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Fun times.

This also means AFAICS that running Pg on NFS is extremely unsafe, you MUST make sure you don't run out of disk. Because the usual safeguard of space reservation against ENOSPC in fsync doesn't apply to NFS. (I haven't tested this with nfsv3 in sync,hard,nointr mode yet, *maybe* that's safe, but I doubt it). The same applies to thin-provisioned storage. Just. Don't.

This helps explain various reports of corruption in Docker and various other tools that use various sorts of thin provisioning. If you hit ENOSPC in fsync(), bye bye data.




--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 7, 2018 at 8:27 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> More below, but here's an idea #5: decide InnoDB has the right idea, and go
> to using a single massive blob file, or a few giant blobs.
>
> We have a storage abstraction that makes this way, way less painful than it
> should be.
>
> We can virtualize relfilenodes into storage extents in relatively few big
> files. We could use sparse regions to make the addressing more convenient,
> but that makes copying and backup painful, so I'd rather not.
>
> Even one file per tablespace for persistent relation heaps, another for
> indexes, another for each fork type.

I'm not sure that we can do that now, since it would break the new
"Optimize btree insertions for common case of increasing values"
optimization. (I did mention this before it went in.)

I've asked Pavan to at least add a note to the nbtree README that
explains the high level theory behind the optimization, as part of
post-commit clean-up. I'll ask him to say something about how it might
affect extent-based storage, too.

-- 
Peter Geoghegan


> On Apr 7, 2018, at 20:27, Craig Ringer <craig@2ndQuadrant.com> wrote:
>
> Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs,
hard-killpostgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs,
fakeup a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just
letit start up again and do redo again. 

Before we spiral down into despair and excessive alcohol consumption, this is basically the same situation as a
checksumfailure or some other kind of uncorrected media-level error.  The bad part is that we have to find out from the
kernellogs rather than from PostgreSQL directly.  But this does not strike me as otherwise significantly different
from,say, an infrequently-accessed disk block reporting an uncorrectable error when we finally get around to reading
it.

--
-- Christophe Pettus
   xof@thebuild.com



On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an 
idea #5: decide InnoDB has the right idea, and
> go to using a single massive blob file, or a few giant blobs.

FYI: MySQL has by default one file per table these days. The old 
approach with one massive file was a maintenance headache so they change 
the default some releases ago.

https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html

Andreas




On 8 April 2018 at 11:46, Christophe Pettus <xof@thebuild.com> wrote:

> On Apr 7, 2018, at 20:27, Craig Ringer <craig@2ndQuadrant.com> wrote:
>
> Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Before we spiral down into despair and excessive alcohol consumption, this is basically the same situation as a checksum failure or some other kind of uncorrected media-level error.  The bad part is that we have to find out from the kernel logs rather than from PostgreSQL directly.  But this does not strike me as otherwise significantly different from, say, an infrequently-accessed disk block reporting an uncorrectable error when we finally get around to reading it.

I don't entirely agree - because it affects ENOSPC, I/O errors on thin provisioned storage, I/O errors on multipath storage, etc. (I identified the original issue on a thin provisioned system that ran out of backing space, mangling PostgreSQL in a way that made no sense at the time).

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 17:41, Andreas Karlsson <andreas@proxel.se> wrote:
On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an idea #5: decide InnoDB has the right idea, and
go to using a single massive blob file, or a few giant blobs.

FYI: MySQL has by default one file per table these days. The old approach with one massive file was a maintenance headache so they change the default some releases ago.

https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html


Huh, thanks for the update.

We should see how they handle reliable flushing and see if they've looked into it. If they haven't, we should give them a heads-up and if they have, lets learn from them.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
> On Apr 8, 2018, at 03:30, Craig Ringer <craig@2ndQuadrant.com> wrote:
>
> These are way more likely than bit flips or other storage level corruption, and things that we previously expected to
detectand fail gracefully for. 

This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen.  (And great work
trackingit down!)  I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data
integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable
storagecoming along) to address this.  While the failure modes are more common, the solution (a PITR backup) is one
thatan installation should have anyway against media failures. 

--
-- Christophe Pettus
   xof@thebuild.com



On 8 April 2018 at 04:27, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 8 April 2018 at 10:16, Thomas Munro <thomas.munro@enterprisedb.com>
> wrote:
>
> If the kernel does writeback in the middle, how on earth is it supposed to
> know we expect to reopen the file and check back later?
>
> Should it just remember "this file had an error" forever, and tell every
> caller? In that case how could we recover? We'd need some new API to say
> "yeah, ok already, I'm redoing all my work since the last good fsync() so
> you can clear the error flag now". Otherwise it'd keep reporting an error
> after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
track of any errors. We just need fsync to accurately report whether
all the buffers in the file have been written out. When you call fsync
again the kernel needs to initiate i/o on all the dirty buffers and
block until they complete successfully. If they complete successfully
then nobody cares whether they had some failure in the past when i/o
was initiated at some point in the past.

The problem is not that errors aren't been tracked correctly. The
problem is that dirty buffers are being marked clean when they haven't
been written out. They consider dirty filesystem buffers when there's
hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then
there's no real hope of any reliable operation through that interface.

Going to DIRECTIO is basically recognizing this. That the kernel
filesystem buffer provides no reliable interface so we need to
reimplement it ourselves in user space.

It's rather disheartening. Aside from having to do all that work we
have the added barrier that we don't have as much information about
the hardware as the kernel has. We don't know where raid stripes begin
and end, how big the memory controller buffers are or how to tell when
they're full or empty or how to flush them. etc etc. We also don't
know what else is going on on the machine.

-- 
greg


> On Apr 8, 2018, at 14:23, Greg Stark <stark@mit.edu> wrote:
>
> They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".

That's not an irrational position.  File system buffers are *not* dedicated memory for file system caching; they're
beingused for that because no one has a better use for them at that moment.  If an inability to flush them to disk
meantthat they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system
havingno more allocatable memory.  I guess in theory that they could swap them, but swapping out a file system buffer
inhopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me. 

--
-- Christophe Pettus
   xof@thebuild.com



On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> On 8 April 2018 at 04:27, Craig Ringer <craig@2ndquadrant.com> wrote:
> > On 8 April 2018 at 10:16, Thomas Munro <thomas.munro@enterprisedb.com>
> > wrote:
> >
> > If the kernel does writeback in the middle, how on earth is it supposed to
> > know we expect to reopen the file and check back later?
> >
> > Should it just remember "this file had an error" forever, and tell every
> > caller? In that case how could we recover? We'd need some new API to say
> > "yeah, ok already, I'm redoing all my work since the last good fsync() so
> > you can clear the error flag now". Otherwise it'd keep reporting an error
> > after we did redo to recover, too.
> 
> There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep
> track of any errors. We just need fsync to accurately report whether
> all the buffers in the file have been written out. When you call fsync

Instead, fsync() reports when some of the buffers have not been
written out, due to reasons outlined before. As such it may make
some sense to maintain some tracking regarding errors even after
marking failed dirty pages as clean (in fact it has been proposed,
but this introduces memory overhead).

> again the kernel needs to initiate i/o on all the dirty buffers and
> block until they complete successfully. If they complete successfully
> then nobody cares whether they had some failure in the past when i/o
> was initiated at some point in the past.

The question is, what should the kernel and application do in cases
where this is simply not possible (according to freebsd that keeps
dirty pages around after failure, for example, -EIO from the block
layer is a contract for unrecoverable errors so it is pointless to
keep them dirty). You'd need a specialized interface to clear-out
the errors (and drop the dirty pages), or potentially just remount
the filesystem.

> The problem is not that errors aren't been tracked correctly. The
> problem is that dirty buffers are being marked clean when they haven't
> been written out. They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".
> 
> As long as any error means the kernel has discarded writes then
> there's no real hope of any reliable operation through that interface.

This does not necessarily follow. Whether the kernel discards writes
or not would not really help (see above). It is more a matter of
proper "reporting contract" between userspace and kernel, and tracking
would be a way for facilitating this vs. having a more complex userspace
scheme (as described by others in this thread) where synchronization
for fsync() is required in a multi-process application.

Best regards,
Anthony


On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:
>
> > On Apr 8, 2018, at 03:30, Craig Ringer <craig@2ndQuadrant.com>
> > wrote:
> >
> > These are way more likely than bit flips or other storage level
> > corruption, and things that we previously expected to detect and
> > fail gracefully for.
>
> This is definitely bad, and it explains a few otherwise-inexplicable
> corruption issues we've seen.  (And great work tracking it down!)  I
> think it's important not to panic, though; PostgreSQL doesn't have a
> reputation for horrible data integrity.  I'm not sure it makes sense
> to do a major rearchitecting of the storage layer (especially with
> pluggable storage coming along) to address this.  While the failure
> modes are more common, the solution (a PITR backup) is one that an
> installation should have anyway against media failures.

I think the big problem is that we don't have any way of stopping
Postgres at the time the kernel reports the errors to the kernel log, so
we are then returning potentially incorrect results and committing
transactions that might be wrong or lost.  If we could stop Postgres
when such errors happen, at least the administrator could fix the
problem of fail-over to a standby.

An crazy idea would be to have a daemon that checks the logs and stops
Postgres when it seems something wrong.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


> On Apr 8, 2018, at 15:29, Bruce Momjian <bruce@momjian.us> wrote:
> I think the big problem is that we don't have any way of stopping
> Postgres at the time the kernel reports the errors to the kernel log, so
> we are then returning potentially incorrect results and committing
> transactions that might be wrong or lost.

Yeah, it's bad.  In the short term, the best advice to installations is to monitor their kernel logs for errors (which
veryfew do right now), and make sure they have a backup strategy which can encompass restoring from an error like this.
Even Craig's smart fix of patching the backup label to recover from a previous checkpoint doesn't do much good if we
don'thave WAL records back that far (or one of the required WAL records also took a hit). 

In the longer term... O_DIRECT seems like the most plausible way out of this, but that might be popular with people
runningon file systems or OSes that don't have this issue.  (Setting aside the daunting prospect of implementing that.) 

--
-- Christophe Pettus
   xof@thebuild.com



On 2018-04-08 18:29:16 -0400, Bruce Momjian wrote:
> On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:
> >
> > > On Apr 8, 2018, at 03:30, Craig Ringer <craig@2ndQuadrant.com>
> > > wrote:
> > >
> > > These are way more likely than bit flips or other storage level
> > > corruption, and things that we previously expected to detect and
> > > fail gracefully for.
> >
> > This is definitely bad, and it explains a few otherwise-inexplicable
> > corruption issues we've seen.  (And great work tracking it down!)  I
> > think it's important not to panic, though; PostgreSQL doesn't have a
> > reputation for horrible data integrity.  I'm not sure it makes sense
> > to do a major rearchitecting of the storage layer (especially with
> > pluggable storage coming along) to address this.  While the failure
> > modes are more common, the solution (a PITR backup) is one that an
> > installation should have anyway against media failures.
> 
> I think the big problem is that we don't have any way of stopping
> Postgres at the time the kernel reports the errors to the kernel log, so
> we are then returning potentially incorrect results and committing
> transactions that might be wrong or lost.  If we could stop Postgres
> when such errors happen, at least the administrator could fix the
> problem of fail-over to a standby.
> 
> An crazy idea would be to have a daemon that checks the logs and stops
> Postgres when it seems something wrong.

I think the danger presented here is far smaller than some of the
statements in this thread might make one think. In all likelihood, once
you've got an IO error that kernel level retries don't fix, your
database is screwed. Whether fsync reports that or not is really
somewhat besides the point. We don't panic that way when getting IO
errors during reads either, and they're more likely to be persistent
than errors during writes (because remapping on storage layer can fix
issues, but not during reads).

There's a lot of not so great things here, but I don't think there's any
need to panic.

We should fix things so that reported errors are treated with crash
recovery, and for the rest I think there's very fair arguments to be
made that that's far outside postgres's remit.

I think there's pretty good reasons to go to direct IO where supported,
but error handling doesn't strike me as a particularly good reason for
the move.

Greetings,

Andres Freund


> On Apr 8, 2018, at 16:16, Andres Freund <andres@anarazel.de> wrote:
> We don't panic that way when getting IO
> errors during reads either, and they're more likely to be persistent
> than errors during writes (because remapping on storage layer can fix
> issues, but not during reads).

There is a distinction to be drawn there, though, because we immediately pass an error back to the client on a read,
buta write problem in this situation can be masked for an extended period of time. 

That being said...

> There's a lot of not so great things here, but I don't think there's any
> need to panic.

No reason to panic, yes.  We can assume that if this was a very big persistent problem, it would be much more widely
reported. It would, however, be good to find a way to get the error surfaced back up to the client in a way that is not
justmonitoring the kernel logs. 

--
-- Christophe Pettus
   xof@thebuild.com



On 9 April 2018 at 05:28, Christophe Pettus <xof@thebuild.com> wrote:

> On Apr 8, 2018, at 14:23, Greg Stark <stark@mit.edu> wrote:
>
> They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".

That's not an irrational position.  File system buffers are *not* dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment.  If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory.  I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me.

Yep.

Another example is a write to an NFS or iSCSI volume that goes away forever. What if the app keeps write()ing in the hopes it'll come back, and by the time the kernel starts reporting EIO for write(), it's already saddled with a huge volume of dirty writeback buffers it can't get rid of because someone, one day, might want to know about them?

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok? What if it's remounted again? That'd be really bad too, for someone expecting write reliability.

You can coarsen from dirty buffer tracking to marking the FD(s) bad, but what  if there's no FD to mark because the file isn't open at the moment?

You can mark the inode cache entry and pin it, I guess. But what if your app triggered I/O errors over vast numbers of small files? Again, the kernel's left holding the ball.

It doesn't know if/when an app will return to check. It doesn't know how long to remember the failure for. It doesn't know when all interested clients have been informed and it can treat the fault as cleared/repaired, either, so it'd have to *keep on reporting EIO for PostgreSQL's own writes and fsyncs() indefinitely*, even once we do recovery.

The only way it could avoid that would be to keep the dirty writeback pages around and flagged bad, then clear the flag when a new write() replaces the same file range. I can't imagine that being practical.

Blaming the kernel for this sure is the easy way out.

But IMO we cannot rationally expect the kernel to remember error state forever for us, then forget it when we expect, all without actually telling it anything about our activities or even that we still exist and are still interested in the files/writes. We've closed the files and gone away.

Whatever we do, it's likely going to have to involve not doing that anymore.

Even if we can somehow convince the kernel folks to add a new interface for us that reports I/O errors to some listener, like an inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting errors in buffered async writes, we won't be able to rely on having it for 5-10 years, and only on Linux.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 9 April 2018 at 06:29, Bruce Momjian <bruce@momjian.us> wrote:
 

I think the big problem is that we don't have any way of stopping
Postgres at the time the kernel reports the errors to the kernel log, so
we are then returning potentially incorrect results and committing
transactions that might be wrong or lost.

Right.

Specifically, we need a way to ask the kernel at checkpoint time "was everything written to [this set of files] flushed successfully since the last time I asked, no matter who did the writing and no matter how the writes were flushed?"

If the result is "no" we PANIC and redo. If the hardware/volume is screwed, the user can fail over to a standby, do PITR, etc.

But we don't have any way to ask that reliably at present.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Hi,

On 2018-04-08 16:27:57 -0700, Christophe Pettus wrote:
> > On Apr 8, 2018, at 16:16, Andres Freund <andres@anarazel.de> wrote:
> > We don't panic that way when getting IO
> > errors during reads either, and they're more likely to be persistent
> > than errors during writes (because remapping on storage layer can fix
> > issues, but not during reads).
> 
> There is a distinction to be drawn there, though, because we
> immediately pass an error back to the client on a read, but a write
> problem in this situation can be masked for an extended period of
> time.

Only if you're "lucky" enough that your clients actually read that data,
and then you're somehow able to figure out across the whole stack that
these 0.001% of transactions that fail are due to IO errors. Or you also
need to do log analysis.

If you want to solve things like that you need regular reads of all your
data, including verifications etc.

Greetings,

Andres Freund


On 9 April 2018 at 07:16, Andres Freund <andres@anarazel.de> wrote:
 

I think the danger presented here is far smaller than some of the
statements in this thread might make one think.

Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time.

I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.

I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say "well, don't do that then" and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed.

We have to look at what checkpoints are and are not supposed to promise, and whether this is a problem we just define away as "not our problem, the lower level failed, we're not obliged to detect this and fail gracefully."

We can choose to say that checkpoints are required to guarantee crash/power loss safety ONLY and do not attempt to protect against I/O errors of any sort. In fact, I think we should likely amend the documentation for release versions to say just that.

In all likelihood, once
you've got an IO error that kernel level retries don't fix, your
database is screwed.

Your database is going to be down or have interrupted service.  It's possible you may have some unreadable data. This could result in localised damage to one or more relations. That could affect FK relationships, indexes, all sorts. If you're really unlucky you might lose something critical like pg_clog/ contents.

But in general your DB should be repairable/recoverable even in those cases.

And in many failure modes there's no reason to expect any data loss at all, like:

* Local disk fills up (seems to be safe already due to space reservation at write() time)
* Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up
* NFS volume fills up
* Multipath I/O error
* Interruption of connectivity to network block device
* Disk develops localized bad sector where we haven't previously written data

Except for the ENOSPC on NFS, all the rest of the cases can be handled by expecting the kernel to retry forever and not return until the block is written or we reach the heat death of the universe. And NFS, well...

Part of the trouble is that the kernel *won't* retry forever in all these cases, and doesn't seem to have a way to ask it to in all cases.

And if the user hasn't configured it for the right behaviour in terms of I/O error resilience, we don't find out about it.

So it's not the end of the world, but it'd sure be nice to fix.

Whether fsync reports that or not is really
somewhat besides the point. We don't panic that way when getting IO
errors during reads either, and they're more likely to be persistent
than errors during writes (because remapping on storage layer can fix
issues, but not during reads).

That's because reads don't make promises about what's committed and synced. I think that's quite different.
 
We should fix things so that reported errors are treated with crash
recovery, and for the rest I think there's very fair arguments to be
made that that's far outside postgres's remit.

Certainly for current versions.

I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory.

The docs need an update to indicate that we explicitly disclaim responsibility for I/O errors on async writes, and that the kernel and I/O stack must be configured never to give up on buffered writes. If it does, that's not our problem anymore.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 2018-04-09 10:00:41 +0800, Craig Ringer wrote:
> I suspect we've written off a fair few issues in the past as "it'd bad
> hardware" when actually, the hardware fault was the trigger for a Pg/kernel
> interaction bug. And blamed containers for things that weren't really the
> container's fault. But even so, if it were happening tons, we'd hear more
> noise.

Agreed on that, but I think that's FAR more likely to be things like
multixacts, index structure corruption due to logic bugs etc.


> I've already been very surprised there when I learned that PostgreSQL
> completely ignores wholly absent relfilenodes. Specifically, if you
> unlink() a relation's backing relfilenode while Pg is down and that file
> has writes pending in the WAL. We merrily re-create it with uninitalized
> pages and go on our way. As Andres pointed out in an offlist discussion,
> redo isn't a consistency check, and it's not obliged to fail in such cases.
> We can say "well, don't do that then" and define away file losses from FS
> corruption etc as not our problem, the lower levels we expect to take care
> of this have failed.

And it'd be a realy bad idea to behave differently.


> And in many failure modes there's no reason to expect any data loss at all,
> like:
> 
> * Local disk fills up (seems to be safe already due to space reservation at
> write() time)

That definitely should be treated separately.


> * Thin-provisioned storage backing local volume iSCSI or paravirt block
> device fills up
> * NFS volume fills up

Those should be the same as the above.


> I think we need to think about a more robust path in future. But it's
> certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least
two orders of magnitude.

Greetings,

Andres Freund


On 9 April 2018 at 10:06, Andres Freund <andres@anarazel.de> wrote:
 

> And in many failure modes there's no reason to expect any data loss at all,
> like:
>
> * Local disk fills up (seems to be safe already due to space reservation at
> write() time)

That definitely should be treated separately.

It is, because all the FSes I looked at reserve space before returning from write(), even if they do delayed allocation. So they won't fail with ENOSPC at fsync() time or silently due to lost errors on background writeback. Otherwise we'd be hearing a LOT more noise about this.
 
> * Thin-provisioned storage backing local volume iSCSI or paravirt block
> device fills up
> * NFS volume fills up

Those should be the same as the above.

Unfortunately, they aren't.

AFAICS NFS doesn't reserve space with the other end before returning from write(), even if mounted with the sync option. So we can get ENOSPC lazily when the buffer writeback fails due to a full backing file system. This then travels the same paths as EIO: we fsync(), ERROR, retry, appear to succeed, and carry on with life losing the data. Or we never hear about the error in the first place.

(There's a proposed extension that'd allow this, see https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5, but I see no mention of it in fs/nfs. All the reserve_space / xdr_reserve_space stuff seems to be related to space in protocol messages at a quick read.)

Thin provisioned storage could vary a fair bit depending on the implementation. But the specific failure case I saw, prompting this thread, was on a volume using the stack:

xfs -> lvm2 -> multipath -> ??? -> SAN

(the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was iSCSI. I'm checking.)

The SAN ran out of space. Due to use of thin provisioning, Linux *thought* there was plenty of space on the volume; LVM thought it had plenty of physical extents free and unallocated, XFS thought there was tons of free space, etc. The space exhaustion manifested as I/O errors on flushes of writeback buffers.

The logs were like this:

kernel: sd 2:0:0:1: [sdd] Unhandled sense code
kernel: sd 2:0:0:1: [sdd]   
kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:1: [sdd]   
kernel: Sense Key : Data Protect [current] 
kernel: sd 2:0:0:1: [sdd]   
kernel: Add. Sense: Space allocation failed write protect
kernel: sd 2:0:0:1: [sdd] CDB: 
kernel: Write(16): **HEX-DATA-CUT-OUT**
kernel: Buffer I/O error on device dm-0, logical block 3098338786 
kernel: lost page write due to I/O error on dm-0
kernel: Buffer I/O error on device dm-0, logical block 3098338787 

The immediate cause was that Linux's multipath driver didn't seem to recognise the sense code as retryable, so it gave up and reported it to the next layer up (LVM). LVM and XFS both seem to think that the lower layer is responsible for retries, so they toss the write away, and tell any interested writers if they feel like it, per discussion upthread.

In this case Pg did get the news and reported fsync() errors on checkpoints, but it only reported an error once per relfilenode. Once it ran out of failed relfilenodes to cause the checkpoint to ERROR, it "completed" a "successful" checkpoint and kept on running until the resulting corruption started to manifest its self and it segfaulted some time later. As we've now learned, there's no guarantee we'd even get the news about the I/O errors at all.

WAL was on a separate volume that didn't run out of room immediately, so we didn't PANIC on WAL write failure and prevent the issue.

In this case if Pg had PANIC'd (and been able to guarantee to get the news of write failures reliably), there'd have been no corruption and no data loss despite the underlying storage issue.

If, prior to seeing this, you'd asked me "will my PostgreSQL database be corrupted if my thin-provisioned volume runs out of space" I'd have said "Surely not. PostgreSQL won't be corrupted by running out of disk space, it orders writes carefully and forces flushes so that it will recover gracefully from write failures."

Except not. I was very surprised.

BTW, it also turns out that the *default* for multipath is to give up on errors anyway; see the queue_if_no_path option and no_path_retries options. (Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup.


I regularly see rather a lot of multipath systems, iSCSI systems, SAN backed systems, etc. I think we need to be pretty clear that we expect them to retry indefinitely, and if they report an I/O error we cannot reliably handle it. We need to patch Pg to PANIC on any fsync() failure and document that Pg won't notice some storage failure modes that might otherwise be considered nonfatal or transient, so very specific storage configuration and testing is required. (Not that anyone will do it).  Also warn against running on NFS even with "hard,sync,nointr".

It'd be interesting to have a tool that tested error handling, allowing people to do iSCSI plug-pull tests, that sort of thing. But as far as I can tell nobody ever tests their storage stack anyway, so I don't plan on writing something that'll never get used.
  
> I think we need to think about a more robust path in future. But it's
> certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least
two orders of magnitude.

Oh, it's just a minor total rewrite of half Pg, no big deal ;) 

I'm sure that no matter how big I think it is, I'm still underestimating it.

The most workable option IMO would be some sort of fnotify/dnotify/whatever that reports all I/O errors on a volume. Some kind of error reporting handle we can keep open on a volume level that we can check for each volume/tablespace after we fsync() everything to see if it all really worked. If we PANIC if that gives us a bad answer, and PANIC on fsync errors, we guard against the great majority of these sorts of should-be-transient-if-the-kernel-didn't-give-up-and-throw-away-our-data errors.

Even then, good luck getting those events from an NFS volume in which the backing volume experiences an issue.

And it's kind of moot because AFAICS no such interface exists.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
>> On 8 April 2018 at 04:27, Craig Ringer <craig@2ndquadrant.com> wrote:
>> > On 8 April 2018 at 10:16, Thomas Munro <thomas.munro@enterprisedb.com>
>
> The question is, what should the kernel and application do in cases
> where this is simply not possible (according to freebsd that keeps
> dirty pages around after failure, for example, -EIO from the block
> layer is a contract for unrecoverable errors so it is pointless to
> keep them dirty). You'd need a specialized interface to clear-out
> the errors (and drop the dirty pages), or potentially just remount
> the filesystem.

Well firstly that's not necessarily the question. ENOSPC is not an
unrecoverable error. And even unrecoverable errors for a single write
doesn't mean the write will never be able to succeed in the future.
But secondly doesn't such an interface already exist? When the device
is dropped any dirty pages already get dropped with it. What's the
point in dropping them but keeping the failing device?

But just to underline the point. "pointless to keep them dirty" is
exactly backwards from the application's point of view. If the error
writing to persistent media really is unrecoverable then it's all the
more critical that the pages be kept so the data can be copied to some
other device. The last thing user space expects to happen is if the
data can't be written to persistent storage then also immediately
delete it from RAM. (And the *really* last thing user space expects is
for this to happen and return no error.)

-- 
greg


On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:
> On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> > On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:
> >> On 8 April 2018 at 04:27, Craig Ringer <craig@2ndquadrant.com> wrote:
> >> > On 8 April 2018 at 10:16, Thomas Munro <thomas.munro@enterprisedb.com>
> >
> > The question is, what should the kernel and application do in cases
> > where this is simply not possible (according to freebsd that keeps
> > dirty pages around after failure, for example, -EIO from the block
> > layer is a contract for unrecoverable errors so it is pointless to
> > keep them dirty). You'd need a specialized interface to clear-out
> > the errors (and drop the dirty pages), or potentially just remount
> > the filesystem.
> 
> Well firstly that's not necessarily the question. ENOSPC is not an
> unrecoverable error. And even unrecoverable errors for a single write
> doesn't mean the write will never be able to succeed in the future.

To make things a bit simpler, let us focus on EIO for the moment.
The contract between the block layer and the filesystem layer is
assumed to be that of, when an EIO is propagated up to the fs,
then you may assume that all possibilities for recovering have
been exhausted in lower layers of the stack. Mind you, I am not
claiming that this contract is either documented or necessarily
respected (in fact there have been studies on the error propagation
and handling of the block layer, see [1]). Let us assume that
this is the design contract though (which appears to be the case
across a number of open-source kernels), and if not - it's a bug.
In this case, indeed the specific write()s will never be able
to succeed in the future, at least not as long as the BIOs are
allocated to the specific failing LBAs.

> But secondly doesn't such an interface already exist? When the device
> is dropped any dirty pages already get dropped with it. What's the
> point in dropping them but keeping the failing device?

I think there are degrees of failure. There are certainly cases
where one may encounter localized unrecoverable medium errors
(specific to certain LBAs) that are non-maskable from the block
layer and below. That does not mean that the device is dropped
at all, so it does make sense to continue all other operations
to all other regions of the device that are functional. In cases
of total device failure, then the filesystem will prevent you
from proceeding anyway.

> But just to underline the point. "pointless to keep them dirty" is
> exactly backwards from the application's point of view. If the error
> writing to persistent media really is unrecoverable then it's all the
> more critical that the pages be kept so the data can be copied to some
> other device. The last thing user space expects to happen is if the
> data can't be written to persistent storage then also immediately
> delete it from RAM. (And the *really* last thing user space expects is
> for this to happen and return no error.)

Right. This implies though that apart from the kernel having
to keep around the dirtied-but-unrecoverable pages for an
unbounded time, that there's further an interface for obtaining
the exact failed pages so that you can read them back. This in
turn means that there needs to be an association between the
fsync() caller and the specific dirtied pages that the caller
intents to drain (for which we'd need an fsync_range(), among
other things). BTW, currently the failed writebacks are not
dropped from memory, but rather marked clean. They could be
lost though due to memory pressure or due to explicit request
(e.g. proc drop_caches), unless mlocked.

There is a clear responsibility of the application to keep
its buffers around until a successful fsync(). The kernels
do report the error (albeit with all the complexities of
dealing with the interface), at which point the application
may not assume that the write()s where ever even buffered
in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping
buffers over the (kernel) fence and idemnifying the application
from any further responsibility, i.e. a hard assurance
that either the kernel will persist the pages or it will
keep them around till the application recovers them
asynchronously, the filesystem is unmounted, or the system
is rebooted.

Best regards,
Anthony

[1] https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf


On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop@altatus.com> wrote:
What you seem to be asking for is the capability of dropping
buffers over the (kernel) fence and idemnifying the application
from any further responsibility, i.e. a hard assurance
that either the kernel will persist the pages or it will
keep them around till the application recovers them
asynchronously, the filesystem is unmounted, or the system
is rebooted.

That seems like a perfectly reasonable position to take, frankly.

The whole _point_ of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a "difficult" problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_its_job_.

Geoff
On 9 April 2018 at 18:50, Anthony Iliopoulos <ailiop@altatus.com> wrote:

There is a clear responsibility of the application to keep
its buffers around until a successful fsync(). The kernels
do report the error (albeit with all the complexities of
dealing with the interface), at which point the application
may not assume that the write()s where ever even buffered
in the kernel page cache in the first place.



 
What you seem to be asking for is the capability of dropping
buffers over the (kernel) fence and idemnifying the application
from any further responsibility, i.e. a hard assurance
that either the kernel will persist the pages or it will
keep them around till the application recovers them
asynchronously, the filesystem is unmounted, or the system
is rebooted.

That's what Pg appears to assume now, yes.
 
Whether that's reasonable is a whole different topic.

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.

Some keen person who wants to later could optimise it by adding a fsync worker thread pool in backends, so we don't block the main thread. Frankly that might be a nice thing to have in the checkpointer anyway. But it's out of scope for fixing this in durability terms.

I'm partway through a patch that makes fsync panic on errors now. Once that's done, the next step will be to force fsync on close() in md and see how we go with that.

Thoughts?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:
> On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> 
> > What you seem to be asking for is the capability of dropping
> > buffers over the (kernel) fence and idemnifying the application
> > from any further responsibility, i.e. a hard assurance
> > that either the kernel will persist the pages or it will
> > keep them around till the application recovers them
> > asynchronously, the filesystem is unmounted, or the system
> > is rebooted.
> >
> 
> That seems like a perfectly reasonable position to take, frankly.

Indeed, as long as you are willing to ignore the consequences of
this design decision: mainly, how you would recover memory when no
application is interested in clearing the error. At which point
other applications with different priorities will find this position
rather unreasonable since there can be no way out of it for them.
Good luck convincing any OS kernel upstream to go with this design.

> The whole _point_ of an Operating System should be that you can do exactly
> that. As a developer I should be able to call write() and fsync() and know
> that if both calls have succeeded then the result is on disk, no matter
> what another application has done in the meantime. If that's a "difficult"
> problem then that's the OS's problem, not mine. If the OS doesn't do that,
> it's _not_doing_its_job_.

No OS kernel that I know of provides any promises for atomicity of a
write()+fsync() sequence, unless one is using O_SYNC. It doesn't
provide you with isolation either, as this is delegated to userspace,
where processes that share a file should coordinate accordingly.

It's not a difficult problem, but rather the kernels provide a common
denominator of possible interfaces and designs that could accommodate
a wider range of potential application scenarios for which the kernel
cannot possibly anticipate requirements. There have been plenty of
experimental works for providing a transactional (ACID) filesystem
interface to applications. On the opposite end, there have been quite
a few commercial databases that completely bypass the kernel storage
stack. But I would assume it is reasonable to figure out something
between those two extremes that can work in a "portable" fashion.

Best regards,
Anthony


On Mon, Apr 09, 2018 at 08:16:38PM +0800, Craig Ringer wrote:
> 
> I'd like a middle ground where the kernel lets us register our interest and
> tells us if it lost something, without us having to keep eight million FDs
> open for some long period. "Tell us about anything that happens under
> pgdata/" or an inotify-style per-directory-registration option. I'd even
> say that's ideal.

I see what you are saying. So basically you'd always maintain the
notification descriptor open, where the kernel would inject events
related to writeback failures of files under watch (potentially
enriched to contain info regarding the exact failed pages and
the file offset they map to). The kernel wouldn't even have to
maintain per-page bits to trace the errors, since they will be
consumed by the process that reads the events (or discarded,
when the notification fd is closed).

Assuming this would be possible, wouldn't Pg still need to deal
with synchronizing writers and related issues (since this would
be merely a notification mechanism - not prevent any process
from continuing), which I understand would be rather intrusive
for the current Pg multi-process design.

But other than that, similarly this interface could in principle
be similarly implemented in the BSDs via kqueue(), I suppose,
to provide what you need.

Best regards,
Anthony


On 04/09/2018 02:31 PM, Anthony Iliopoulos wrote:
> On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:
>> On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop@altatus.com> wrote:
>>
>>> What you seem to be asking for is the capability of dropping
>>> buffers over the (kernel) fence and idemnifying the application
>>> from any further responsibility, i.e. a hard assurance
>>> that either the kernel will persist the pages or it will
>>> keep them around till the application recovers them
>>> asynchronously, the filesystem is unmounted, or the system
>>> is rebooted.
>>>
>>
>> That seems like a perfectly reasonable position to take, frankly.
> 
> Indeed, as long as you are willing to ignore the consequences of
> this design decision: mainly, how you would recover memory when no
> application is interested in clearing the error. At which point
> other applications with different priorities will find this position
> rather unreasonable since there can be no way out of it for them.

Sure, but the question is whether the system can reasonably operate
after some of the writes failed and the data got lost. Because if it
can't, then recovering the memory is rather useless. It might be better
to stop the system in that case, forcing the system administrator to
resolve the issue somehow (fail-over to a replica, perform recovery from
the last checkpoint, ...).

We already have dirty_bytes and dirty_background_bytes, for example. I
don't see why there couldn't be another limit defining how much dirty
data to allow before blocking writes altogether. I'm sure it's not that
simple, but you get the general idea - do not allow using all available
memory because of writeback issues, but don't throw the data away in
case it's just a temporary issue.

> Good luck convincing any OS kernel upstream to go with this design.
> 

Well, there seem to be kernels that seem to do exactly that already. At
least that's how I understand what this thread says about FreeBSD and
Illumos, for example. So it's not an entirely insane design, apparently.

The question is whether the current design makes it any easier for
user-space developers to build reliable systems. We have tried using it,
and unfortunately the answers seems to be "no" and "Use direct I/O and
manage everything on your own!"

>> The whole _point_ of an Operating System should be that you can do exactly
>> that. As a developer I should be able to call write() and fsync() and know
>> that if both calls have succeeded then the result is on disk, no matter
>> what another application has done in the meantime. If that's a "difficult"
>> problem then that's the OS's problem, not mine. If the OS doesn't do that,
>> it's _not_doing_its_job_.
> 
> No OS kernel that I know of provides any promises for atomicity of a
> write()+fsync() sequence, unless one is using O_SYNC. It doesn't
> provide you with isolation either, as this is delegated to userspace,
> where processes that share a file should coordinate accordingly.
> 

We can (and do) take care of the atomicity and isolation. Implementation
of those parts is obviously very application-specific, and we have WAL
and locks for that purpose. I/O on the other hand seems to be a generic
service provided by the OS - at least that's how we saw it until now.

> It's not a difficult problem, but rather the kernels provide a common
> denominator of possible interfaces and designs that could accommodate
> a wider range of potential application scenarios for which the kernel
> cannot possibly anticipate requirements. There have been plenty of
> experimental works for providing a transactional (ACID) filesystem
> interface to applications. On the opposite end, there have been quite
> a few commercial databases that completely bypass the kernel storage
> stack. But I would assume it is reasonable to figure out something
> between those two extremes that can work in a "portable" fashion.
> 

Users ask us about this quite often, actually. The question is usually
about "RAW devices" and performance, but ultimately it boils down to
buffered vs. direct I/O. So far our answer was we rely on kernel to do
this reliably, because they know how to do that correctly and we simply
don't have the manpower to implement it (portable, reliable, handling
different types of storage, ...).

One has to wonder how many applications actually use this correctly,
considering PostgreSQL cares about data durability/consistency so much
and yet we've been misunderstanding how it works for 20+ years.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> 
> An crazy idea would be to have a daemon that checks the logs and
> stops Postgres when it seems something wrong.
> 

That doesn't seem like a very practical way. It's better than nothing,
of course, but I wonder how would that work with containers (where I
think you may not have access to the kernel log at all). Also, I'm
pretty sure the messages do change based on kernel version (and possibly
filesystem) so parsing it reliably seems rather difficult. And we
probably don't want to PANIC after I/O error on an unrelated device, so
we'd need to understand which devices are related to PostgreSQL.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


At 2018-04-09 15:42:35 +0200, tomas.vondra@2ndquadrant.com wrote:
>
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> > 
> > An crazy idea would be to have a daemon that checks the logs and
> > stops Postgres when it seems something wrong.
> > 
> 
> That doesn't seem like a very practical way.

Not least because Craig's tests showed that you can't rely on *always*
getting an error message in the logs.

-- Abhijit


On 04/09/2018 04:00 AM, Craig Ringer wrote:
> On 9 April 2018 at 07:16, Andres Freund <andres@anarazel.de
> <mailto:andres@anarazel.de>> wrote:
>  
> 
> 
>     I think the danger presented here is far smaller than some of the
>     statements in this thread might make one think.
> 
> 
> Clearly it's not happening a huge amount or we'd have a lot of noise
> about Pg eating people's data, people shouting about how unreliable it
> is, etc. We don't. So it's not some earth shattering imminent threat to
> everyone's data. It's gone unnoticed, or the root cause unidentified,
> for a long time.
> 

Yeah, it clearly isn't the case that everything we do suddenly got
pointless. It's fairly annoying, though.

> I suspect we've written off a fair few issues in the past as "it'd
> bad hardware" when actually, the hardware fault was the trigger for
> a Pg/kernel interaction bug. And blamed containers for things that
> weren't really the container's fault. But even so, if it were
> happening tons, we'd hear more noise.
> 

Right. Write errors are fairly rare, and we've probably ignored a fair
number of cases demonstrating this issue. It kinda reminds me the wisdom
 that not seeing planes with bullet holes in the engine does not mean
engines don't need armor [1].

[1]
https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d



regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>
> We already have dirty_bytes and dirty_background_bytes, for example. I
> don't see why there couldn't be another limit defining how much dirty
> data to allow before blocking writes altogether. I'm sure it's not that
> simple, but you get the general idea - do not allow using all available
> memory because of writeback issues, but don't throw the data away in
> case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie"
pages may occupy. Not sure how helpful it would be in the long run
since this tends to be highly application-specific, and for something
with a large data footprint one would end up tuning this accordingly
in a system-wide manner. This has the potential to leave other
applications running in the same system with very little memory, in
cases where for example original application crashes and never clears
the error. Apart from that, further interfaces would need to be provided
for actually dealing with the error (again assuming non-transient
issues that may not be fixed transparently and that temporary issues
are taken care of by lower layers of the stack).

> Well, there seem to be kernels that seem to do exactly that already. At
> least that's how I understand what this thread says about FreeBSD and
> Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right
there (since 2017), mentioning that there can be no recovery from
EIO at the block layer and this needs to be done differently. No
idea how an application running on top of either FreeBSD or Illumos
would actually recover from this error (and clear it out), other
than remounting the fs in order to force dropping of relevant pages.
It does provide though indeed a persistent error indication that
would allow Pg to simply reliably panic. But again this does not
necessarily play well with other applications that may be using
the filesystem reliably at the same time, and are now faced with
EIO while their own writes succeed to be persisted.

Ideally, you'd want a (potentially persistent) indication of error
localized to a file region (mapping the corresponding failed writeback
pages). NetBSD is already implementing fsync_ranges(), which could
be a step in the right direction.

> One has to wonder how many applications actually use this correctly,
> considering PostgreSQL cares about data durability/consistency so much
> and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have
a very simple process model (e.g. embedded DBs that can abort a
txn on fsync() EIO). I think that durability is a rather complex
cross-layer issue which has been grossly misunderstood similarly
in the past (e.g. see [1]). It seems that both the OS and DB
communities greatly benefit from a periodic reality check, and
I see this as an opportunity for strengthening the IO stack in
an end-to-end manner.

Best regards,
Anthony

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf


On 9 April 2018 at 15:22, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner.

Surely this is exactly what the kernel is there to manage. It has to
control how much memory is allowed to be full of dirty buffers in the
first place to ensure that the system won't get memory starved if it
can't clean them fast enough. That isn't even about persistent
hardware errors. Even when the hardware is working perfectly it can
only flush buffers so fast.  The whole point of the kernel is to
abstract away shared resources. It's not like user space has any
better view of the situation here. If Postgres implemented all this in
DIRECT_IO it would have exactly the same problem only with less
visibility into what the rest of the system is doing. If every
application implemented its own buffer cache we would be back in the
same boat only with a fragmented memory allocation.

> This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error.

I still think we're speaking two different languages. There's no
application anywhere that's going to "clear the error". The
application has done the writes and if it's calling fsync it wants to
wait until the filesystem can arrange for the write to be persisted.
If the application could manage without the persistence then it
wouldn't have called fsync.

The only way to "clear out" the error would be by having the writes
succeed. There's no reason to think that wouldn't be possible
sometime. The filesystem could remap blocks or an administrator could
replace degraded raid device components. The only thing Postgres could
do to recover would be create a new file and move the data (reading
from the dirty buffer in memory!) to a new file anyways so we would
"clear the error" by just no longer calling fsync on the old file.

We always read fsync as a simple write barrier. That's what the
documentation promised and it's what Postgres always expected. It
sounds like the kernel implementors looked at it as some kind of
communication channel to communicate status report for specific writes
back to user-space. That's a much more complex problem and would have
entirely different interface. I think this is why we're having so much
difficulty communicating.



> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Well if they're writing to the same file that had a previous error I
doubt there are many applications that would be happy to consider
their writes "persisted" when the file was corrupt. Ironically the
earlier discussion quoted talked about how applications that wanted
more granular communication would be using O_DIRECT -- but what we
have is fsync trying to be *too* granular such that it's impossible to
get any strong guarantees about anything with it.

>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
>
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO).

Honestly I don't think there's *any* way to use the current interface
to implement reliable operation. Even that embedded database using a
single process and keeping every file open all the time (which means
file descriptor limits limit its scalability) can be having silent
corruption whenever some other process like a backup program comes
along and calls fsync (or even sync?).



-- 
greg


On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> In the mean time, I propose that we fsync() on close() before we age FDs out
> of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
> we don't seem to have many better options. At least it'll only flush what we
> actually wrote to the OS buffers not what we may have in shared_buffers. If
> the bgwriter does the same thing, we should be 100% safe from this problem
> on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
> full_page_writes options that people can turn off if they know the risks /
> know their storage is safe / don't care.

Ouch.  If a process exits -- say, because the user typed \q into psql
-- then you're talking about potentially calling fsync() on a really
large number of file descriptor flushing many gigabytes of data to
disk.  And it may well be that you never actually wrote any data to
any of those file descriptors -- those writes could have come from
other backends.  Or you may have written a little bit of data through
those FDs, but there could be lots of other data that you end up
flushing incidentally.  Perfectly innocuous things like starting up a
backend, running a few short queries, and then having that backend
exit suddenly turn into something that could have a massive
system-wide performance impact.

Also, if a backend ever manages to exit without running through this
code, or writes any dirty blocks afterward, then this still fails to
fix the problem completely.  I guess that's probably avoidable -- we
can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve
the problem.  We suffered for years because of ext3's desire to flush
the entire page cache whenever any single file was fsync()'d, which
was terrible.  Eventually ext4 became the norm, and the problem went
away.  Now we're going to deliberately insert logic to do a very
similar kind of terrible thing because the kernel developers have
decided that fsync() doesn't have to do what it says on the tin?  I
grant that there doesn't seem to be a better option, but I bet we're
going to have a lot of really unhappy users if we do this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On 04/09/2018 09:45 AM, Robert Haas wrote:
> On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> In the mean time, I propose that we fsync() on close() before we age FDs out
>> of the LRU on backends. Yes, that will hurt throughput and cause stalls, but
>> we don't seem to have many better options. At least it'll only flush what we
>> actually wrote to the OS buffers not what we may have in shared_buffers. If
>> the bgwriter does the same thing, we should be 100% safe from this problem
>> on 4.13+, and it'd be trivial to make it a GUC much like the fsync or
>> full_page_writes options that people can turn off if they know the risks /
>> know their storage is safe / don't care.
> I have a really tough time believing this is the right way to solve
> the problem.  We suffered for years because of ext3's desire to flush
> the entire page cache whenever any single file was fsync()'d, which
> was terrible.  Eventually ext4 became the norm, and the problem went
> away.  Now we're going to deliberately insert logic to do a very
> similar kind of terrible thing because the kernel developers have
> decided that fsync() doesn't have to do what it says on the tin?  I
> grant that there doesn't seem to be a better option, but I bet we're
> going to have a lot of really unhappy users if we do this.

I don't have a better option but whatever we do, it should be an optional
(GUC) change. We have plenty of YEARS of people not noticing this issue and
Robert's correct, if we go back to an era of things like stalls it is going
to look bad on us no matter how we describe the problem.

Thanks,

JD


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****



On 09. 04. 2018 15:42, Tomas Vondra wrote:
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
>> An crazy idea would be to have a daemon that checks the logs and
>> stops Postgres when it seems something wrong.
>>
> That doesn't seem like a very practical way. It's better than nothing,
> of course, but I wonder how would that work with containers (where I
> think you may not have access to the kernel log at all). Also, I'm
> pretty sure the messages do change based on kernel version (and possibly
> filesystem) so parsing it reliably seems rather difficult. And we
> probably don't want to PANIC after I/O error on an unrelated device, so
> we'd need to understand which devices are related to PostgreSQL.
>
> regards
>

For a bit less (or more) crazy idea, I'd imagine creating a Linux kernel
module with kprobe/kretprobe capturing the file passed to fsync or even
byte range within file and corresponding return value shouldn't be that
hard. Kprobe has been a part of Linux kernel for a really long time, and
from first glance it seems like it could be backported to 2.6 too.

Then you could have stable log messages or implement some kind of "fsync
error log notification" via whatever is the most sane way to get this
out of kernel.

If the kernel is new enough and has eBPF support (seems like >=4.4),
using bcc-tools[1] should enable you to write a quick script to get
exactly that info via perf events[2].

Obviously, that's a stopgap solution ...


Kind regards,
Gasper


[1] https://github.com/iovisor/bcc
[2]
https://blog.yadutaf.fr/2016/03/30/turn-any-syscall-into-event-introducing-ebpf-kernel-probes/


> On Apr 9, 2018, at 10:26 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

> We have plenty of YEARS of people not noticing this issue

I disagree.  I have noticed this problem, but blamed it on other things.
For over five years now, I have had to tell customers not to use thin
provisioning, and I have had to add code to postgres to refuse to perform
inserts or updates if the disk volume is more than 80% full.  I have lost
count of the number of customers who are running an older version of the
product (because they refuse to upgrade) and come back with complaints that
they ran out of disk and now their database is corrupt.  All this time, I
have been blaming this on virtualization and thin provisioning.

mark


On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Ouch.  If a process exits -- say, because the user typed \q into psql
> -- then you're talking about potentially calling fsync() on a really
> large number of file descriptor flushing many gigabytes of data to
> disk.  And it may well be that you never actually wrote any data to
> any of those file descriptors -- those writes could have come from
> other backends.  Or you may have written a little bit of data through
> those FDs, but there could be lots of other data that you end up
> flushing incidentally.  Perfectly innocuous things like starting up a
> backend, running a few short queries, and then having that backend
> exit suddenly turn into something that could have a massive
> system-wide performance impact.
>
> Also, if a backend ever manages to exit without running through this
> code, or writes any dirty blocks afterward, then this still fails to
> fix the problem completely.  I guess that's probably avoidable -- we
> can put this late in the shutdown sequence and PANIC if it fails.
>
> I have a really tough time believing this is the right way to solve
> the problem.  We suffered for years because of ext3's desire to flush
> the entire page cache whenever any single file was fsync()'d, which
> was terrible.  Eventually ext4 became the norm, and the problem went
> away.  Now we're going to deliberately insert logic to do a very
> similar kind of terrible thing because the kernel developers have
> decided that fsync() doesn't have to do what it says on the tin?  I
> grant that there doesn't seem to be a better option, but I bet we're
> going to have a lot of really unhappy users if we do this.

What about the bug we fixed in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e
?  Say somebody does something along the lines of:

ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9

...and then restarts postgres.  Craig's proposal wouldn't cover this
case, because there was no opportunity to run fsync() after the first
crash, and there's now no way to go back and fsync() any stuff we
didn't fsync() before, because the kernel may have already thrown away
the error state, or may lie to us and tell us everything is fine
(because our new fd wasn't opened early enough).  I can't find the
original discussion that led to that commit right now, so I'm not
exactly sure what scenarios we were thinking about.  But I think it
would at least be a problem if full_page_writes=off or if you had
previously started the server with fsync=off and now wish to switch to
fsync=on after completing a bulk load or similar.  Recovery can read a
page, see that it looks OK, and continue, and then a later fsync()
failure can revert that page to an earlier state and now your database
is corrupted -- and there's absolute no way to detect this because
write() gives you the new page contents later, fsync() doesn't feel
obliged to tell you about the error because your fd wasn't opened
early enough, and eventually the write can be discarded and you'll
revert back to the old page version with no errors ever being reported
anywhere.

Another consequence of this behavior that initdb -S is never reliable,
so pg_rewind's use of it doesn't actually fix the problem it was
intended to solve.  It also means that initdb itself isn't crash-safe,
since the data file changes are made by the backend but initdb itself
is doing the fsyncs, and initdb has no way of knowing what files the
backend is going to create and therefore can't -- even theoretically
-- open them first.

What's being presented to us as the API contract that we should expect
from buffered I/O is that if you open a file and read() from it, call
fsync(), and get no error, the kernel may nevertheless decide that
some previous write that it never managed to flush can't be flushed,
and then revert the page to the contents it had at some point in the
past.  That's mostly or less equivalent to letting a malicious
adversary randomly overwrite database pages plausible-looking but
incorrect contents without notice and hoping you can still build a
reliable system.  You can avoid the problem if you can always open an
fd for every file you want to modify before it's written and hold on
to it until after it's fsync'd, but that's pretty hard to guarantee in
the face of kill -9.

I think the simplest technological solution to this problem is to
rewrite the entire backend and all supporting processes to use
O_DIRECT everywhere.  To maintain adequate performance, we'll have to
write a complete I/O scheduling system inside PostgreSQL.  Also, since
we'll now have to make shared_buffers much larger -- since we'll no
longer be benefiting from the OS cache -- we'll need to replace the
use of malloc() with an allocator that pulls from shared_buffers.
Plus, as noted, we'll need to totally rearchitect several of our
critical frontend tools.  Let's freeze all other development for the
next year while we work on that, and put out a notice that Linux is no
longer a supported platform for any existing release.  Before we do
that, we might want to check whether fsync() actually writes the data
to disk in a usable way even with O_DIRECT.  If not, we should just
de-support Linux entirely as a hopelessly broken and unsupportable
platform.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
> I think the simplest technological solution to this problem is to
> rewrite the entire backend and all supporting processes to use
> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
> write a complete I/O scheduling system inside PostgreSQL.  Also, since
> we'll now have to make shared_buffers much larger -- since we'll no
> longer be benefiting from the OS cache -- we'll need to replace the
> use of malloc() with an allocator that pulls from shared_buffers.
> Plus, as noted, we'll need to totally rearchitect several of our
> critical frontend tools.  Let's freeze all other development for the
> next year while we work on that, and put out a notice that Linux is no
> longer a supported platform for any existing release.  Before we do
> that, we might want to check whether fsync() actually writes the data
> to disk in a usable way even with O_DIRECT.  If not, we should just
> de-support Linux entirely as a hopelessly broken and unsupportable
> platform.

Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
absurd, as is some of the proposed ways this is all supposed to
work. But I think the case we're discussing is much closer to a near
irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable
error. You're hosed even if we report it properly.  Yes, it'd be nice if
we could report it reliably.  But that doesn't change the fact that what
we're doing is ensuring that data is safely fsynced unless storage
fails, in which case it's not safely fsynced anyway.

Greetings,

Andres Freund



On 04/09/2018 08:29 PM, Mark Dilger wrote:
> 
>> On Apr 9, 2018, at 10:26 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
> 
>> We have plenty of YEARS of people not noticing this issue
> 
> I disagree.  I have noticed this problem, but blamed it on other things.
> For over five years now, I have had to tell customers not to use thin
> provisioning, and I have had to add code to postgres to refuse to perform
> inserts or updates if the disk volume is more than 80% full.  I have lost
> count of the number of customers who are running an older version of the
> product (because they refuse to upgrade) and come back with complaints that
> they ran out of disk and now their database is corrupt.  All this time, I
> have been blaming this on virtualization and thin provisioning.
> 

Yeah. There's a big difference between not noticing an issue because it
does not happen very often vs. attributing it to something else. If we
had the ability to revisit past data corruption cases, we would probably
discover a fair number of cases caused by this.

The other thing we probably need to acknowledge is that the environment
changes significantly - things like thin provisioning are likely to get
even more common, increasing the incidence of these issues.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund <andres@anarazel.de> wrote:
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.

+1

> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

Right. We seem to be implicitly assuming that there is a big
difference between a problem in the storage layer that we could in
principle detect, but don't, and any other problem in the storage
layer. I've read articles claiming that technologies like SMART are
not really reliable in a practical sense [1], so it seems to me that
there is reason to doubt that this gap is all that big.

That said, I suspect that the problems with running out of disk space
are serious practical problems. I have personally scoffed at stories
involving Postgres databases corruption that gets attributed to
running out of disk space. Looks like I was dead wrong.

[1] https://danluu.com/file-consistency/ -- "Filesystem correctness"
-- 
Peter Geoghegan


On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote:
> Honestly I don't think there's *any* way to use the current interface
> to implement reliable operation. Even that embedded database using a
> single process and keeping every file open all the time (which means
> file descriptor limits limit its scalability) can be having silent
> corruption whenever some other process like a backup program comes
> along and calls fsync (or even sync?).

That is indeed true (sync would induce fsync on open inodes and clear
the error), and that's a nasty bug that apparently went unnoticed for
a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at
least this issue, but similar fixes need to be adopted by many other
kernels (all those that mark failed pages as clean).

I honestly do not expect that keeping around the failed pages will
be an acceptable change for most kernels, and as such the recommendation
will probably be to coordinate in userspace for the fsync().

What about having buffered IO with implied fsync() atomicity via O_SYNC?
This would probably necessitate some helper threads that mask the
latency and present an async interface to the rest of PG, but sounds
less intrusive than going for DIO.

Best regards,
Anthony


On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> What about having buffered IO with implied fsync() atomicity via
> O_SYNC?

You're kidding, right?  We could also just add sleep(30)'s all over the
tree, and hope that that'll solve the problem.  There's a reason we
don't permanently fsync everything. Namely that it'll be way too slow.

Greetings,

Andres Freund



On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos <ailiop@altatus.com> wrote:

>I honestly do not expect that keeping around the failed pages will
>be an acceptable change for most kernels, and as such the
>recommendation
>will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about fatal failures that occurred around.
Reporterrors until that bit is explicitly cleared.  Yes, that keeps some memory around until unmount if nobody clears
it.But it's orders of magnitude less, and results in usable semantics. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
> You could make the argument that it's OK to forget if the entire file
> system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any
opened files would prevent unmounting; but, then I realized we need to consider
the case of close()ing all FDs then opening them later..in another process.

I was going to say that's fine for postgres, since it chdir()s into its
basedir, but actually not fine for nondefault tablespaces..

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
> notification descriptor open, where the kernel would inject events
> related to writeback failures of files under watch (potentially
> enriched to contain info regarding the exact failed pages and
> the file offset they map to).

For postgres that'd require backend processes to open() an file such that,
following its close(), any writeback errors are "signalled" to the checkpointer
process...

Justin


On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote:
> On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:
> > What about having buffered IO with implied fsync() atomicity via
> > O_SYNC?
> 
> You're kidding, right?  We could also just add sleep(30)'s all over the
> tree, and hope that that'll solve the problem.  There's a reason we
> don't permanently fsync everything. Namely that it'll be way too slow.

I am assuming you can apply the same principle of selectively using O_SYNC
at times and places that you'd currently actually call fsync().

Also assuming that you'd want to have a backwards-compatible solution for
all those kernels that don't keep the pages around, irrespective of future
fixes. Short of loading a kernel module and dealing with the problem directly,
the only other available options seem to be either O_SYNC, O_DIRECT or ignoring
the issue.

Best regards,
Anthony


On 04/09/2018 04:22 PM, Anthony Iliopoulos wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
>> We already have dirty_bytes and dirty_background_bytes, for example. I
>> don't see why there couldn't be another limit defining how much dirty
>> data to allow before blocking writes altogether. I'm sure it's not that
>> simple, but you get the general idea - do not allow using all available
>> memory because of writeback issues, but don't throw the data away in
>> case it's just a temporary issue.
> 
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner. This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error. Apart from that, further interfaces would need to be provided
> for actually dealing with the error (again assuming non-transient
> issues that may not be fixed transparently and that temporary issues
> are taken care of by lower layers of the stack).
> 

I don't quite see how this is any different from other possible issues
when running multiple applications on the same system. One application
can generate a lot of dirty data, reaching dirty_bytes and forcing the
other applications on the same host to do synchronous writes.

Of course, you might argue that is a temporary condition - it will
resolve itself once the dirty pages get written to storage. In case of
an I/O issue, it is a permanent impact - it will not resolve itself
unless the I/O problem gets fixed.

Not sure what interfaces would need to be written? Possibly something
that says "drop dirty pages for these files" after the application gets
killed or something. That makes sense, of course.

>> Well, there seem to be kernels that seem to do exactly that already. At
>> least that's how I understand what this thread says about FreeBSD and
>> Illumos, for example. So it's not an entirely insane design, apparently.
> 
> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.
> 

In my experience when you have a persistent I/O error on a device, it
likely affects all applications using that device. So unmounting the fs
to clear the dirty pages seems like an acceptable solution to me.

I don't see what else the application should do? In a way I'm suggesting
applications don't really want to be responsible for recovering (cleanup
or dirty pages etc.). We're more than happy to hand that over to kernel,
e.g. because each kernel will do that differently. What we however do
want is reliable information about fsync outcome, which we need to
properly manage WAL, checkpoints etc.

> Ideally, you'd want a (potentially persistent) indication of error
> localized to a file region (mapping the corresponding failed writeback
> pages). NetBSD is already implementing fsync_ranges(), which could
> be a step in the right direction.
> 
>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
> 
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO). I think that durability is a rather complex
> cross-layer issue which has been grossly misunderstood similarly
> in the past (e.g. see [1]). It seems that both the OS and DB
> communities greatly benefit from a periodic reality check, and
> I see this as an opportunity for strengthening the IO stack in
> an end-to-end manner.
> 

Right. What I was getting to is that perhaps the current fsync()
behavior is not very practical for building actual applications.

> Best regards,
> Anthony
> 
> [1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
> 

Thanks. The paper looks interesting.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote:
> 
> 
> On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> 
> >I honestly do not expect that keeping around the failed pages will
> >be an acceptable change for most kernels, and as such the
> >recommendation
> >will probably be to coordinate in userspace for the fsync().
> 
> Why is that required? You could very well just keep per inode information about fatal failures that occurred around.
Reporterrors until that bit is explicitly cleared.  Yes, that keeps some memory around until unmount if nobody clears
it.But it's orders of magnitude less, and results in usable semantics.
 

As discussed before, I think this could be acceptable, especially
if you pair it with an opt-in mechanism (only applications that
care to deal with this will have to), and would give it a shot.

Still need a way to deal with all other systems and prior kernel
releases that are eating fsync() writeback errors even over sync().

Best regards,
Anthony



On 04/09/2018 09:37 PM, Andres Freund wrote:
> 
> 
> On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> 
>> I honestly do not expect that keeping around the failed pages will
>> be an acceptable change for most kernels, and as such the
>> recommendation
>> will probably be to coordinate in userspace for the fsync().
> 
> Why is that required? You could very well just keep per inode 
> information about fatal failures that occurred around. Report errors 
> until that bit is explicitly cleared. Yes, that keeps some memory
> around until unmount if nobody clears it. But it's orders of
> magnitude less, and results in usable semantics.
>

Isn't the expectation that when a fsync call fails, the next one will
retry writing the pages in the hope that it succeeds?

Of course, it's also possible to do what you suggested, and simply mark
the inode as failed. In which case the next fsync can't possibly retry
the writes (e.g. after freeing some space on thin-provisioned system),
but we'd get reliable failure mode.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:
> On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
> > You could make the argument that it's OK to forget if the entire file
> > system goes away. But actually, why is that ok?
> 
> I was going to say that it'd be okay to clear error flag on umount, since any
> opened files would prevent unmounting; but, then I realized we need to consider
> the case of close()ing all FDs then opening them later..in another process.

> On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
> > notification descriptor open, where the kernel would inject events
> > related to writeback failures of files under watch (potentially
> > enriched to contain info regarding the exact failed pages and
> > the file offset they map to).
> 
> For postgres that'd require backend processes to open() an file such that,
> following its close(), any writeback errors are "signalled" to the checkpointer
> process...

I don't think that's as hard as some people argued in this thread.  We
could very well open a pipe in postmaster with the write end open in
each subprocess, and the read end open only in checkpointer (and
postmaster, but unused there).  Whenever closing a file descriptor that
was dirtied in the current process, send it over the pipe to the
checkpointer. The checkpointer then can receive all those file
descriptors (making sure it's not above the limit, fsync(), close() ing
to make room if necessary).  The biggest complication would presumably
be to deduplicate the received filedescriptors for the same file,
without loosing track of any errors.

Even better, we could do so via a dedicated worker. That'd quite
possibly end up as a performance benefit.


> I was going to say that's fine for postgres, since it chdir()s into its
> basedir, but actually not fine for nondefault tablespaces..

I think it'd be fair to open PG_VERSION of all created
tablespaces. Would require some hangups to signal checkpointer (or
whichever process) to do so when creating one, but it shouldn't be too
hard.  Some people would complain because they can't do some nasty hacks
anymore, but it'd also save peoples butts by preventing them from
accidentally unmounting.

Greetings,

Andres Freund


Hi,

On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:
> Isn't the expectation that when a fsync call fails, the next one will
> retry writing the pages in the hope that it succeeds?

Some people expect that, I personally don't think it's a useful
expectation.

We should just deal with this by crash-recovery.  The big problem I see
is that you always need to keep an file descriptor open for pretty much
any file written to inside and outside of postgres, to be guaranteed to
see errors. And that'd solve that.  Even if retrying would work, I'd
advocate for that (I've done so in the past, and I've written code in pg
that panics on fsync failure...).

What we'd need to do however is to clear that bit during crash
recovery... Which is interesting from a policy perspective. Could be
that other apps wouldn't want that.

I also wonder if we couldn't just somewhere read each relevant mounted
filesystem's errseq value. Whenever checkpointer notices before
finishing a checkpoint that it has changed, do a crash restart.


Greetings,

Andres Freund


> On Apr 9, 2018, at 12:13 PM, Andres Freund <andres@anarazel.de> wrote:
> 
> Hi,
> 
> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>> I think the simplest technological solution to this problem is to
>> rewrite the entire backend and all supporting processes to use
>> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
>> write a complete I/O scheduling system inside PostgreSQL.  Also, since
>> we'll now have to make shared_buffers much larger -- since we'll no
>> longer be benefiting from the OS cache -- we'll need to replace the
>> use of malloc() with an allocator that pulls from shared_buffers.
>> Plus, as noted, we'll need to totally rearchitect several of our
>> critical frontend tools.  Let's freeze all other development for the
>> next year while we work on that, and put out a notice that Linux is no
>> longer a supported platform for any existing release.  Before we do
>> that, we might want to check whether fsync() actually writes the data
>> to disk in a usable way even with O_DIRECT.  If not, we should just
>> de-support Linux entirely as a hopelessly broken and unsupportable
>> platform.
> 
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.
> 
> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could
receive corrupt WAL data and become corrupted.  That seems a much bigger
problem than merely having the master become corrupted in some unrecoverable
way.  It is a long standing expectation that serious hardware problems on
the master can result in the master needing to be replaced.  But there has
not been an expectation that the one or more standby servers would be taken
down along with the master, leaving all copies of the database unusable.
If this bug corrupts the standby servers, too, then it is a whole different
class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has
the problem, and will not get propagated to other servers.  Am I reading
that right?

Can anybody clarify this for non-core-hacker folks following along at home?


mark




On 04/09/2018 10:04 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:
>> Isn't the expectation that when a fsync call fails, the next one will
>> retry writing the pages in the hope that it succeeds?
> 
> Some people expect that, I personally don't think it's a useful
> expectation.
> 

Maybe. I'd certainly prefer automated recovery from an temporary I/O
issues (like full disk on thin-provisioning) without the database
crashing and restarting. But I'm not sure it's worth the effort.

And most importantly, it's rather delusional to think the kernel
developers are going to be enthusiastic about that approach ...

>
> We should just deal with this by crash-recovery. The big problem I
> see is that you always need to keep an file descriptor open for
> pretty much any file written to inside and outside of postgres, to be
> guaranteed to see errors. And that'd solve that. Even if retrying
> would work, I'd advocate for that (I've done so in the past, and I've
> written code in pg that panics on fsync failure...).
> 

Sure. And it's likely way less invasive from kernel perspective.

>
> What we'd need to do however is to clear that bit during crash 
> recovery... Which is interesting from a policy perspective. Could be 
> that other apps wouldn't want that.
> 

IMHO it'd be enough if a remount clears it.

>
> I also wonder if we couldn't just somewhere read each relevant
> mounted filesystem's errseq value. Whenever checkpointer notices
> before finishing a checkpoint that it has changed, do a crash
> restart.
> 

Hmmmm, that's an interesting idea, and it's about the only thing that
would help us on older kernels. There's a wb_err in adress_space, but
that's at inode level. Not sure if there's something at fs level.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Hi,

On 2018-04-09 13:25:54 -0700, Mark Dilger wrote:
> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.

I don't see that as a real problem here. For one the problematic
scenarios shouldn't readily apply, for another WAL is checksummed.

There's the problem that a new basebackup would potentially become
corrupted however. And similarly pg_rewind.

Note that I'm not saying that we and/or linux shouldn't change
anything. Just that the apocalypse isn't here.


> Your comment reads as if this is a problem isolated to whichever server has
> the problem, and will not get propagated to other servers.  Am I reading
> that right?

I think that's basically right. There's cases where corruption could get
propagated, but they're not straightforward.

Greetings,

Andres Freund


Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
> Maybe. I'd certainly prefer automated recovery from an temporary I/O
> issues (like full disk on thin-provisioning) without the database
> crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the
kernel's hand on allocating disk space. In most cases the kernel
allocates the disk space immediately, even if delayed allocation is in
effect. For the cases where that's not the case (if there are current
ones, rather than just past bugs), we should be able to make sure that's
not an issue by pre-zeroing the data and/or using fallocate.

Greetings,

Andres Freund



On 04/09/2018 10:25 PM, Mark Dilger wrote:
> 
>> On Apr 9, 2018, at 12:13 PM, Andres Freund <andres@anarazel.de> wrote:
>>
>> Hi,
>>
>> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>>> I think the simplest technological solution to this problem is to
>>> rewrite the entire backend and all supporting processes to use
>>> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
>>> write a complete I/O scheduling system inside PostgreSQL.  Also, since
>>> we'll now have to make shared_buffers much larger -- since we'll no
>>> longer be benefiting from the OS cache -- we'll need to replace the
>>> use of malloc() with an allocator that pulls from shared_buffers.
>>> Plus, as noted, we'll need to totally rearchitect several of our
>>> critical frontend tools.  Let's freeze all other development for the
>>> next year while we work on that, and put out a notice that Linux is no
>>> longer a supported platform for any existing release.  Before we do
>>> that, we might want to check whether fsync() actually writes the data
>>> to disk in a usable way even with O_DIRECT.  If not, we should just
>>> de-support Linux entirely as a hopelessly broken and unsupportable
>>> platform.
>>
>> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
>> absurd, as is some of the proposed ways this is all supposed to
>> work. But I think the case we're discussing is much closer to a near
>> irresolvable corner case than anything else.
>>
>> We're talking about the storage layer returning an irresolvable
>> error. You're hosed even if we report it properly.  Yes, it'd be nice if
>> we could report it reliably.  But that doesn't change the fact that what
>> we're doing is ensuring that data is safely fsynced unless storage
>> fails, in which case it's not safely fsynced anyway.
> 
> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.  That seems a much bigger
> problem than merely having the master become corrupted in some unrecoverable
> way.  It is a long standing expectation that serious hardware problems on
> the master can result in the master needing to be replaced.  But there has
> not been an expectation that the one or more standby servers would be taken
> down along with the master, leaving all copies of the database unusable.
> If this bug corrupts the standby servers, too, then it is a whole different
> class of problem than the one folks have come to expect.
> 
> Your comment reads as if this is a problem isolated to whichever server has
> the problem, and will not get propagated to other servers.  Am I reading
> that right?
> 
> Can anybody clarify this for non-core-hacker folks following along at home?
> 

That's a good question. I don't see any guarantee it'd be isolated to
the master node. Consider this example:

(0) checkpoint happens on the primary

(1) a page gets modified, a full-page gets written to WAL

(2) the page is written out to page cache

(3) writeback of that page fails (and gets discarded)

(4) we attempt to modify the page again, but we read the stale version

(5) we modify the stale version, writing the change to WAL


The standby will get the full-page, and then a WAL from the stale page
version. That doesn't seem like a story with a happy end, I guess. But I
might be easily missing some protection built into the WAL ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


> On Apr 9, 2018, at 1:43 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>
>
> On 04/09/2018 10:25 PM, Mark Dilger wrote:
>>
>>> On Apr 9, 2018, at 12:13 PM, Andres Freund <andres@anarazel.de> wrote:
>>>
>>> Hi,
>>>
>>> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>>>> I think the simplest technological solution to this problem is to
>>>> rewrite the entire backend and all supporting processes to use
>>>> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
>>>> write a complete I/O scheduling system inside PostgreSQL.  Also, since
>>>> we'll now have to make shared_buffers much larger -- since we'll no
>>>> longer be benefiting from the OS cache -- we'll need to replace the
>>>> use of malloc() with an allocator that pulls from shared_buffers.
>>>> Plus, as noted, we'll need to totally rearchitect several of our
>>>> critical frontend tools.  Let's freeze all other development for the
>>>> next year while we work on that, and put out a notice that Linux is no
>>>> longer a supported platform for any existing release.  Before we do
>>>> that, we might want to check whether fsync() actually writes the data
>>>> to disk in a usable way even with O_DIRECT.  If not, we should just
>>>> de-support Linux entirely as a hopelessly broken and unsupportable
>>>> platform.
>>>
>>> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
>>> absurd, as is some of the proposed ways this is all supposed to
>>> work. But I think the case we're discussing is much closer to a near
>>> irresolvable corner case than anything else.
>>>
>>> We're talking about the storage layer returning an irresolvable
>>> error. You're hosed even if we report it properly.  Yes, it'd be nice if
>>> we could report it reliably.  But that doesn't change the fact that what
>>> we're doing is ensuring that data is safely fsynced unless storage
>>> fails, in which case it's not safely fsynced anyway.
>>
>> I was reading this thread up until now as meaning that the standby could
>> receive corrupt WAL data and become corrupted.  That seems a much bigger
>> problem than merely having the master become corrupted in some unrecoverable
>> way.  It is a long standing expectation that serious hardware problems on
>> the master can result in the master needing to be replaced.  But there has
>> not been an expectation that the one or more standby servers would be taken
>> down along with the master, leaving all copies of the database unusable.
>> If this bug corrupts the standby servers, too, then it is a whole different
>> class of problem than the one folks have come to expect.
>>
>> Your comment reads as if this is a problem isolated to whichever server has
>> the problem, and will not get propagated to other servers.  Am I reading
>> that right?
>>
>> Can anybody clarify this for non-core-hacker folks following along at home?
>>
>
> That's a good question. I don't see any guarantee it'd be isolated to
> the master node. Consider this example:
>
> (0) checkpoint happens on the primary
>
> (1) a page gets modified, a full-page gets written to WAL
>
> (2) the page is written out to page cache
>
> (3) writeback of that page fails (and gets discarded)
>
> (4) we attempt to modify the page again, but we read the stale version
>
> (5) we modify the stale version, writing the change to WAL
>
>
> The standby will get the full-page, and then a WAL from the stale page
> version. That doesn't seem like a story with a happy end, I guess. But I
> might be easily missing some protection built into the WAL ...

I can also imagine a master and standby that are similarly provisioned,
and thus hit an out of disk error at around the same time, resulting in
corruption on both, even if not the same corruption.  When choosing to
have one standby, or two standbys, or ten standbys, one needs to be able
to assume a certain amount of statistical independence between failures
on one server and failures on another.  If they are tightly correlated
dependent variables, then the conclusion that the probability of all
nodes failing simultaneously is vanishingly small becomes invalid.

mark

Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
> I can also imagine a master and standby that are similarly provisioned,
> and thus hit an out of disk error at around the same time, resulting in
> corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should
solve by making sure there's always enough space pre-allocated), with
EIO type errors.  The problem is different, the solution is different.

Greetings,

Andres Freund



On 04/09/2018 11:08 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
>> I can also imagine a master and standby that are similarly provisioned,
>> and thus hit an out of disk error at around the same time, resulting in
>> corruption on both, even if not the same corruption.
> 
> I think it's a grave mistake conflating ENOSPC issues (which we should
> solve by making sure there's always enough space pre-allocated), with
> EIO type errors.  The problem is different, the solution is different.
> 

In any case, that certainly does not count as data corruption spreading
from the master to standby.


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


> On Apr 9, 2018, at 2:25 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>
>
> On 04/09/2018 11:08 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
>>> I can also imagine a master and standby that are similarly provisioned,
>>> and thus hit an out of disk error at around the same time, resulting in
>>> corruption on both, even if not the same corruption.
>>
>> I think it's a grave mistake conflating ENOSPC issues (which we should
>> solve by making sure there's always enough space pre-allocated), with
>> EIO type errors.  The problem is different, the solution is different.

I'm happy to take your word for that.

> In any case, that certainly does not count as data corruption spreading
> from the master to standby.

Maybe not from the point of view of somebody looking at the code.  But a
user might see it differently.  If the data being loaded into the master
and getting replicated to the standby "causes" both to get corrupt, then
it seems like corruption spreading.  I put "causes" in quotes because there
is some argument to be made about "correlation does not prove cause" and so
forth, but it still feels like causation from an arms length perspective.
If there is a pattern of standby servers tending to fail more often right
around the time that the master fails, you'll have a hard time comforting
users, "hey, it's not technically causation."  If loading data into the
master causes the master to hit ENOSPC, and replicating that data to the
standby causes the standby to hit ENOSPC, and if the bug abound ENOSPC has
not been fixed, then this looks like corruption spreading.

I'm certainly planning on taking a hard look at the disk allocation on my
standby servers right soon now.

mark



On Tue, Apr 10, 2018 at 2:22 AM, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>> Well, there seem to be kernels that seem to do exactly that already. At
>> least that's how I understand what this thread says about FreeBSD and
>> Illumos, for example. So it's not an entirely insane design, apparently.
>
> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Right.  For anyone interested, here is the change you mentioned, and
an interesting one that came a bit earlier last year:

https://reviews.freebsd.org/rS316941 -- drop buffers after device goes away
https://reviews.freebsd.org/rS326029 -- update comment about EIO contract

Retrying may well be futile, but at least future fsync() calls won't
report success bogusly.  There may of course be more space-efficient
ways to represent that state as the comment implies, while never lying
to the user -- perhaps involving filesystem level or (pinned) inode
level errors that stop all writes until unmounted.  Something tells me
they won't resort to flakey fsync() error reporting.

I wonder if anyone can tell us what Windows, AIX and HPUX do here.

> [1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Very interesting, thanks.

-- 
Thomas Munro
http://www.enterprisedb.com


On Tue, Apr 10, 2018 at 10:33 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I wonder if anyone can tell us what Windows, AIX and HPUX do here.

I created a wiki page to track what we know (or think we know) about
fsync() on various operating systems:

https://wiki.postgresql.org/wiki/Fsync_Errors

If anyone has more information or sees mistakes, please go ahead and edit it.

-- 
Thomas Munro
http://www.enterprisedb.com


On 04/09/2018 02:16 PM, Craig Ringer wrote:
> I'd like a middle ground where the kernel lets us register our interest 
> and tells us if it lost something, without us having to keep eight 
> million FDs open for some long period. "Tell us about anything that 
> happens under pgdata/" or an inotify-style per-directory-registration 
> option. I'd even say that's ideal.

Could there be a risk of a race condition here where fsync incorrectly 
returns success before we get the notification of that something went wrong?

Andreas


On 10 April 2018 at 03:59, Andres Freund <andres@anarazel.de> wrote:
> On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:
>> On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:
>> > You could make the argument that it's OK to forget if the entire file
>> > system goes away. But actually, why is that ok?
>>
>> I was going to say that it'd be okay to clear error flag on umount, since any
>> opened files would prevent unmounting; but, then I realized we need to consider
>> the case of close()ing all FDs then opening them later..in another process.
>
>> On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:
>> > notification descriptor open, where the kernel would inject events
>> > related to writeback failures of files under watch (potentially
>> > enriched to contain info regarding the exact failed pages and
>> > the file offset they map to).
>>
>> For postgres that'd require backend processes to open() an file such that,
>> following its close(), any writeback errors are "signalled" to the checkpointer
>> process...
>
> I don't think that's as hard as some people argued in this thread.  We
> could very well open a pipe in postmaster with the write end open in
> each subprocess, and the read end open only in checkpointer (and
> postmaster, but unused there).  Whenever closing a file descriptor that
> was dirtied in the current process, send it over the pipe to the
> checkpointer. The checkpointer then can receive all those file
> descriptors (making sure it's not above the limit, fsync(), close() ing
> to make room if necessary).  The biggest complication would presumably
> be to deduplicate the received filedescriptors for the same file,
> without loosing track of any errors.

Yep. That'd be a cheaper way to do it, though it wouldn't work on
Windows. Though we don't know how Windows behaves here at all yet.

Prior discussion upthread had the checkpointer open()ing a file at the
same time as a backend, before the backend writes to it. But passing
the fd when the backend is done with it would be better.

We'd need a way to dup() the fd and pass it back to a backend when it
needed to reopen it sometimes, or just make sure to keep the oldest
copy of the fd when a backend reopens multiple times, but that's no
biggie.

We'd still have to fsync() out early in the checkpointer if we ran out
of space in our FD list, and initscripts would need to change our
ulimit or we'd have to do it ourselves in the checkpointer. But
neither seems insurmountable.

FWIW, I agree that this is a corner case, but it's getting to be a
pretty big corner with the spread of overcommitted, dedupliating SANs,
cloud storage, etc. Not all I/O errors indicate permanent hardware
faults, disk failures, etc, as I outlined earlier. I'm very curious to
know what AWS EBS's error semantics are, and other cloud network block
stores. (I posted on Amazon forums
https://forums.aws.amazon.com/thread.jspa?threadID=279274&tstart=0 but
nothing so far).

I'm also not particularly inclined to trust that all file systems will
always reliably reserve space without having some cases where they'll
fail writeback on space exhaustion.

So we don't need to panic and freak out, but it's worth looking at the
direction the storage world is moving in, and whether this will become
a bigger issue over time.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Tue, Apr 10, 2018 at 1:44 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 10 April 2018 at 03:59, Andres Freund <andres@anarazel.de> wrote:
>> I don't think that's as hard as some people argued in this thread.  We
>> could very well open a pipe in postmaster with the write end open in
>> each subprocess, and the read end open only in checkpointer (and
>> postmaster, but unused there).  Whenever closing a file descriptor that
>> was dirtied in the current process, send it over the pipe to the
>> checkpointer. The checkpointer then can receive all those file
>> descriptors (making sure it's not above the limit, fsync(), close() ing
>> to make room if necessary).  The biggest complication would presumably
>> be to deduplicate the received filedescriptors for the same file,
>> without loosing track of any errors.
>
> Yep. That'd be a cheaper way to do it, though it wouldn't work on
> Windows. Though we don't know how Windows behaves here at all yet.
>
> Prior discussion upthread had the checkpointer open()ing a file at the
> same time as a backend, before the backend writes to it. But passing
> the fd when the backend is done with it would be better.

How would that interlock with concurrent checkpoints?

I can see how to make that work if the share-fd-or-fsync-now logic
happens in smgrwrite() when called by FlushBuffer() while you hold
io_in_progress, but not if you defer it to some random time later.

-- 
Thomas Munro
http://www.enterprisedb.com


On 10 April 2018 at 04:25, Mark Dilger <hornschnorter@gmail.com> wrote:

> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.

Yes, it can, but not directly through the first error.

What can happen is that we think a block got written when it didn't.

If our in memory state diverges from our on disk state, we can make
subsequent WAL writes based on that wrong information. But that's
actually OK, since the standby will have replayed the original WAL
correctly.

I think the only time we'd run into trouble is if we evict the good
(but not written out) data from s_b and the fs buffer cache, then
later read in the old version of a block we failed to overwrite. Data
checksums (if enabled) might catch it unless the write left the whole
block stale. In that case we might generate a full page write with the
stale block and propagate that over WAL to the standby.

So I'd say standbys are relatively safe - very safe if the issue is
caught promptly, and less so over time. But AFAICS WAL-based
replication (physical or logical) is not a perfect defense for this.

However, remember, if your storage system is free of any sort of
overprovisioning, is on a non-network file system, and doesn't use
multipath (or sets it up right) this issue *is exceptionally unlikely
to affect you*.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On 10 April 2018 at 04:37, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
>> Maybe. I'd certainly prefer automated recovery from an temporary I/O
>> issues (like full disk on thin-provisioning) without the database
>> crashing and restarting. But I'm not sure it's worth the effort.
>
> Oh, I agree on that one. But that's more a question of how we force the
> kernel's hand on allocating disk space. In most cases the kernel
> allocates the disk space immediately, even if delayed allocation is in
> effect. For the cases where that's not the case (if there are current
> ones, rather than just past bugs), we should be able to make sure that's
> not an issue by pre-zeroing the data and/or using fallocate.

Nitpick: In most cases the kernel reserves disk space immediately,
before returning from write(). NFS seems to be the main exception
here.

EXT4 and XFS don't allocate until later, it by performing actual
writes to FS metadata, initializing disk blocks, etc. So we won't
notice errors that are only detectable at actual time of allocation,
like thin provisioning problems, until after write() returns and we
face the same writeback issues.

So I reckon you're safe from space-related issues if you're not on NFS
(and whyyy would you do that?) and not thinly provisioned. I'm sure
there are other corner cases, but I don't see any reason to expect
space-exhaustion-related corruption problems on a sensible FS backed
by a sensible block device. I haven't tested things like quotas,
verified how reliable space reservation is under concurrency, etc as
yet.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



On April 9, 2018 6:59:03 PM PDT, Craig Ringer <craig@2ndquadrant.com> wrote:
>On 10 April 2018 at 04:37, Andres Freund <andres@anarazel.de> wrote:
>> Hi,
>>
>> On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:
>>> Maybe. I'd certainly prefer automated recovery from an temporary I/O
>>> issues (like full disk on thin-provisioning) without the database
>>> crashing and restarting. But I'm not sure it's worth the effort.
>>
>> Oh, I agree on that one. But that's more a question of how we force
>the
>> kernel's hand on allocating disk space. In most cases the kernel
>> allocates the disk space immediately, even if delayed allocation is
>in
>> effect. For the cases where that's not the case (if there are current
>> ones, rather than just past bugs), we should be able to make sure
>that's
>> not an issue by pre-zeroing the data and/or using fallocate.
>
>Nitpick: In most cases the kernel reserves disk space immediately,
>before returning from write(). NFS seems to be the main exception
>here.
>
>EXT4 and XFS don't allocate until later, it by performing actual
>writes to FS metadata, initializing disk blocks, etc. So we won't
>notice errors that are only detectable at actual time of allocation,
>like thin provisioning problems, until after write() returns and we
>face the same writeback issues.
>
>So I reckon you're safe from space-related issues if you're not on NFS
>(and whyyy would you do that?) and not thinly provisioned. I'm sure
>there are other corner cases, but I don't see any reason to expect
>space-exhaustion-related corruption problems on a sensible FS backed
>by a sensible block device. I haven't tested things like quotas,
>verified how reliable space reservation is under concurrency, etc as
>yet.

How's that not solved by pre zeroing and/or fallocate as I suggested above?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


On 10 April 2018 at 08:41, Andreas Karlsson <andreas@proxel.se> wrote:
> On 04/09/2018 02:16 PM, Craig Ringer wrote:
>>
>> I'd like a middle ground where the kernel lets us register our interest
>> and tells us if it lost something, without us having to keep eight million
>> FDs open for some long period. "Tell us about anything that happens under
>> pgdata/" or an inotify-style per-directory-registration option. I'd even say
>> that's ideal.
>
>
> Could there be a risk of a race condition here where fsync incorrectly
> returns success before we get the notification of that something went wrong?

We'd examine the notification queue only once all our checkpoint
fsync()s had succeeded, and before we updated the control file to
advance the redo position.

I'm intrigued by the suggestion upthread of using a kprobe or similar
to achieve this. It's a horrifying unportable hack that'd make kernel
people cry, and I don't know if we have any way to flush buffered
probe data to be sure we really get the news in time, but it's a cool
idea too.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:
> Another consequence of this behavior that initdb -S is never reliable,
> so pg_rewind's use of it doesn't actually fix the problem it was
> intended to solve.  It also means that initdb itself isn't crash-safe,
> since the data file changes are made by the backend but initdb itself
> is doing the fsyncs, and initdb has no way of knowing what files the
> backend is going to create and therefore can't -- even theoretically
> -- open them first.

And pg_basebackup.  And pg_dump.  And pg_dumpall.  Anything using initdb
-S or fsync_pgdata would enter in those waters.
--
Michael

Attachment
On 10 April 2018 at 13:04, Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:
>> Another consequence of this behavior that initdb -S is never reliable,
>> so pg_rewind's use of it doesn't actually fix the problem it was
>> intended to solve.  It also means that initdb itself isn't crash-safe,
>> since the data file changes are made by the backend but initdb itself
>> is doing the fsyncs, and initdb has no way of knowing what files the
>> backend is going to create and therefore can't -- even theoretically
>> -- open them first.
>
> And pg_basebackup.  And pg_dump.  And pg_dumpall.  Anything using initdb
> -S or fsync_pgdata would enter in those waters.

... but *only if they hit an I/O error* or they're on a FS that
doesn't reserve space and hit ENOSPC.

It still does 99% of the job. It still flushes all buffers to
persistent storage and maintains write ordering. It may not detect and
report failures to the user how we'd expect it to, yes, and that's not
great. But it's hardly throw up our hands and give up territory
either. Also, at least for initdb, we can make initdb fsync() its own
files before close(). Annoying but hardly the end of the world.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Tue, Apr 10, 2018 at 01:37:19PM +0800, Craig Ringer wrote:
> On 10 April 2018 at 13:04, Michael Paquier <michael@paquier.xyz> wrote:
>> And pg_basebackup.  And pg_dump.  And pg_dumpall.  Anything using initdb
>> -S or fsync_pgdata would enter in those waters.
>
> ... but *only if they hit an I/O error* or they're on a FS that
> doesn't reserve space and hit ENOSPC.

Sure.

> It still does 99% of the job. It still flushes all buffers to
> persistent storage and maintains write ordering. It may not detect and
> report failures to the user how we'd expect it to, yes, and that's not
> great. But it's hardly throw up our hands and give up territory
> either. Also, at least for initdb, we can make initdb fsync() its own
> files before close(). Annoying but hardly the end of the world.

Well, I think that there is place for improving reporting of failure
in file_utils.c for frontends, or at worst have an exit() for any kind
of critical failures equivalent to a PANIC.
--
Michael

Attachment
On 10 April 2018 at 14:10, Michael Paquier <michael@paquier.xyz> wrote:

> Well, I think that there is place for improving reporting of failure
> in file_utils.c for frontends, or at worst have an exit() for any kind
> of critical failures equivalent to a PANIC.

Yup.

In the mean time, speaking of PANIC, here's the first cut patch to
make Pg panic on fsync() failures. I need to do some closer review and
testing, but it's presented here for anyone interested.

I intentionally left some failures as ERROR not PANIC, where the
entire operation is done as a unit, and an ERROR will cause us to
retry the whole thing.

For example, when we fsync() a temp file before we move it into place,
there's no point panicing on failure, because we'll discard the temp
file on ERROR and retry the whole thing.

I've verified that it works as expected with some modifications to the
test tool I've been using (pushed).

The main downside is that if we panic in redo, we don't try again. We
throw our toys and shut down. But arguably if we get the same I/O
error again in redo, that's the right thing to do anyway, and quite
likely safer than continuing to ERROR on checkpoints indefinitely.

Patch attached.

To be clear, this patch only deals with the issue of us retrying
fsyncs when it turns out to be unsafe. This does NOT address any of
the issues where we won't find out about writeback errors at all.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote:
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.

Well, I admit that I wasn't entirely serious about that email, but I
wasn't entirely not-serious either.  If you can't find reliably find
out whether the contents of the file on disk are the same as the
contents that the kernel is giving you when you call read(), then you
are going to have a heck of a time building a reliable system.  If the
kernel developers are determined to insist on these semantics (and,
admittedly, I don't know whether that's the case - I've only read
Anthony's remarks), then I don't really see what we can do except give
up on buffered I/O (or on Linux).

> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

I think that reliable error reporting is more than "nice" -- I think
it's essential.  The only argument for the current Linux behavior that
has been so far advanced on this thread, at least as far as I can see,
is that if it kept retrying the buffers forever, it would be pointless
and might run the machine out of memory, so we might as well discard
them.  But previous comments have already illustrated that the kernel
is not really up against a wall there -- it could put individual
inodes into a permanent failure state when it discards their dirty
data, as you suggested, or it could do what others have suggested, and
what I think is better, which is to put the whole filesystem into a
permanent failure state that can be cleared by remounting the FS.
That could be done on an as-needed basis -- if the number of dirty
buffers you're holding onto for some filesystem becomes too large, put
the filesystem into infinite-fail mode and discard them all.  That
behavior would be pretty easy for administrators to understand and
would resolve the entire problem here provided that no PostgreSQL
processes survived the eventual remount.

I also don't really know what we mean by an "unresolvable" error.  If
the drive is beyond all hope, then it doesn't really make sense to
talk about whether the database stored on it is corrupt.  In general
we can't be sure that we'll even get an error - e.g. the system could
be idle and the drive could be on fire.  Maybe this is the case you
meant by "it'd be nice if we could report it reliably".  But at least
in my experience, that's typically not what's going on.  You get some
I/O errors and so you remount the filesystem, or reboot, or rebuild
the array, or ... something.  And then the errors go away and, at that
point, you want to run recovery and continue using your database.  In
this scenario, it matters *quite a bit* what the error reporting was
like during the period when failures were occurring.  In particular,
if the database was allowed to think that it had successfully
checkpointed when it didn't, you're going to start recovery from the
wrong place.

I'm going to shut up now because I'm telling you things that you
obviously already know, but this doesn't sound like a "near
irresolvable corner case".  When the storage goes bonkers, either
PostgreSQL and the kernel can interact in such a way that a checkpoint
can succeed without all of the relevant data getting persisted, or
they don't.  It sounds like right now they do, and I'm not really
clear that we have a reasonable idea how to fix that.  It does not
sound like a PANIC is sufficient.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Tue, Apr 10, 2018 at 1:37 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> ... but *only if they hit an I/O error* or they're on a FS that
> doesn't reserve space and hit ENOSPC.
>
> It still does 99% of the job. It still flushes all buffers to
> persistent storage and maintains write ordering. It may not detect and
> report failures to the user how we'd expect it to, yes, and that's not
> great. But it's hardly throw up our hands and give up territory
> either. Also, at least for initdb, we can make initdb fsync() its own
> files before close(). Annoying but hardly the end of the world.

I think we'd need every child postgres process started by initdb to do
that individually, which I suspect would slow down initdb quite a lot.
Now admittedly for anybody other than a PostgreSQL developer that's
only a minor issue, and our regression tests mostly run with fsync=off
anyway.  But I have a strong suspicion that our assumptions about how
fsync() reports errors are baked into an awful lot of parts of the
system, and by the time we get unbaking them I think it's going to be
really surprising if we haven't done real harm to overall system
performance.

BTW, I took a look at the MariaDB source code to see whether they've
got this problem too and it sure looks like they do.
os_file_fsync_posix() retries the fsync in a loop with an 0.2 second
sleep after each retry.  It warns after 100 failures and fails an
assertion after 1000 failures.  It is hard to understand why they
would have written the code this way unless they expect errors
reported by fsync() to continue being reported until the underlying
condition is corrected.  But, it looks like they wouldn't have the
problem that we do with trying to reopen files to fsync() them later
-- I spot checked a few places where this code is invoked and in all
of those it looks like the file is already expected to be open.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Hi Robert,

On Tue, Apr 10, 2018 at 11:15:46AM -0400, Robert Haas wrote:
> On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote:
> > Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> > absurd, as is some of the proposed ways this is all supposed to
> > work. But I think the case we're discussing is much closer to a near
> > irresolvable corner case than anything else.
> 
> Well, I admit that I wasn't entirely serious about that email, but I
> wasn't entirely not-serious either.  If you can't find reliably find
> out whether the contents of the file on disk are the same as the
> contents that the kernel is giving you when you call read(), then you
> are going to have a heck of a time building a reliable system.  If the
> kernel developers are determined to insist on these semantics (and,
> admittedly, I don't know whether that's the case - I've only read
> Anthony's remarks), then I don't really see what we can do except give
> up on buffered I/O (or on Linux).

I think it would be interesting to get in touch with some of the
respective linux kernel maintainers and open up this topic for
more detailed discussions. LSF/MM'18 is upcoming and it would
have been the perfect opportunity but it's past the CFP deadline.
It may still worth contacting the organizers to bring forward
the issue, and see if there is a chance to have someone from
Pg invited for further discussions.

Best regards,
Anthony


On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:
>> On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop@altatus.com> wrote:

> To make things a bit simpler, let us focus on EIO for the moment.
> The contract between the block layer and the filesystem layer is
> assumed to be that of, when an EIO is propagated up to the fs,
> then you may assume that all possibilities for recovering have
> been exhausted in lower layers of the stack.

Well Postgres is using the filesystem. The interface between the block
layer and the filesystem may indeed need to be more complex, I
wouldn't know.

But I don't think "all possibilities" is a very useful concept.
Neither layer here is going to be perfect. They can only promise that
all possibilities that have actually been implemented have been
exhausted. And even among those only to the degree they can be done
automatically within the engineering tradeoffs and constraints. There
will always be cases like thin provisioned devices that an operator
can expand, or degraded raid arrays that can be repaired after a long
operation and so on. A network device can't be sure whether a remote
server may eventually come back or not and have to be reconfigured by
a human or system automation tool to point to the new server or new
network configuration.

> Right. This implies though that apart from the kernel having
> to keep around the dirtied-but-unrecoverable pages for an
> unbounded time, that there's further an interface for obtaining
> the exact failed pages so that you can read them back.

No, the interface we have is fsync which gives us that information
with the granularity of a single file. The database could in theory
recognize that fsync is not completing on a file and read that file
back and write it to a new file. More likely we would implement a
feature Oracle has of writing key files to multiple devices. But
currently in practice that's not what would happen, what would happen
would be a human would recognize that the database has stopped being
able to commit and there are hardware errors in the log and would stop
the database, take a backup, and restore onto a new working device.
The current interface is that there's one error and then Postgres
would pretty much have to say, "sorry, your database is corrupt and
the data is gone, restore from your backups". Which is pretty dismal.

> There is a clear responsibility of the application to keep
> its buffers around until a successful fsync(). The kernels
> do report the error (albeit with all the complexities of
> dealing with the interface), at which point the application
> may not assume that the write()s where ever even buffered
> in the kernel page cache in the first place.

Postgres cannot just store the entire database in RAM. It writes
things to the filesystem all the time. It calls fsync only when it
needs a write barrier to ensure consistency. That's only frequent on
the transaction log to ensure it's flushed before data modifications
and then periodically to checkpoint the data files. The amount of data
written between checkpoints can be arbitrarily large and Postgres has
no idea how much memory is available as filesystem buffers or how much
i/o bandwidth is available or other memory pressure there is. What
you're suggesting is that the application should have to babysit the
filesystem buffer cache and reimplement all of it in user-space
because the filesystem is free to throw away any data any time it
chooses?

The current interface to throw away filesystem buffer cache is
unmount. It sounds like the kernel would like a more granular way to
discard just part of a device which makes a lot of sense in the age of
large network block devices. But I don't think just saying that the
filesystem buffer cache is now something every application needs to
re-implement in user-space really helps with that, they're going to
have the same problems to solve.

-- 
greg


On 10 April 2018 at 02:59, Craig Ringer <craig@2ndquadrant.com> wrote:

> Nitpick: In most cases the kernel reserves disk space immediately,
> before returning from write(). NFS seems to be the main exception
> here.

I'm kind of puzzled by this. Surely NFS servers store the data in the
filesystem using write(2) or the in-kernel equivalent? So if the
server is backed by a filesystem where write(2) preallocates space
surely the NFS server must behave as if it'spreallocating as well? I
would expect NFS to provide basically the same set of possible
failures as the underlying filesystem (as long as you don't enable
nosync of course).

-- 
greg


-hackers,

I reached out to the Linux ext4 devs, here is tytso@mit.edu response:

"""
Hi Joshua,

This isn't actually an ext4 issue, but a long-standing VFS/MM issue.

There are going to be multiple opinions about what the right thing to
do.  I'll try to give as unbiased a description as possible, but
certainly some of this is going to be filtered by my own biases no
matter how careful I can be.

First of all, what storage devices will do when they hit an exception
condition is quite non-deterministic.  For example, the vast majority
of SSD's are not power fail certified.  What this means is that if
they suffer a power drop while they are doing a GC, it is quite
possible for data written six months ago to be lost as a result.  The
LBA could potentialy be far, far away from any LBA's that were
recently written, and there could have been multiple CACHE FLUSH
operations in the since the LBA in question was last written six
months ago.  No matter; for a consumer-grade SSD, it's possible for
that LBA to be trashed after an unexpected power drop.

Which is why after a while, one can get quite paranoid and assume that
the only way you can guarantee data robustness is to store multiple
copies and/or use erasure encoding, with some of the copies or shards
written to geographically diverse data centers.

Secondly, I think it's fair to say that the vast majority of the
companies who require data robustness, and are either willing to pay
$$$ to an enterprise distro company like Red Hat, or command a large
enough paying customer base that they can afford to dictate terms to
an enterprise distro, or hire a consultant such as Christoph, or have
their own staffed Linux kernel teams, have tended to use O_DIRECT.  So
for better or for worse, there has not been as much investment in
buffered I/O and data robustness in the face of exception handling of
storage devices.

Next, the reason why fsync() has the behaviour that it does is one
ofhe the most common cases of I/O storage errors in buffered use
cases, certainly as seen by the community distros, is the user who
pulls out USB stick while it is in use.  In that case, if there are
dirtied pages in the page cache, the question is what can you do?
Sooner or later the writes will time out, and if you leave the pages
dirty, then it effectively becomes a permanent memory leak.  You can't
unmount the file system --- that requires writing out all of the pages
such that the dirty bit is turned off.  And if you don't clear the
dirty bit on an I/O error, then they can never be cleaned.  You can't
even re-insert the USB stick; the re-inserted USB stick will get a new
block device.  Worse, when the USB stick was pulled, it will have
suffered a power drop, and see above about what could happen after a
power drop for non-power fail certified flash devices --- it goes
double for the cheap sh*t USB sticks found in the checkout aisle of
Micro Center.

So this is the explanation for why Linux handles I/O errors by
clearing the dirty bit after reporting the error up to user space.
And why there is not eagerness to solve the problem simply by "don't
clear the dirty bit".  For every one Postgres installation that might
have a better recover after an I/O error, there's probably a thousand
clueless Fedora and Ubuntu users who will have a much worse user
experience after a USB stick pull happens.

I can think of things that could be done --- for example, it could be
switchable on a per-block device basis (or maybe a per-mount basis)
whether or not the dirty bit gets cleared after the error is reported
to userspace.  And perhaps there could be a new unmount flag that
causes all dirty pages to be wiped out, which could be used to recover
after a permanent loss of the block device.  But the question is who
is going to invest the time to make these changes?  If there is a
company who is willing to pay to comission this work, it's almost
certainly soluble.  Or if a company which has a kernel on staff is
willing to direct an engineer to work on it, it certainly could be
solved.  But again, of the companies who have client code where we
care about robustness and proper handling of failed disk drives, and
which have a kernel team on staff, pretty much all of the ones I can
think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try
to make buffered writes and error reporting via fsync(2) work well.

In general these companies want low-level control over buffer cache
eviction algorithms, which drives them towards the design decision of
effectively implementing the page cache in userspace, and using
O_DIRECT reads/writes.

If you are aware of a company who is willing to pay to have a new
kernel feature implemented to meet your needs, we might be able to
refer you to a company or a consultant who might be able to do that
work.  Let me know off-line if that's the case...

                    - Ted
"""

-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****


-hackers,

The thread is picking up over on the ext4 list. They don't update their 
archives as often as we do, so I can't link to the discussion. What 
would be the preferred method of sharing the info?

Thanks,

JD


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****


On 04/10/2018 12:51 PM, Joshua D. Drake wrote:
> -hackers,
> 
> The thread is picking up over on the ext4 list. They don't update their 
> archives as often as we do, so I can't link to the discussion. What 
> would be the preferred method of sharing the info?

Thanks to Anthony for this link:

http://lists.openwall.net/linux-ext4/2018/04/10/33

It isn't quite real time but it keeps things close enough.

jD


> 
> Thanks,
> 
> JD
> 
> 


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****


On Tue, 10 Apr 2018 17:40:05 +0200
Anthony Iliopoulos <ailiop@altatus.com> wrote:

> LSF/MM'18 is upcoming and it would
> have been the perfect opportunity but it's past the CFP deadline.
> It may still worth contacting the organizers to bring forward
> the issue, and see if there is a chance to have someone from
> Pg invited for further discussions.

FWIW, it is my current intention to be sure that the development
community is at least aware of the issue by the time LSFMM starts.

The event is April 23-25 in Park City, Utah.  I bet that room could be
found for somebody from the postgresql community, should there be
somebody who would like to represent the group on this issue.  Let me
know if an introduction or advocacy from my direction would be helpful.

jon


On 10 April 2018 at 19:58, Joshua D. Drake <jd@commandprompt.com> wrote:
> You can't unmount the file system --- that requires writing out all of the pages
> such that the dirty bit is turned off.

I always wondered why Linux didn't implement umount -f. It's been in
BSD since forever and it's a major annoyance that it's missing in
Linux. Even without leaking memory it still leaks other resources,
causes confusion and awkward workarounds in UI and automation
software.

-- 
greg


Hi,

On 2018-04-11 06:05:27 -0600, Jonathan Corbet wrote:
> The event is April 23-25 in Park City, Utah.  I bet that room could be
> found for somebody from the postgresql community, should there be
> somebody who would like to represent the group on this issue.  Let me
> know if an introduction or advocacy from my direction would be helpful.

If that room can be found, I might be able to make it. Being in SF, I'm
probably the physically closest PG dev involved in the discussion.

Thanks for chiming in,

Andres


On Wed, 11 Apr 2018 07:29:09 -0700
Andres Freund <andres@anarazel.de> wrote:

> If that room can be found, I might be able to make it. Being in SF, I'm
> probably the physically closest PG dev involved in the discussion.

OK, I've dropped the PC a note; hopefully you'll be hearing from them.

jon


On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:
> On 10 April 2018 at 02:59, Craig Ringer <craig@2ndquadrant.com> wrote:
> 
> > Nitpick: In most cases the kernel reserves disk space immediately,
> > before returning from write(). NFS seems to be the main exception
> > here.
> 
> I'm kind of puzzled by this. Surely NFS servers store the data in the
> filesystem using write(2) or the in-kernel equivalent? So if the
> server is backed by a filesystem where write(2) preallocates space
> surely the NFS server must behave as if it'spreallocating as well? I
> would expect NFS to provide basically the same set of possible
> failures as the underlying filesystem (as long as you don't enable
> nosync of course).

I don't think the write is _sent_ to the NFS at the time of the write,
so while the NFS side would reserve the space, it might get the write
request until after we return write success to the process.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> > 
> > An crazy idea would be to have a daemon that checks the logs and
> > stops Postgres when it seems something wrong.
> > 
> 
> That doesn't seem like a very practical way. It's better than nothing,
> of course, but I wonder how would that work with containers (where I
> think you may not have access to the kernel log at all). Also, I'm
> pretty sure the messages do change based on kernel version (and possibly
> filesystem) so parsing it reliably seems rather difficult. And we
> probably don't want to PANIC after I/O error on an unrelated device, so
> we'd need to understand which devices are related to PostgreSQL.

My more-considered crazy idea is to have a postgresql.conf setting like
archive_command that allows the administrator to specify a command that
will be run _after_ fsync but before the checkpoint is marked as
complete.  While we can have write flush errors before fsync and never
see the errors during fsync, we will not have write flush errors _after_
fsync that are associated with previous writes.

The script should check for I/O or space-exhaustion errors and return
false in that case, in which case we can stop and maybe stop and crash
recover.  We could have an exit of 1 do the former, and an exit of 2 do
the later.

Also, if we are relying on WAL, we have to make sure WAL is actually
safe with fsync, and I am betting only the O_DIRECT methods actually
are safe:

    #wal_sync_method = fsync                # the default is the first option
                                            # supported by the operating system:
                                            #   open_datasync
                                     -->    #   fdatasync (default on Linux)
                                     -->    #   fsync
                                     -->    #   fsync_writethrough
                                            #   open_sync

I am betting the marked wal_sync_method methods are not safe since there
is time between the write and fsync.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> On 04/09/2018 12:29 AM, Bruce Momjian wrote:
> > 
> > An crazy idea would be to have a daemon that checks the logs and
> > stops Postgres when it seems something wrong.
> > 
> 
> That doesn't seem like a very practical way. It's better than nothing,
> of course, but I wonder how would that work with containers (where I
> think you may not have access to the kernel log at all). Also, I'm
> pretty sure the messages do change based on kernel version (and possibly
> filesystem) so parsing it reliably seems rather difficult. And we
> probably don't want to PANIC after I/O error on an unrelated device, so
> we'd need to understand which devices are related to PostgreSQL.

Replying to your specific case, I am not sure how we would use a script
to check for I/O errors/space-exhaustion if the postgres user doesn't
have access to it.  Does O_DIRECT work in such container cases?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:
> Also, if we are relying on WAL, we have to make sure WAL is actually
> safe with fsync, and I am betting only the O_DIRECT methods actually
> are safe:
> 
>     #wal_sync_method = fsync                # the default is the first option
>                                             # supported by the operating system:
>                                             #   open_datasync
>                                      -->    #   fdatasync (default on Linux)
>                                      -->    #   fsync
>                                      -->    #   fsync_writethrough
>                                             #   open_sync
> 
> I am betting the marked wal_sync_method methods are not safe since there
> is time between the write and fsync.

Hm? That's not really the issue though? One issue is that retries are
not necessarily safe in buffered IO, the other that fsync might not
report an error if the fd was closed and opened.

O_DIRECT is only used if wal archiving or streaming isn't used, which
makes it pretty useless anyway.

Greetings,

Andres Freund


On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:
> On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> > That doesn't seem like a very practical way. It's better than nothing,
> > of course, but I wonder how would that work with containers (where I
> > think you may not have access to the kernel log at all). Also, I'm
> > pretty sure the messages do change based on kernel version (and possibly
> > filesystem) so parsing it reliably seems rather difficult. And we
> > probably don't want to PANIC after I/O error on an unrelated device, so
> > we'd need to understand which devices are related to PostgreSQL.

You can certainly have access to the kernel log in containers. I'd
assume such a script wouldn't check various system logs but instead tail
/dev/kmsg or such. Otherwise the variance between installations would be
too big.

There's not *that* many different type of error messages and they don't
change that often. If we'd just detect error for the most common FSs
we'd probably be good. Detecting a few general storage layer message
wouldn't be that hard either, most things have been unified over the
last ~8-10 years.


> Replying to your specific case, I am not sure how we would use a script
> to check for I/O errors/space-exhaustion if the postgres user doesn't
> have access to it.

Not sure what you mean?

Space exhaustiion can be checked when allocating space, FWIW. We'd just
need to use posix_fallocate et al.


> Does O_DIRECT work in such container cases?

Yes.

Greetings,

Andres Freund


On Mon, Apr  9, 2018 at 12:25:33PM -0700, Peter Geoghegan wrote:
> On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund <andres@anarazel.de> wrote:
> > Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> > absurd, as is some of the proposed ways this is all supposed to
> > work. But I think the case we're discussing is much closer to a near
> > irresolvable corner case than anything else.
> 
> +1
> 
> > We're talking about the storage layer returning an irresolvable
> > error. You're hosed even if we report it properly.  Yes, it'd be nice if
> > we could report it reliably.  But that doesn't change the fact that what
> > we're doing is ensuring that data is safely fsynced unless storage
> > fails, in which case it's not safely fsynced anyway.
> 
> Right. We seem to be implicitly assuming that there is a big
> difference between a problem in the storage layer that we could in
> principle detect, but don't, and any other problem in the storage
> layer. I've read articles claiming that technologies like SMART are
> not really reliable in a practical sense [1], so it seems to me that
> there is reason to doubt that this gap is all that big.
> 
> That said, I suspect that the problems with running out of disk space
> are serious practical problems. I have personally scoffed at stories
> involving Postgres databases corruption that gets attributed to
> running out of disk space. Looks like I was dead wrong.

Yes, I think we need to look at user expectations here.

If the device has a hardware write error, it is true that it is good to
detect it, and it might be permanent or temporary, e.g. NAS/NFS.  The
longer the error persists, the more likely the user will expect
corruption.  However, right now, any length outage could cause
corruption, and it will not be reported in all cases.

Running out of disk space is also something you don't expect to corrupt
your database --- you expect it to only prevent future writes.  It seems
NAS/NFS and any thin provisioned storage will have this problem, and
again, not always reported.

So, our initial action might just be to educate users that write errors
can cause silent corruption, and out-of-space errors on NAS/NFS and any
thin provisioned storage can cause corruption.

Kernel logs (not just Postgres logs) should be monitored for these
issues and fail-over/recovering might be necessary.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Tue, Apr 17, 2018 at 02:34:53PM -0700, Andres Freund wrote:
> On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:
> > Also, if we are relying on WAL, we have to make sure WAL is actually
> > safe with fsync, and I am betting only the O_DIRECT methods actually
> > are safe:
> > 
> >     #wal_sync_method = fsync                # the default is the first option
> >                                             # supported by the operating system:
> >                                             #   open_datasync
> >                                      -->    #   fdatasync (default on Linux)
> >                                      -->    #   fsync
> >                                      -->    #   fsync_writethrough
> >                                             #   open_sync
> > 
> > I am betting the marked wal_sync_method methods are not safe since there
> > is time between the write and fsync.
> 
> Hm? That's not really the issue though? One issue is that retries are
> not necessarily safe in buffered IO, the other that fsync might not
> report an error if the fd was closed and opened.

Well, we have have been focusing on the delay between backend or
checkpoint writes and checkpoint fsyncs. My point is that we have the
same problem in doing a write, _then_ fsync for the WAL.  Yes, the delay
is much shorter, but the issue still exists.  I realize that newer Linux
kernels will not have the problem since the file descriptor remains
open, but the problem exists with older/common linux kernels.

> O_DIRECT is only used if wal archiving or streaming isn't used, which
> makes it pretty useless anyway.

Uh, as doesn't 'open_datasync' and 'open_sync' fsync as part of the
write, meaning we can't lose the error report like we can with the
others?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On 18 April 2018 at 05:19, Bruce Momjian <bruce@momjian.us> wrote:
> On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:
>> On 10 April 2018 at 02:59, Craig Ringer <craig@2ndquadrant.com> wrote:
>>
>> > Nitpick: In most cases the kernel reserves disk space immediately,
>> > before returning from write(). NFS seems to be the main exception
>> > here.
>>
>> I'm kind of puzzled by this. Surely NFS servers store the data in the
>> filesystem using write(2) or the in-kernel equivalent? So if the
>> server is backed by a filesystem where write(2) preallocates space
>> surely the NFS server must behave as if it'spreallocating as well? I
>> would expect NFS to provide basically the same set of possible
>> failures as the underlying filesystem (as long as you don't enable
>> nosync of course).
>
> I don't think the write is _sent_ to the NFS at the time of the write,
> so while the NFS side would reserve the space, it might get the write
> request until after we return write success to the process.

It should be sent if you're using sync mode.

From my reading of the docs, if you're using async mode you're already
open to so many potential corruptions you might as well not bother.

I need to look into this more re NFS and expand the tests I have to
cover that properly.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On 10 April 2018 at 20:15, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 10 April 2018 at 14:10, Michael Paquier <michael@paquier.xyz> wrote:
>
>> Well, I think that there is place for improving reporting of failure
>> in file_utils.c for frontends, or at worst have an exit() for any kind
>> of critical failures equivalent to a PANIC.
>
> Yup.
>
> In the mean time, speaking of PANIC, here's the first cut patch to
> make Pg panic on fsync() failures. I need to do some closer review and
> testing, but it's presented here for anyone interested.
>
> I intentionally left some failures as ERROR not PANIC, where the
> entire operation is done as a unit, and an ERROR will cause us to
> retry the whole thing.
>
> For example, when we fsync() a temp file before we move it into place,
> there's no point panicing on failure, because we'll discard the temp
> file on ERROR and retry the whole thing.
>
> I've verified that it works as expected with some modifications to the
> test tool I've been using (pushed).
>
> The main downside is that if we panic in redo, we don't try again. We
> throw our toys and shut down. But arguably if we get the same I/O
> error again in redo, that's the right thing to do anyway, and quite
> likely safer than continuing to ERROR on checkpoints indefinitely.
>
> Patch attached.
>
> To be clear, this patch only deals with the issue of us retrying
> fsyncs when it turns out to be unsafe. This does NOT address any of
> the issues where we won't find out about writeback errors at all.

Thinking about this some more, it'll definitely need a GUC to force it
to continue despite a potential hazard. Otherwise we go backwards from
the status quo if we're in a position where uptime is vital and
correctness problems can be tolerated or repaired later. Kind of like
zero_damaged_pages, we'll need some sort of
continue_after_fsync_errors .

Without that, we'll panic once, enter redo, and if the problem
persists we'll panic in redo and exit the startup process. That's not
going to help users.

I'll amend the patch accordingly as time permits.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Wed, Apr 18, 2018 at 06:04:30PM +0800, Craig Ringer wrote:
> On 18 April 2018 at 05:19, Bruce Momjian <bruce@momjian.us> wrote:
> > On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:
> >> On 10 April 2018 at 02:59, Craig Ringer <craig@2ndquadrant.com> wrote:
> >>
> >> > Nitpick: In most cases the kernel reserves disk space immediately,
> >> > before returning from write(). NFS seems to be the main exception
> >> > here.
> >>
> >> I'm kind of puzzled by this. Surely NFS servers store the data in the
> >> filesystem using write(2) or the in-kernel equivalent? So if the
> >> server is backed by a filesystem where write(2) preallocates space
> >> surely the NFS server must behave as if it'spreallocating as well? I
> >> would expect NFS to provide basically the same set of possible
> >> failures as the underlying filesystem (as long as you don't enable
> >> nosync of course).
> >
> > I don't think the write is _sent_ to the NFS at the time of the write,
> > so while the NFS side would reserve the space, it might get the write
> > request until after we return write success to the process.
> 
> It should be sent if you're using sync mode.
> 
> >From my reading of the docs, if you're using async mode you're already
> open to so many potential corruptions you might as well not bother.
> 
> I need to look into this more re NFS and expand the tests I have to
> cover that properly.

So, if sync mode passes the write to NFS, and NFS pre-reserves write
space, and throws an error on reservation failure, that means that NFS
will not corrupt a cluster on out-of-space errors.

So, what about thin provisioning?  I can understand sharing _free_ space
among file systems, but once a write arrives I assume it reserves the
space.  Is the problem that many thin provisioning systems don't have a
sync mode, so you can't force the write to appear on the device before
an fsync?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Tue, Apr 17, 2018 at 02:41:42PM -0700, Andres Freund wrote:
> On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:
> > On Mon, Apr  9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:
> > > That doesn't seem like a very practical way. It's better than nothing,
> > > of course, but I wonder how would that work with containers (where I
> > > think you may not have access to the kernel log at all). Also, I'm
> > > pretty sure the messages do change based on kernel version (and possibly
> > > filesystem) so parsing it reliably seems rather difficult. And we
> > > probably don't want to PANIC after I/O error on an unrelated device, so
> > > we'd need to understand which devices are related to PostgreSQL.
> 
> You can certainly have access to the kernel log in containers. I'd
> assume such a script wouldn't check various system logs but instead tail
> /dev/kmsg or such. Otherwise the variance between installations would be
> too big.

I was thinking 'dmesg', but the result is similar.

> There's not *that* many different type of error messages and they don't
> change that often. If we'd just detect error for the most common FSs
> we'd probably be good. Detecting a few general storage layer message
> wouldn't be that hard either, most things have been unified over the
> last ~8-10 years.

It is hard to know exactly what the message format should be for each
operating system because it is hard to generate them on demand, and we
would need to filter based on Postgres devices.

The other issue is that once you see a message during a checkpoint and
exit, you don't want to see that message again after the problem has
been fixed and the server restarted.  The simplest solution is to save
the output of the last check and look for only new entries.  I am
attaching a script I run every 15 minutes from cron that emails me any
unexpected kernel messages.

I am thinking we would need a contrib module with sample scripts for
various operating systems.

> > Replying to your specific case, I am not sure how we would use a script
> > to check for I/O errors/space-exhaustion if the postgres user doesn't
> > have access to it.
> 
> Not sure what you mean?
> 
> Space exhaustiion can be checked when allocating space, FWIW. We'd just
> need to use posix_fallocate et al.

I was asking about cases where permissions prevent viewing of kernel
messages.  I think you can view them in containers, but in virtual
machines you might not have access to the host operating system's kernel
messages, and that might be where they are.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Attachment
wrOn 18 April 2018 at 19:46, Bruce Momjian <bruce@momjian.us> wrote:

> So, if sync mode passes the write to NFS, and NFS pre-reserves write
> space, and throws an error on reservation failure, that means that NFS
> will not corrupt a cluster on out-of-space errors.

Yeah. I need to verify in a concrete test case.

The thing is that write() is allowed to be asynchronous anyway. Most
file systems choose to implement eager reservation of space, but it's
not mandated. AFAICS that's largely a historical accident to keep
applications happy, because FSes used to *allocate* the space at
write() time too, and when they moved to delayed allocations, apps
tended to break too easily unless they at least reserved space. NFS
would have to do a round-trip on write() to reserve space.

The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:

"
       A successful return from write() does not make any guarantee that
       data has been committed to disk.  On some filesystems, including NFS,
       it does not even guarantee that space has successfully been reserved
       for the data.  In this case, some errors might be delayed until a
       future write(2), fsync(2), or even close(2).  The only way to be sure
       is to call fsync(2) after you are done writing all your data.
"

... and I'm inclined to believe it when it refuses to make guarantees.
Especially lately.

> So, what about thin provisioning?  I can understand sharing _free_ space
> among file systems

Most thin provisioning is done at the block level, not file system
level. So the FS is usually unaware it's on a thin-provisioned volume.
Usually the whole kernel is unaware, because the thin provisioning is
done on the SAN end or by a hypervisor. But the same sort of thing may
be done via LVM - see lvmthin. For example, you may make 100 different
1TB ext4 FSes, each on 1TB iSCSI volumes backed by SAN with a total of
50TB of concrete physical capacity. The SAN is doing block mapping and
only allocating storage chunks to a given volume when the FS has
written blocks to every previous free block in the previous storage
chunk. It may also do things like block de-duplication, compression of
storage chunks that aren't written to for a while, etc.

The idea is that when the SAN's actual physically allocate storage
gets to 40TB it starts telling you to go buy another rack of storage
so you don't run out. You don't have to resize volumes, resize file
systems, etc. All the storage space admin is centralized on the SAN
and storage team, and your sysadmins, DBAs and app devs are none the
wiser. You buy storage when you need it, not when the DBA demands they
need a 200% free space margin just in case. Whether or not you agree
with this philosophy or think it's sensible is kind of moot, because
it's an extremely widespread model, and servers you work on may well
be backed by thin provisioned storage _even if you don't know it_.

Think of it as a bit like VM overcommit, for storage. You can malloc()
as much memory as you like and everything's fine until you try to
actually use it. Then you go to dirty a page, no free pages are
available, and *boom*.

The thing is, the SAN (or LVM) doesn't have any idea about the FS's
internal in-memory free space counter and its space reservations. Nor
does it understand any FS metadata. All it cares about is "has this
LBA ever been written to by the FS?". If so, it must make sure backing
storage for it exists. If not, it won't bother.

Most FSes only touch the blocks on dirty writeback, or sometimes
lazily as part of delayed allocation. So if your SAN is running out of
space and there's 100MB free, each of your 100 FSes may have
decremented its freelist by 2MB and be happily promising more space to
apps on write() because, well, as far as they know they're only 50%
full. When they all do dirty writeback and flush to storage, kaboom,
there's nowhere to put some of the data.

I don't know if posix_fallocate is a sufficient safeguard either.
You'd have to actually force writes to each page through to the
backing storage to know for sure the space existed. Yes, the docs say

"
       After a
       successful call to posix_fallocate(), subsequent writes to bytes in
       the specified range are guaranteed not to fail because of lack of
       disk space.
"

... but they're speaking from the filesystem's perspective. If the FS
doesn't dirty and flush the actual blocks, a thin provisioned storage
system won't know.

It's reasonable enough to throw up our hands in this case and say
"your setup is crazy, you're breaking the rules, don't do that". The
truth is they AREN'T breaking the rules, but we can disclaim support
for such configurations anyway.

After all, we tell people not to use Linux's VM overcommit too. How's
that working for you? I see it enabled on the great majority of
systems I work with, and some people are very reluctant to turn it off
because they don't want to have to add swap.

If someone has a 50TB SAN and wants to allow for unpredictable space
use expansion between various volumes, and we say "you can't do that,
go buy a 100TB SAN instead" ... that's not going to go down too well
either. Often we can actually say "make sure the 5TB volume PostgreSQL
is using is eagerly provisioned, and expand it at need using online
resize if required. We don't care about the rest of the SAN.".

I guarantee you that when you create a 100GB EBS volume on AWS EC2,
you don't get 100GB of storage preallocated. AWS are probably pretty
good about not running out of backing store, though.


There _are_ file systems optimised for thin provisioning, etc, too.
But that's more commonly done by having them do things like zero
deallocated space so the thin provisioning system knows it can return
it to the free pool, and now things like DISCARD provide much of that
signalling in a standard way.



-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On 19/04/18 00:45, Craig Ringer wrote:

>
> I guarantee you that when you create a 100GB EBS volume on AWS EC2,
> you don't get 100GB of storage preallocated. AWS are probably pretty
> good about not running out of backing store, though.
>
>

Some db folks (used to anyway) advise dd'ing to your freshly attached 
devices on AWS (for performance mainly IIRC), but that would help 
prevent some failure scenarios for any thin provisioned storage (but 
probably really annoy the admins' thereof).

regards
Mark


On 19 April 2018 at 07:31, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
> On 19/04/18 00:45, Craig Ringer wrote:
>
>>
>> I guarantee you that when you create a 100GB EBS volume on AWS EC2,
>> you don't get 100GB of storage preallocated. AWS are probably pretty
>> good about not running out of backing store, though.
>>
>>
>
> Some db folks (used to anyway) advise dd'ing to your freshly attached
> devices on AWS (for performance mainly IIRC), but that would help prevent
> some failure scenarios for any thin provisioned storage (but probably really
> annoy the admins' thereof).

This still makes a lot of sense on AWS EBS, particularly when using a
volume created from a non-empty snapshot. Performance of S3-snapshot
based EBS volumes is spectacularly awful, since they're copy-on-read.
Reading the whole volume helps a lot.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Wed, Apr 18, 2018 at 08:45:53PM +0800, Craig Ringer wrote:
> wrOn 18 April 2018 at 19:46, Bruce Momjian <bruce@momjian.us> wrote:
> 
> > So, if sync mode passes the write to NFS, and NFS pre-reserves write
> > space, and throws an error on reservation failure, that means that NFS
> > will not corrupt a cluster on out-of-space errors.
> 
> Yeah. I need to verify in a concrete test case.

Thanks.

> The thing is that write() is allowed to be asynchronous anyway. Most
> file systems choose to implement eager reservation of space, but it's
> not mandated. AFAICS that's largely a historical accident to keep
> applications happy, because FSes used to *allocate* the space at
> write() time too, and when they moved to delayed allocations, apps
> tended to break too easily unless they at least reserved space. NFS
> would have to do a round-trip on write() to reserve space.
> 
> The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:
> 
> "
>        A successful return from write() does not make any guarantee that
>        data has been committed to disk.  On some filesystems, including NFS,
>        it does not even guarantee that space has successfully been reserved
>        for the data.  In this case, some errors might be delayed until a
>        future write(2), fsync(2), or even close(2).  The only way to be sure
>        is to call fsync(2) after you are done writing all your data.
> "
> 
> ... and I'm inclined to believe it when it refuses to make guarantees.
> Especially lately.

Uh, even calling fsync after write isn't 100% safe since the kernel
could have flushed the dirty pages to storage, and failed, and the fsync
would later succeed.  I realize newer kernels have that fixed for files
open during that operation, but that is the minority of installs.

> The idea is that when the SAN's actual physically allocate storage
> gets to 40TB it starts telling you to go buy another rack of storage
> so you don't run out. You don't have to resize volumes, resize file
> systems, etc. All the storage space admin is centralized on the SAN
> and storage team, and your sysadmins, DBAs and app devs are none the
> wiser. You buy storage when you need it, not when the DBA demands they
> need a 200% free space margin just in case. Whether or not you agree
> with this philosophy or think it's sensible is kind of moot, because
> it's an extremely widespread model, and servers you work on may well
> be backed by thin provisioned storage _even if you don't know it_.


> Most FSes only touch the blocks on dirty writeback, or sometimes
> lazily as part of delayed allocation. So if your SAN is running out of
> space and there's 100MB free, each of your 100 FSes may have
> decremented its freelist by 2MB and be happily promising more space to
> apps on write() because, well, as far as they know they're only 50%
> full. When they all do dirty writeback and flush to storage, kaboom,
> there's nowhere to put some of the data.

I see what you are saying --- that the kernel is reserving the write
space from its free space, but the free space doesn't all exist.  I am
not sure how we can tell people to make sure the file system free space
is real.

> You'd have to actually force writes to each page through to the
> backing storage to know for sure the space existed. Yes, the docs say
> 
> "
>        After a
>        successful call to posix_fallocate(), subsequent writes to bytes in
>        the specified range are guaranteed not to fail because of lack of
>        disk space.
> "
> 
> ... but they're speaking from the filesystem's perspective. If the FS
> doesn't dirty and flush the actual blocks, a thin provisioned storage
> system won't know.

Frankly, in what cases will a write fail _for_ lack of free space?  It
could be a new WAL file (not recycled), or a pages added to the end of
the table.

Is that it?  It doesn't sound too terrible.  If we can eliminate the
corruption due to free space exxhaustion, it would be a big step
forward.

The next most common failure would be temporary storage failure or
storage communication failure.

Permanent storage failure is "game over" so we don't need to worry about
that.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Just for the record, I tried the test case with ZFS on Ubuntu 17.10 host with ZFS on Linux 0.6.5.11.

ZFS does not swallow the fsync error, but the system does not handle the error nicely: the test case program hangs on fsync, the load jumps up and there's a bunch of z_wr_iss and z_null_int kernel threads belonging to zfs, eating up the CPU.

Even then I managed to reboot the system, so it's not a complete and utter mess.


The test case adjustments are here: https://github.com/zejn/scrapcode/commit/e7612536c346d59a4b69bedfbcafbe8c1079063c


Kind regards,

Gasper

On 29. 03. 2018 07:25, Craig Ringer wrote:
On 29 March 2018 at 13:06, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> The retries are the source of the problem ; the first fsync() can return EIO,
> and also *clears the error* causing a 2nd fsync (of the same data) to return
> success.

What I'm failing to grok here is how that error flag even matters,
whether it's a single bit or a counter as described in that patch.  If
write back failed, *the page is still dirty*.  So all future calls to
fsync() need to try to try to flush it again, and (presumably) fail
again (unless it happens to succeed this time around).

You'd think so. But it doesn't appear to work that way. You can see yourself with the error device-mapper destination mapped over part of a volume.

I wrote a test case here.


I don't pretend the kernel behaviour is sane. And it's possible I've made an error in my analysis. But since I've observed this in the wild, and seen it in a test case, I strongly suspect that's what I've described is just what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O subsystem and doesn't dirty it again on I/O error? I haven't dug that deep on the kernel side. See the stackoverflow post for details on what I found in kernel code analysis.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Hi,

On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:
> TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
> least on Linux. When fsync() returns success it means "all writes since the
> last fsync have hit disk" but we assume it means "all writes since the last
> SUCCESSFUL fsync have hit disk".

> But then we retried the checkpoint, which retried the fsync(). The retry
> succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.

Random other thing we should look at: Some filesystems (nfs yes, xfs
ext4 no) flush writes at close(2). We check close() return code, just
log it... So close() counts as an fsync for such filesystems().

I'm LSF/MM to discuss future behaviour of linux here, but that's how it
is right now.

Greetings,

Andres Freund


On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:
> > TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
> > least on Linux. When fsync() returns success it means "all writes since the
> > last fsync have hit disk" but we assume it means "all writes since the last
> > SUCCESSFUL fsync have hit disk".
> 
> > But then we retried the checkpoint, which retried the fsync(). The retry
> > succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.
> 
> Random other thing we should look at: Some filesystems (nfs yes, xfs
> ext4 no) flush writes at close(2). We check close() return code, just
> log it... So close() counts as an fsync for such filesystems().

Well, that's interesting.  You might remember that NFS does not reserve
space for writes like local file systems like ext4/xfs do.  For that
reason, we might be able to capture the out-of-space error on close and
exit sooner for NFS.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On 24 April 2018 at 04:14, Andres Freund <andres@anarazel.de> wrote:

> I'm LSF/MM to discuss future behaviour of linux here, but that's how it
> is right now.


Interim LWN.net coverage of that can be found here:
https://lwn.net/Articles/752613/

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Tue, Apr 24, 2018 at 12:09 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:
>> > TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
>> > least on Linux. When fsync() returns success it means "all writes since the
>> > last fsync have hit disk" but we assume it means "all writes since the last
>> > SUCCESSFUL fsync have hit disk".
>>
>> > But then we retried the checkpoint, which retried the fsync(). The retry
>> > succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.
>>
>> Random other thing we should look at: Some filesystems (nfs yes, xfs
>> ext4 no) flush writes at close(2). We check close() return code, just
>> log it... So close() counts as an fsync for such filesystems().
>
> Well, that's interesting.  You might remember that NFS does not reserve
> space for writes like local file systems like ext4/xfs do.  For that
> reason, we might be able to capture the out-of-space error on close and
> exit sooner for NFS.

It seems like some implementations flush on close and therefore
discover ENOSPC problem at that point, unless they have NVSv4 (RFC
3050) "write delegation" with a promise from the server that a certain
amount of space is available.  It seems like you can't count on that
in any way though, because it's the server that decides when to
delegate and how much space to promise is preallocated, not the
client.  So in userspace you always need to be able to handle errors
including ENOSPC returned by close(), and if you ignore that and
you're using an operating system that immediately incinerates all
evidence after telling you that (so that later fsync() doesn't fail),
you're in trouble.

Some relevant code:

https://github.com/torvalds/linux/commit/5445b1fbd123420bffed5e629a420aa2a16bf849
https://github.com/freebsd/freebsd/blob/master/sys/fs/nfsclient/nfs_clvnops.c#L618

It looks like the bleeding edge of the NFS spec includes a new
ALLOCATE operation that should be able to support posix_fallocate()
(if we were to start using that for extending files):

https://tools.ietf.org/html/rfc7862#page-64

I'm not sure how reliable [posix_]fallocate is on NFS in general
though, and it seems that there are fall-back implementations of
posix_fallocate() that write zeros (or even just feign success?) which
probably won't do anything useful here if not also flushed (that
fallback strategy might only work on eager reservation filesystems
that don't have direct fallocate support?) so there are several layers
(libc, kernel, nfs client, nfs server) that'd need to be aligned for
that to work, and it's not clear how a humble userspace program is
supposed to know if they are.

I guess if you could find a way to amortise the cost of extending
(like Oracle et al do by extending big container datafiles 10MB at a
time or whatever), then simply writing zeros and flushing when doing
that might work out OK, so you wouldn't need such a thing?  (Unless of
course it's a COW filesystem, but that's a different can of worms.)

-- 
Thomas Munro
http://www.enterprisedb.com


For archive readers, this thread is continued as
https://www.postgresql.org/message-id/20180427222842.in2e4mibx45zdth5@alap3.anarazel.de
and there's a follow-up lwn article at
https://lwn.net/Articles/752613/ too.