Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id CAMsr+YEp21emoQG_dEqgUxckGNgZ8oQfab0giK6XLTyFbRY02Q@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Anthony Iliopoulos <ailiop@altatus.com>)
Responses Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
List pgsql-hackers
On 3 April 2018 at 07:05, Anthony Iliopoulos <ailiop@altatus.com> wrote:
Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:
>
> fsync() doesn't reflect the status of given pages, however, it reflects
> the status of the file descriptor, and as such the file, on which it's
> called.  This notion that fsync() is actually only responsible for the
> changes which were made to a file since the last fsync() call is pure
> foolishness.  If we were able to pass a list of pages or data ranges to
> fsync() for it to verify they're on disk then perhaps things would be
> different, but we can't, all we can do is ask to "please flush all the
> dirty pages associated with this file descriptor, which represents this
> file we opened, to disk, and let us know if you were successful."
>
> Give us a way to ask "are these specific pages written out to persistant
> storage?" and we would certainly be happy to use it, and to repeatedly
> try to flush out pages which weren't synced to disk due to some
> transient error, and to track those cases and make sure that we don't
> incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy
interface but further changing its established semantics (no matter how
unreasonable they may be) is probably not the way to go.

They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.

So I don't buy this argument.

It really turns out that this is not how the fsync() semantics work
though, exactly because the nature of the errors: even if the kernel
retained the dirty bits on the failed pages, retrying persisting them
on the same disk location would simply fail.

*might* simply fail.

It depends on why the error ocurred.
 
I originally identified this behaviour on a multipath system. Multipath defaults to "throw the writes away, nobody really cares anyway" on error. It seems to figure a higher level will retry, or the application will receive the error and retry.

(See no_path_retry in multipath config. AFAICS the default is insanely dangerous and only suitable for specialist apps that understand the quirks; you should use no_path_retry=queue).

Instead the kernel opts
for marking those pages clean (since there is no other recovery
strategy),
and reporting once to the caller who can potentially deal
with it in some manner. It is sadly a bad and undocumented convention.


It could mark the FD.

It's not just undocumented, it's a slightly creative interpretation of the POSIX spec for fsync.
 
> Consider two independent programs where the first one writes to a file
> and then calls the second one whose job it is to go out and fsync(),
> perhaps async from the first, those files.  Is the second program
> supposed to go write to each page that the first one wrote to, in order
> to ensure that all the dirty bits are set so that the fsync() will
> actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather
than fsync(), but as long as an application needs to ensure data
are persisted to storage, it needs to retain those data in its heap
until fsync() is successful instead of discarding them and relying
on the kernel after write().

This is almost exactly what we tell application authors using PostgreSQL: the data isn't written until you receive a successful commit confirmation, so you'd better not forget it.

We provide applications with *clear boundaries* so they can know *exactly* what was, and was not, written. I guess the argument from the kernel is the same is true: whatever was written since the last *successful* fsync is potentially lost and must be redone.

But the fsync behaviour is utterly undocumented and dubiously standard.
 
I think we are conflating a few issues here: having the OS kernel being
responsible for error recovery (so that subsequent fsync() would fix
the problems) is one. This clearly is a design which most kernels have
not really adopted for reasons outlined above

[citation needed] 

What do other major platforms do here? The post above suggests it's a bit of a mix of behaviours.

Now, there is the issue of granularity of
error reporting: userspace could benefit from a fine-grained indication
of failed pages (or file ranges).

Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map a sync failure back to an individual AIO write.

But it seems AIO just adds more problems and fixes none. Flush behaviour with AIO from what I can tell is inconsistent version to version and generally unhelpful. The kernel should really report such sync failures back to the app on its AIO write mapping, but it seems nothing of the sort happens.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: WIP: Covering + unique indexes.
Next
From: Christophe Pettus
Date:
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS