Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>
> We already have dirty_bytes and dirty_background_bytes, for example. I
> don't see why there couldn't be another limit defining how much dirty
> data to allow before blocking writes altogether. I'm sure it's not that
> simple, but you get the general idea - do not allow using all available
> memory because of writeback issues, but don't throw the data away in
> case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie"
pages may occupy. Not sure how helpful it would be in the long run
since this tends to be highly application-specific, and for something
with a large data footprint one would end up tuning this accordingly
in a system-wide manner. This has the potential to leave other
applications running in the same system with very little memory, in
cases where for example original application crashes and never clears
the error. Apart from that, further interfaces would need to be provided
for actually dealing with the error (again assuming non-transient
issues that may not be fixed transparently and that temporary issues
are taken care of by lower layers of the stack).

> Well, there seem to be kernels that seem to do exactly that already. At
> least that's how I understand what this thread says about FreeBSD and
> Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right
there (since 2017), mentioning that there can be no recovery from
EIO at the block layer and this needs to be done differently. No
idea how an application running on top of either FreeBSD or Illumos
would actually recover from this error (and clear it out), other
than remounting the fs in order to force dropping of relevant pages.
It does provide though indeed a persistent error indication that
would allow Pg to simply reliably panic. But again this does not
necessarily play well with other applications that may be using
the filesystem reliably at the same time, and are now faced with
EIO while their own writes succeed to be persisted.

Ideally, you'd want a (potentially persistent) indication of error
localized to a file region (mapping the corresponding failed writeback
pages). NetBSD is already implementing fsync_ranges(), which could
be a step in the right direction.

> One has to wonder how many applications actually use this correctly,
> considering PostgreSQL cares about data durability/consistency so much
> and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have
a very simple process model (e.g. embedded DBs that can abort a
txn on fsync() EIO). I think that durability is a rather complex
cross-layer issue which has been grossly misunderstood similarly
in the past (e.g. see [1]). It seems that both the OS and DB
communities greatly benefit from a periodic reality check, and
I see this as an opportunity for strengthening the IO stack in
an end-to-end manner.

Best regards,
Anthony

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf


pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: WIP: Covering + unique indexes.
Next
From: Tom Lane
Date:
Subject: Re: pruning disabled for array, enum, record, range type partition keys