Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Greg Stark
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id CAM-w4HPH=g8gEU4yg9XMa_Ai2F1FOQWACg=9D9MD6dm+F_nzwA@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Anthony Iliopoulos <ailiop@altatus.com>)
Responses Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
List pgsql-hackers
On 9 April 2018 at 15:22, Anthony Iliopoulos <ailiop@altatus.com> wrote:
> On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:
>>
> Sure, there could be knobs for limiting how much memory such "zombie"
> pages may occupy. Not sure how helpful it would be in the long run
> since this tends to be highly application-specific, and for something
> with a large data footprint one would end up tuning this accordingly
> in a system-wide manner.

Surely this is exactly what the kernel is there to manage. It has to
control how much memory is allowed to be full of dirty buffers in the
first place to ensure that the system won't get memory starved if it
can't clean them fast enough. That isn't even about persistent
hardware errors. Even when the hardware is working perfectly it can
only flush buffers so fast.  The whole point of the kernel is to
abstract away shared resources. It's not like user space has any
better view of the situation here. If Postgres implemented all this in
DIRECT_IO it would have exactly the same problem only with less
visibility into what the rest of the system is doing. If every
application implemented its own buffer cache we would be back in the
same boat only with a fragmented memory allocation.

> This has the potential to leave other
> applications running in the same system with very little memory, in
> cases where for example original application crashes and never clears
> the error.

I still think we're speaking two different languages. There's no
application anywhere that's going to "clear the error". The
application has done the writes and if it's calling fsync it wants to
wait until the filesystem can arrange for the write to be persisted.
If the application could manage without the persistence then it
wouldn't have called fsync.

The only way to "clear out" the error would be by having the writes
succeed. There's no reason to think that wouldn't be possible
sometime. The filesystem could remap blocks or an administrator could
replace degraded raid device components. The only thing Postgres could
do to recover would be create a new file and move the data (reading
from the dirty buffer in memory!) to a new file anyways so we would
"clear the error" by just no longer calling fsync on the old file.

We always read fsync as a simple write barrier. That's what the
documentation promised and it's what Postgres always expected. It
sounds like the kernel implementors looked at it as some kind of
communication channel to communicate status report for specific writes
back to user-space. That's a much more complex problem and would have
entirely different interface. I think this is why we're having so much
difficulty communicating.



> It is reasonable, but even FreeBSD has a big fat comment right
> there (since 2017), mentioning that there can be no recovery from
> EIO at the block layer and this needs to be done differently. No
> idea how an application running on top of either FreeBSD or Illumos
> would actually recover from this error (and clear it out), other
> than remounting the fs in order to force dropping of relevant pages.
> It does provide though indeed a persistent error indication that
> would allow Pg to simply reliably panic. But again this does not
> necessarily play well with other applications that may be using
> the filesystem reliably at the same time, and are now faced with
> EIO while their own writes succeed to be persisted.

Well if they're writing to the same file that had a previous error I
doubt there are many applications that would be happy to consider
their writes "persisted" when the file was corrupt. Ironically the
earlier discussion quoted talked about how applications that wanted
more granular communication would be using O_DIRECT -- but what we
have is fsync trying to be *too* granular such that it's impossible to
get any strong guarantees about anything with it.

>> One has to wonder how many applications actually use this correctly,
>> considering PostgreSQL cares about data durability/consistency so much
>> and yet we've been misunderstanding how it works for 20+ years.
>
> I would expect it would be very few, potentially those that have
> a very simple process model (e.g. embedded DBs that can abort a
> txn on fsync() EIO).

Honestly I don't think there's *any* way to use the current interface
to implement reliable operation. Even that embedded database using a
single process and keeping every file open all the time (which means
file descriptor limits limit its scalability) can be having silent
corruption whenever some other process like a backup program comes
along and calls fsync (or even sync?).



-- 
greg


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] path toward faster partition pruning
Next
From: Alexander Korotkov
Date:
Subject: Re: [HACKERS] [PATCH] Incremental sort