Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Andres Freund
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id 20180402232324.hdt2345l6srm2upw@alap3.anarazel.de
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Anthony Iliopoulos <ailiop@altatus.com>)
List pgsql-hackers
On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote:
> Would using sync_file_range() be helpful? Potential errors would only
> apply to pages that cover the requested file ranges. There are a few
> caveats though:

To quote sync_file_range(2):
   Warning
       This  system  call  is  extremely  dangerous and should not be used in portable programs.  None of these
operationswrites out the
 
       file's metadata.  Therefore, unless the application is strictly performing overwrites of already-instantiated
disk blocks,  there
 
       are no guarantees that the data will be available after a crash.  There is no user interface to know if a write
ispurely an over‐
 
       write.  On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is
impossible.  When
 
       writing  into  preallocated  space,  many filesystems also require calls into the block allocator, which this
systemcall does not
 
       sync out to disk.  This system call does not flush disk write caches and thus does not provide any data
integrityon systems  with
 
       volatile disk write caches.

Given the lack of metadata safety that seems entirely a no go.  We use
sfr(2), but only to force the kernel's hand around writing back earlier
without throwing away cache contents.


> > > The application will need to deal with that first error irrespective of
> > > subsequent return codes from fsync(). Conceptually every fsync() invocation
> > > demarcates an epoch for which it reports potential errors, so the caller
> > > needs to take responsibility for that particular epoch.
> > 
> > We do deal with that error- by realizing that it failed and later
> > *retrying* the fsync(), which is when we get back an "all good!
> > everything with this file descriptor you've opened is sync'd!" and
> > happily expect that to be truth, when, in reality, it's an unfortunate
> > lie and there are still pages associated with that file descriptor which
> > are, in reality, dirty and not sync'd to disk.
> 
> It really turns out that this is not how the fsync() semantics work
> though

Except on freebsd and solaris, and perhaps others.


>, exactly because the nature of the errors: even if the kernel
> retained the dirty bits on the failed pages, retrying persisting them
> on the same disk location would simply fail.

That's not guaranteed at all, think NFS.


> Instead the kernel opts for marking those pages clean (since there is
> no other recovery strategy), and reporting once to the caller who can
> potentially deal with it in some manner. It is sadly a bad and
> undocumented convention.

It's broken behaviour justified post facto with the only rational that
was available, which explains why it's so unconvincing. You could just
say "this ship has sailed, and it's to onerous to change because xxx"
and this'd be a done deal. But claiming this is reasonable behaviour is
ridiculous.

Again, you could just continue to error for this fd and still throw away
the data.


> > Consider two independent programs where the first one writes to a file
> > and then calls the second one whose job it is to go out and fsync(),
> > perhaps async from the first, those files.  Is the second program
> > supposed to go write to each page that the first one wrote to, in order
> > to ensure that all the dirty bits are set so that the fsync() will
> > actually return if all the dirty pages are written?
> 
> I think what you have in mind are the semantics of sync() rather
> than fsync()

If you open the same file with two fds, and write with one, and fsync
with another that's definitely supposed to work. And sync() isn't a
realistic replacement in any sort of way because it's obviously
systemwide, and thus entirely and completely unsuitable. Nor does it
have any sort of better error reporting behaviour, does it?

Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Creating streaming replication standby
Next
From: Alexander Korotkov
Date:
Subject: Re: WIP: Covering + unique indexes.