Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date | |
Msg-id | 20180402232324.hdt2345l6srm2upw@alap3.anarazel.de Whole thread Raw |
In response to | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS (Anthony Iliopoulos <ailiop@altatus.com>) |
List | pgsql-hackers |
On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote: > Would using sync_file_range() be helpful? Potential errors would only > apply to pages that cover the requested file ranges. There are a few > caveats though: To quote sync_file_range(2): Warning This system call is extremely dangerous and should not be used in portable programs. None of these operationswrites out the file's metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write ispurely an over‐ write. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this systemcall does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrityon systems with volatile disk write caches. Given the lack of metadata safety that seems entirely a no go. We use sfr(2), but only to force the kernel's hand around writing back earlier without throwing away cache contents. > > > The application will need to deal with that first error irrespective of > > > subsequent return codes from fsync(). Conceptually every fsync() invocation > > > demarcates an epoch for which it reports potential errors, so the caller > > > needs to take responsibility for that particular epoch. > > > > We do deal with that error- by realizing that it failed and later > > *retrying* the fsync(), which is when we get back an "all good! > > everything with this file descriptor you've opened is sync'd!" and > > happily expect that to be truth, when, in reality, it's an unfortunate > > lie and there are still pages associated with that file descriptor which > > are, in reality, dirty and not sync'd to disk. > > It really turns out that this is not how the fsync() semantics work > though Except on freebsd and solaris, and perhaps others. >, exactly because the nature of the errors: even if the kernel > retained the dirty bits on the failed pages, retrying persisting them > on the same disk location would simply fail. That's not guaranteed at all, think NFS. > Instead the kernel opts for marking those pages clean (since there is > no other recovery strategy), and reporting once to the caller who can > potentially deal with it in some manner. It is sadly a bad and > undocumented convention. It's broken behaviour justified post facto with the only rational that was available, which explains why it's so unconvincing. You could just say "this ship has sailed, and it's to onerous to change because xxx" and this'd be a done deal. But claiming this is reasonable behaviour is ridiculous. Again, you could just continue to error for this fd and still throw away the data. > > Consider two independent programs where the first one writes to a file > > and then calls the second one whose job it is to go out and fsync(), > > perhaps async from the first, those files. Is the second program > > supposed to go write to each page that the first one wrote to, in order > > to ensure that all the dirty bits are set so that the fsync() will > > actually return if all the dirty pages are written? > > I think what you have in mind are the semantics of sync() rather > than fsync() If you open the same file with two fds, and write with one, and fsync with another that's definitely supposed to work. And sync() isn't a realistic replacement in any sort of way because it's obviously systemwide, and thus entirely and completely unsuitable. Nor does it have any sort of better error reporting behaviour, does it? Greetings, Andres Freund
pgsql-hackers by date: