Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers
From | Anthony Iliopoulos |
---|---|
Subject | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date | |
Msg-id | 20180403103538.GP11627@technoir Whole thread Raw |
In response to | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
List | pgsql-hackers |
Hi Robert, On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote: > On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop@altatus.com> wrote: > > Given precisely that the dirty pages which cannot been written-out are > > practically thrown away, the semantics of fsync() (after the 4.13 fixes) > > are essentially correct: the first call indicates that a writeback error > > indeed occurred, while subsequent calls have no reason to indicate an error > > (assuming no other errors occurred in the meantime). > > Like other people here, I think this is 100% unreasonable, starting > with "the dirty pages which cannot been written out are practically > thrown away". Who decided that was OK, and on the basis of what > wording in what specification? I think it's always unreasonable to If you insist on strict conformance to POSIX, indeed the linux glibc configuration and associated manpage are probably wrong in stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation matches that of the flexibility allowed by not supporting SIO. There's a long history of brokenness between linux and posix, and I think there was never an intention of conforming to the standard. > throw away the user's data. If the writes are going to fail, then let > them keep on failing every time. *That* wouldn't cause any data loss, > because we'd never be able to checkpoint, and eventually the user > would have to kill the server uncleanly, and that would trigger > recovery. I believe (as tried to explain earlier) there is a certain assumption being made that the writer and original owner of data is responsible for dealing with potential errors in order to avoid data loss (which should be only of interest to the original writer anyway). It would be very questionable for the interface to persist the error while subsequent writes and fsyncs to different offsets may as well go through. Another process may need to write into the file and fsync, while being unaware of those newly introduced semantics is now faced with EIO because some unrelated previous process failed some earlier writes and did not bother to clear the error for those writes. In a similar scenario where the second process is aware of the new semantics, it would naturally go ahead and clear the global error in order to proceed with its own write()+fsync(), which would essentially amount to the same problematic semantics you have now. > Also, this really does make it impossible to write reliable programs. > Imagine that, while the server is running, somebody runs a program > which opens a file in the data directory, calls fsync() on it, and > closes it. If the fsync() fails, postgres is now borked and has no > way of being aware of the problem. If we knew, we could PANIC, but > we'll never find out, because the unrelated process ate the error. > This is exactly the sort of ill-considered behavior that makes fcntl() > locking nearly useless. Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that *happen to be open at the time of error*. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where: process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s. process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process. This would be a highly user-visible change of semantics from edge- triggered to level-triggered behavior. > dodge this issue in another way: suppose that when we write a page > out, we don't consider it really written until fsync() succeeds. Then That's the only way to think about fsync() guarantees unless you are on a kernel that keeps retrying to persist dirty pages. Assuming such a model, after repeated and unrecoverable hard failures the process would have to explicitly inform the kernel to drop the dirty pages. All the process could do at that point is read back to userspace the dirty/failed pages and attempt to rewrite them at a different place (which is current possible too). Most applications would not bother though to inform the kernel and drop the permanently failed pages; and thus someone eventually would hit the case that a large amount of failed writeback pages are running his server out of memory, at which point people will complain that those semantics are completely unreasonable. > we wouldn't need to PANIC if an fsync() fails; we could just re-write > the page. Unfortunately, this would also be terrible for performance, > for pretty much the same reasons: letting the OS cache absorb lots of > dirty blocks and do write-combining is *necessary* for good > performance. Not sure I understand this case. The application may indeed re-write a bunch of pages that have failed and proceed with fsync(). The kernel will deal with combining the writeback of all the re-written pages. But further the necessity of combining for performance really depends on the exact storage medium. At the point you start caring about write-combining, the kernel community will naturally redirect you to use DIRECT_IO. > > The error reporting is thus consistent with the intended semantics (which > > are sadly not properly documented). Repeated calls to fsync() simply do not > > imply that the kernel will retry to writeback the previously-failed pages, > > so the application needs to be aware of that. Persisting the error at the > > fsync() level would essentially mean moving application policy into the > > kernel. > > I might accept this argument if I accepted that it was OK to decide > that an fsync() failure means you can forget that the write() ever > happened in the first place, but it's hard to imagine an application > that wants that behavior. If the application didn't care about > whether the bytes really got to disk or not, it would not have called > fsync() in the first place. If it does care, reporting the error only > once is never an improvement. Again, conflating two separate issues, that of buffering and retrying failed pages and that of error reporting. Yes it would be convenient for applications not to have to care at all about recovery of failed write-backs, but at some point they would have to face this issue one way or another (I am assuming we are always talking about hard failures, other kinds of failures are probably already being dealt with transparently at the kernel level). As for the reporting, it is also unreasonable to effectively signal and persist an error on a file-wide granularity while it pertains to subsets of that file and other writes can go through, but I am repeating myself. I suppose that if the check-and-clear semantics are problematic for Pg, one could suggest a kernel patch that opts-in to a level-triggered reporting of fsync() on a per-descriptor basis, which seems to be non-intrusive and probably sufficient to cover your expected use-case. Best regards, Anthony
pgsql-hackers by date: