Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date | |
Msg-id | CA+TgmoYQuM-8F0mweBFVvhjFa04DOaxamd8_g5wfJAuOgwwXuw@mail.gmail.com Whole thread Raw |
In response to | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS (Anthony Iliopoulos <ailiop@altatus.com>) |
Responses |
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
List | pgsql-hackers |
On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop@altatus.com> wrote: > Given precisely that the dirty pages which cannot been written-out are > practically thrown away, the semantics of fsync() (after the 4.13 fixes) > are essentially correct: the first call indicates that a writeback error > indeed occurred, while subsequent calls have no reason to indicate an error > (assuming no other errors occurred in the meantime). Like other people here, I think this is 100% unreasonable, starting with "the dirty pages which cannot been written out are practically thrown away". Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to throw away the user's data. If the writes are going to fail, then let them keep on failing every time. *That* wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery. Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless. Even leaving that aside, a PANIC means a prolonged outage on a prolonged system - it could easily take tens of minutes or longer to run recovery. So saying "oh, just do that" is not really an answer. Sure, we can do it, but it's like trying to lose weight by intentionally eating a tapeworm. Now, it's possible to shorten the checkpoint_timeout so that recovery runs faster, but then performance drops because data has to be fsync()'d more often instead of getting buffered in the OS cache for the maximum possible time. We could also dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is *necessary* for good performance. > The error reporting is thus consistent with the intended semantics (which > are sadly not properly documented). Repeated calls to fsync() simply do not > imply that the kernel will retry to writeback the previously-failed pages, > so the application needs to be aware of that. Persisting the error at the > fsync() level would essentially mean moving application policy into the > kernel. I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: