Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date | |
Msg-id | CAM-w4HPH=g8gEU4yg9XMa_Ai2F1FOQWACg=9D9MD6dm+F_nzwA@mail.gmail.com Whole thread Raw |
In response to | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS (Anthony Iliopoulos <ailiop@altatus.com>) |
Responses |
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
|
List | pgsql-hackers |
On 9 April 2018 at 15:22, Anthony Iliopoulos <ailiop@altatus.com> wrote: > On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote: >> > Sure, there could be knobs for limiting how much memory such "zombie" > pages may occupy. Not sure how helpful it would be in the long run > since this tends to be highly application-specific, and for something > with a large data footprint one would end up tuning this accordingly > in a system-wide manner. Surely this is exactly what the kernel is there to manage. It has to control how much memory is allowed to be full of dirty buffers in the first place to ensure that the system won't get memory starved if it can't clean them fast enough. That isn't even about persistent hardware errors. Even when the hardware is working perfectly it can only flush buffers so fast. The whole point of the kernel is to abstract away shared resources. It's not like user space has any better view of the situation here. If Postgres implemented all this in DIRECT_IO it would have exactly the same problem only with less visibility into what the rest of the system is doing. If every application implemented its own buffer cache we would be back in the same boat only with a fragmented memory allocation. > This has the potential to leave other > applications running in the same system with very little memory, in > cases where for example original application crashes and never clears > the error. I still think we're speaking two different languages. There's no application anywhere that's going to "clear the error". The application has done the writes and if it's calling fsync it wants to wait until the filesystem can arrange for the write to be persisted. If the application could manage without the persistence then it wouldn't have called fsync. The only way to "clear out" the error would be by having the writes succeed. There's no reason to think that wouldn't be possible sometime. The filesystem could remap blocks or an administrator could replace degraded raid device components. The only thing Postgres could do to recover would be create a new file and move the data (reading from the dirty buffer in memory!) to a new file anyways so we would "clear the error" by just no longer calling fsync on the old file. We always read fsync as a simple write barrier. That's what the documentation promised and it's what Postgres always expected. It sounds like the kernel implementors looked at it as some kind of communication channel to communicate status report for specific writes back to user-space. That's a much more complex problem and would have entirely different interface. I think this is why we're having so much difficulty communicating. > It is reasonable, but even FreeBSD has a big fat comment right > there (since 2017), mentioning that there can be no recovery from > EIO at the block layer and this needs to be done differently. No > idea how an application running on top of either FreeBSD or Illumos > would actually recover from this error (and clear it out), other > than remounting the fs in order to force dropping of relevant pages. > It does provide though indeed a persistent error indication that > would allow Pg to simply reliably panic. But again this does not > necessarily play well with other applications that may be using > the filesystem reliably at the same time, and are now faced with > EIO while their own writes succeed to be persisted. Well if they're writing to the same file that had a previous error I doubt there are many applications that would be happy to consider their writes "persisted" when the file was corrupt. Ironically the earlier discussion quoted talked about how applications that wanted more granular communication would be using O_DIRECT -- but what we have is fsync trying to be *too* granular such that it's impossible to get any strong guarantees about anything with it. >> One has to wonder how many applications actually use this correctly, >> considering PostgreSQL cares about data durability/consistency so much >> and yet we've been misunderstanding how it works for 20+ years. > > I would expect it would be very few, potentially those that have > a very simple process model (e.g. embedded DBs that can abort a > txn on fsync() EIO). Honestly I don't think there's *any* way to use the current interface to implement reliable operation. Even that embedded database using a single process and keeping every file open all the time (which means file descriptor limits limit its scalability) can be having silent corruption whenever some other process like a backup program comes along and calls fsync (or even sync?). -- greg
pgsql-hackers by date: