Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date | |
Msg-id | 63e55a27-a6a4-e7eb-d74f-78a5d0840bd1@2ndquadrant.com Whole thread Raw |
In response to | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS (Anthony Iliopoulos <ailiop@altatus.com>) |
Responses |
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
|
List | pgsql-hackers |
On 04/09/2018 02:31 PM, Anthony Iliopoulos wrote: > On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote: >> On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop@altatus.com> wrote: >> >>> What you seem to be asking for is the capability of dropping >>> buffers over the (kernel) fence and idemnifying the application >>> from any further responsibility, i.e. a hard assurance >>> that either the kernel will persist the pages or it will >>> keep them around till the application recovers them >>> asynchronously, the filesystem is unmounted, or the system >>> is rebooted. >>> >> >> That seems like a perfectly reasonable position to take, frankly. > > Indeed, as long as you are willing to ignore the consequences of > this design decision: mainly, how you would recover memory when no > application is interested in clearing the error. At which point > other applications with different priorities will find this position > rather unreasonable since there can be no way out of it for them. Sure, but the question is whether the system can reasonably operate after some of the writes failed and the data got lost. Because if it can't, then recovering the memory is rather useless. It might be better to stop the system in that case, forcing the system administrator to resolve the issue somehow (fail-over to a replica, perform recovery from the last checkpoint, ...). We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue. > Good luck convincing any OS kernel upstream to go with this design. > Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently. The question is whether the current design makes it any easier for user-space developers to build reliable systems. We have tried using it, and unfortunately the answers seems to be "no" and "Use direct I/O and manage everything on your own!" >> The whole _point_ of an Operating System should be that you can do exactly >> that. As a developer I should be able to call write() and fsync() and know >> that if both calls have succeeded then the result is on disk, no matter >> what another application has done in the meantime. If that's a "difficult" >> problem then that's the OS's problem, not mine. If the OS doesn't do that, >> it's _not_doing_its_job_. > > No OS kernel that I know of provides any promises for atomicity of a > write()+fsync() sequence, unless one is using O_SYNC. It doesn't > provide you with isolation either, as this is delegated to userspace, > where processes that share a file should coordinate accordingly. > We can (and do) take care of the atomicity and isolation. Implementation of those parts is obviously very application-specific, and we have WAL and locks for that purpose. I/O on the other hand seems to be a generic service provided by the OS - at least that's how we saw it until now. > It's not a difficult problem, but rather the kernels provide a common > denominator of possible interfaces and designs that could accommodate > a wider range of potential application scenarios for which the kernel > cannot possibly anticipate requirements. There have been plenty of > experimental works for providing a transactional (ACID) filesystem > interface to applications. On the opposite end, there have been quite > a few commercial databases that completely bypass the kernel storage > stack. But I would assume it is reasonable to figure out something > between those two extremes that can work in a "portable" fashion. > Users ask us about this quite often, actually. The question is usually about "RAW devices" and performance, but ultimately it boils down to buffered vs. direct I/O. So far our answer was we rely on kernel to do this reliably, because they know how to do that correctly and we simply don't have the manpower to implement it (portable, reliable, handling different types of storage, ...). One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: