Thread: Postgres, fsync, and OSs (specifically linux)
Hi, I thought I'd send this separately from [0] as the issue has become more general than what was mentioned in that thread, and it went off into various weeds. I went to LSF/MM 2018 to discuss [0] and related issues. Overall I'd say it was a very productive discussion. I'll first try to recap the current situation, updated with knowledge I gained. Secondly I'll try to discuss the kernel changes that seem to have been agreed upon. Thirdly I'll try to sum up what postgres needs to change. == Current Situation == The fundamental problem is that postgres assumed that any IO error would be reported at fsync time, and that the error would be reported until resolved. That's not true in several operating systems, linux included. There's various judgement calls leading to the current OS (specifically linux, but the concerns are similar in other OSs) behaviour: - By the time IO errors are treated as fatal, it's unlikely that plain retries attempting to write exactly the same data are going to succeed. There are retries on several layers. Some cases would be resolved by overwriting a larger amount (so device level remapping functionality can mask dead areas), but plain retries aren't going to get there if they didn't the first time round. - Retaining all the data necessary for retries would make it quite possible to turn IO errors on some device into out of memory errors. This is true to a far lesser degree if only enough information were to be retained to (re-)report an error, rather than actually retry the write. - Continuing to re-report an error after one fsync() failed would make it hard to recover from that fact. There'd need to be a way to "clear" a persistent error bit, and that'd obviously be outside of posix. - Some other databases use direct-IO and thus these paths haven't been exercised under fire that much. - Actually marking files as persistently failed would require filesystem changes, and filesystem metadata IO, far from guaranteed in failure scenarios. Before linux v4.13 errors in kernel writeback would be reported at most once, without a guarantee that that'd happen (IIUC memory pressure could lead to the relevant information being evicted) - but it was pretty likely. After v4.13 (see https://lwn.net/Articles/724307/) errors are reported exactly once to all open file descriptors for a file with an error - but never for files that have been opened after the error occurred. It's worth to note that on linux it's not well defined what contents one would read after a writeback error. IIUC xfs will mark the pagecache contents that triggered an error as invalid, triggering a re-read from the underlying storage (thus either failing or returning old but persistent contents). Whereas some other filesystems (among them ext4 I believe) retain the modified contents of the page cache, but marking it as clean (thereby returning new contents until the page cache contents are evicted). Some filesystems (prominently NFS in many configurations) perform an implicit fsync when closing the file. While postgres checks for an error of close() and reports it, we don't treat it as fatal. It's worth to note that by my reading this means that an fsync error at close() will *not* be re-reported by the time an explicit fsync() is issued. It also means that we'll not react properly to the possible ENOSPC errors that may be reported at close() for NFS. At least the latter isn't just the case in linux. Proposals for how postgres could deal with this included using syncfs(2) - but that turns out not to work at all currently, because syncfs() basically wouldn't return any file-level errors. It'd also imply superflously flushing temporary files etc. The second major type of proposal was using direct-IO. That'd generally be a desirable feature, but a) would require some significant changes to postgres to be performant, b) isn't really applicable for the large percentage of installations that aren't tuned reasonably well, because at the moment the OS page cache functions as a memory-pressure aware extension of postgres' page cache. Another topic brought up in this thread was the handling of ENOSPC errors that aren't triggered on a filesystem level, but rather are triggered by thin provisioning. On linux that currently apprently lead to page cache contents being lost (and errors "eaten") in a lot of places, including just when doing a write(). In a lot of cases it's pretty much expected that the file system will just hang or react unpredictably upon space exhaustion. My reading is that the block-layer thin provisioning code is still pretty fresh, and should only be used with great care. The only way to halfway reliably use it appears to change the configuration so space exhaustion blocks until admin intervention (at least dm-thinp provides allows that). There's some clear need to automate some more testing in this area so that future behaviour changes don't surprise us. == Proposed Linux Changes == - Matthew Wilcox proposed (and posted a patch) that'd partially revert behaviour to the pre v4.13 world, by *also* reporting errors to "newer" file-descriptors if the error hasn't previously been reported. That'd still not guarantee that the error is reported (memory pressure could evict information without open fd), but in most situations we'll again get the error in the checkpointer. This seems largely be agreed upon. It's unclear whether it'll go into the stable backports for still-maintained >= v4.13 kernels. - syncfs() will be fixed so it reports errors properly - that'll likely require passing it an O_PATH filedescriptor to have space to store the errseq_t value that allows discerning already reported and new errors. No patch has appeared yet, but the behaviour seems largely agreed upon. - Make per-filesystem error counts available in a uniform (i.e. same for every supporting fs) manner. Right now it's very hard to figure out whether errors occurred. There seemed general agreement that exporting knowledge about such errors is desirable. Quite possibly the syncfs() fix above will provide the necessary infrastructure. It's unclear as of yet how the value would be exposed. Per-fs /sys/ entries and an ioctl on O_PATH fds have been mentioned. These'd error counts would not vanish due to memory pressure, and they can be checked even without knowing which files in a specific filesystem have been touched (e.g. when just untar-ing something). There seemed to be fairly widespread agreement that this'd be a good idea. Much less clearer whether somebody would do the work. - Provide config knobs that allow to define the FS error behaviour in a consistent way across supported filesystems. XFS currently has various knobs controlling what happens in case of metadata errors [1] (retry forever, timeout, return up). It was proposed that this interface be extended to also deal with data errors, and moved into generic support code. While the timeline is unclear, there seemed to be widespread support for the idea. I believe Dave Chinner indicated that he at least has plans to generalize the code. - Stop inodes with unreported errors from being evicted. This will guarantee that a later fsync (without an open FD) will see the error. The memory pressure concerns here are lower than with keeping all the failed pages in memory, and it could be optimized further. I read some tentative agreement behind this idea, but I think it's the by far most controversial one. == Potential Postgres Changes == Several operating systems / file systems behave differently (See e.g. [2], thanks Thomas) than we expected. Even the discussed changes to e.g. linux don't get to where we thought we are. There's obviously also the question of how to deal with kernels / OSs that have not been updated. Changes that appear to be necessary, even for kernels with the issues addressed: - Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and retry recovery. While ENODEV (underlying device went away) will be persistent, it probably makes sense to treat it the same or even just give up and shut down. One question I see here is whether we just want to continue crash-recovery cycles, or whether we want to limit that. - We need more aggressive error checking on close(), for ENOSPC and EIO. In both cases afaics we'll have to trigger a crash recovery cycle. It's entirely possible to end up in a loop on NFS etc, but I don't think there's a way around that. Robert, on IM, wondered whether there'd be a race between some backend doing a close(), triggering a PANIC, and a checkpoint succeeding. I don't *think* so, because the error will only happen if there's outstanding dirty data, and the checkpoint would have flushed that out if it belonged to the current checkpointing cycle. - The outstanding fsync request queue isn't persisted properly [3]. This means that even if the kernel behaved the way we'd expected, we'd not fail a second checkpoint :(. It's possible that we don't need to deal with this because we'll henceforth PANIC, but I'd argue we should fix that regardless. Seems like a time-bomb otherwise (e.g. after moving to DIO somebody might want to relax the PANIC...). - It might be a good idea to whitelist expected return codes for write() and PANIC one ones that we did not expect. E.g. when hitting an EIO we should probably PANIC, to get back to a known good state. Even though it's likely that we'd again that error at fsync(). - Docs. I think we also need to audit a few codepaths. I'd be surprised if we PANICed appropriately on all fsyncs(), particularly around the SLRUs. I think we need to be particularly careful around the WAL handling, I think it's fairly likely that there's cases where we'd write out WAL in one backend and then fsync() in another backend with a file descriptor that has only been opened *after* the write occurred, which means we might miss the error entirely. Then there's the question of how we want to deal with kernels that haven't been updated with the aforementioned changes. We could say that we expect decent OS support and declare that we just can't handle this - given that at least various linux versions, netbsd, openbsd, MacOS just silently drop errors and we'd need different approaches for dealing with that, that doesn't seem like an insane approach. What we could do: - forward file descriptors from backends to checkpointer (using SCM_RIGHTS) when marking a segment dirty. That'd require some optimizations (see [4]) to avoid doing so repeatedly. That'd guarantee correct behaviour in all linux kernels >= 4.13 (possibly backported by distributions?), and I think it'd also make it vastly more likely that errors are reported in earlier kernels. This should be doable without a noticeable performance impact, I believe. I don't think it'd be that hard either, but it'd be a bit of a pain to backport it to all postgres versions, as well as a bit invasive for that. The infrastructure this'd likely end up building (hashtable of open relfilenodes), would likely be useful for further things (like caching file size). - Add a pre-checkpoint hook that checks for filesystem errors *after* fsyncing all the files, but *before* logging the checkpoint completion record. Operating systems, filesystems, etc. all log the error format differently, but for larger installations it'd not be too hard to write code that checks their specific configuration. While I'm a bit concerned adding user-code before a checkpoint, if we'd do it as a shell command it seems pretty reasonable. And useful even without concern for the fsync issue itself. Checking for IO errors could e.g. also include checking for read errors - it'd not be unreasonable to not want to complete a checkpoint if there'd been any media errors. - Use direct IO. Due to architectural performance issues in PG and the fact that it'd not be applicable for all installations I don't think this is a reasonable fix for the issue presented here. Although it's independently something we should work on. It might be worthwhile to provide a configuration that allows to force DIO to be enabled for WAL even if replication is turned on. - magic Greetings, Andres Freund [0] https://archives.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com [1] static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = { { .name = "default", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "EIO", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENOSPC", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENODEV", .max_retries = 0, /* We can't recover from devices disappearing */ .retry_timeout = 0, }, }; [2] https://wiki.postgresql.org/wiki/Fsync_Errors [3] https://archives.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk [4] https://archives.postgresql.org/message-id/20180424180054.inih6bxfspgowjuc@alap3.anarazel.de
On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote: > - We need more aggressive error checking on close(), for ENOSPC and > EIO. In both cases afaics we'll have to trigger a crash recovery > cycle. It's entirely possible to end up in a loop on NFS etc, but I > don't think there's a way around that. If the no-space or write failures are persistent, as you mentioned above, what is the point of going into crash recovery --- why not just shut down? Also, since we can't guarantee that we can write any persistent state to storage, we have no way of preventing infinite crash recovery loops, which, based on inconsistent writes, might make things worse. I think a single panic with no restart is the right solution. An additional features we have talked about is running some kind of notification shell script to inform administrators, similar to archive_command. We need this too when sync replication fails. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hi, On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote: > On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote: > > - We need more aggressive error checking on close(), for ENOSPC and > > EIO. In both cases afaics we'll have to trigger a crash recovery > > cycle. It's entirely possible to end up in a loop on NFS etc, but I > > don't think there's a way around that. > > If the no-space or write failures are persistent, as you mentioned > above, what is the point of going into crash recovery --- why not just > shut down? Well, I mentioned that as an alternative in my email. But for one we don't really have cases where we do that right now, for another we can't really differentiate between a transient and non-transient state. It's entirely possible that the admin on the system that ran out of space fixes things, clearing up the problem. > Also, since we can't guarantee that we can write any persistent state > to storage, we have no way of preventing infinite crash recovery > loops, which, based on inconsistent writes, might make things worse. How would it make things worse? > An additional features we have talked about is running some kind of > notification shell script to inform administrators, similar to > archive_command. We need this too when sync replication fails. To me that seems like a feature independent of this thread. Greetings, Andres Freund
On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote: > Hi, > > On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote: > > On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote: > > > - We need more aggressive error checking on close(), for ENOSPC and > > > EIO. In both cases afaics we'll have to trigger a crash recovery > > > cycle. It's entirely possible to end up in a loop on NFS etc, but I > > > don't think there's a way around that. > > > > If the no-space or write failures are persistent, as you mentioned > > above, what is the point of going into crash recovery --- why not just > > shut down? > > Well, I mentioned that as an alternative in my email. But for one we > don't really have cases where we do that right now, for another we can't > really differentiate between a transient and non-transient state. It's > entirely possible that the admin on the system that ran out of space > fixes things, clearing up the problem. True, but if we get a no-space error, odds are it will not be fixed at the time we are failing. Wouldn't the administrator check that the server is still running after they free the space? > > Also, since we can't guarantee that we can write any persistent state > > to storage, we have no way of preventing infinite crash recovery > > loops, which, based on inconsistent writes, might make things worse. > > How would it make things worse? Uh, I can imagine some writes working and some not, and getting things more inconsistent. I would say at least that we don't know. > > An additional features we have talked about is running some kind of > > notification shell script to inform administrators, similar to > > archive_command. We need this too when sync replication fails. > > To me that seems like a feature independent of this thread. Well, if we are introducing new panic-and-not-restart behavior, we might need this new feature. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 2018-04-27 19:38:30 -0400, Bruce Momjian wrote: > On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote: > > Hi, > > > > On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote: > > > On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote: > > > > - We need more aggressive error checking on close(), for ENOSPC and > > > > EIO. In both cases afaics we'll have to trigger a crash recovery > > > > cycle. It's entirely possible to end up in a loop on NFS etc, but I > > > > don't think there's a way around that. > > > > > > If the no-space or write failures are persistent, as you mentioned > > > above, what is the point of going into crash recovery --- why not just > > > shut down? > > > > Well, I mentioned that as an alternative in my email. But for one we > > don't really have cases where we do that right now, for another we can't > > really differentiate between a transient and non-transient state. It's > > entirely possible that the admin on the system that ran out of space > > fixes things, clearing up the problem. > > True, but if we get a no-space error, odds are it will not be fixed at > the time we are failing. Wouldn't the administrator check that the > server is still running after they free the space? I'd assume it's pretty common that those are separate teams. Given that we currently don't behave that way for other cases where we *already* can enter crash-recovery loops I don't think we need to introduce that here. It's far more common to enter this kind of problem with pg_xlog filling up the ordinary way. And that can lead to such loops. > > > Also, since we can't guarantee that we can write any persistent state > > > to storage, we have no way of preventing infinite crash recovery > > > loops, which, based on inconsistent writes, might make things worse. > > > > How would it make things worse? > > Uh, I can imagine some writes working and some not, and getting things > more inconsistent. I would say at least that we don't know. Recovery needs to fix that or we're lost anyway. And we'll retry exactly the same writes each round. > > > An additional features we have talked about is running some kind of > > > notification shell script to inform administrators, similar to > > > archive_command. We need this too when sync replication fails. > > > > To me that seems like a feature independent of this thread. > > Well, if we are introducing new panic-and-not-restart behavior, we might > need this new feature. I don't see how this follows. It's easier to externally script notification for the server having died, than doing it for crash restarts. That's why we have restart_after_crash=false... There might be some arguments for this type of notification, but I don't think it should be conflated with the problem here. Nor is it guaranteed that such a script could do much, given that disks might be failing and such. Greetings, Andres Freund
On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote: > Hi, > > I thought I'd send this separately from [0] as the issue has become more > general than what was mentioned in that thread, and it went off into > various weeds. Thanks very much for going and for the great summary. > - Actually marking files as persistently failed would require filesystem > changes, and filesystem metadata IO, far from guaranteed in failure > scenarios. Yeah, I've avoided suggesting anything like that because it seems way too likely to lead to knock-on errors. Like malloc'ing in an OOM path, just don't. > The second major type of proposal was using direct-IO. That'd generally > be a desirable feature, but a) would require some significant changes to > postgres to be performant, b) isn't really applicable for the large > percentage of installations that aren't tuned reasonably well, because > at the moment the OS page cache functions as a memory-pressure aware > extension of postgres' page cache. Yeah. I've avoided advocating for O_DIRECT because it's a big job (understatement). We'd need to pay so much more attention to details of storage layout if we couldn't rely as much on the kernel neatly organising and queuing everything for us, too. At the risk of displaying my relative ignorance of direct I/O: Does O_DIRECT without O_SYNC even provide a strong guarantee that when you close() the file, all I/O has reliably succeeded? It must've gone through the FS layer, but don't FSes do various caching and reorganisation too? Can the same issue arise in other ways unless we also fsync() before close() or write O_SYNC? At one point I looked into using AIO instead. But last I looked it was pretty spectacularly quirky when it comes to reliably flushing, and outright broken on some versions. In any case, our multiprocessing model would make tracking completions annoying, likely more so than the sort of FD handoff games we've discussed. > Another topic brought up in this thread was the handling of ENOSPC > errors that aren't triggered on a filesystem level, but rather are > triggered by thin provisioning. On linux that currently apprently lead > to page cache contents being lost (and errors "eaten") in a lot of > places, including just when doing a write(). ... wow. Is that with lvm-thin? The thin provisioning I was mainly concerned with is SAN-based thin provisioning, which looks like a normal iSCSI target or a normal LUN on a HBA to Linux. Then it starts failing writes with a weird potentially vendor-specific sense error if it runs out of backing store. How that's handled likely depends on the specific error, the driver, which FS you use, etc. In the case I saw, multipath+lvm+xfs, it resulted in lost writes and fsync() errors being reported once, per the start of the original thread. > In a lot of cases it's > pretty much expected that the file system will just hang or react > unpredictably upon space exhaustion. My reading is that the block-layer > thin provisioning code is still pretty fresh, and should only be used > with great care. The only way to halfway reliably use it appears to > change the configuration so space exhaustion blocks until admin > intervention (at least dm-thinp provides allows that). Seems that should go in the OS-specific configuration part of the docs, along with the advice I gave on the original thread re configuring multipath no_path_retries. > There's some clear need to automate some more testing in this area so > that future behaviour changes don't surprise us. We don't routinely test ENOSPC (or memory exhaustion, or crashes) in PostgreSQL even on bog standard setups. Like the performance farm discussion, this is something I'd like to pick up at some point. I'm going to need to talk to the team I work with regarding time/resources allocation, but I think it's important that we make such testing more of a routine thing. > - Matthew Wilcox proposed (and posted a patch) that'd partially revert > behaviour to the pre v4.13 world, by *also* reporting errors to > "newer" file-descriptors if the error hasn't previously been > reported. That'd still not guarantee that the error is reported > (memory pressure could evict information without open fd), but in most > situations we'll again get the error in the checkpointer. > > This seems largely be agreed upon. It's unclear whether it'll go into > the stable backports for still-maintained >= v4.13 kernels. That seems very sensible. In our case we're very unlikely to have some other unrelated process come in and fsync() our files for us. I'd want to be sure the report didn't get eaten by sync() or syncfs() though. > - syncfs() will be fixed so it reports errors properly - that'll likely > require passing it an O_PATH filedescriptor to have space to store the > errseq_t value that allows discerning already reported and new errors. > > No patch has appeared yet, but the behaviour seems largely agreed > upon. Good, but as you noted, of limited use to us unless we want to force users to manage space for temporary and unlogged relations completely separately. I wonder if we could convince the kernel to offer a file_sync_mode xattr to control this? (Hint: I'm already running away in a mylar fire suit). > - Make per-filesystem error counts available in a uniform (i.e. same for > every supporting fs) manner. Right now it's very hard to figure out > whether errors occurred. There seemed general agreement that exporting > knowledge about such errors is desirable. Quite possibly the syncfs() > fix above will provide the necessary infrastructure. It's unclear as > of yet how the value would be exposed. Per-fs /sys/ entries and an > ioctl on O_PATH fds have been mentioned. > > These'd error counts would not vanish due to memory pressure, and they > can be checked even without knowing which files in a specific > filesystem have been touched (e.g. when just untar-ing something). > > There seemed to be fairly widespread agreement that this'd be a good > idea. Much less clearer whether somebody would do the work. > > - Provide config knobs that allow to define the FS error behaviour in a > consistent way across supported filesystems. XFS currently has various > knobs controlling what happens in case of metadata errors [1] (retry > forever, timeout, return up). It was proposed that this interface be > extended to also deal with data errors, and moved into generic support > code. > > While the timeline is unclear, there seemed to be widespread support > for the idea. I believe Dave Chinner indicated that he at least has > plans to generalize the code. That's great. It sounds like this has revitalised some interest in the error reporting and might yield some more general cleanups :) > - Stop inodes with unreported errors from being evicted. This will > guarantee that a later fsync (without an open FD) will see the > error. The memory pressure concerns here are lower than with keeping > all the failed pages in memory, and it could be optimized further. > > I read some tentative agreement behind this idea, but I think it's the > by far most controversial one. The main issue there would seem to be cases of whole-FS failure like the USB-key-yank example. You're going to have to be able to get rid of them at some point. > - Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and > retry recovery. While ENODEV (underlying device went away) will be > persistent, it probably makes sense to treat it the same or even just > give up and shut down. One question I see here is whether we just > want to continue crash-recovery cycles, or whether we want to limit > that. Right now, we'll panic once, then panic again in redo if the error persists and give up. On some systems, and everywhere that Pg is directly user-managed with pg_ctl, that'll leave Pg down until the operator intervenes. Some init systems will restart the postmaster automatically. Some will give up after a few tries. Some will back off retries over time. It depends on the init system. I'm not sure that's a great outcome. So rather than giving up if redo fails, we might want to offer a knob to retry, possibly with pause/backoff. I'm sure people currently expect PostgreSQL to try to stay up and recover, like it does after a segfault or most other errors. Personally I prefer to run Pg with restart_after_crash=off and let the init system launch a new postmaster, but that's not an option unless you have a sensible init. > - We need more aggressive error checking on close(), for ENOSPC and > EIO. In both cases afaics we'll have to trigger a crash recovery > cycle. It's entirely possible to end up in a loop on NFS etc, but I > don't think there's a way around that. > > Robert, on IM, wondered whether there'd be a race between some backend > doing a close(), triggering a PANIC, and a checkpoint succeeding. I > don't *think* so, because the error will only happen if there's > outstanding dirty data, and the checkpoint would have flushed that out > if it belonged to the current checkpointing cycle. Even if it's possible (which it sounds like it probably isn't), it might also be one of those corner-cases-of-corner-cases where we just shrug and worry about bigger fish. > - The outstanding fsync request queue isn't persisted properly [3]. This > means that even if the kernel behaved the way we'd expected, we'd not > fail a second checkpoint :(. It's possible that we don't need to deal > with this because we'll henceforth PANIC, but I'd argue we should fix > that regardless. Seems like a time-bomb otherwise (e.g. after moving > to DIO somebody might want to relax the PANIC...). Huh! Good find. That definitely merits fixing. > - It might be a good idea to whitelist expected return codes for write() > and PANIC one ones that we did not expect. E.g. when hitting an EIO we > should probably PANIC, to get back to a known good state. Even though > it's likely that we'd again that error at fsync(). > > - Docs. Yep. Especially OS-specific configuration for known dangerous setups (lvm-thin, multipath), etc. I imagine we can distill a lot of it from the discussion and simplify a bit. > I think we also need to audit a few codepaths. I'd be surprised if we > PANICed appropriately on all fsyncs(), particularly around the SLRUs. We _definitely_ do not, see the patch I sent on the other thread. > Then there's the question of how we want to deal with kernels that > haven't been updated with the aforementioned changes. We could say that > we expect decent OS support and declare that we just can't handle this - > given that at least various linux versions, netbsd, openbsd, MacOS just > silently drop errors and we'd need different approaches for dealing with > that, that doesn't seem like an insane approach. > What we could do: > > - forward file descriptors from backends to checkpointer (using > SCM_RIGHTS) when marking a segment dirty. That'd require some > optimizations (see [4]) to avoid doing so repeatedly. That'd > guarantee correct behaviour in all linux kernels >= 4.13 (possibly > backported by distributions?), and I think it'd also make it vastly > more likely that errors are reported in earlier kernels. It'd be interesting to see if other platforms that support fd passing will give us the desired behaviour too. But even if it only helps on Linux, that's a huge majority of the PostgreSQL deployments these days. > - Add a pre-checkpoint hook that checks for filesystem errors *after* > fsyncing all the files, but *before* logging the checkpoint completion > record. Operating systems, filesystems, etc. all log the error format > differently, but for larger installations it'd not be too hard to > write code that checks their specific configuration. I looked into using trace event file descriptors for this, btw, but we'd need CAP_SYS_ADMIN to create one that captured events for other processes. Plus filtering the events to find only events for the files / file systems of interest would be far from trivial. And I don't know what guarantees we have about when events are delivered. I'd love to be able to use inotify for this, but again, that'd only be a new-kernels thing since it'd need an inotify extension to report I/O errors. Presumably mostly this check would land up looking at dmesg. I'm not convinced it'd get widely deployed and widely used, or that it'd be used correctly when people tried to use it. Look at the hideous mess that most backup/standby creation scripts, archive_command scripts, etc are. > - Use direct IO. Due to architectural performance issues in PG and the > fact that it'd not be applicable for all installations I don't think > this is a reasonable fix for the issue presented here. Although it's > independently something we should work on. It might be worthwhile to > provide a configuration that allows to force DIO to be enabled for WAL > even if replication is turned on. Seems like a long term goal, but you've noted elsewhere that doing it well would be hard. I suspect we'd need writer threads, we'd need to know more about the underlying FS/storage layout to make better decisions about write parallelism, etc. We get away with a lot right now by letting the kernel and buffered I/O sort that out. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2018-04-27 19:38:30 -0400, Bruce Momjian wrote: > > On Fri, Apr 27, 2018 at 04:10:43PM -0700, Andres Freund wrote: > > > On 2018-04-27 19:04:47 -0400, Bruce Momjian wrote: > > > > On Fri, Apr 27, 2018 at 03:28:42PM -0700, Andres Freund wrote: > > > > > - We need more aggressive error checking on close(), for ENOSPC and > > > > > EIO. In both cases afaics we'll have to trigger a crash recovery > > > > > cycle. It's entirely possible to end up in a loop on NFS etc, but I > > > > > don't think there's a way around that. > > > > > > > > If the no-space or write failures are persistent, as you mentioned > > > > above, what is the point of going into crash recovery --- why not just > > > > shut down? > > > > > > Well, I mentioned that as an alternative in my email. But for one we > > > don't really have cases where we do that right now, for another we can't > > > really differentiate between a transient and non-transient state. It's > > > entirely possible that the admin on the system that ran out of space > > > fixes things, clearing up the problem. > > > > True, but if we get a no-space error, odds are it will not be fixed at > > the time we are failing. Wouldn't the administrator check that the > > server is still running after they free the space? > > I'd assume it's pretty common that those are separate teams. Given that > we currently don't behave that way for other cases where we *already* > can enter crash-recovery loops I don't think we need to introduce that > here. It's far more common to enter this kind of problem with pg_xlog > filling up the ordinary way. And that can lead to such loops. When we crash-restart, we also go through and clean things up some, no? Seems like that gives us the potential to end up fixing things ourselves and allowing the crash-restart to succeed. Consider unlogged tables, temporary tables, on-disk sorts, etc. It's entirely common for a bad query to run the system out of disk space (but have a write of a regular table be what discovers the out-of-space problem...) and if we crash-restart properly then we'd hopefully clean things out, freeing up space, and allowing us to come back up. Now, of course, ideally admins would set up temp tablespaces and segregate WAL onto its own filesystem, etc, but... Thanks! Stephen
Attachment
Greetings, * Craig Ringer (craig@2ndquadrant.com) wrote: > On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote: > > - Add a pre-checkpoint hook that checks for filesystem errors *after* > > fsyncing all the files, but *before* logging the checkpoint completion > > record. Operating systems, filesystems, etc. all log the error format > > differently, but for larger installations it'd not be too hard to > > write code that checks their specific configuration. > > I looked into using trace event file descriptors for this, btw, but > we'd need CAP_SYS_ADMIN to create one that captured events for other > processes. Plus filtering the events to find only events for the files > / file systems of interest would be far from trivial. And I don't know > what guarantees we have about when events are delivered. > > I'd love to be able to use inotify for this, but again, that'd only be > a new-kernels thing since it'd need an inotify extension to report I/O > errors. > > Presumably mostly this check would land up looking at dmesg. > > I'm not convinced it'd get widely deployed and widely used, or that > it'd be used correctly when people tried to use it. Look at the > hideous mess that most backup/standby creation scripts, > archive_command scripts, etc are. Agree with more-or-less everything you've said here, but a big +1 on this. If we do end up going down this route we have *got* to provide scripts which we know work and have been tested and are well maintained on the popular OS's for the popular filesystems and make it clear that we've tested those and not others. We definitely shouldn't put something in our docs that is effectively an example of the interface but not an actual command that anyone should be using. Thanks! Stephen
Attachment
On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote: > - Add a pre-checkpoint hook that checks for filesystem errors *after* > fsyncing all the files, but *before* logging the checkpoint completion > record. Operating systems, filesystems, etc. all log the error format > differently, but for larger installations it'd not be too hard to > write code that checks their specific configuration. > > While I'm a bit concerned adding user-code before a checkpoint, if > we'd do it as a shell command it seems pretty reasonable. And useful > even without concern for the fsync issue itself. Checking for IO > errors could e.g. also include checking for read errors - it'd not be > unreasonable to not want to complete a checkpoint if there'd been any > media errors. It seems clear that we need to evaluate our compatibility not just with an OS, as we do now, but with an OS/filesystem. Although people have suggested some approaches, I'm more interested in discovering how we can be certain we got it right. And the end result seems to be that PostgreSQL will be forced, in the short term, to declare certain combinations of OS/filesystem unsupported, with clear warning sent out to users. Adding a pre-checkpoint hook encourages people to fix this themselves without reporting issues, so I initially oppose this until we have a clearer argument as to why we need it. The answer is not to make this issue more obscure, but to make it more public. > - Use direct IO. Due to architectural performance issues in PG and the > fact that it'd not be applicable for all installations I don't think > this is a reasonable fix for the issue presented here. Although it's > independently something we should work on. It might be worthwhile to > provide a configuration that allows to force DIO to be enabled for WAL > even if replication is turned on. "Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems". It would be a major admission of defeat for us to take that as our main route to a solution. The people I've spoken to so far have encouraged us to continue working with the filesystem layer, offering encouragement of our decision to use filesystems. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On Sat, Apr 28, 2018 at 11:21:20AM -0400, Stephen Frost wrote: > * Craig Ringer (craig@2ndquadrant.com) wrote: > > On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote: > > > - Add a pre-checkpoint hook that checks for filesystem errors *after* > > > fsyncing all the files, but *before* logging the checkpoint completion > > > record. Operating systems, filesystems, etc. all log the error format > > > differently, but for larger installations it'd not be too hard to > > > write code that checks their specific configuration. > > > > I looked into using trace event file descriptors for this, btw, but > > we'd need CAP_SYS_ADMIN to create one that captured events for other > > processes. Plus filtering the events to find only events for the files > > / file systems of interest would be far from trivial. And I don't know > > what guarantees we have about when events are delivered. > > > > I'd love to be able to use inotify for this, but again, that'd only be > > a new-kernels thing since it'd need an inotify extension to report I/O > > errors. > > > > Presumably mostly this check would land up looking at dmesg. > > > > I'm not convinced it'd get widely deployed and widely used, or that > > it'd be used correctly when people tried to use it. Look at the > > hideous mess that most backup/standby creation scripts, > > archive_command scripts, etc are. > > Agree with more-or-less everything you've said here, but a big +1 on > this. If we do end up going down this route we have *got* to provide > scripts which we know work and have been tested and are well maintained > on the popular OS's for the popular filesystems and make it clear that > we've tested those and not others. We definitely shouldn't put > something in our docs that is effectively an example of the interface > but not an actual command that anyone should be using. This dmesg-checking has been mentioned several times now, but IME enterprise distributions (or server ops teams?) seem to tighten access to dmesg and /var/log to non-root users, including postgres. Well, or just vanilla Debian stable apparently: postgres@fock:~$ dmesg dmesg: read kernel buffer failed: Operation not permitted Is it really a useful expectation that the postgres user will be able to trawl system logs for I/O errors? Or are we expecting the sysadmins (in case they are distinct from the DBAs) to setup sudo and/or relax permissions for this everywhere? We should document this requirement properly at least then. The netlink thing from Google that Tet Ts'O mentioned would probably work around that, but if that is opened up it would not be deployed anytime soon either. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
Hi, On 2018-04-28 11:10:54 -0400, Stephen Frost wrote: > When we crash-restart, we also go through and clean things up some, no? > Seems like that gives us the potential to end up fixing things ourselves > and allowing the crash-restart to succeed. Sure, there's the potential for that. But it's quite possible to be missing a lot of free space over NFS (this really isn't much of an issue for local FS, at least not on linux) in a workload with rapidly expanding space usage. And even if you recover, you could just hit the issue again shortly afterwards. Greetings, Andres Freund
Hi, On 2018-04-28 17:35:48 +0200, Michael Banck wrote: > This dmesg-checking has been mentioned several times now, but IME > enterprise distributions (or server ops teams?) seem to tighten access > to dmesg and /var/log to non-root users, including postgres. > > Well, or just vanilla Debian stable apparently: > > postgres@fock:~$ dmesg > dmesg: read kernel buffer failed: Operation not permitted > > Is it really a useful expectation that the postgres user will be able to > trawl system logs for I/O errors? Or are we expecting the sysadmins (in > case they are distinct from the DBAs) to setup sudo and/or relax > permissions for this everywhere? We should document this requirement > properly at least then. I'm not a huge fan of this approach, but yes, that'd be necessary. It's not that problematic to have to change /dev/kmsg permissions imo. Adding a read group / acl seems quite doable. > The netlink thing from Google that Tet Ts'O mentioned would probably > work around that, but if that is opened up it would not be deployed > anytime soon either. Yea, that seems irrelevant for now. Greetings, Andres Freund
Hi, On 2018-04-28 08:25:53 -0700, Simon Riggs wrote: > > - Use direct IO. Due to architectural performance issues in PG and the > > fact that it'd not be applicable for all installations I don't think > > this is a reasonable fix for the issue presented here. Although it's > > independently something we should work on. It might be worthwhile to > > provide a configuration that allows to force DIO to be enabled for WAL > > even if replication is turned on. > > "Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems". I want to emphasize that this is NOT a linux only issue. It's a problem across a number of operating systems, including linux. > It would be a major admission of defeat for us to take that as our > main route to a solution. Well, I think we were wrong to not engineer towards DIO. There's just too many issues with buffered IO to not have a supported path for DIO. But given that it's unrealistic to do so without major work, and wouldn't be applicable for all installations (shared_buffer size becomes critical), I don't think it matters that much for the issue discussed here. > The people I've spoken to so far have encouraged us to continue > working with the filesystem layer, offering encouragement of our > decision to use filesystems. There's a lot of people disagreeing with it too. Greetings, Andres Freund
Hi, On 2018-04-28 20:00:25 +0800, Craig Ringer wrote: > On 28 April 2018 at 06:28, Andres Freund <andres@anarazel.de> wrote: > > The second major type of proposal was using direct-IO. That'd generally > > be a desirable feature, but a) would require some significant changes to > > postgres to be performant, b) isn't really applicable for the large > > percentage of installations that aren't tuned reasonably well, because > > at the moment the OS page cache functions as a memory-pressure aware > > extension of postgres' page cache. > > Yeah. I've avoided advocating for O_DIRECT because it's a big job > (understatement). We'd need to pay so much more attention to details > of storage layout if we couldn't rely as much on the kernel neatly > organising and queuing everything for us, too. > > At the risk of displaying my relative ignorance of direct I/O: Does > O_DIRECT without O_SYNC even provide a strong guarantee that when you > close() the file, all I/O has reliably succeeded? It must've gone > through the FS layer, but don't FSes do various caching and > reorganisation too? Can the same issue arise in other ways unless we > also fsync() before close() or write O_SYNC? No, not really. There's generally two categories of IO here: Metadata IO and data IO. The filesystem's metadata IO a) has a lot more error checking (including things like remount-ro, stalling the filesystem on errors etc), b) isn't direct IO itself. For some filesystem metadata operations you'll still need fsyncs, but the *data* is flushed if use use DIO. I'd personally use O_DSYNC | O_DIRECT, and have the metadata operations guaranteed by fsyncs. You'd need the current fsyncs for renaming, and probably some fsyncs for file extensions. The latter to make sure the filesystem has written the metadata change. > At one point I looked into using AIO instead. But last I looked it was > pretty spectacularly quirky when it comes to reliably flushing, and > outright broken on some versions. In any case, our multiprocessing > model would make tracking completions annoying, likely more so than > the sort of FD handoff games we've discussed. AIO pretty much only works sensibly with DIO. > > Another topic brought up in this thread was the handling of ENOSPC > > errors that aren't triggered on a filesystem level, but rather are > > triggered by thin provisioning. On linux that currently apprently lead > > to page cache contents being lost (and errors "eaten") in a lot of > > places, including just when doing a write(). > > ... wow. > > Is that with lvm-thin? I think both dm and lvm (I typed llvm thrice) based thin provisioning. The FS code basically didn't expect ENOSPC being returned from storage, but suddenly the storage layer started returning it... > The thin provisioning I was mainly concerned with is SAN-based thin > provisioning, which looks like a normal iSCSI target or a normal LUN > on a HBA to Linux. Then it starts failing writes with a weird > potentially vendor-specific sense error if it runs out of backing > store. How that's handled likely depends on the specific error, the > driver, which FS you use, etc. In the case I saw, multipath+lvm+xfs, > it resulted in lost writes and fsync() errors being reported once, per > the start of the original thread. I think the concerns are largely the same for that. You'll have to configure the SAN to block in that case. > > - Matthew Wilcox proposed (and posted a patch) that'd partially revert > > behaviour to the pre v4.13 world, by *also* reporting errors to > > "newer" file-descriptors if the error hasn't previously been > > reported. That'd still not guarantee that the error is reported > > (memory pressure could evict information without open fd), but in most > > situations we'll again get the error in the checkpointer. > > > > This seems largely be agreed upon. It's unclear whether it'll go into > > the stable backports for still-maintained >= v4.13 kernels. > > That seems very sensible. In our case we're very unlikely to have some > other unrelated process come in and fsync() our files for us. > > I'd want to be sure the report didn't get eaten by sync() or syncfs() though. It doesn't. Basically every fd has an errseq_t value copied into it at open. > > - syncfs() will be fixed so it reports errors properly - that'll likely > > require passing it an O_PATH filedescriptor to have space to store the > > errseq_t value that allows discerning already reported and new errors. > > > > No patch has appeared yet, but the behaviour seems largely agreed > > upon. > > Good, but as you noted, of limited use to us unless we want to force > users to manage space for temporary and unlogged relations completely > separately. Well, I think it'd still be ok as a backstop if it had decent error semantics. We don't checkpoint that often, and doing the syncing via syncfs() is considerably more efficient than individual fsync()s. But given it's currently buggy that tradeoff is moot. > I wonder if we could convince the kernel to offer a file_sync_mode > xattr to control this? (Hint: I'm already running away in a mylar fire > suit). Err. I am fairly sure you're not going to get anywhere with that. Given we're concerned about existing kernels, I doubt it'd help us much anyway. > > - Stop inodes with unreported errors from being evicted. This will > > guarantee that a later fsync (without an open FD) will see the > > error. The memory pressure concerns here are lower than with keeping > > all the failed pages in memory, and it could be optimized further. > > > > I read some tentative agreement behind this idea, but I think it's the > > by far most controversial one. > > The main issue there would seem to be cases of whole-FS failure like > the USB-key-yank example. You're going to have to be able to get rid > of them at some point. It's not actually a real problem (despite initially being brought up a number of times by kernel people). There's a separate error for that (ENODEV), and filesystems already handle it differently. Once that's returned, fsyncs() etc are just shortcut to ENODEV. > > What we could do: > > > > - forward file descriptors from backends to checkpointer (using > > SCM_RIGHTS) when marking a segment dirty. That'd require some > > optimizations (see [4]) to avoid doing so repeatedly. That'd > > guarantee correct behaviour in all linux kernels >= 4.13 (possibly > > backported by distributions?), and I think it'd also make it vastly > > more likely that errors are reported in earlier kernels. > > It'd be interesting to see if other platforms that support fd passing > will give us the desired behaviour too. But even if it only helps on > Linux, that's a huge majority of the PostgreSQL deployments these > days. Afaict it'd not help all of them. It does provide guarantees against the inode being evicted on pretty much all OSs, but not all of them have an error counter there... > > - Use direct IO. Due to architectural performance issues in PG and the > > fact that it'd not be applicable for all installations I don't think > > this is a reasonable fix for the issue presented here. Although it's > > independently something we should work on. It might be worthwhile to > > provide a configuration that allows to force DIO to be enabled for WAL > > even if replication is turned on. > > Seems like a long term goal, but you've noted elsewhere that doing it > well would be hard. I suspect we'd need writer threads, we'd need to > know more about the underlying FS/storage layout to make better > decisions about write parallelism, etc. We get away with a lot right > now by letting the kernel and buffered I/O sort that out. We're a *lot* slower due to it. Don't think you would need writer threads, "just" a bgwriter that actually works and provides clean buffers unless the machine is overloaded. I've posted a patch that adds that. On the write side you then additionally need write combining (doing one writes for several on-disk-consecutive buffers), which isn't trivial to add currently. The bigger issue than writes is actually doing reads nicely. There's no readahead anymore, and we'd not have the kernel backstopping our bad caching decisions anymore. Greetings, Andres Freund
On 29 April 2018 at 00:15, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-04-28 08:25:53 -0700, Simon Riggs wrote: >> > - Use direct IO. Due to architectural performance issues in PG and the >> > fact that it'd not be applicable for all installations I don't think >> > this is a reasonable fix for the issue presented here. Although it's >> > independently something we should work on. It might be worthwhile to >> > provide a configuration that allows to force DIO to be enabled for WAL >> > even if replication is turned on. >> >> "Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems". > > I want to emphasize that this is NOT a linux only issue. It's a problem > across a number of operating systems, including linux. > > >> It would be a major admission of defeat for us to take that as our >> main route to a solution. > > Well, I think we were wrong to not engineer towards DIO. There's just > too many issues with buffered IO to not have a supported path for > DIO. But given that it's unrealistic to do so without major work, and > wouldn't be applicable for all installations (shared_buffer size becomes > critical), I don't think it matters that much for the issue discussed > here. 20/20 hindsight, really. Not much to be done now. Even with the work you and others have done on shared_buffers scalability, there's likely still improvement needed there if it becomes more important to evict buffers into per-device queues, etc, too. Personally I'd rather not have to write half the kernel's job because the kernel doesn't feel like doing it :( . I'd kind of hoped to go in the other direction if anything, with some kind of pseudo-write op that let us swap a dirty shared_buffers entry from our shared_buffers into the OS dirty buffer cache (on Linux at least) and let it handle writeback, so we reduce double-buffering. Ha! So much for that! -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 28 April 2018 at 23:25, Simon Riggs <simon@2ndquadrant.com> wrote: > On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote: > >> - Add a pre-checkpoint hook that checks for filesystem errors *after* >> fsyncing all the files, but *before* logging the checkpoint completion >> record. Operating systems, filesystems, etc. all log the error format >> differently, but for larger installations it'd not be too hard to >> write code that checks their specific configuration. >> >> While I'm a bit concerned adding user-code before a checkpoint, if >> we'd do it as a shell command it seems pretty reasonable. And useful >> even without concern for the fsync issue itself. Checking for IO >> errors could e.g. also include checking for read errors - it'd not be >> unreasonable to not want to complete a checkpoint if there'd been any >> media errors. > > It seems clear that we need to evaluate our compatibility not just > with an OS, as we do now, but with an OS/filesystem. > > Although people have suggested some approaches, I'm more interested in > discovering how we can be certain we got it right. TBH, we can't be certain, because there are too many failure modes, some of which we can't really simulate in practical ways, or automated ways. But there are definitely steps we can take: - Test the stack of FS, LVM (if any) etc with the dmsetup 'flakey' target and a variety of workloads designed to hit errors at various points. Some form of torture test. - Almost up the device and see what happens if we write() then fsync() enough to fill it. - Plug-pull storage and see what happens, especially for multipath/iSCSI/SAN. Experience with pg_test_fsync shows that it can also be hard to reliably interpret the results of tests. Again I'd like to emphasise that this is really only a significant risk for a few configurations. Yes, it could result in Pg not failing a checkpoint when it should if, say, your disk has a bad block it can't repair and remap. But as Andres has pointed out in the past, those sorts local storage failure cases tend toward "you're kind of screwed anyway". It's only a serious concern when I/O errors are part of the storage's accepted operation, as in multipath with default settings. We _definitely_ need to warn multipath users that the defaults are insane. >> - Use direct IO. Due to architectural performance issues in PG and the >> fact that it'd not be applicable for all installations I don't think >> this is a reasonable fix for the issue presented here. Although it's >> independently something we should work on. It might be worthwhile to >> provide a configuration that allows to force DIO to be enabled for WAL >> even if replication is turned on. > > "Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems". Surprisingly, that seems to be a lot of what's coming out of Linux developers. Reliable buffered I/O? Why would you try to do that? I know that's far from a universal position, though, and it sounds like things were more productive in Andres's discussions at the meet. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 28 April 2018 at 08:25, Simon Riggs <simon@2ndquadrant.com> wrote: > On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote: > >> - Add a pre-checkpoint hook that checks for filesystem errors *after* >> fsyncing all the files, but *before* logging the checkpoint completion >> record. Operating systems, filesystems, etc. all log the error format >> differently, but for larger installations it'd not be too hard to >> write code that checks their specific configuration. >> >> While I'm a bit concerned adding user-code before a checkpoint, if >> we'd do it as a shell command it seems pretty reasonable. And useful >> even without concern for the fsync issue itself. Checking for IO >> errors could e.g. also include checking for read errors - it'd not be >> unreasonable to not want to complete a checkpoint if there'd been any >> media errors. > > It seems clear that we need to evaluate our compatibility not just > with an OS, as we do now, but with an OS/filesystem. > > Although people have suggested some approaches, I'm more interested in > discovering how we can be certain we got it right. > > And the end result seems to be that PostgreSQL will be forced, in the > short term, to declare certain combinations of OS/filesystem > unsupported, with clear warning sent out to users. > > Adding a pre-checkpoint hook encourages people to fix this themselves > without reporting issues, so I initially oppose this until we have a > clearer argument as to why we need it. The answer is not to make this > issue more obscure, but to make it more public. Thinking some more, I think I understand, but please explain if not. We need behavior that varies according to OS and filesystem, which varies per tablespace. We could have that variable behavior using * a hook * a set of GUC parameters that can be set at tablespace level * a separate config file for each tablespace My preference would be to avoid a hook. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 28 April 2018 at 09:15, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-04-28 08:25:53 -0700, Simon Riggs wrote: >> > - Use direct IO. Due to architectural performance issues in PG and the >> > fact that it'd not be applicable for all installations I don't think >> > this is a reasonable fix for the issue presented here. Although it's >> > independently something we should work on. It might be worthwhile to >> > provide a configuration that allows to force DIO to be enabled for WAL >> > even if replication is turned on. >> >> "Use DirectIO" is roughly same suggestion as "don't trust Linux filesystems". > > I want to emphasize that this is NOT a linux only issue. It's a problem > across a number of operating systems, including linux. Yes, of course. >> It would be a major admission of defeat for us to take that as our >> main route to a solution. > > Well, I think we were wrong to not engineer towards DIO. There's just > too many issues with buffered IO to not have a supported path for > DIO. But given that it's unrealistic to do so without major work, and > wouldn't be applicable for all installations (shared_buffer size becomes > critical), I don't think it matters that much for the issue discussed > here. > > >> The people I've spoken to so far have encouraged us to continue >> working with the filesystem layer, offering encouragement of our >> decision to use filesystems. > > There's a lot of people disagreeing with it too. Specific recent verbal feedback from OpenLDAP was that the project adopted DIO and found no benefit in doing so, with regret the other way from having tried. The care we need to use for any technique is the same. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 29, 2018 at 10:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 28 April 2018 at 09:15, Andres Freund <andres@anarazel.de> wrote: >> On 2018-04-28 08:25:53 -0700, Simon Riggs wrote: >>> The people I've spoken to so far have encouraged us to continue >>> working with the filesystem layer, offering encouragement of our >>> decision to use filesystems. >> >> There's a lot of people disagreeing with it too. > > Specific recent verbal feedback from OpenLDAP was that the project > adopted DIO and found no benefit in doing so, with regret the other > way from having tried. I'm not sure if OpenLDAP is really comparable. The big three RDBMSs + MySQL started like us and eventually switched to direct IO, I guess at a time when direct IO support matured in OSs and their own IO scheduling was thought to be superior. I'm pretty sure they did that because they didn't like wasting RAM on double buffering and had better ideas about IO scheduling. From some googling this morning: DB2: The Linux/Unix/Windows edition changed its default to DIO ("NO FILESYSTEM CACHING") in release 9.5 in 2007[1], but it can still do buffered IO if you ask for it. Oracle: Around the same time or earlier, in the Linux 2.4 era, Oracle apparently supported direct IO ("FILESYSTEMIO_OPTIONS = DIRECTIO" (or SETALL for DIRECTIO + ASYNCH)) on big iron Unix but didn't yet use it on Linux[2]. There were some amusing emails from Linus Torvalds on this topic[3]. I'm not sure what FILESYSTEMIO_OPTIONS's default value is on each operating system today or when it changed, it's probably SETALL everywhere by now? I wonder if they stuck with buffered IO for a time on Linux despite the availability of direct IO because they thought it was more reliable or more performant. SQL Server: I couldn't find any evidence that they've even kept the option to use buffered IO (which must have existed in the ancestral code base). Can it? It's a different situation though, targeting a reduced set of platforms. MySQL: The default is still buffered ("innodb_flush_method = fsync" as opposed to "O_DIRECT") but O_DIRECT is supported and widely recommended, so it sounds like it's usually a win. Maybe not on smaller systems though? On MySQL, there are anecdotal reports of performance suffering on some systems when you turn on O_DIRECT however. If that's true, it's interesting to speculate about why that might be as it would probably apply also to us in early versions (optimistic explanation: the kernel's stretchy page cache allows people to get away with poorly tuned buffer pool size? pessimistic explanation: the page reclamation or IO scheduling (asynchronous write-back, write clustering, read-ahead etc) is not as good as the OS's, but that effect is hidden by suitably powerful disk subsystem with its own magic caching?) Note that its O_DIRECT setting *also* calls fsync() to flush filesystem meta-data (necessary if the file was extended); I wonder if that is exposed to write-back error loss. [1] https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0051304.html [2] http://www.ixora.com.au/notes/direct_io.htm [3] https://lkml.org/lkml/2002/5/11/58 -- Thomas Munro http://www.enterprisedb.com
On Mon, Apr 30, 2018 at 11:02 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > MySQL: The default is still buffered Someone pulled me up on this off-list: the default is buffered (fsync) on Unix, but it's unbuffered on Windows. That's quite interesting. https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_method https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_flush_method -- Thomas Munro http://www.enterprisedb.com
On Sun, Apr 29, 2018 at 1:58 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 28 April 2018 at 23:25, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 27 April 2018 at 15:28, Andres Freund <andres@anarazel.de> wrote: >>> While I'm a bit concerned adding user-code before a checkpoint, if >>> we'd do it as a shell command it seems pretty reasonable. And useful >>> even without concern for the fsync issue itself. Checking for IO >>> errors could e.g. also include checking for read errors - it'd not be >>> unreasonable to not want to complete a checkpoint if there'd been any >>> media errors. >> >> It seems clear that we need to evaluate our compatibility not just >> with an OS, as we do now, but with an OS/filesystem. >> >> Although people have suggested some approaches, I'm more interested in >> discovering how we can be certain we got it right. > > TBH, we can't be certain, because there are too many failure modes, > some of which we can't really simulate in practical ways, or automated > ways. +1 Testing is good, but unless you have a categorical statement from the relevant documentation or kernel team or you have the source code, I'm not sure how you can ever really be sure about this. I think we have a fair idea now what several open kernels do, but we still haven't got a clue about Windows, AIX, HPUX and Solaris and we only have half the answer for Illumos, and no "negative" test result can prove that they can't throw away write-back errors or data. Considering the variety in interpretation and liberties taken, I wonder if fsync() is underspecified and someone should file an issue over at http://www.opengroup.org/austin/ about that. -- Thomas Munro http://www.enterprisedb.com
On 30 April 2018 at 09:09, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Considering the variety in interpretation and liberties taken, I > wonder if fsync() is underspecified and someone should file an issue > over at http://www.opengroup.org/austin/ about that. All it's going to achieve is adding an "is implementation-defined" caveat, but that's at least a bit of a heads-up. I filed patches for Linux man-pages ages ago. I'll update them and post to LKML; apparently bugzilla has a lot of spam and many people ignore notifications, so they might just bitrot forever otherwise. Meanwhile, do we know if, on Linux 4.13+, if we get a buffered write error due to dirty writeback before we close() a file we don't fsync(), we'll get the error on close()? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-04-30 10:14:23 +0800, Craig Ringer wrote: > Meanwhile, do we know if, on Linux 4.13+, if we get a buffered write > error due to dirty writeback before we close() a file we don't > fsync(), we'll get the error on close()? Not quite sure what you're getting at with "a file we don't fsync" - if we don't, we don't care about durability anyway, no? Or do you mean where we fsync in a different process? Either way, the answer is mostly no: On NFS et al where close() implies an fsync you'll get the error at that time, otherwise you'll get it at the next fsync(). Greetings, Andres Freund
> Not quite sure what you're getting at with "a file we don't fsync" - if > we don't, we don't care about durability anyway, no? Or do you mean > where we fsync in a different process? Right. > Either way, the answer is mostly no: On NFS et al where close() implies > an fsync you'll get the error at that time, otherwise you'll get it at > the next fsync(). Thanks. The reason I ask is that if we got notified of already-detected writeback errors (on 4.13+) on close() too, it'd narrow the window a little for problems, since normal backends could PANIC if close() of a persistent file raised EIO. Otherwise we're less likely to see the error, since the checkpointer won't see it - it happened before the checkpointer open()ed the file. It'd still be no help for dirty writeback that happens after we close() in a user backend / the bgwriter and before we re-open(), but it'd be nice if the kernel would tell us on close() if it knows of a writeback error. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hrm, something else that just came up. On 9.6+ we use sync_file_range. It's surely going to eat errors: rc = sync_file_range(fd, offset, nbytes, SYNC_FILE_RANGE_WRITE); /* don't error out, this is just a performance optimization */ if (rc != 0) { ereport(WARNING, (errcode_for_file_access(), errmsg("could not flush dirty data: %m"))); } so that has to panic too. I'm very suspicious about the safety of the msync() path too. I'll post an update to my PANIC-everywhere patch that add these cases.
On 2018-04-30 13:03:24 +0800, Craig Ringer wrote: > Hrm, something else that just came up. On 9.6+ we use sync_file_range. > It's surely going to eat errors: > > rc = sync_file_range(fd, offset, nbytes, > SYNC_FILE_RANGE_WRITE); > > /* don't error out, this is just a performance optimization */ > if (rc != 0) > { > ereport(WARNING, > (errcode_for_file_access(), > errmsg("could not flush dirty data: %m"))); > } It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which seems sensible, because they could be considered data integrity operations. fs/sync.c: SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes, unsigned int, flags) { ... if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) { ret = file_fdatawait_range(f.file, offset, endbyte); if (ret < 0) goto out_put; } if (flags & SYNC_FILE_RANGE_WRITE) { ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE); if (ret < 0) goto out_put; } if (flags & SYNC_FILE_RANGE_WAIT_AFTER) ret = file_fdatawait_range(f.file, offset, endbyte); where int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte) { struct address_space *mapping = file->f_mapping; __filemap_fdatawait_range(mapping, start_byte, end_byte); return file_check_and_advance_wb_err(file); } EXPORT_SYMBOL(file_fdatawait_range); the critical call is file_check_and_advance_wb_err(). That's the call that checks and clears errors. Would be good to add a kernel xfstest (gets used on all relevant FS these days), to make sure that doesn't change. > I'm very suspicious about the safety of the msync() path too. That seems justified however: SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) { ... error = vfs_fsync_range(file, fstart, fend, 1); .. */ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) { ... return file->f_op->fsync(file, start, end, datasync); } EXPORT_SYMBOL(vfs_fsync_range); int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { ... ret = file_write_and_wait_range(file, start, end); if (ret) return ret; ... STATIC int xfs_file_fsync( struct file *file, loff_t start, loff_t end, int datasync) { ... error = file_write_and_wait_range(file, start, end); if (error) return error; int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend) { ... err2 = file_check_and_advance_wb_err(file); if (!err) err = err2; return err; } Greetings, Andres Freund
On 1 May 2018 at 00:09, Andres Freund <andres@anarazel.de> wrote: > It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which > seems sensible, because they could be considered data integrity > operations. Ah, I misread that. Thankyou. >> I'm very suspicious about the safety of the msync() path too. > > That seems justified however: I'll add EIO tests there. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 28, 2018 at 12:28 AM, Andres Freund <andres@anarazel.de> wrote: > Before linux v4.13 errors in kernel writeback would be reported at most > once, without a guarantee that that'd happen (IIUC memory pressure could > lead to the relevant information being evicted) - but it was pretty > likely. After v4.13 (see https://lwn.net/Articles/724307/) errors are > reported exactly once to all open file descriptors for a file with an > error - but never for files that have been opened after the error > occurred. snip > == Proposed Linux Changes == > > - Matthew Wilcox proposed (and posted a patch) that'd partially revert > behaviour to the pre v4.13 world, by *also* reporting errors to > "newer" file-descriptors if the error hasn't previously been > reported. That'd still not guarantee that the error is reported > (memory pressure could evict information without open fd), but in most > situations we'll again get the error in the checkpointer. > > This seems largely be agreed upon. It's unclear whether it'll go into > the stable backports for still-maintained >= v4.13 kernels. This is now merged, if it's not reverted it will appear in v4.17. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3 The commit is cc-ed to stable so it should get picked up in the near future. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b
Hi, On 2018-04-27 15:28:42 -0700, Andres Freund wrote: > I went to LSF/MM 2018 to discuss [0] and related issues. Overall I'd say > it was a very productive discussion. I'll first try to recap the > current situation, updated with knowledge I gained. Secondly I'll try to > discuss the kernel changes that seem to have been agreed upon. Thirdly > I'll try to sum up what postgres needs to change. LWN summarized the discussion as well: https://lwn.net/SubscriberLink/752952/6825e6a1ddcfb1f3/ - Andres
On 2018-05-01 09:38:03 +0800, Craig Ringer wrote: > On 1 May 2018 at 00:09, Andres Freund <andres@anarazel.de> wrote: > > > It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which > > seems sensible, because they could be considered data integrity > > operations. > > Ah, I misread that. Thankyou. > > >> I'm very suspicious about the safety of the msync() path too. > > > > That seems justified however: > > I'll add EIO tests there. Do you have a patchset including that? I didn't find anything after a quick search... Greetings, Andres Freund
On 10 May 2018 at 06:55, Andres Freund <andres@anarazel.de> wrote: > Do you have a patchset including that? I didn't find anything after a > quick search... There was an earlier rev on the other thread but without msync checks. I've added panic for msync in the attached, and tidied the comments a bit. I didn't add comments on why we panic to each individual pg_fsync or FileSync caller that panics; instead pg_fsync points to pg_fsync_no_writethrough which explains it briefly and has links. I looked at callers of pg_fsync, pg_fsync_no_writethrough, pg_fsync_writethrough, mdsync, and FileSync when writing this. WAL writing already PANIC'd on fsync failure, which helps, though we now know that's not sufficient for complete safety. Patch on top of v11 HEAD @ ddc1f32ee507 -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On 2018-05-10 09:50:03 +0800, Craig Ringer wrote: > while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL) > { > if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0) > - ereport(ERROR, > + ereport(PANIC, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", src->path))); To me this (and the other callers) doesn't quite look right. First, I think we should probably be a bit more restrictive about when PANIC out. It seems like we should PANIC on ENOSPC and EIO, but possibly not others. Secondly, I think we should centralize the error handling. It seems likely that we'll acrue some platform specific workarounds, and I don't want to copy that knowledge everywhere. Also, don't we need the same on close()? - Andres
On Thu, May 17, 2018 at 12:44 PM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-05-10 09:50:03 +0800, Craig Ringer wrote: >> while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL) >> { >> if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0) >> - ereport(ERROR, >> + ereport(PANIC, >> (errcode_for_file_access(), >> errmsg("could not fsync file \"%s\": %m", src->path))); > > To me this (and the other callers) doesn't quite look right. First, I > think we should probably be a bit more restrictive about when PANIC > out. It seems like we should PANIC on ENOSPC and EIO, but possibly not > others. Secondly, I think we should centralize the error handling. It > seems likely that we'll acrue some platform specific workarounds, and I > don't want to copy that knowledge everywhere. Maybe something like: ereport(promote_eio_to_panic(ERROR), ...) ? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 17, 2018 at 11:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 17, 2018 at 12:44 PM, Andres Freund <andres@anarazel.de> wrote: >> Hi, >> >> On 2018-05-10 09:50:03 +0800, Craig Ringer wrote: >>> while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL) >>> { >>> if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0) >>> - ereport(ERROR, >>> + ereport(PANIC, >>> (errcode_for_file_access(), >>> errmsg("could not fsync file \"%s\": %m", src->path))); >> >> To me this (and the other callers) doesn't quite look right. First, I >> think we should probably be a bit more restrictive about when PANIC >> out. It seems like we should PANIC on ENOSPC and EIO, but possibly not >> others. Secondly, I think we should centralize the error handling. It >> seems likely that we'll acrue some platform specific workarounds, and I >> don't want to copy that knowledge everywhere. > > Maybe something like: > > ereport(promote_eio_to_panic(ERROR), ...) Well, searching for places where error is reported with level PANIC, using word PANIC would miss these instances. People will have to remember to search with promote_eio_to_panic. May be handle the errors inside FileSync() itself or a wrapper around that. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Hi, On 2018-04-27 15:28:42 -0700, Andres Freund wrote: > == Potential Postgres Changes == > > Several operating systems / file systems behave differently (See > e.g. [2], thanks Thomas) than we expected. Even the discussed changes to > e.g. linux don't get to where we thought we are. There's obviously also > the question of how to deal with kernels / OSs that have not been > updated. > > Changes that appear to be necessary, even for kernels with the issues > addressed: > > - Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and > retry recovery. While ENODEV (underlying device went away) will be > persistent, it probably makes sense to treat it the same or even just > give up and shut down. One question I see here is whether we just > want to continue crash-recovery cycles, or whether we want to limit > that. Craig has a patch for this, although I'm not yet 100% happy with it. > - We need more aggressive error checking on close(), for ENOSPC and > EIO. In both cases afaics we'll have to trigger a crash recovery > cycle. It's entirely possible to end up in a loop on NFS etc, but I > don't think there's a way around that. This needs to be handled. > - The outstanding fsync request queue isn't persisted properly [3]. This > means that even if the kernel behaved the way we'd expected, we'd not > fail a second checkpoint :(. It's possible that we don't need to deal > with this because we'll henceforth PANIC, but I'd argue we should fix > that regardless. Seems like a time-bomb otherwise (e.g. after moving > to DIO somebody might want to relax the PANIC...). > What we could do: > > - forward file descriptors from backends to checkpointer (using > SCM_RIGHTS) when marking a segment dirty. That'd require some > optimizations (see [4]) to avoid doing so repeatedly. That'd > guarantee correct behaviour in all linux kernels >= 4.13 (possibly > backported by distributions?), and I think it'd also make it vastly > more likely that errors are reported in earlier kernels. > > This should be doable without a noticeable performance impact, I > believe. I don't think it'd be that hard either, but it'd be a bit of > a pain to backport it to all postgres versions, as well as a bit > invasive for that. > > The infrastructure this'd likely end up building (hashtable of open > relfilenodes), would likely be useful for further things (like caching > file size). I've written a patch series for this. Took me quite a bit longer than I had hoped. The attached patchseries consists out of a few preparatory patches: - freespace optimization to not call smgrexists() unnecessarily - register_dirty_segment() optimization to not queue requests for segments that locally are known to already have been dirtied. This seems like a good optimization regardless of further changes. Doesn't yet deal with the mdsync counter wrapping around (which is unlikely to ever happen in practice, it's 32bit). - some fd.c changes, I don't think they're quite right yet - new functions to send/recv data over a unix domain socket, *including* a file descriptor. The main patch guarantees that fsync requests are forwarded from backends to the checkpointer, including the file descriptor. As we do so immediately at mdwrite() time, that guarantees that the fd has been open from before the write started, therefore linux will guarantee that that FD will see errors. The design of the patch went through a few iterations. I initially attempted to make the fsync request hashtable shared, but that turned out to be a lot harder to do reliably *and* fast than I was anticipating (we'd need to hold a lock for the entirety of mdsync(), dynahash doesn't allow iteration while other backends modify). So what I instead did was to replace the shared memory fsync request queue with a unix domain socket (created in postmaster, using socketpair()). CheckpointerRequest structs are written to that queue, including the associated file descriptor. The checkpointer absorbs those requests, and updates the local pending requests hashtable in local process memory. To facilitate that mdsync() has read all requests from the last cycle, checkpointer self-enqueues a token, which allows to detect the end of the relevant portion of the queue. The biggest complication in all this scheme is that checkpointer now needs to keep a file descriptor open for every segment. That obviously requires adding a few new fields to the hashtable entry. But the bigger issue is that it's now possible that pending requests need to be processed earlier than the next checkpoint, because of file descriptor limits. To address that absorbing the fsync request queue will now do a mdsync() style pass, doing the necessary fsyncs. Because mdsync() (or rather its new workhorse mdsyncpass()) will now not open files itself, there's no need to do deal with retries for files that have been deleted. For the cases where we didn't yet receive a fsync cancel request, we'll just fsync the fd. That's unnecessary, but harmless. Obviously this is currently heavily unix specific (according to my research all our unix platforms support say that they support sending fds across unix domain sockets w/ SCM_RIGHTS). It's unclear whether any OS but linux benefits from not closing file descriptors before fsync(). We could make this work for windows, without *too* much trouble (one can just open fds in another process, using that process' handle). I think there's some advantage in using the same approach everywhere. For one not maintaining two radically different approaches for complicated code. It'd also allow us to offload more fsyncs to checkpointer, not just the ones for normal relation files, which does seem advantageous. Not having ugly retry logic around deleted files in mdsync() also seems nice. But there's cases where this is likely slower, due to the potential of having to wait for checkpointer when the queue is full. I'll note that I think the new mdsync() is considerably simpler. Even if we do not decide to use an approach as presented here, I think we should make some of those changes. Specifically not unlinking the pending requests bitmap in mdsync() seems like it both resolves existing bug (see upthread) and makes the code simpler. I plan to switch to working on something else for a day or two next week, and then polish this further. I'd greatly appreciate comments till then. I didn't want to do this now, but I think we should also consider removing all awareness of segments from the fsync request queue. Instead it should deal with individual files, and the segmentation should be handled by md.c. That'll allow us to move all the necessary code to smgr.c (or checkpointer?); Thomas said that'd be helpful for further work. I personally think it'd be a lot simpler, because having to have long bitmaps with only the last bit set for large append only relations isn't a particularly sensible approach imo. The only thing that that'd make more complicated is that the file/database unlink requests get more expensive (as they'd likely need to search the whole table), but that seems like a sensible tradeoff. Alternatively using a tree structure would be an alternative obviously. Personally I was thinking that we should just make the hashtable be over a pathname, that seems most generic. Greetings, Andres Freund
Attachment
- v1-0001-freespace-Don-t-constantly-close-files-when-readi.patch
- v1-0002-Add-functions-to-send-receive-data-FD-over-a-unix.patch
- v1-0003-Make-FileGetRawDesc-ensure-there-s-an-associated-.patch
- v1-0004-WIP-Allow-to-create-a-transient-file-for-a-previo.patch
- v1-0005-WIP-Allow-more-transient-files-and-allow-to-query.patch
- v1-0006-WIP-Optimize-register_dirty_segment-to-not-repeat.patch
- v1-0007-Heavily-WIP-Send-file-descriptors-to-checkpointer.patch
Greetings, * Ashutosh Bapat (ashutosh.bapat@enterprisedb.com) wrote: > On Thu, May 17, 2018 at 11:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, May 17, 2018 at 12:44 PM, Andres Freund <andres@anarazel.de> wrote: > >> Hi, > >> > >> On 2018-05-10 09:50:03 +0800, Craig Ringer wrote: > >>> while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL) > >>> { > >>> if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0) > >>> - ereport(ERROR, > >>> + ereport(PANIC, > >>> (errcode_for_file_access(), > >>> errmsg("could not fsync file \"%s\": %m", src->path))); > >> > >> To me this (and the other callers) doesn't quite look right. First, I > >> think we should probably be a bit more restrictive about when PANIC > >> out. It seems like we should PANIC on ENOSPC and EIO, but possibly not > >> others. Secondly, I think we should centralize the error handling. It > >> seems likely that we'll acrue some platform specific workarounds, and I > >> don't want to copy that knowledge everywhere. > > > > Maybe something like: > > > > ereport(promote_eio_to_panic(ERROR), ...) > > Well, searching for places where error is reported with level PANIC, > using word PANIC would miss these instances. People will have to > remember to search with promote_eio_to_panic. May be handle the errors > inside FileSync() itself or a wrapper around that. No, that search wouldn't miss those instances- such a search would find promote_eio_to_panic() and then someone would go look up the uses of that function. That hardly seems like a serious issue for folks hacking on PG. I'm not saying that having a wrapper around FileSync() would be bad or having it handle things, but I don't agree with the general notion that we can't have a function which handles the complicated bits about the kind of error because someone grep'ing the source for PANIC might have to do an additional lookup. Thanks! Stephen
Attachment
At 2018-05-18 20:27:57 -0400, sfrost@snowman.net wrote: > > I don't agree with the general notion that we can't have a function > which handles the complicated bits about the kind of error because > someone grep'ing the source for PANIC might have to do an additional > lookup. Or we could just name the function promote_eio_to_PANIC. (I understood the objection to be about how 'grep PANIC' wouldn't find these lines at all, not that there would be an additional lookup.) -- Abhijit
Greetings, * Abhijit Menon-Sen (ams@2ndQuadrant.com) wrote: > At 2018-05-18 20:27:57 -0400, sfrost@snowman.net wrote: > > > > I don't agree with the general notion that we can't have a function > > which handles the complicated bits about the kind of error because > > someone grep'ing the source for PANIC might have to do an additional > > lookup. > > Or we could just name the function promote_eio_to_PANIC. Ugh, I'm not thrilled with that either. > (I understood the objection to be about how 'grep PANIC' wouldn't find > these lines at all, not that there would be an additional lookup.) ... and my point was that 'grep PANIC' would, almost certainly, find the function promote_eio_to_panic(), and someone could trivially look up all the callers of that function then. Thanks! Stephen
Attachment
On Sat, May 19, 2018 at 9:03 AM, Andres Freund <andres@anarazel.de> wrote: > I've written a patch series for this. Took me quite a bit longer than I > had hoped. Great. > I plan to switch to working on something else for a day or two next > week, and then polish this further. I'd greatly appreciate comments till > then. Took it for a spin on macOS and FreeBSD. First problem: + if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0) SOCK_CLOEXEC isn't portable (FreeBSD yes since 10, macOS no, others who knows). Adding FD_CLOEXEC to your later fcntl() calls is probably the way to do it? I understand from reading the Linux man pages that there are race conditions with threads but that doesn't apply here. Next, make check hangs in initdb on both of my pet OSes when md.c raises an error (fseek fails) and we raise and error while raising and error and deadlock against ourselves. Backtrace here: https://paste.debian.net/1025336/ Apparently the initial error was that mdextend() called _mdnblocks() which called FileSeek() on vfd 43 == fd 30, pathname "base/1/2704", but when I check my operating system open file descriptor table I find that there is no fd 30: there is a 29 and a 31, so it has already been unexpectedly closed. I could dig further and/or provide a shell on a system with dev tools. > I didn't want to do this now, but I think we should also consider > removing all awareness of segments from the fsync request queue. Instead > it should deal with individual files, and the segmentation should be > handled by md.c. That'll allow us to move all the necessary code to > smgr.c (or checkpointer?); Thomas said that'd be helpful for further > work. I personally think it'd be a lot simpler, because having to have > long bitmaps with only the last bit set for large append only relations > isn't a particularly sensible approach imo. The only thing that that'd > make more complicated is that the file/database unlink requests get more > expensive (as they'd likely need to search the whole table), but that > seems like a sensible tradeoff. Alternatively using a tree structure > would be an alternative obviously. Personally I was thinking that we > should just make the hashtable be over a pathname, that seems most > generic. +1 I'll be posting a patch shortly that also needs similar machinery, but can't easily share with md.c due to technical details. I'd love there to be just one of those, and for it to be simpler and general. -- Thomas Munro http://www.enterprisedb.com
On Sat, May 19, 2018 at 4:51 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Next, make check hangs in initdb on both of my pet OSes when md.c > raises an error (fseek fails) and we raise and error while raising and > error and deadlock against ourselves. Backtrace here: > https://paste.debian.net/1025336/ Ah, I see now that something similar is happening on Linux too, so I guess you already knew this. https://travis-ci.org/postgresql-cfbot/postgresql/builds/380913223 -- Thomas Munro http://www.enterprisedb.com
On Sat, May 19, 2018 at 6:31 AM, Stephen Frost <sfrost@snowman.net> wrote: > Greetings, > > * Abhijit Menon-Sen (ams@2ndQuadrant.com) wrote: >> At 2018-05-18 20:27:57 -0400, sfrost@snowman.net wrote: >> > >> > I don't agree with the general notion that we can't have a function >> > which handles the complicated bits about the kind of error because >> > someone grep'ing the source for PANIC might have to do an additional >> > lookup. >> >> Or we could just name the function promote_eio_to_PANIC. > > Ugh, I'm not thrilled with that either. > >> (I understood the objection to be about how 'grep PANIC' wouldn't find >> these lines at all, not that there would be an additional lookup.) > > ... and my point was that 'grep PANIC' would, almost certainly, find the > function promote_eio_to_panic(), and someone could trivially look up all > the callers of that function then. It's not just grep, but tools like cscope, tag. Although, I agree, that adding a function, if all necessary, is more important than convenience of finding all the instances of a certain token easily. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 18 May 2018 at 00:44, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:
> while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
> {
> if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
> - ereport(ERROR,
> + ereport(PANIC,
> (errcode_for_file_access(),
> errmsg("could not fsync file \"%s\": %m", src->path)));
To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.
Also, don't we need the same on close()?
Yes, we do, and that expands the scope a bit.
I agree with Robert that some sort of filter/macro is wise, though naming it clearly will be tricky.
I'll have a look.
On 21 May 2018 at 12:57, Craig Ringer <craig@2ndquadrant.com> wrote:
On 18 May 2018 at 00:44, Andres Freund <andres@anarazel.de> wrote:Hi,
On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:
> while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
> {
> if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
> - ereport(ERROR,
> + ereport(PANIC,
> (errcode_for_file_access(),
> errmsg("could not fsync file \"%s\": %m", src->path)));
To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.
Also, don't we need the same on close()?Yes, we do, and that expands the scope a bit.I agree with Robert that some sort of filter/macro is wise, though naming it clearly will be tricky.I'll have a look.
On the queue for tomorrow.
On 2018-05-19 18:12:52 +1200, Thomas Munro wrote: > On Sat, May 19, 2018 at 4:51 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > Next, make check hangs in initdb on both of my pet OSes when md.c > > raises an error (fseek fails) and we raise and error while raising and > > error and deadlock against ourselves. Backtrace here: > > https://paste.debian.net/1025336/ > > Ah, I see now that something similar is happening on Linux too, so I > guess you already knew this. I didn't. I cleaned something up and only tested installcheck after... Singleuser mode was broken. Attached is a new version. I've changed my previous attempt at using transient files to using File type files, but unliked from the LRU so that they're kept open. Not sure if that's perfect, but seems cleaner. Greetings, Andres Freund
Attachment
- v2-0001-freespace-Don-t-constantly-close-files-when-readi.patch
- v2-0002-Add-functions-to-send-receive-data-FD-over-a-unix.patch
- v2-0003-Make-FileGetRawDesc-ensure-there-s-an-associated-.patch
- v2-0004-WIP-Add-FileOpenForFd.patch
- v2-0005-WIP-Optimize-register_dirty_segment-to-not-repeat.patch
- v2-0006-Heavily-WIP-Send-file-descriptors-to-checkpointer.patch
> On 22 May 2018 at 03:08, Andres Freund <andres@anarazel.de> wrote: > On 2018-05-19 18:12:52 +1200, Thomas Munro wrote: >> On Sat, May 19, 2018 at 4:51 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> > Next, make check hangs in initdb on both of my pet OSes when md.c >> > raises an error (fseek fails) and we raise and error while raising and >> > error and deadlock against ourselves. Backtrace here: >> > https://paste.debian.net/1025336/ >> >> Ah, I see now that something similar is happening on Linux too, so I >> guess you already knew this. > > I didn't. I cleaned something up and only tested installcheck > after... Singleuser mode was broken. > > Attached is a new version. > > I've changed my previous attempt at using transient files to using File > type files, but unliked from the LRU so that they're kept open. Not sure > if that's perfect, but seems cleaner. Thanks for the patch. Out of curiosity I tried to play with it a bit. `pgbench -i -s 100` actually hang on my machine, because the copy process ended up with waiting after `pg_uds_send_with_fd` had errno == EWOULDBLOCK || errno == EAGAIN as well as the checkpointer process. Looks like with the default configuration and `max_wal_size=1GB` it writes more than reads to a socket, and a buffer eventually becomes full. I've increased SO_RCVBUF/SO_SNDBUF and `max_wal_size` independently to check it, and in both cases the problem disappeared (but I assume only for this particular scale). Is it something that was already considered?
Hi, On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote: > Thanks for the patch. Out of curiosity I tried to play with it a bit. Thanks. > `pgbench -i -s 100` actually hang on my machine, because the > copy process ended up with waiting after `pg_uds_send_with_fd` > had Hm, that had worked at some point... > errno == EWOULDBLOCK || errno == EAGAIN > > as well as the checkpointer process. What do you mean with that latest sentence? > Looks like with the default > configuration and `max_wal_size=1GB` it writes more than reads to a > socket, and a buffer eventually becomes full. That's intended to then wake up the checkpointer immediately, so it can absorb the requests. So something isn't right yet. > I've increased SO_RCVBUF/SO_SNDBUF and `max_wal_size` independently to > check it, and in both cases the problem disappeared (but I assume only > for this particular scale). Is it something that was already > considered? It's considered. Tuning up those might help with performance, but shouldn't required from a correctness POV. Hm. Greetings, Andres Freund
On 2018-05-22 08:57:18 -0700, Andres Freund wrote: > Hi, > > > On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote: > > Thanks for the patch. Out of curiosity I tried to play with it a bit. > > Thanks. > > > > `pgbench -i -s 100` actually hang on my machine, because the > > copy process ended up with waiting after `pg_uds_send_with_fd` > > had > > Hm, that had worked at some point... > > > > errno == EWOULDBLOCK || errno == EAGAIN > > > > as well as the checkpointer process. > > What do you mean with that latest sentence? > > > > Looks like with the default > > configuration and `max_wal_size=1GB` it writes more than reads to a > > socket, and a buffer eventually becomes full. > > That's intended to then wake up the checkpointer immediately, so it can > absorb the requests. So something isn't right yet. Doesn't hang here, but it's way too slow. Reason for that is that I've wrongly resolved a merge conflict. Attached is a fixup patch - does that address the issue for you? Greetings, Andres Freund
Attachment
> On 22 May 2018 at 18:47, Andres Freund <andres@anarazel.de> wrote: > On 2018-05-22 08:57:18 -0700, Andres Freund wrote: >> Hi, >> >> >> On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote: >> > Thanks for the patch. Out of curiosity I tried to play with it a bit. >> >> Thanks. >> >> >> > `pgbench -i -s 100` actually hang on my machine, because the >> > copy process ended up with waiting after `pg_uds_send_with_fd` >> > had >> >> Hm, that had worked at some point... >> >> >> > errno == EWOULDBLOCK || errno == EAGAIN >> > >> > as well as the checkpointer process. >> >> What do you mean with that latest sentence? To investigate what's happening I attached with gdb to two processes, COPY process from pgbench and checkpointer (since I assumed it may be involved). Both were waiting in WaitLatchOrSocket right after SendFsyncRequest. >> > Looks like with the default >> > configuration and `max_wal_size=1GB` it writes more than reads to a >> > socket, and a buffer eventually becomes full. >> >> That's intended to then wake up the checkpointer immediately, so it can >> absorb the requests. So something isn't right yet. > > Doesn't hang here, but it's way too slow. Yep, in my case it was also getting slower, but eventually hang. > Reason for that is that I've wrongly resolved a merge conflict. Attached is a > fixup patch - does that address the issue for you? Hm...is it a correct patch? I see the same committed in 8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it.
On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote: > > On 22 May 2018 at 18:47, Andres Freund <andres@anarazel.de> wrote: > > On 2018-05-22 08:57:18 -0700, Andres Freund wrote: > >> Hi, > >> > >> > >> On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote: > >> > Thanks for the patch. Out of curiosity I tried to play with it a bit. > >> > >> Thanks. > >> > >> > >> > `pgbench -i -s 100` actually hang on my machine, because the > >> > copy process ended up with waiting after `pg_uds_send_with_fd` > >> > had > >> > >> Hm, that had worked at some point... > >> > >> > >> > errno == EWOULDBLOCK || errno == EAGAIN > >> > > >> > as well as the checkpointer process. > >> > >> What do you mean with that latest sentence? > > To investigate what's happening I attached with gdb to two processes, COPY > process from pgbench and checkpointer (since I assumed it may be involved). > Both were waiting in WaitLatchOrSocket right after SendFsyncRequest. Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the backtrace? > >> > Looks like with the default > >> > configuration and `max_wal_size=1GB` it writes more than reads to a > >> > socket, and a buffer eventually becomes full. > >> > >> That's intended to then wake up the checkpointer immediately, so it can > >> absorb the requests. So something isn't right yet. > > > > Doesn't hang here, but it's way too slow. > > Yep, in my case it was also getting slower, but eventually hang. > > > Reason for that is that I've wrongly resolved a merge conflict. Attached is a > > fixup patch - does that address the issue for you? > > Hm...is it a correct patch? I see the same committed in > 8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it. Yea, sorry for that. Too many files in my patch directory... Right one attached. Greetings, Andres Freund
Attachment
> On 22 May 2018 at 20:59, Andres Freund <andres@anarazel.de> wrote: > On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote: >> > On 22 May 2018 at 18:47, Andres Freund <andres@anarazel.de> wrote: >> > On 2018-05-22 08:57:18 -0700, Andres Freund wrote: >> >> Hi, >> >> >> >> >> >> On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote: >> >> > Thanks for the patch. Out of curiosity I tried to play with it a bit. >> >> >> >> Thanks. >> >> >> >> >> >> > `pgbench -i -s 100` actually hang on my machine, because the >> >> > copy process ended up with waiting after `pg_uds_send_with_fd` >> >> > had >> >> >> >> Hm, that had worked at some point... >> >> >> >> >> >> > errno == EWOULDBLOCK || errno == EAGAIN >> >> > >> >> > as well as the checkpointer process. >> >> >> >> What do you mean with that latest sentence? >> >> To investigate what's happening I attached with gdb to two processes, COPY >> process from pgbench and checkpointer (since I assumed it may be involved). >> Both were waiting in WaitLatchOrSocket right after SendFsyncRequest. > > Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the > backtrace? Well, that's what I've got from gdb: #0 0x00007fae03fae9f3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x000000000077a979 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffe37529ec0, cur_timeout=-1, set=0x23cddf8) at latch.c:1048 #2 WaitEventSetWait (set=set@entry=0x23cddf8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffe37529ec0, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=0) at latch.c:1000 #3 0x000000000077ad08 in WaitLatchOrSocket (latch=latch@entry=0x0, wakeEvents=wakeEvents@entry=4, sock=8, timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=0) at latch.c:385 #4 0x00000000007152cb in SendFsyncRequest (request=request@entry=0x7ffe37529f40, fd=fd@entry=-1) at checkpointer.c:1345 #5 0x0000000000716223 in AbsorbAllFsyncRequests () at checkpointer.c:1207 #6 0x000000000079a5f0 in mdsync () at md.c:1339 #7 0x000000000079c672 in smgrsync () at smgr.c:766 #8 0x000000000076dd53 in CheckPointBuffers (flags=flags@entry=64) at bufmgr.c:2581 #9 0x000000000051c681 in CheckPointGuts (checkPointRedo=722254352, flags=flags@entry=64) at xlog.c:9079 #10 0x0000000000523c4a in CreateCheckPoint (flags=flags@entry=64) at xlog.c:8863 #11 0x0000000000715f41 in CheckpointerMain () at checkpointer.c:494 #12 0x00000000005329f4 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffe3752a220) at bootstrap.c:451 #13 0x0000000000720c28 in StartChildProcess (type=type@entry=CheckpointerProcess) at postmaster.c:5340 #14 0x0000000000721c23 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2875 #15 <signal handler called> #16 0x00007fae03fa45b3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84 #17 0x0000000000722968 in ServerLoop () at postmaster.c:1679 #18 0x0000000000723cde in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x23a00e0) at postmaster.c:1388 #19 0x000000000068979f in main (argc=3, argv=0x23a00e0) at main.c:228 >> >> > Looks like with the default >> >> > configuration and `max_wal_size=1GB` it writes more than reads to a >> >> > socket, and a buffer eventually becomes full. >> >> >> >> That's intended to then wake up the checkpointer immediately, so it can >> >> absorb the requests. So something isn't right yet. >> > >> > Doesn't hang here, but it's way too slow. >> >> Yep, in my case it was also getting slower, but eventually hang. >> >> > Reason for that is that I've wrongly resolved a merge conflict. Attached is a >> > fixup patch - does that address the issue for you? >> >> Hm...is it a correct patch? I see the same committed in >> 8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it. > > Yea, sorry for that. Too many files in my patch directory... Right one > attached. Yes, this patch solves the problem, thanks.
On 2018-05-22 21:58:06 +0200, Dmitry Dolgov wrote: > > On 22 May 2018 at 20:59, Andres Freund <andres@anarazel.de> wrote: > > On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote: > > Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the > > backtrace? > > Well, that's what I've got from gdb: > #3 0x000000000077ad08 in WaitLatchOrSocket > (latch=latch@entry=0x0, wakeEvents=wakeEvents@entry=4, sock=8, > timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=0) at > latch.c:385 > #4 0x00000000007152cb in SendFsyncRequest > (request=request@entry=0x7ffe37529f40, fd=fd@entry=-1) at > checkpointer.c:1345 > #5 0x0000000000716223 in AbsorbAllFsyncRequests () at checkpointer.c:1207 Oh, I see. That makes sense. So it's possible to self-deadlock here. Should be easy to fix... Thanks for finding that one. > Yes, this patch solves the problem, thanks. Just avoids it, I'm afraid... It probably can't be hit easily, but the issue is there... - Andres
On 21 May 2018 at 15:50, Craig Ringer <craig@2ndquadrant.com> wrote:
On 21 May 2018 at 12:57, Craig Ringer <craig@2ndquadrant.com> wrote:On 18 May 2018 at 00:44, Andres Freund <andres@anarazel.de> wrote:Hi,
On 2018-05-10 09:50:03 +0800, Craig Ringer wrote:
> while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
> {
> if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
> - ereport(ERROR,
> + ereport(PANIC,
> (errcode_for_file_access(),
> errmsg("could not fsync file \"%s\": %m", src->path)));
To me this (and the other callers) doesn't quite look right. First, I
think we should probably be a bit more restrictive about when PANIC
out. It seems like we should PANIC on ENOSPC and EIO, but possibly not
others. Secondly, I think we should centralize the error handling. It
seems likely that we'll acrue some platform specific workarounds, and I
don't want to copy that knowledge everywhere.
Also, don't we need the same on close()?Yes, we do, and that expands the scope a bit.I agree with Robert that some sort of filter/macro is wise, though naming it clearly will be tricky.I'll have a look.On the queue for tomorrow.
Hi all.
I've revised the fsync patch with the cleanups discussed and gone through the close() calls.
AFAICS either socket closes, temp file closes, or (for WAL) already PANIC on close. It's mainly fd.c that needs amendment. Which I've done per the attached revised patch.
Attachment
On Wed, May 23, 2018 at 8:02 AM, Andres Freund <andres@anarazel.de> wrote: > [patches] Hi Andres, Obviously there is more work to be done here but the basic idea in your clone-fd-checkpointer branch as of today seems OK to me. I think Craig and I both had similar ideas (sending file descriptors that have an old enough f_wb_err) but we thought rlimit management was going to be hard (shared memory counters + deduplication, bleugh). You made it look easy. Nice work. First, let me describe in my own words what's going on, mostly to make sure I really understand this: 1. We adopt a new fd lifecycle that is carefully tailored to avoid error loss on Linux, and can't hurt on other OSes. By sending the file descriptor used for every write to the checkpointer, we guarantee that (1) the inode stays pinned (the file is "in flight" in the socket, even if the sender closes it before the checkpointer receives it) so Linux won't be tempted to throw away the precious information in i_mapping->wb_err, and (2) the checkpointer finishes up with a file descriptor that points to the very same "struct file" with the f_wb_err value that was originally sampled before the write, by the sender. So we can't miss any write-back errors. Wahey! However, errors are still reported only once, so we probably need to panic. Hmm... Is there any way that the *sender* could finish up in file_check_and_advance_wb_err() for the same struct file, before the checkpointer manages to call fsync() on its dup'd fd? I don't immediately see how (it looks like you have to call one of the various sync functions to reach that, and your patch removes the fallback just-call-FileSync()-myself code from register_dirty_segment()). I guess it would be bad if, say, close() were to do that in some future Linux release because then we'd have no race-free way to tell the checkpointer that the file is borked before it runs fsync() and potentially gets an OK response and reports a successful checkpoint (even if we panicked, with sufficiently bad timing it might manage to report a successful checkpoint). 2. In order to make sure that we don't exceed our soft limit on the number of file descriptors per process, you use a simple backend-local counter in the checkpointer, on the theory that we don't care about fds (or, technically, the files they point to) waiting in the socket buffer, we care only about how many the checkpointer has actually received but not yet closed. As long as we make sure there is space for at least one more before we read one message, everything should be fine. Good and simple. One reason I thought this would be harder is because I had no model of how RLIMIT_NOFILE would apply when you start flinging fds around between processes (considering there can be moments when neither end has the file open), so I feared the worst and thought we would need to maintain a counter in shared memory and have senders and receiver cooperate to respect it. My intuition that this was murky and required pessimism wasn't too far from the truth actually: apparently the kernel didn't do a great job at accounting for that until a patch[1] landed for CVE-2013-4312. The behaviour in older releases is super lax, so no problem there. The behaviour from 4.9 onward (or is it 4.4.1?) is that you get a separate per-user RLIMIT_NOFILE allowance for in-flight fds. So effectively the sender doesn't have to worry about about fds it has sent but closed and the receiver doesn't have to care about fds it hasn't received yet, so your counting scheme seems OK. As for exceeding RLIMIT_NOFILE with in-flight fds, it's at least bounded by the fact that the socket would block/EWOULDBLOCK if the receiver isn't draining it fast enough and can only hold a small and finite amount of data and thus file descriptors, so we can probably just ignore that. If you did manage to exceed it, I think you'd find out about that with ETOOMANYREFS at send time (curiously absent from the sendmsg() man page, but there in black and white in the source for unix_attach_fds()), and then you'd just raise an error (currently FATAL in your patch). I have no idea how the rlimit for SCM-ified files works on other Unixoid systems though. Some actual code feedback: + if (entry->syncfds[forknum][segno] == -1) + { + char *path = mdpath(entry->rnode, forknum, segno); + open_fsync_queue_files++; + /* caller must have reserved entry */ + entry->syncfds[forknum][segno] = + FileOpenForFd(fd, path); + pfree(path); + } + else + { + /* + * File is already open. Have to keep the older fd, errors + * might only be reported to it, thus close the one we just + * got. + * + * XXX: check for errrors. + */ + close(fd); + } Wait... it's not guaranteed to be older in open() time, is it? It's just older in sendmsg() time. Those might be different: A: open("base/42/1234") A: write() kernel: inode->i_mapping->wb_err++ B: open("base/42/1234") B: write() B: sendmsg() A: sendmsg() C: recvmsg() /* clone B's fd */ C: recvmsg() /* clone A's fd, throw it away because we already have B's */ C: fsync() I think that would eat an error under the 4.13 code. I think it might be OK under the new 4.17 code, because the new "must report it to at least one caller" thing would save the day. So now I'm wondering if that's getting backported to the 4.14 long term kernel. Aha, yes it has been already[2]. So... if you don't have to worry about kernels without that patch, I suspect the exact ordering doesn't matter anymore, as long as *someone* held the file open at all times beween write and fsync() to keep the inode around, which this patch achieves. (And of course no other process sets the ERRSEQ_SEEN flag, for example some other kind of backend or random other program that opened our data file and called one of the sync syscalls, or any future syscalls that start calling file_check_and_advance_wb_err()). + /* + * FIXME: do DuplicateHandle dance for windows - can that work + * trivially? + */ I don't know, I guess something like CreatePipe() and then write_duplicate_handle()? And some magic (maybe our own LWLock) to allow atomic messages? A more interesting question is: how will you cap the number file handles you send through that pipe? On that OS you call DuplicateHandle() to fling handles into another process's handle table directly. Then you send the handle number as plain old data to the other process via carrier pigeon, smoke signals, a pipe etc. That's interesting because the handle allocation is asynchronous from the point of view of the receiver. Unlike the Unix case where the receiver can count handles and make sure there is space for one more before it reads a potentially-SCM-containing message, here the *senders* will somehow need to make sure they don't create too many in the receiving process. I guess that would involve a shared counter, and a strategy for what to do when the number is too high (probably just wait). Hmm. I wonder if that would be a safer approach on all operating systems. + if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0) ... + size = sendmsg(sock, &msg, 0); Here you are relying on atomic sending and receiving of whole messages over a stream socket. For example, in Linux's unix_stream_sendmsg() it's going to be chopped into buffers of size (sk->sk_sndbuf >> 1) - 64 (I have no clue how big that is) that are appended one at a time to the receiving socket's queue, with no locking in between, so a concurrently send messages can be interleaved with yours. How big is big enough for that to happen? There doesn't seem to be an equivalent of PIPE_BUF for Unix domain sockets. These tiny messages are almost certainly safe, but I wonder if we should be using SOCK_SEQPACKET instead of SOCK_STREAM? Might be a less theoretical problem if we switch to variable sized messages containing file paths as you once contemplated in an off-list chat. Presumably for EXEC_BACKEND we'll need to open PGDATA/something/something/socket or similar. + if (returnCode < 0) + { + /* XXX: decide on policy */ + bms_add_member(entry->requests[forknum], segno); Obviously this is pseudocode (doesn't even keep the return value), but just BTW, I think that if we decide not to PANIC unconditionally on any kind of fsync() failure, we definitely can't use bms_add_member() here (it might fail to allocate, and then we forget the segment, raise and error and won't try again). It's got to be PANIC or no-fail code (like the patch I proposed in another thread). +SendFsyncRequest(CheckpointerRequest *request, int fd) ... + * Don't think short reads will ever happen in realistic ... + ereport(FATAL, (errmsg("could not receive fsync request: %m"))); Short *writes*, could not *send*. + * back, as that'd lead to loosing error reporting guarantees on s/loosing/losing/ [1] https://github.com/torvalds/linux/commit/712f4aad406bb1ed67f3f98d04c044191f0ff593 [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/lib/errseq.c?h=linux-4.14.y&id=0799a0ea96e4923f52f85fe315b62e9176a3319c -- Thomas Munro http://www.enterprisedb.com
On Tue, May 29, 2018 at 4:53 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > I've revised the fsync patch with the cleanups discussed and gone through > the close() calls. > > AFAICS either socket closes, temp file closes, or (for WAL) already PANIC on > close. It's mainly fd.c that needs amendment. Which I've done per the > attached revised patch. I think we should have a separate thread for this patch vs. Andres's patch to do magic things with the checkpointer and file-descriptor forwarding. Meanwhile, here's some review. 1. No longer applies cleanly. 2. I don't like promote_ioerr_to_panic() very much, partly because the same pattern gets repeated over and over, and partly because it would be awkwardly-named if we discovered that another 2 or 3 errors needed similar handling (or some other variant handling). I suggest instead having a function like report_critical_fsync_failure(char *path) that does something like this: int elevel = ERROR; if (errno == EIO) elevel = PANIC; ereport(elevel, (errcode_for_file_access(), errmsg("could not fsync file \"%s\": %m", path); And similarly I'd add report_critical_close_failure. In some cases, this would remove wording variations (e.g. in twophase.c) but I think that's fine, and maybe an improvement, as discussed on another recent thread. 3. slru.c calls pg_fsync() but isn't changed by the patch. That looks wrong. 4. The comment changes in snapbuild.c interact with the TODO that immediately follows. I think more adjustment is needed here. 5. It seems odd that you adjusted the comment for pg_fsync_no_writethrough() but not pg_fsync_writethrough() or pg_fsync(). Either pg_fsync_writethrough() doesn't have the same problem, in which case, awesome, but let's add a comment, or it does, in which case it should refer to the other one. And I think pg_fsync() itself needs a comment saying that every caller must be careful to use promote_ioerr_to_panic() or report_critical_fsync_failure() or whatever we end up calling it unless the fsync is not critical for data integrity. 6. In md.c, there's a stray blank line added. But more importantly, the code just above that looks like this: if (!FILE_POSSIBLY_DELETED(errno) || failures > 0) - ereport(ERROR, + ereport(promote_ioerr_to_panic(ERROR), (errcode_for_file_access(), errmsg("could not fsync file \"%s\": %m", path))); else ereport(DEBUG1, (errcode_for_file_access(), errmsg("could not fsync file \"%s\" but retrying: %m", path))); I might be all wet here, but it seems like if we enter the bottom branch, we still need the promote-to-panic behavior. 7. The comment adjustment for SyncDataDirectory mentions an "important" fact about fsync behavior, but then doesn't seem to change any logic on that basis. I think in general a number of these comments need a little more thought, but in this particular case, I think we also need to consider what the behavior should be (and the comment should reflect our considered judgement on that point, and the implementation should match). 8. Andres suggested to me off-list that we should have a GUC to disable the promote-to-panic behavior in case it turns out to be a show-stopper for some user. I think that's probably a good idea. Adding many new ways to PANIC in a minor release without providing any way to go back to the old behavior sounds unfriendly. Surely, anyone who suffers much from this has really serious other problems anyway, but all the same I think we should provide an escape hatch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 19, 2018 at 7:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > 2. I don't like promote_ioerr_to_panic() very much, partly because the > same pattern gets repeated over and over, and partly because it would > be awkwardly-named if we discovered that another 2 or 3 errors needed > similar handling (or some other variant handling). I suggest instead > having a function like report_critical_fsync_failure(char *path) that > ... Note that if we don't cover *all* errno values, or ... > 8. Andres suggested to me off-list that we should have a GUC to > disable the promote-to-panic behavior in case it turns out to be a > show-stopper for some user. ... we let the user turn this off, then we also have to fix this: https://www.postgresql.org/message-id/flat/87y3i1ia4w.fsf@news-spur.riddles.org.uk -- Thomas Munro http://www.enterprisedb.com
On Thu, Jun 14, 2018 at 5:30 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, May 23, 2018 at 8:02 AM, Andres Freund <andres@anarazel.de> wrote: >> [patches] > > A more interesting question is: how will you cap the number file > handles you send through that pipe? On that OS you call > DuplicateHandle() to fling handles into another process's handle table > directly. Then you send the handle number as plain old data to the > other process via carrier pigeon, smoke signals, a pipe etc. That's > interesting because the handle allocation is asynchronous from the > point of view of the receiver. Unlike the Unix case where the > receiver can count handles and make sure there is space for one more > before it reads a potentially-SCM-containing message, here the > *senders* will somehow need to make sure they don't create too many in > the receiving process. I guess that would involve a shared counter, > and a strategy for what to do when the number is too high (probably > just wait). > > Hmm. I wonder if that would be a safer approach on all operating systems. As a way of poking this thread, here are some more thoughts. Buffer stealing currently look something like this: Evicting backend: lseek(fd) write(fd) ...enqueue-fsync-request via shm... Checkpointer: ...push into hash table... With the patch it presumably looks something like this: Evicting backend: lseek(fd) write(fd) sendmsg(fsync_socket) /* passes fd */ Checkpointer: recvmsg(fsync_socket) /* gets a copy of fd */ ...push into hash table... close(fd) /* for all but the first one received for the same file */ That takes us from 2 syscalls to 5 per evicted buffer. I suppose it's possible that on some operating systems that might hurt a bit, given that it's happening at the granularity of 1GB data files that could have a lot of backends working in them concurrently. I have no idea if it's really a problem on any particular OS. Admittedly on Linux it's probably just a bunch of fast atomic ops and RCU stuff... probably only the existing write() actually takes the inode lock or anything that heavy, and that's probably lost in the noise in an evict-heavy workload. I don't know, I guess it's probably not a problem, but I thought I'd mention that. Contention on the new fsync socket doesn't seem to be a new problem per se since it replaces a contention point we already had: CheckpointerCommLock. If that was acceptable today then perhaps that indicates that any in-kernel contention created by the new syscalls is also OK. My feeling so far is that I'd probably go for sender-collapses model (and it might even be necessary on Windows?) if doing this as a new feature, but I fully understand your desire to do it in a much simpler way that could be back-patched more easily. I'm just slightly concerned about the unintended consequence risk that comes with exercising an operating system feature that not all operating system authors probably intended to be used at high frequency. Nothing that can't be assuaged by testing. * the queue is full and contains no duplicate entries. In that case, we * let the backend know by returning false. */ -bool -ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno) +void +ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno, + File file) Comment out of date. -- Thomas Munro http://www.enterprisedb.com
On Sun, Jul 29, 2018 at 6:14 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > As a way of poking this thread, here are some more thoughts. I am keen to move this forward, not only because it is something we need to get fixed, but also because I have some other pending patches in this area and I want this sorted out first. Here are some small fix-up patches for Andres's patchset: 1. Use FD_CLOEXEC instead of the non-portable Linuxism SOCK_CLOEXEC. 2. Fix the self-deadlock hazard reported by Dmitry Dolgov. Instead of the checkpoint trying to send itself a CKPT_REQUEST_SYN message through the socket (whose buffer may be full), I included the ckpt_started counter in all messages. When AbsorbAllFsyncRequests() drains the socket, it stops at messages with the current ckpt_started value. 3. Handle postmaster death while waiting. 4. I discovered that macOS would occasionally return EMSGSIZE for sendmsg(), but treating that just like EAGAIN seems to work the next time around. I couldn't make that happen on FreeBSD (I mention that because the implementation is somehow related). So handle that weird case on macOS only for now. Testing on other Unixoid systems would be useful. The case that produced occasional EMSGSIZE on macOS was: shared_buffers=1MB, max_files_per_process=32, installcheck-parallel. Based on man pages that seems to imply an error in the client code but I don't see it. (I also tried to use SOCK_SEQPACKET instead of SOCK_STREAM, but it's not supported on macOS. I also tried to use SOCK_DGRAM, but that produced occasional ENOBUFS errors and retrying didn't immediately succeed leading to busy syscall churn. This is all rather unsatisfying, since SOCK_STREAM is not guaranteed by any standard to be atomic, and we're writing messages from many backends into the socket so we're assuming atomicity. I don't have a better idea that is portable.) There are a couple of FIXMEs remaining, and I am aware of three more problems: * Andres mentioned to me off-list that there may be a deadlock risk where the checkpointer gets stuck waiting for an IO lock. I'm going to look into that. * Windows. Patch soon. * The ordering problem that I mentioned earlier: the patchset wants to keep the *oldest* fd, but it's really the oldest it has received. An idea Andres and I discussed is to use a shared atomic counter to assign a number to all file descriptors just before their first write, and send that along with it to the checkpointer. Patch soon. -- Thomas Munro http://www.enterprisedb.com
Attachment
I was looking at the commitfest entry for feature (https://commitfest.postgresql.org/19/1639/) for the most recent list of patches to try out. The list doesn't look correct/complete. Can someone please check? Asim
On Wed, Aug 15, 2018 at 11:08 AM, Asim R P <apraveen@pivotal.io> wrote: > I was looking at the commitfest entry for feature > (https://commitfest.postgresql.org/19/1639/) for the most recent list > of patches to try out. The list doesn't look correct/complete. Can > someone please check? Hi Asim, This thread is a bit tangled up. There are two related patchsets in it: 1. Craig Ringer's PANIC-on-EIO patch set, to cope with the fact that Linux throws away buffers and errors after reporting an error, so the checkpointer shouldn't retry as it does today. The latest is here: https://www.postgresql.org/message-id/CAMsr%2BYFPeKVaQ57PwHqmRNjPCPABsdbV%3DL85he2dVBcr6yS1mA%40mail.gmail.com 2. Andres Freund's fd-sending fsync queue, to cope with the fact that some versions of Linux only report writeback errors that occurred after you opened the file, and all versions of Linux and some other operating systems might forget about writeback errors while no one has it open. Here is the original patchset: https://www.postgresql.org/message-id/20180522010823.z5bdq7wnlsna5qoo%40alap3.anarazel.de Here is a fix-up you need: https://www.postgresql.org/message-id/20180522185951.5sdudzl46spktyyz%40alap3.anarazel.de Here are some more fix-up patches that I propose: https://www.postgresql.org/message-id/CAEepm%3D2WSPP03-20XHpxohSd2UyG_dvw5zWS1v7Eas8Rd%3D5e4A%40mail.gmail.com I will soon post some more fix-up patches that add EXEC_BACKEND support, Windows support, and a counting scheme to fix the timing issue that I mentioned in my first review. I will probably squash it all down to a tidy patch-set after that. -- Thomas Munro http://www.enterprisedb.com
On 15 August 2018 at 07:32, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Aug 15, 2018 at 11:08 AM, Asim R P <apraveen@pivotal.io> wrote:
> I was looking at the commitfest entry for feature
> (https://commitfest.postgresql.org/19/1639/) for the most recent list
> of patches to try out. The list doesn't look correct/complete. Can
> someone please check?
Hi Asim,
This thread is a bit tangled up. There are two related patchsets in it:
1. Craig Ringer's PANIC-on-EIO patch set, to cope with the fact that
Linux throws away buffers and errors after reporting an error, so the
checkpointer shouldn't retry as it does today. The latest is here:
https://www.postgresql.org/message-id/CAMsr% 2BYFPeKVaQ57PwHqmRNjPCPABsdbV% 3DL85he2dVBcr6yS1mA%40mail. gmail.com
2. Andres Freund's fd-sending fsync queue, to cope with the fact that
some versions of Linux only report writeback errors that occurred
after you opened the file, and all versions of Linux and some other
operating systems might forget about writeback errors while no one has
it open.
Here is the original patchset:
https://www.postgresql.org/message-id/20180522010823. z5bdq7wnlsna5qoo%40alap3. anarazel.de
Here is a fix-up you need:
https://www.postgresql.org/message-id/20180522185951. 5sdudzl46spktyyz%40alap3. anarazel.de
Here are some more fix-up patches that I propose:
https://www.postgresql.org/message-id/CAEepm%3D2WSPP03- 20XHpxohSd2UyG_ dvw5zWS1v7Eas8Rd%3D5e4A% 40mail.gmail.com
I will soon post some more fix-up patches that add EXEC_BACKEND
support, Windows support, and a counting scheme to fix the timing
issue that I mentioned in my first review. I will probably squash it
all down to a tidy patch-set after that.
Thanks very much Tomas.
I've had to back off from this a bit after posting my initial panic-for-safety patch, as the changes Andres proposed are a bit out of my current depth and time capacity.
I still think the panic patch is needed and appropriate, but agree it's not *sufficient*.
On Thu, Aug 30, 2018 at 2:44 PM Craig Ringer <craig@2ndquadrant.com> wrote: > On 15 August 2018 at 07:32, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> I will soon post some more fix-up patches that add EXEC_BACKEND >> support, Windows support, and a counting scheme to fix the timing >> issue that I mentioned in my first review. I will probably squash it >> all down to a tidy patch-set after that. I went down a bit of a rabbit hole with the Windows support for Andres's patch set. I have something that works as far as I can tell, but my Windows environment consists of throwing things at Appveyor and seeing what sticks, so I'm hoping that someone with a real Windows system and knowledge will be able to comment. New patches in this WIP patch set: 0012: Fix for EXEC_BACKEND. 0013: Windows. This involved teaching latch.c to deal with Windows asynchronous IO events, since you can't wait for pipe readiness via WSAEventSelect. Pipes and sockets exist in different dimensions on Windows, and there are no "Unix" domain sockets (well, there are but they aren't usable yet[1]). An alternative would be to use TCP sockets for this, and then the code would look more like the Unix code, but that seems a bit strange. Note that the Windows version doesn't actually hand off file handles like the Unix code (it could fairly easily, but there is no reason to think that would actually be useful on that platform). I may be way off here... The 0013 patch also fixes a mistake in the 0010 patch: it is not appropriate to call CFI() while waiting to notify the checkpointer of a dirty segment, because then ^C could cause the following checkpoint not to flush dirty data. SendFsyncRequest() is essentially blocking, except that it uses non-blocking IO so that it multiplex postmaster death detection. 0014: Fix the ordering race condition mentioned upthread[2]. All files are assigned an increasing sequence number after [re]opening (ie before their first write), so that the checkpointer process can track the fd that must have the oldest Linux f_wb_err that could be relevant for writes done by PostgreSQL. The other patches in this tarball are all as posted already, but are now rebased and assembled in one place. Also pushed to https://github.com/macdice/postgres/tree/fsyncgate . Thoughts? [1] https://blogs.msdn.microsoft.com/commandline/2017/12/19/af_unix-comes-to-windows/ [2] https://www.postgresql.org/message-id/CAEepm%3D04ZCG_8N3m61kXZP-7Ecr02HUNNG-QsAhwyFLim4su2g%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Sep 28, 2018 at 9:37 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > The 0013 patch also fixes a mistake in the 0010 patch: it is not > appropriate to call CFI() while waiting to notify the checkpointer of > a dirty segment, because then ^C could cause the following checkpoint > not to flush dirty data. (Though of course it wouldn't actually do that due to an LWLock being held, but still, I removed the CFI because it was at best misleading). -- Thomas Munro http://www.enterprisedb.com
On Fri, Sep 28, 2018 at 9:37 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > The other patches in this tarball are all as posted already, but are > now rebased and assembled in one place. Also pushed to > https://github.com/macdice/postgres/tree/fsyncgate . Here is a new version that fixes an assertion failure during crash recovery, revealed by cfbot. I also took the liberty of squashing the patch stack into one and writing a new commit message, except the Windows part which seems worth keeping separate until we agree it's the right way forward. -- Thomas Munro http://www.enterprisedb.com
Attachment
Hello hackers, Let's try to get this issue resolved. Here is my position on the course of action we should take in back-branches: 1. I am -1 on back-patching the fd-transfer code. It's a significant change, and even when sufficiently debugged (I don't think it's there yet), we have no idea what will happen on all the kernels we support under extreme workloads. IMHO there is no way we can spring this on users in a point release. 2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing nothing is not an option I like. I have some feedback and changes to propose though; see attached. Responses to a review from Robert: On Thu, Jul 19, 2018 at 7:23 AM Robert Haas <robertmhaas@gmail.com> wrote: > 2. I don't like promote_ioerr_to_panic() very much, partly because the > same pattern gets repeated over and over, and partly because it would > be awkwardly-named if we discovered that another 2 or 3 errors needed > similar handling (or some other variant handling). I suggest instead > having a function like report_critical_fsync_failure(char *path) that > does something like this: > > int elevel = ERROR; > if (errno == EIO) > elevel = PANIC; > ereport(elevel, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", path); > > And similarly I'd add report_critical_close_failure. In some cases, > this would remove wording variations (e.g. in twophase.c) but I think > that's fine, and maybe an improvement, as discussed on another recent > thread. I changed it to look like data_sync_elevel(ERROR) and made it treat all errnos the same. ENOSPC, EIO, EWOK, EIEIO, it makes no difference to the level of faith I have that my data still exists. > 3. slru.c calls pg_fsync() but isn't changed by the patch. That looks wrong. Fixed. > 4. The comment changes in snapbuild.c interact with the TODO that > immediately follows. I think more adjustment is needed here. I don't understand this. > 5. It seems odd that you adjusted the comment for > pg_fsync_no_writethrough() but not pg_fsync_writethrough() or > pg_fsync(). Either pg_fsync_writethrough() doesn't have the same > problem, in which case, awesome, but let's add a comment, or it does, > in which case it should refer to the other one. And I think > pg_fsync() itself needs a comment saying that every caller must be > careful to use promote_ioerr_to_panic() or > report_critical_fsync_failure() or whatever we end up calling it > unless the fsync is not critical for data integrity. I removed these comments and many others; I don't see the point in scattering descriptions of this problem and references to specific versions of Linux and -hackers archive links all over the place. I added a comment in one place, and also added some user documentation of the problem. > 6. In md.c, there's a stray blank line added. But more importantly, > the code just above that looks like this: > > if (!FILE_POSSIBLY_DELETED(errno) || > failures > 0) > - ereport(ERROR, > + ereport(promote_ioerr_to_panic(ERROR), > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", > path))); > else > ereport(DEBUG1, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\" > but retrying: %m", > path))); > > I might be all wet here, but it seems like if we enter the bottom > branch, we still need the promote-to-panic behavior. That case only is only reached if FILE_POSSIBLY_DELETED() on the first time through the loop, and it detects an errno value not actually from fsync(). It's from FileSync(), when it tries to reopen a virtual fd and gets ENOENT, before calling fsync(). Code further down then absorbs incoming requests before checking if that was expected, closing a race. The comments could make that clearer, admittedly. > 7. The comment adjustment for SyncDataDirectory mentions an > "important" fact about fsync behavior, but then doesn't seem to change > any logic on that basis. I think in general a number of these > comments need a little more thought, but in this particular case, I > think we also need to consider what the behavior should be (and the > comment should reflect our considered judgement on that point, and the > implementation should match). I updated the comment. I don't think this is too relevant to the fsync() failure case, because we'll be rewriting all changes from the WAL again during recovery; I think this function is mostly useful for switching from fsync = off to fsync = on and restarting, not coping with previous fsync() failures by retrying (which we know to be useless anyway). Someone could argue that if you restarted after changing fsync from off to on, then this may be the first time you learn that write-back failed, and then you're somewhat screwed whether we panic or not, but I don't see any solution to that. Don't run databases with fsync = off. > 8. Andres suggested to me off-list that we should have a GUC to > disable the promote-to-panic behavior in case it turns out to be a > show-stopper for some user. I think that's probably a good idea. > Adding many new ways to PANIC in a minor release without providing any > way to go back to the old behavior sounds unfriendly. Surely, anyone > who suffers much from this has really serious other problems anyway, > but all the same I think we should provide an escape hatch. +1. See the new GUC data_sync_retry, defaulting to false. If set to true, we also need to fix the problem reported in [1], so here's the patch for that too. Other comments: I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a pass here. Inspection of some version of the kernel might tell us it can't advance the error counter and report failure, but what do we gain by relying on that? Changed. FD_DELETE_AT_CLOSE is not a good way to detect temporary files in recent versions, as it doesn't detect the kind of shared temporary files used by Parallel Hash; FD_TEMP_FILE_LIMIT is a better way. Changed. (We could also just not bother exempting temporary files?) I plan to continue working on the fd-transfer system as part of a larger sync queue redesign project for 12. If we can get an agreement that we can't possibly back-patch the fd-transfer logic, then we can move all future discussion of that topic over to the other thread[2], and this thread can be about consensus to back-patch the PANIC patch. Thoughts? [1] https://www.postgresql.org/message-id/flat/87y3i1ia4w.fsf@news-spur.riddles.org.uk [2] https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
Attachment
On Fri, 19 Oct 2018 at 07:27, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing
nothing is not an option I like. I have some feedback and changes to
propose though; see attached.
Thanks very much for the work on reviewing and revising this.
I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a
pass here. Inspection of some version of the kernel might tell us it
can't advance the error counter and report failure, but what do we
gain by relying on that? Changed.
I was sure it made sense at the time, but I can't explain that decision now, and it looks like we should treat it as a failure.
On Fri, Oct 19, 2018 at 6:42 PM Craig Ringer <craig@2ndquadrant.com> wrote: > On Fri, 19 Oct 2018 at 07:27, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> 2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing >> nothing is not an option I like. I have some feedback and changes to >> propose though; see attached. > > Thanks very much for the work on reviewing and revising this. My plan is do a round of testing and review of this stuff next week once the dust is settled on the current minor releases (including fixing a few typos I just spotted and some word-smithing). All going well, I will then push the resulting patches to master and all supported stable branches, unless other reviews or objections appear. At some point not too far down the track I hope to be ready to consider committing that other patch that will completely change all of this code in the master branch, but in any case Craig's patch will get almost a full minor release cycle to sit in the stable branches before release. -- Thomas Munro http://www.enterprisedb.com
On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > My plan is do a round of testing and review of this stuff next week > once the dust is settled on the current minor releases (including > fixing a few typos I just spotted and some word-smithing). All going > well, I will then push the resulting patches to master and all > supported stable branches, unless other reviews or objections appear. > At some point not too far down the track I hope to be ready to > consider committing that other patch that will completely change all > of this code in the master branch, but in any case Craig's patch will > get almost a full minor release cycle to sit in the stable branches > before release. I did a read-through of these patches. + new_requests = entry->requests[forknum]; + entry->requests[forknum] = + bms_join(new_requests, requests); What happens if bms_join fails, too? + recover from the WAL after any failure is reported, preferrably preferably. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 9, 2018 at 7:07 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > My plan is do a round of testing and review of this stuff next week > > once the dust is settled on the current minor releases (including > > fixing a few typos I just spotted and some word-smithing). All going > > well, I will then push the resulting patches to master and all > > supported stable branches, unless other reviews or objections appear. > > At some point not too far down the track I hope to be ready to > > consider committing that other patch that will completely change all > > of this code in the master branch, but in any case Craig's patch will > > get almost a full minor release cycle to sit in the stable branches > > before release. > > I did a read-through of these patches. > > + new_requests = entry->requests[forknum]; > + entry->requests[forknum] = > + bms_join(new_requests, requests); > > What happens if bms_join fails, too? My reasoning for choosing bms_join() is that it cannot fail, assuming the heap is not corrupted. It simply ORs the two bit-strings into whichever is the longer input string, and frees the shorter input string. (In an earlier version I used bms_union(), this function's non-destructive sibling, but then realised that it could fail to allocate() causing us to lose track of a 1 bit). Philosophical point: if pfree() throws, then bms_join() throws, but (assuming AllocSetFree() implementation) it can only throw if the heap is corrupted, eg elog(ERROR, "could not find block containing chunk %p", chunk) and possibly other errors. Of course it's impossible to make guarantees of any kind in case of arbitrary corruption. But perhaps we could do this in a critical section, so errors are promoted to PANIC. > + recover from the WAL after any failure is reported, preferrably > > preferably. Thanks. -- Thomas Munro http://www.enterprisedb.com
On Thu, Nov 8, 2018 at 3:04 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > My reasoning for choosing bms_join() is that it cannot fail, assuming > the heap is not corrupted. It simply ORs the two bit-strings into > whichever is the longer input string, and frees the shorter input > string. (In an earlier version I used bms_union(), this function's > non-destructive sibling, but then realised that it could fail to > allocate() causing us to lose track of a 1 bit). Oh, OK. I was assuming it was allocating. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 9, 2018 at 9:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Nov 8, 2018 at 3:04 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > My reasoning for choosing bms_join() is that it cannot fail, assuming > > the heap is not corrupted. It simply ORs the two bit-strings into > > whichever is the longer input string, and frees the shorter input > > string. (In an earlier version I used bms_union(), this function's > > non-destructive sibling, but then realised that it could fail to > > allocate() causing us to lose track of a 1 bit). > > Oh, OK. I was assuming it was allocating. I did some more testing using throw-away fault injection patch 0003. I found one extra problem: fsync_fname() needed data_sync_elevel() treatment, because it is used in eg CheckPointCLOG(). With data_sync_retry = on, if you update a row, touch /tmp/FileSync_EIO and try to checkpoint then the checkpoint fails, and the cluster keeps running. Future checkpoint attempts report the same error about the same file, showing that patch 0001 works (we didn't forget about the dirty file). Then rm /tmp/FileSync_EIO, and the next checkpoint should succeed. With data_sync_retry = off (the default), the same test produces a PANIC, showing that patch 0002 works. It's similar if you touch /tmp/pg_sync_EIO instead. That shows that cases like fsync_fname("pg_xact") also cause PANIC when data_sync_retry = off, but it hides the bug that 0001 fixes when data_sync_retry = on, hence my desire to test the two different fault injection points. I think these patches are looking good now. If I don't spot any other problems or hear any objections, I will commit them tomorrow-ish. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Nov 9, 2018 at 9:03 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Nov 9, 2018 at 7:07 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro > > <thomas.munro@enterprisedb.com> wrote: > > > My plan is do a round of testing and review of this stuff next week > > > once the dust is settled on the current minor releases (including > > > fixing a few typos I just spotted and some word-smithing). All going > > > well, I will then push the resulting patches to master and all > > > supported stable branches, unless other reviews or objections appear. ... On Sun, Nov 18, 2018 at 3:20 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I think these patches are looking good now. If I don't spot any other > problems or hear any objections, I will commit them tomorrow-ish. Hearing no objections, pushed to all supported branches. Thank you to Craig for all his work getting to the bottom of this, to Andres for his open source diplomacy, and the Linux guys for their change "errseq: Always report a writeback error once" which came out of that. Some more comments: * The promotion of errors from close() to PANIC may or may not be effective considering that it doesn't have interlocking with concurrent checkpoints. I'm not sure if it can really happen on local file systems anyway... this may fall under the category of "making PostgreSQL work reliably on NFS", a configuration that is not recommended currently, and a separate project IMV. * In 9.4 and 9.5 there is no checking of errors from sync_file_range(), and I didn't add any for now. It was claimed that sync_file_range() without BEFORE/AFTER can't consume errors[1]. Errors are promoted in 9.6+ for consistency because we already looked at the return code, so we won't long rely on that knowledge in the long term. * I personally believe it is safe to run with data_sync_retry = on on any file system on FreeBSD, and ZFS on any operating system... but I see no need to make recommendations about that in the documentation, other than that you should investigate the behaviour of your operating system if you really want to turn it on. * A PANIC (and possibly ensuing crash restart loop if the I/O error is not transient) is of course a very unpleasant failure mode, but it is one that we already had for the WAL and control file. So I'm not sure I'd personally bother to run with the non-default setting even on a system where I believe it to be safe (considering the low likelihood that I/O failure is isolated to certain files); at best it probably gives you a better experience if the fs underneath a non-default tablespace dies. * The GUC is provided primarily because this patch is so drastic in its effect that it seems like we owe our users a way to disable it on principle, and that seems to outweigh a desire not to add GUCs in back-branches. * If I/O errors happen, your system is probably toast and you need to fail over or restore from backups, but at least we won't tell you any lies about checkpoints succeeding. In rare scenarios, perhaps involving a transient failure of virtualised storage with thin provisioning as originally described by Craig, the system may actually be able to continue running, and with this change we should now be able to avoid data loss by recovering from the WAL. * As noted the commit message, this isn't quite the end of the story. See the fsync queue redesign thread[2], WIP for master only. [1] https://www.postgresql.org/message-id/20180430160945.5s5qfoqryhtmugxl%40alap3.anarazel.de [2] https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com