Re: Two fsync related performance issues? - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Two fsync related performance issues? |
Date | |
Msg-id | CA+hUKG+8pys+bLSoN6C-F1Nss2BiJfx43VyO5jL+P-whSMPQpg@mail.gmail.com Whole thread Raw |
In response to | Re: Two fsync related performance issues? (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Wed, Oct 7, 2020 at 6:17 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > For the record, Andres Freund mentioned a few problems with this > > > off-list and suggested we consider calling Linux syncfs() for each top > > > level directory that could potentially be on a different filesystem. > > > That seems like a nice idea to look into. > ... and for comparison/discussion, here is an alternative patch that > figures out precisely which files need to be fsync'd using information > in the WAL. [...] Michael Banck reported[1] a system that spent 20 minute in SyncDataDirectory(). His summary caused me to go back and read the discussions[1][2] that produced the current behaviour via commits 2ce439f3 and d8179b00, and I wanted to add a couple more observations about the two draft patches mentioned above. About the need to sync files that were dirtied during a previous run: 1. The syncfs() patch has the same ignore errors-and-press-on behaviour as d8179b00 gave us, though on Linux < 5.8 it would not even report them at LOG level. 2. The "precise" fsync() patch defers the work until after redo, but if you get errors while processing the queued syncs, you would not be able to complete the end-of-recovery checkpoint. This is correct behaviour in my opinion; any such checkpoint that is allowed to complete would be a lie, and would make the corruption permanent, releasing the WAL that was our only hope of recovering. If you want to run a so-damaged system for forensic purposes, you could always bring it up with fsync=off, or consider the idea from a nearby thread to allow the end-of-recovery checkpoint to be disabled (then the eventual first checkpoint will likely take down your system, but that's the case with the current ignore-errors-and-hope-for-the-best-after-crash coding for the *next* checkpoint, assuming your damaged filesystem continues to produce errors, it's just more principled IMHO). I recognise that this sounds an absolutist argument that might attract some complaints on practical grounds, but I don't think it really makes too much difference in practice. Consider a typical Linux filesystem: individual errors aren't going to be reported more than once, and full_page_writes must be on on such a system so we'll be writing out all affected pages again and then trying to fsync again in the end-of-recovery checkpoint, so despite our attempt at creating a please-ignore-errors-and-corrupt-my-database-and-play-on mode, you'll likely panic again if I/O errors persist, or survive and continue without corruption if the error-producing-conditions were fleeting and transient (as in the thin provisioning EIO or NFS ENOSPC conditions discussed in other threads). About the need to fsync everything in sight on a system that previously ran with fsync=off, as discussed in the earlier threads: 1. The syncfs() patch provides about the same weak guarantee as the current coding. Something like: it can convert all checkpoints that were logged in the time since the kernel started from fiction to fact, except those corrupted by (unlikely) I/O errors, which may only be reported in the kernel logs if at all. 2. The "precise" fsync() patch provides no such weak guarantee. It takes the last checkpoint at face value, and can't help you with anything that happened before that. The problem I have with this is that the current coding *only does it for crash scenarios*. So, if you're moving a system from fsync=off to fsync=on with a clean shutdown in between, we already don't help you. Effectively, you need to run sync(1) (but see man page for caveats and kernel logs for errors) to convert your earlier checkpoints from fiction to fact. So all we're discussing here is what we do with a system that crashed. Why is that a sane time to transition from fsync=off to fsync=on, and, supposing someone does that, why should we offer any more guarantees about that than we do when you make the same transition on a system that shut down cleanly? [1] https://www.postgresql.org/message-id/flat/4a5d233fcd28b5f1008aec79119b02b5a9357600.camel%40credativ.de [2] https://www.postgresql.org/message-id/flat/20150114105908.GK5245%40awork2.anarazel.de#1525fab691dbaaef35108016f0b99467 [3] https://www.postgresql.org/message-id/flat/20150523172627.GA24277%40msg.df7cb.de
pgsql-hackers by date: