Re: Two fsync related performance issues? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Two fsync related performance issues?
Date
Msg-id CA+hUKG+8pys+bLSoN6C-F1Nss2BiJfx43VyO5jL+P-whSMPQpg@mail.gmail.com
Whole thread Raw
In response to Re: Two fsync related performance issues?  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Wed, Oct 7, 2020 at 6:17 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > For the record, Andres Freund mentioned a few problems with this
> > > off-list and suggested we consider calling Linux syncfs() for each top
> > > level directory that could potentially be on a different filesystem.
> > > That seems like a nice idea to look into.

> ... and for comparison/discussion, here is an alternative patch that
> figures out precisely which files need to be fsync'd using information
> in the WAL. [...]

Michael Banck reported[1] a system that spent 20 minute in
SyncDataDirectory().  His summary caused me to go back and read the
discussions[1][2] that produced the current behaviour via commits
2ce439f3 and d8179b00, and I wanted to add a couple more observations
about the two draft patches mentioned above.

About the need to sync files that were dirtied during a previous run:

1.  The syncfs() patch has the same ignore errors-and-press-on
behaviour as d8179b00 gave us, though on Linux < 5.8 it would not even
report them at LOG level.

2.  The "precise" fsync() patch defers the work until after redo, but
if you get errors while processing the queued syncs, you would not be
able to complete the end-of-recovery checkpoint.  This is correct
behaviour in my opinion; any such checkpoint that is allowed to
complete would be a lie, and would make the corruption permanent,
releasing the WAL that was our only hope of recovering.  If you want
to run a so-damaged system for forensic purposes, you could always
bring it up with fsync=off, or consider the idea from a nearby thread
to allow the end-of-recovery checkpoint to be disabled (then the
eventual first checkpoint will likely take down your system, but
that's the case with the current
ignore-errors-and-hope-for-the-best-after-crash coding for the *next*
checkpoint, assuming your damaged filesystem continues to produce
errors, it's just more principled IMHO).

I recognise that this sounds an absolutist argument that might attract
some complaints on practical grounds, but I don't think it really
makes too much difference in practice.  Consider a typical Linux
filesystem:  individual errors aren't going to be reported more than
once, and full_page_writes must be on on such a system so we'll be
writing out all affected pages again and then trying to fsync again in
the end-of-recovery checkpoint, so despite our attempt at creating a
please-ignore-errors-and-corrupt-my-database-and-play-on mode, you'll
likely panic again if I/O errors persist, or survive and continue
without corruption if the error-producing-conditions were fleeting and
transient (as in the thin provisioning EIO or NFS ENOSPC conditions
discussed in other threads).

About the need to fsync everything in sight on a system that
previously ran with fsync=off, as discussed in the earlier threads:

1.  The syncfs() patch provides about the same weak guarantee as the
current coding.  Something like: it can convert all checkpoints that
were logged in the time since the kernel started from fiction to fact,
except those corrupted by (unlikely) I/O errors, which may only be
reported in the kernel logs if at all.

2.  The "precise" fsync() patch provides no such weak guarantee.  It
takes the last checkpoint at face value, and can't help you with
anything that happened before that.

The problem I have with this is that the current coding *only does it
for crash scenarios*.  So, if you're moving a system from fsync=off to
fsync=on with a clean shutdown in between, we already don't help you.
Effectively, you need to run sync(1) (but see man page for caveats and
kernel logs for errors) to convert your earlier checkpoints from
fiction to fact.  So all we're discussing here is what we do with a
system that crashed.  Why is that a sane time to transition from
fsync=off to fsync=on, and, supposing someone does that, why should we
offer any more guarantees about that than we do when you make the same
transition on a system that shut down cleanly?

[1] https://www.postgresql.org/message-id/flat/4a5d233fcd28b5f1008aec79119b02b5a9357600.camel%40credativ.de
[2]
https://www.postgresql.org/message-id/flat/20150114105908.GK5245%40awork2.anarazel.de#1525fab691dbaaef35108016f0b99467
[3] https://www.postgresql.org/message-id/flat/20150523172627.GA24277%40msg.df7cb.de



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Recent failures on buildfarm member hornet
Next
From: Tom Lane
Date:
Subject: Re: Recent failures on buildfarm member hornet