Re: Two fsync related performance issues? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Two fsync related performance issues?
Date
Msg-id CA+hUKGKHhDNnN6fxf6qrAx9h+mjdNU2Zmx7ztJzFQ0C5=u3QPg@mail.gmail.com
Whole thread Raw
In response to Re: Two fsync related performance issues?  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Two fsync related performance issues?
Re: Two fsync related performance issues?
Re: Two fsync related performance issues?
List pgsql-hackers
On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > For the record, Andres Freund mentioned a few problems with this
> > off-list and suggested we consider calling Linux syncfs() for each top
> > level directory that could potentially be on a different filesystem.
> > That seems like a nice idea to look into.
>
> Here's an experimental patch to try that.  One problem is that before
> Linux 5.8, syncfs() doesn't report failures[1].  I'm not sure what to
> think about that; in the current coding we just log them and carry on
> anyway, but any theoretical problems that causes for BLK_DONE should
> be moot anyway because of FPIs which result in more writes and syncs.
> Another is that it may affect other files that aren't under pgdata as
> collateral damage, but that seems acceptable.  It also makes me a bit
> sad that this wouldn't help other OSes.

... and for comparison/discussion, here is an alternative patch that
figures out precisely which files need to be fsync'd using information
in the WAL.  On a system with full_page_writes=on, this effectively
means that we don't have to do anything at all for relation files, as
described in more detail in the commit message.  You still need to
fsync the WAL files to make sure you can't replay records from the log
that were written but not yet fdatasync'd, addressed in the patch.
I'm not yet sure which other kinds of special files might need special
treatment.

Some thoughts:
1.  Both patches avoid the need to open many files.  With 1 million
tables this can take over a minute even on a fast system with warm
caches and/or fast local storage, before replay begins, and for a cold
system with high latency storage it can be a serious problem.
2.  The syncfs() one is comparatively simple, but it only works on
Linux.  If you used sync() (or sync(); sync()) instead, you'd be
relying on non-POSIX behaviour, because POSIX says it doesn't wait for
completion and indeed many non-Linux systems really are like that.
3.  Though we know of kernel/filesystem pairs that can mark buffers
clean while retaining the dirty contents (which would cause corruption
with the current code in master, or either of these patches),
fortunately those systems can't possibly run with full_page_writes=off
so such problems are handled the same way torn pages are fixed.
4.  Perhaps you could set a flag in the buffer to say 'needs sync' as
a way to avoid repeatedly requesting syncs for the same page, but it
might not be worth the hassle involved.

Some other considerations that have been mentioned to me by colleagues
I talked to about this:
1.  The ResetUnloggedRelations() code still has to visit all relation
files looking for init forks after a crash.  But that turns out to be
OK, it only reads directory entries in a straight line.  It doesn't
stat() or open() files with non-matching names, so unless you have
very many unlogged tables, the problem is already avoided.  (It also
calls fsync() on the result, which could perhaps be replaced with a
deferred request too, not sure, for another day.)
2.  There may be some more directories that need special fsync()
treatment.  SLRUs are already covered (either handled by checkpoint or
they hold ephemeral data), and I think pg_tblspc changes will be
redone.  pg_logical?  I am not sure.

Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: [HACKERS] Custom compression methods
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] logical decoding of two-phase transactions