Re: Two fsync related performance issues? - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Two fsync related performance issues? |
Date | |
Msg-id | CA+hUKGKHhDNnN6fxf6qrAx9h+mjdNU2Zmx7ztJzFQ0C5=u3QPg@mail.gmail.com Whole thread Raw |
In response to | Re: Two fsync related performance issues? (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Two fsync related performance issues?
Re: Two fsync related performance issues? Re: Two fsync related performance issues? |
List | pgsql-hackers |
On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > For the record, Andres Freund mentioned a few problems with this > > off-list and suggested we consider calling Linux syncfs() for each top > > level directory that could potentially be on a different filesystem. > > That seems like a nice idea to look into. > > Here's an experimental patch to try that. One problem is that before > Linux 5.8, syncfs() doesn't report failures[1]. I'm not sure what to > think about that; in the current coding we just log them and carry on > anyway, but any theoretical problems that causes for BLK_DONE should > be moot anyway because of FPIs which result in more writes and syncs. > Another is that it may affect other files that aren't under pgdata as > collateral damage, but that seems acceptable. It also makes me a bit > sad that this wouldn't help other OSes. ... and for comparison/discussion, here is an alternative patch that figures out precisely which files need to be fsync'd using information in the WAL. On a system with full_page_writes=on, this effectively means that we don't have to do anything at all for relation files, as described in more detail in the commit message. You still need to fsync the WAL files to make sure you can't replay records from the log that were written but not yet fdatasync'd, addressed in the patch. I'm not yet sure which other kinds of special files might need special treatment. Some thoughts: 1. Both patches avoid the need to open many files. With 1 million tables this can take over a minute even on a fast system with warm caches and/or fast local storage, before replay begins, and for a cold system with high latency storage it can be a serious problem. 2. The syncfs() one is comparatively simple, but it only works on Linux. If you used sync() (or sync(); sync()) instead, you'd be relying on non-POSIX behaviour, because POSIX says it doesn't wait for completion and indeed many non-Linux systems really are like that. 3. Though we know of kernel/filesystem pairs that can mark buffers clean while retaining the dirty contents (which would cause corruption with the current code in master, or either of these patches), fortunately those systems can't possibly run with full_page_writes=off so such problems are handled the same way torn pages are fixed. 4. Perhaps you could set a flag in the buffer to say 'needs sync' as a way to avoid repeatedly requesting syncs for the same page, but it might not be worth the hassle involved. Some other considerations that have been mentioned to me by colleagues I talked to about this: 1. The ResetUnloggedRelations() code still has to visit all relation files looking for init forks after a crash. But that turns out to be OK, it only reads directory entries in a straight line. It doesn't stat() or open() files with non-matching names, so unless you have very many unlogged tables, the problem is already avoided. (It also calls fsync() on the result, which could perhaps be replaced with a deferred request too, not sure, for another day.) 2. There may be some more directories that need special fsync() treatment. SLRUs are already covered (either handled by checkpoint or they hold ephemeral data), and I think pg_tblspc changes will be redone. pg_logical? I am not sure.
Attachment
pgsql-hackers by date: