Re: Two fsync related performance issues? - Mailing list pgsql-hackers

From Michael Banck
Subject Re: Two fsync related performance issues?
Date
Msg-id 861c934c113941cd960974b2ca578f8137d656f0.camel@credativ.de
Whole thread Raw
In response to Re: Two fsync related performance issues?  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi,

Am Mittwoch, den 07.10.2020, 18:17 +1300 schrieb Thomas Munro:
> On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > For the record, Andres Freund mentioned a few problems with this
> > > off-list and suggested we consider calling Linux syncfs() for each top
> > > level directory that could potentially be on a different filesystem.
> > > That seems like a nice idea to look into.
> > 
> > Here's an experimental patch to try that.  One problem is that before
> > Linux 5.8, syncfs() doesn't report failures[1].  I'm not sure what to
> > think about that; in the current coding we just log them and carry on
> > anyway, but any theoretical problems that causes for BLK_DONE should
> > be moot anyway because of FPIs which result in more writes and syncs.
> > Another is that it may affect other files that aren't under pgdata as
> > collateral damage, but that seems acceptable.  It also makes me a bit
> > sad that this wouldn't help other OSes.
> 
> ... and for comparison/discussion, here is an alternative patch that
> figures out precisely which files need to be fsync'd using information
> in the WAL.  On a system with full_page_writes=on, this effectively
> means that we don't have to do anything at all for relation files, as
> described in more detail in the commit message.  You still need to
> fsync the WAL files to make sure you can't replay records from the log
> that were written but not yet fdatasync'd, addressed in the patch.
> I'm not yet sure which other kinds of special files might need special
> treatment.
> 
> Some thoughts:
> 1.  Both patches avoid the need to open many files.  With 1 million
> tables this can take over a minute even on a fast system with warm
> caches and/or fast local storage, before replay begins, and for a cold
> system with high latency storage it can be a serious problem.

You mention "serious problem" and "over a minute", but I don't recall
you mentioning how long it takes with those two patches (or maybe I
missed it), so here goes:

I created an instance with 250 databases on 250 tablespaces on a SAN
storage. This is on 12.4, the patches can be backpatched with minimal
changes.

After pg_ctl stop -m immediate, a pg_ctl start -w (or rather the time
between the two log messages "database system was interrupted; last
known up at %s" and "database system was not properly shut down;
automatic recovery in progress" took

1. 12-13 seconds on vanilla
2. usually < 10 ms, sometimes 70-80 ms with the syncfs patch
3. 4 ms with the optimized sync patch

That's a dramatic improvement, but maybe also a best case scenario as no
traffic happened since the last checkpoint. I did some light pgbench
before killing the server again, but couldn't get the optimzid sync
patch to take any longer, might try harder at some point but won't have
much more time to test today.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Expansion of our checks for connection-loss errors
Next
From: Greg Nancarrow
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)