Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Andres Freund
Subject Re: checkpointer continuous flushing
Date
Msg-id 20160120101326.rvao4mcuntxxf7wf@alap3.anarazel.de
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Re: checkpointer continuous flushing  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: checkpointer continuous flushing  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > This seems like a problem with the WAL writer quite independent of
> > anything else.  It seems likely to be inadvertent fallout from this
> > patch:
> > 
> > Author: Simon Riggs <simon@2ndQuadrant.com>
> > Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000
> > 
> >     Wakeup WALWriter as needed for asynchronous commit performance.
> >     Previously we waited for wal_writer_delay before flushing WAL. Now
> >     we also wake WALWriter as soon as a WAL buffer page has filled.
> >     Significant effect observed on performance of asynchronous commits
> >     by Robert Haas, attributed to the ability to set hint bits on tuples
> >     earlier and so reducing contention caused by clog lookups.
> 
> In addition to that the "powersaving" effort also plays a role - without
> the latch we'd not wake up at any meaningful rate at all atm.

The relevant thread is at
http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
what I didn't remember is that I voiced concern back then about exactly this:
http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
;)

Simon: CCed you, as the author of the above commit. Quick summary:
The frequent wakeups of wal writer can lead to significant performance
regressions in workloads that are bigger than shared_buffers, because
the super-frequent fdatasync()s by the wal writer slow down concurrent
writes (bgwriter, checkpointer, individual backend writes)
dramatically. To the point that SIGSTOPing the wal writer gets a pgbench
workload from 2995 to 10887 tps.  The reasons fdatasyncs cause a slow
down is that it prevents real use of queuing to the storage devices.


On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > If I understand correctly, prior to that commit, WAL writer woke up 5
> > times per second and flushed just that often (unless you changed the
> > default settings).    But as the commit message explained, that turned
> > out to suck - you could make performance go up very significantly by
> > radically decreasing wal_writer_delay.  This commit basically lets it
> > flush at maximum velocity - as fast as we finish one flush, we can
> > start the next.  That must have seemed like a win at the time from the
> > way the commit message was written, but you seem to now be seeing the
> > opposite effect, where performance is suffering because flushes are
> > too frequent rather than too infrequent.  I wonder if there's an ideal
> > flush rate and what it is, and how much it depends on what hardware
> > you have got.
> 
> I think the problem isn't really that it's flushing too much WAL in
> total, it's that it's flushing WAL in a too granular fashion. I suspect
> we want something where we attempt a minimum number of flushes per
> second (presumably tied to wal_writer_delay) and, once exceeded, a
> minimum number of pages per flush. I think we even could continue to
> write() the data at the same rate as today, we just would need to reduce
> the number of fdatasync()s we issue. And possibly could make the
> eventual fdatasync()s cheaper by hinting the kernel to write them out
> earlier.
> 
> Now the question what the minimum number of pages we want to flush for
> (setting wal_writer_delay triggered ones aside) isn't easy to answer. A
> simple model would be to statically tie it to the size of wal_buffers;
> say, don't flush unless at least 10% of XLogBuffers have been written
> since the last flush. More complex approaches would be to measure the
> continuous WAL writeout rate.
> 
> By tying it to both a minimum rate under activity (ensuring things go to
> disk fast) and a minimum number of pages to sync (ensuring a reasonable
> number of cache flush operations) we should be able to mostly accomodate
> the different types of workloads. I think.

This unfortunately leaves out part of the reasoning for the above
commit: We want WAL to be flushed fast, so we immediately can set hint
bits.

One, relatively extreme, approach would be to continue *writing* WAL in
the background writer as today, but use rules like suggested above
guiding the actual flushing. Additionally using operations like
sync_file_range() (and equivalents on other OSs).  Then, to address the
regression of SetHintBits() having to bail out more often, actually
trigger a WAL flush whenever WAL is already written, but not flushed.
has the potential to be bad in a number of other cases tho :(

Andres



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Odd behavior in foreign table modification (Was: Re: Optimization for updating foreign tables in Postgres FDW)
Next
From: Rushabh Lathia
Date:
Subject: Re: Optimization for updating foreign tables in Postgres FDW