Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Andres Freund
Subject Re: checkpointer continuous flushing
Date
Msg-id 20160120140220.iidxqnkx73k2ahd5@alap3.anarazel.de
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2016-01-20 11:13:26 +0100, Andres Freund wrote:
> On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> > On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > I think the problem isn't really that it's flushing too much WAL in
> > total, it's that it's flushing WAL in a too granular fashion. I suspect
> > we want something where we attempt a minimum number of flushes per
> > second (presumably tied to wal_writer_delay) and, once exceeded, a
> > minimum number of pages per flush. I think we even could continue to
> > write() the data at the same rate as today, we just would need to reduce
> > the number of fdatasync()s we issue. And possibly could make the
> > eventual fdatasync()s cheaper by hinting the kernel to write them out
> > earlier.
> >
> > Now the question what the minimum number of pages we want to flush for
> > (setting wal_writer_delay triggered ones aside) isn't easy to answer. A
> > simple model would be to statically tie it to the size of wal_buffers;
> > say, don't flush unless at least 10% of XLogBuffers have been written
> > since the last flush. More complex approaches would be to measure the
> > continuous WAL writeout rate.
> >
> > By tying it to both a minimum rate under activity (ensuring things go to
> > disk fast) and a minimum number of pages to sync (ensuring a reasonable
> > number of cache flush operations) we should be able to mostly accomodate
> > the different types of workloads. I think.
>
> This unfortunately leaves out part of the reasoning for the above
> commit: We want WAL to be flushed fast, so we immediately can set hint
> bits.
>
> One, relatively extreme, approach would be to continue *writing* WAL in
> the background writer as today, but use rules like suggested above
> guiding the actual flushing. Additionally using operations like
> sync_file_range() (and equivalents on other OSs).  Then, to address the
> regression of SetHintBits() having to bail out more often, actually
> trigger a WAL flush whenever WAL is already written, but not flushed.
> has the potential to be bad in a number of other cases tho :(

Chatting on IM with Heikki, I noticed that we're pretty pessimistic in
SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN),
because we can't easily set the LSN. But, it's actually fairly common
that the pages LSN is already newer than the commitLSN - in which case
we, afaics, just can go ahead and set the hint bit, no?

So, instead of    if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)        return;                /* not
flushedyet, so don't set hint */
 
we do    if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN)        && BufferGetLSNAtomic(buffer) < commitLSN)
    return;                /* not flushed yet, so don't set hint */
 

In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers
a large portion of the hint writes that we currently skip.

Right now, on my laptop, I get (-M prepared -c 32 -j 32):
current wal-writer                              12827 tps, 95 % IO util, 93 % CPU
no flushing in wal writer *                     13185 tps, 46 % IO util, 93 % CPU
no flushing in wal writer & above change        16366 tps, 41 % IO util, 95 % CPU
flushing in wal writer & above change:          14812 tps, 94 % IO util, 95 % CPU

* sometimes the results initially were much lower, with lots of lock contention. Can't figure out why that's only
sometimesthe case. In those cases the results were more like 8967 tps.
 

these aren't meant as thorough benchmarks, just to provide some
orientation.


Now that solution won't improve every situation, e.g. for a workload
that inserts a lot of rows in one transaction, and only does inserts, it
probably won't do all that much. But it still seems like a pretty good
mitigation strategy. I hope that with a smarter write strategy (getting
that 50% reduction in IO util) and the above we should be ok.

Andres



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Next
From: Fujii Masao
Date:
Subject: Re: GIN pending list clean up exposure to SQL