Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Andres Freund
Subject Re: checkpointer continuous flushing
Date
Msg-id 20151111073857.GX32209@alap3.anarazel.de
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
Re: checkpointer continuous flushing  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:
> Here is a v13, which is just a rebase after 1aba62ec.

And here's v14. It's not something entirely ready. A lot of details have
changed, I unfortunately don't remember them all. But there are more
important things than the details of the patch.

I've played *a lot* with this patch. I found a bunch of issues:

1) The FileFlushContext context infrastructure isn't actually
   correct. There's two problems: First, using the actual 'fd' number to
   reference a to-be-flushed file isn't meaningful. If there  are lots
   of files open, fds get reused within fd.c. That part is enough fixed
   by referencing File instead the fd. The bigger problem is that the
   infrastructure doesn't deal with files being closed. There can, which
   isn't that hard to trigger, be smgr invalidations causing smgr handle
   and thus the file to be closed.

   I think this means that the entire flushing infrastructure actually
   needs to be hoisted up, onto the smgr/md level.

2) I noticed that sync_file_range() blocked far more often than I'd
   expected. Reading the kernel code that turned out to be caused by a
   pessimization in the kernel introduced years ago - in many situation
   SFR_WRITE waited for the writes. A fix for this will be in the 4.4
   kernel.

3) I found that latency wasn't improved much for workloads that are
   significantly bigger than shared buffers. The problem here is that
   neither bgwriter nor the backends have, so far, done
   sync_file_range() calls. That meant that the old problem of having
   gigabytes of dirty data that periodically get flushed out, still
   exists. Having these do flushes mostly attacks that problem.


Benchmarking revealed that for workloads where the hot data set mostly
fits into shared buffers flushing and sorting is anywhere from a small
to a massive improvement, both in throughput and latency. Even without
the patch from 2), although fixing that improves things furhter.



What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%. The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(

My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is. That seems to intuitively make sense, give the
above explanation attempt. If the checkpoint takes longer the clock hand
will almost certainly soon overtake checkpoints 'implicit' hand.

I'm not sure if we can really do anything about this problem. While I'm
pretty jet lagged, I still spent a fair amount of time thinking about
it. Seems to suggest that we need to bring back the setting to
enable/disable sorting :(


What I think needs to happen next with the patch is:
1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
   handling the issue of smgr invalidations.
2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
   later can contain multiple elements like checkpoint, bgwriter,
   backends, ddl, bulk-writes. That seems better than adding GUCs for
   these separately. Then make the flush locations in the patch
   configurable using that.
3) I think we should remove the sort timing from the checkpoint logging
   before commit. It'll always be pretty short.


Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Dangling Client Backend Process
Next
From: Simon Riggs
Date:
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data