Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1511111005220.1004@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

> And here's v14. It's not something entirely ready.

I'm going to have a careful look at it.

> A lot of details have changed, I unfortunately don't remember them all. 
> But there are more important things than the details of the patch.
>
> I've played *a lot* with this patch. I found a bunch of issues:
>
> 1) The FileFlushContext context infrastructure isn't actually
>   correct. There's two problems: First, using the actual 'fd' number to
>   reference a to-be-flushed file isn't meaningful. If there  are lots
>   of files open, fds get reused within fd.c.

Hmm.

My assumption is that a file being used (i.e. with modifie pages, being 
used for writes...) would not be closed before everything is cleared...

After some poking in the code, I think that this issue may indeed be 
there, although the probability of hitting it is close to 0, but alas not 
0:-)

To fix it, ITSM that it is enough to hold a "do not close lock" on the 
file while a flush is in progress (a short time) that would prevent 
mdclose to do its stuff.

> That part is enough fixed by referencing File instead the fd. The bigger 
> problem is that the infrastructure doesn't deal with files being closed. 
> There can, which isn't that hard to trigger, be smgr invalidations 
> causing smgr handle and thus the file to be closed.
>
> I think this means that the entire flushing infrastructure actually
> needs to be hoisted up, onto the smgr/md level.

Hmmm. I'm not sure that it is necessary, see above my suggestion.

> 2) I noticed that sync_file_range() blocked far more often than I'd
>   expected. Reading the kernel code that turned out to be caused by a
>   pessimization in the kernel introduced years ago - in many situation
>   SFR_WRITE waited for the writes. A fix for this will be in the 4.4
>   kernel.

Alas, Pg cannot help issues in the kernel.

> 3) I found that latency wasn't improved much for workloads that are
>   significantly bigger than shared buffers. The problem here is that
>   neither bgwriter nor the backends have, so far, done
>   sync_file_range() calls. That meant that the old problem of having
>   gigabytes of dirty data that periodically get flushed out, still
>   exists. Having these do flushes mostly attacks that problem.

I'm concious that the patch only addresses *checkpointer* writes, not 
those from bgwrither or backends writes. I agree that these should need to 
be addressed at some point as well, but given the time to get a patch 
through, the more complex the slower (sort propositions are 10 years old), 
I think this should be postponed for later.

> Benchmarking revealed that for workloads where the hot data set mostly
> fits into shared buffers flushing and sorting is anywhere from a small
> to a massive improvement, both in throughput and latency. Even without
> the patch from 2), although fixing that improves things furhter.

This is consistent with my experiments: sorting improves things, and 
flushing on top of sorting improves things further.

> What I did not expect, and what confounded me for a long while, is that
> for workloads where the hot data set does *NOT* fit into shared buffers,
> sorting often led to be a noticeable reduction in throughput. Up to
> 30%.

I did not see such behavior in the many tests I ran. Could you share more 
precise details so that I can try to reproduce this performance 
regression? (available memory, shared buffers, db size, ...).

> The performance was still much more regular than before, i.e. no
> more multi-second periods without any transactions happening.
>
> By now I think I know what's going on: Before the sorting portion of the
> patch the write-loop in BufferSync() starts at the current clock hand,
> by using StrategySyncStart(). But after the sorting that obviously
> doesn't happen anymore - buffers are accessed in their sort order. By
> starting at the current clock hand and moving on from there the
> checkpointer basically makes it more less likely that victim buffers
> need to be written either by the backends themselves or by
> bgwriter. That means that the sorted checkpoint writes can, indirectly,
> increase the number of unsorted writes by other processes :(

I'm quite surprised at such a large effect on throughput, though.

This explanation seems to suggest that if bgwriter/workders write are 
sorted and/or coordinated with the checkpointer somehow then all would be 
well?

ISTM that this explanation could be checked by looking whether 
bgwriter/workers writes are especially large compared to checkpointer 
writes in those cases with reduced throughput? The data is in the log.



> My benchmarking suggest that that effect is the larger, the shorter the
> checkpoint timeout is.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be 
effective, and the more likely to go back to random I/Os, and maybe to 
seem some effect of the sync strategy stuff.

> That seems to intuitively make sense, give the above explanation 
> attempt. If the checkpoint takes longer the clock hand will almost 
> certainly soon overtake checkpoints 'implicit' hand.
>
> I'm not sure if we can really do anything about this problem. While I'm
> pretty jet lagged, I still spent a fair amount of time thinking about
> it. Seems to suggest that we need to bring back the setting to
> enable/disable sorting :(
>
>
> What I think needs to happen next with the patch is:
> 1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
>   handling the issue of smgr invalidations.

Not sure that much is necessary, see above.

> 2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
>   later can contain multiple elements like checkpoint, bgwriter,
>   backends, ddl, bulk-writes. That seems better than adding GUCs for
>   these separately. Then make the flush locations in the patch
>   configurable using that.

My 0,02€ on this point: I have not seen much of this style of guc 
elsewhere. The only one I found while scanning the postgres file are 
*_path and *_libraries. It seems to me that this would depart 
significantly from the usual style, so one guc per case, or one shared guc 
but with only on/off, would blend in more cleanly with the usual style.

> 3) I think we should remove the sort timing from the checkpoint logging
>   before commit. It'll always be pretty short.

I added it to show that it was really short, in response to concerns that 
my approach of just sorting through indexes to reduce the memory needed 
instead of copying the data to be sorted did not induce significant 
performance issues. I prooved my point, but peer pressure made me switch 
to larger memory anyway.

I think it should be kept while the features are under testing. I do not 
think that it harms in anyway.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: pglogical_output - a general purpose logical decoding output plugin
Next
From: Tom Lane
Date:
Subject: Re: LLVM miscompiles numeric.c access to short numeric var headers