Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1511121722140.15029@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
>> To fix it, ITSM that it is enough to hold a "do not close lock" on the file
>> while a flush is in progress (a short time) that would prevent mdclose to do
>> its stuff.
>
> Could you expand a bit more on this? You're suggesting something like a
> boolean in the vfd struct?

Basically yes, I'm suggesting a mutex in the vdf struct.

> If that, how would you deal with FileClose() being called?

Just wait for the mutex, which would be held while flushes are accumulated 
into the flush context and released after the flush is performed and the 
fd is not necessary anymore for this purpose, which is expected to be 
short (at worst between the wake & sleep of the checkpointer, and just one 
file at a time).

>> I'm concious that the patch only addresses *checkpointer* writes, not those
>> from bgwrither or backends writes. I agree that these should need to be
>> addressed at some point as well, but given the time to get a patch through,
>> the more complex the slower (sort propositions are 10 years old), I think
>> this should be postponed for later.
>
> I think we need to have at least a PoC of all of the relevant
> changes. We're doing these to fix significant latency and throughput
> issues, and if the approach turns out not to be suitable for
> e.g. bgwriter or backends, that might have influence over checkpointer's
> design as well.

Hmmm. See below.

>>> What I did not expect, and what confounded me for a long while, is that
>>> for workloads where the hot data set does *NOT* fit into shared buffers,
>>> sorting often led to be a noticeable reduction in throughput. Up to
>>> 30%.
>>
>> I did not see such behavior in the many tests I ran. Could you share more
>> precise details so that I can try to reproduce this performance regression?
>> (available memory, shared buffers, db size, ...).
>
>
> I generally found that I needed to disable autovacuum's analyze to get
> anything even close to stable numbers. The issue in described in
> http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
> otherwise badly kicks in. I basically just set
> autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from occuring.
>
> I'll show actual numbers at some point yes. I tried three different systems:
>
> * my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
>  shared_buffers. Tried checkpoint timeouts from 60 to 300s.

Hmmm. This is quite short. I tend to do tests with much larger timeouts. I 
would advise against a short timeout esp. in a high throughput system, the 
whole point of the checkpointer is to accumulate as much changes as 
possible.

I'll look into that.

>> This explanation seems to suggest that if bgwriter/workders write are sorted
>> and/or coordinated with the checkpointer somehow then all would be well?
>
> Well, you can't easily sort bgwriter/backend writes stemming from cache
> replacement. Unless your access patterns are entirely sequential the
> data in shared buffers will be laid out in a nearly entirely random
> order.  We could try sorting the data, but with any reasonable window,
> for many workloads the likelihood of actually achieving much with that
> seems low.

Maybe the sorting could be shared with others so that everybody uses the 
same order?

That would suggest to have one global sorting of buffers, maybe maintained 
by the checkpointer, which could be used by all processes that need to 
scan the buffers (in file order), instead of scanning them in memory 
order.

For this purpose, I think that the initial index-based sorting would 
suffice. Could be resorted periodically with some delay maintained in a 
guc, or when significant buffer changes have occured (read & writes).

>> ISTM that this explanation could be checked by looking whether
>> bgwriter/workers writes are especially large compared to checkpointer writes
>> in those cases with reduced throughput? The data is in the log.
>
> What do you mean with "large"? Numerous?

I mean the amount of buffers written by bgwriter/worker is greater than 
what is written by the checkpointer. If all fits in shared buffers, 
bgwriter/worker mostly do not need to write anything and the checkpointer 
does all the writes.

The larger the memory needed, the more likely workers/bgwriter will have 
to quick in and generate random I/Os because nothing sensible is currently 
done, so this is consistent with your findings, although I'm surprised 
that it would have a large effect on throughput, as already said.

>> Hmmm. The shorter the timeout, the more likely the sorting NOT to be
>> effective
>
> You mean, as evidenced by the results, or is that what you'd actually
> expect?

What I would expect...

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: LLVM miscompiles numeric.c access to short numeric var headers
Next
From: Vik Fearing
Date:
Subject: Re: psql: add \pset true/false