Re: checkpointer continuous flushing - V18 - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing - V18
Date
Msg-id alpine.DEB.2.10.1602210746250.3927@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing - V18  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing - V18
Re: checkpointer continuous flushing - V18
List pgsql-hackers
Hallo Andres,

>> In some previous version I think a warning was shown if the feature was
>> requested but not available.
>
> I think we should either silently ignore it, or error out. Warnings
> somewhere in the background aren't particularly meaningful.

I like "ignoring with a warning" in the log file, because when things do 
not behave as expected that is where I'll be looking. I do not think that 
it should error out.

>> The sgml documentation about "*_flush_after" configuration parameter 
>> talks about bytes, but the actual unit should be buffers.
>
> The unit actually is buffers, but you can configure it using
> bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
> ...). Refering to bytes is easier because you don't have to explain that
> it depends on compilation settings how many data it actually is and
> such.

So I understand that it works with kb as well. Now I do not think that it 
would need a lot if explanations if you say that it is a number of pages, 
and I think that a number of pages is significant because it is a number 
of IO requests to be coalesced, eventually.

>> In the discussion in the wal section, I'm not sure about the effect of
>> setting writebacks on SSD, [...]
>
> Yea, that paragraph needs some editing. I think we should basically
> remove that last sentence.

Ok, fine with me. Does that mean that flushing as a significant positive 
impact on SSD in your tests?

>> However it does not address the point that bgwriter and backends 
>> basically issue random writes, [...]
>
> The benefit is primarily that you don't collect large amounts of dirty
> buffers in the kernel page cache. In most cases the kernel will not be
> able to coalesce these writes either...  I've measured *massive*
> performance latency differences for workloads that are bigger than
> shared buffers - because suddenly bgwriter / backends do the majority of
> the writes. Flushing in the checkpoint quite possibly makes nearly no
> difference in such cases.

So I understand that there is a positive impact under some load. Good!

>> Maybe the merging strategy could be more aggressive than just strict
>> neighbors?
>
> I don't think so. If you flush more than neighbouring writes you'll
> often end up flushing buffers dirtied by another backend, causing
> additional stalls.

Ok. Maybe the neightbor definition could be relaxed just a little bit so 
that small holes are overtake, but not large holes? If there is only a few 
pages in between, even if written by another process, then writing them 
together should be better? Well, this can wait for a clear case, because 
hopefully the OS will recoalesce them behind anyway.

>> struct WritebackContext: keeping a pointer to guc variables is a kind of
>> trick, I think it deserves a comment.
>
> It has, it's just in WritebackContextInit(). Can duplicateit.

I missed it, I expected something in the struct definition. Do not 
duplicate, but cross reference it?

>> IssuePendingWritebacks: I understand that qsort is needed "again"
>> because when balancing writes over tablespaces they may be intermixed.
>
> Also because the infrastructure is used for more than checkpoint
> writes. There's absolutely no ordering guarantees there.

Yep, but not much benefit to expect from a few dozens random pages either.

>> [...] I do think that this whole writeback logic really does make sense 
>> *per table space*,
>
> Leads to less regular IO, because if your tablespaces are evenly sized
> (somewhat common) you'll sometimes end up issuing sync_file_range's
> shortly after each other.  For latency outside checkpoints it's
> important to control the total amount of dirty buffers, and that's
> obviously independent of tablespaces.

I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per 
device as well (otherwise what the point?), so you should want to coalesce 
and "writeback" pages per device as wel. Calling sync_file_range on 
distinct devices should probably be issued more or less randomly, and 
should not interfere one with the other.

If you use just one context, the more table spaces the less performance 
gains, because there is less and less aggregation thus sequential writes 
per device.

So for me there should really be one context per tablespace. That would 
suggest a hashtable or some other structure to keep and retrieve them, 
which would not be that bad, and I think that it is what is needed.

>> For the checkpointer, a key aspect is that the scheduling process goes
>> to sleep from time to time, and this sleep time looked like a great
>> opportunity to do this kind of flushing. You choose not to take advantage
>> of the behavior, why?
>
> Several reasons: Most importantly there's absolutely no guarantee that 
> you'll ever end up sleeping, it's quite common to happen only seldomly.

Well, that would be under a situation when pg is completely unresponsive. 
More so, this behavior *makes* pg unresponsive.

> If you're bottlenecked on IO, you can end up being behind all the time.

Hopefully sorting & flushing should improve this situation a lot.

> But even then you don't want to cause massive latency spikes
> due to gigabytes of dirty data - a slower checkpoint is a much better
> choice.  It'd make the writeback infrastructure less generic.

Sure. It would be sufficient to have a call to ask for writebacks 
independently of the number of writebacks accumulated in the queue, it 
does not need to change the infrastructure.

Also, I think that such a call would make sense at the end of the 
checkpoint.

> I also don't really believe it helps that much, although that's a 
> complex argument to make.

Yep. My thinking is that doing things in the sleeping interval does not 
interfere with the checkpointer scheduling, so it is less likely to go 
wrong and falling behind.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Next
From: Fabien COELHO
Date:
Subject: Re: checkpointer continuous flushing - V18