Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1601071533020.5278@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

>> I thought of adding a pointer to the current flush structure at the vfd
>> level, so that on closing a file with a flush in progress the flush can be
>> done and the structure properly cleaned up, hence later the checkpointer
>> would see a clean thing and be able to skip it instead of generating flushes
>> on a closed file or on a different file...
>>
>> Maybe I'm missing something, but that is the plan I had in mind.
>
> That might work, although it'd not be pretty (not fatally so
> though).

Alas, any solution has to communicate somehow between the API levels, so 
it cannot be "pretty", although we should avoid the worse.

> But I'm inclined to go a different way: I think it's a mistake to do 
> flusing based on a single file. It seems better to track a fixed number 
> of outstanding 'block flushes', independent of the file. Whenever the 
> number of outstanding blocks is exceeded, sort that list, and flush all 
> outstanding flush requests after merging neighbouring flushes.

Hmmm. I'm not sure I understand your strategy.

I do not think that flushing without a prior sorting would be effective, 
because there is no clear reason why buffers written together would then 
be next to the other and thus give sequential write benefits, we would 
just get flushed random IO, I tested that and it worked badly.

One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the 
thread, so it makes sense to limit this cost, hence the aggregation. These 
removed some performation regression I had in some cases.

Also, the granularity of the buffer flush call is a file + offset + size, 
so necessarily it should be done this way (i.e. per file).

Once buffers are sorted per file and offset within file, then written 
buffers are as close as possible one after the other, the merging is very 
easy to compute (it is done on the fly, no need to keep the list of 
buffers for instance), it is optimally effective, and when the 
checkpointed file changes then we will never go back to it before the next 
checkpoint, so there is no reason not to flush right then.

So basically I do not see a clear positive advantage to your suggestion, 
especially when taking into consideration the scheduling process of the 
scheduler:

In effect the checkpointer already works with little bursts of activity 
between sleep phases, so that it writes buffers a few at a time, so it may 
already work more or less as you expect, but not for the same reason.

The closest stategy that I experimented which is maybe close to your 
suggestion was to manage a minimum number of buffers to write when awaken 
and to change the sleep delay in between, but I had no clear way to choose 
values and the experiments I did did not show significant performance 
impact by varying these parameters, so I kept that out. If you find a 
magic number of buffer which results in consistant better performance, 
fine with me, but this is independent with aggregating before or after.

> Imo that means that we'd better track writes on a relfilenode + block 
> number level.

I do not think that it is a better option. Moreover, the current approach 
has been proven to be very effective on hundreds of runs, so redoing it 
differently for the sake of it does not look like good resource 
allocation.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: Very confusing installcheck behavior with PGXS
Next
From: Andres Freund
Date:
Subject: Re: checkpointer continuous flushing