Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1601071613040.5278@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

>> One of the point of aggregating flushes is that the range flush call cost
>> is significant, as shown by preliminary tests I did, probably up in the
>> thread, so it makes sense to limit this cost, hence the aggregation. These
>> removed some performation regression I had in some cases.
>
> FWIW, my tests show that flushing for clean ranges is pretty cheap.

Yes, I agree that it is quite cheap, but I had a few % tps regressions 
in some cases without aggregating, and aggregating was enough to avoid 
these small regressions.

>> Also, the granularity of the buffer flush call is a file + offset + size, so
>> necessarily it should be done this way (i.e. per file).
>
> What syscalls we issue, and at what level we track outstanding flushes,
> doesn't have to be the same.

Sure. But the current version is simple, efficient and proven by many 
runs, so there should be a very strong argument to justify a significant 
benefit to change the approach, and I see no such thing in your arguments.

For me the current approach is optimal for the checkpointer, because it 
takes advantage of all available information to perform a better job.

>> Once buffers are sorted per file and offset within file, then written
>> buffers are as close as possible one after the other, the merging is very
>> easy to compute (it is done on the fly, no need to keep the list of buffers
>> for instance), it is optimally effective, and when the checkpointed file
>> changes then we will never go back to it before the next checkpoint, so
>> there is no reason not to flush right then.
>
> Well, that's true if there's only one tablespace, but e.g. not the case
> with two tablespaces of about the same number of dirty buffers.

ISTM that in the version of the patch I sent there was one flushing 
structure per tablespace each doing its own flushing on its files, so it 
should work the same, only the writing intensity is devided by the number 
of tablespace? Or am I missing something?

>> So basically I do not see a clear positive advantage to your suggestion,
>> especially when taking into consideration the scheduling process of the
>> scheduler:
>
> I don't think it makes a big difference for the checkpointer alone, but
> it makes the interface much more suitable for other processes, e.g. the
> bgwriter, and normal backends.

Hmmm.

ISTM that the requirement are not exactly the same for the bgwriter and 
backends vs the checkpointer. The checkpointer has the advantage of being 
able to plan its IOs on the long term (volume & time is known...) and the 
implementation takes the full benefit of this planing by sorting and 
scheduling and flushing buffers so as to generate as much sequential 
writes as possible.

The bgwriter and backends have a much shorter vision (a few seconds, or 
juste one query being process), so the solution will be less efficient and 
probably more messy on the coding side. This is life. I do not see why not 
to take the benefit of a full planing in the checkpointer just because 
other processes cannot do the same, especially as under plenty of loads 
the checkpointer does most of the writing so is the limiting factor.

So I do not buy your suggestion for the checkpointer. Maybe it will be the 
way to go for bgwriter and backends, then fine for them.

>>> Imo that means that we'd better track writes on a relfilenode + block
>>> number level.
>>
>> I do not think that it is a better option. Moreover, the current approach
>> has been proven to be very effective on hundreds of runs, so redoing it
>> differently for the sake of it does not look like good resource allocation.
>
> For a subset of workloads, yes.

Hmmm. What I understood is that the workloads that have some performance 
regressions (regressions that I have *not* seen in the many tests I ran) 
are not due to checkpointer IOs, but rather in settings where most of the 
writes is done by backends or bgwriter.

I do not see the point of rewriting the checkpointer for them, although 
obviously I agree that something has to be done also for the other 
processes.

Maybe if all the writes (bgwriter and checkpointer) where performed by the 
same process then some dynamic mixing and sorting and aggregating would 
make sense, but this is currently not the case, and would probably have 
quite limited effect.

Basically I do not understand how changing the flushing organisation as 
you suggest would improve the checkpointer performance significantly, for 
me it should only degrade the performance compared to the current version, 
as far as the checkpointer is concerned.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [COMMITTERS] pgsql: Windows: Make pg_ctl reliably detect service status
Next
From: Alvaro Herrera
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Windows: Make pg_ctl reliably detect service status