Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1601091620160.4394@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

> Hm. New theory: The current flush interface does the flushing inside
> FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
> problem with that is that at that point we (need to) hold a content lock
> on the buffer!

You are worrying that FlushBuffer is holding a lock on a buffer and the 
"sync_file_range" call occurs is issued at that moment.

Although I agree that it is not that good, I would be surprise if that was 
the explanation for a performance regression, because the sync_file_range 
with the chosen parameters is an async call, it "advises" the OS to send 
the file, but it does not wait for it to be completed.

Moreover, for this issue to have a significant impact, it would require 
that another backend just happen to need this very buffer, but ISTM that 
the performance regression you are arguing about is on random IO bound 
performance, that is a few 100 tps in the best case, for very large bases, 
so a lot of buffers, so the probability of such a collision is very small, 
so it would not explain a significant regression.

> Especially on a system that's bottlenecked on IO that means we'll
> frequently hold content locks for a noticeable amount of time, while
> flushing blocks, without any need to.

I'm not that sure it is really noticeable, because sync_file_range does 
not wait for completion.

> Even if that's not the reason for the slowdowns I observed, I think this
> fact gives further credence to the current "pending flushes" tracking
> residing on the wrong level.

ISTM that I put the tracking at the level where is the information is 
available without having to recompute it several times, as the flush needs 
to know the fd and offset. Doing it differently would mean more code and 
translating buffer to file/offset several times, I think.

Also, maybe you could answer a question I had about the performance 
regression you observed, I could not find the post where you gave the 
detailed information about it, so that I could try reproducing it: what 
are the exact settings and conditions (shared_buffers, pgbench scaling, 
host memory, ...), what is the observed regression (tps? other?), and what 
is the responsiveness of the database under the regression (eg % of 
seconds with 0 tps for instance, or something like that).

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Speedup twophase transactions
Next
From: Andrew Dunstan
Date:
Subject: Re: [COMMITTERS] pgsql: Blind attempt at a Cygwin fix