Hello Andres,
> Hm. New theory: The current flush interface does the flushing inside
> FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
> problem with that is that at that point we (need to) hold a content lock
> on the buffer!
You are worrying that FlushBuffer is holding a lock on a buffer and the
"sync_file_range" call occurs is issued at that moment.
Although I agree that it is not that good, I would be surprise if that was
the explanation for a performance regression, because the sync_file_range
with the chosen parameters is an async call, it "advises" the OS to send
the file, but it does not wait for it to be completed.
Moreover, for this issue to have a significant impact, it would require
that another backend just happen to need this very buffer, but ISTM that
the performance regression you are arguing about is on random IO bound
performance, that is a few 100 tps in the best case, for very large bases,
so a lot of buffers, so the probability of such a collision is very small,
so it would not explain a significant regression.
> Especially on a system that's bottlenecked on IO that means we'll
> frequently hold content locks for a noticeable amount of time, while
> flushing blocks, without any need to.
I'm not that sure it is really noticeable, because sync_file_range does
not wait for completion.
> Even if that's not the reason for the slowdowns I observed, I think this
> fact gives further credence to the current "pending flushes" tracking
> residing on the wrong level.
ISTM that I put the tracking at the level where is the information is
available without having to recompute it several times, as the flush needs
to know the fd and offset. Doing it differently would mean more code and
translating buffer to file/offset several times, I think.
Also, maybe you could answer a question I had about the performance
regression you observed, I could not find the post where you gave the
detailed information about it, so that I could try reproducing it: what
are the exact settings and conditions (shared_buffers, pgbench scaling,
host memory, ...), what is the observed regression (tps? other?), and what
is the responsiveness of the database under the regression (eg % of
seconds with 0 tps for instance, or something like that).
--
Fabien.