Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1508171526400.28260@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
List pgsql-hackers

<Oops, stalled post, sorry wrong "From", resent..>


Hello Andres,

>> +        rc = posix_fadvise(context->fd, context->offset, [...]
> 
> I'm a bit wary that this might cause significant regressions on
> platforms not supporting sync_file_range, but support posix_fadvise()
> for workloads that are bigger than shared_buffers. Consider what happens
> if the workload does *not* fit into shared_buffers but *does* fit into
> the OS's buffer cache. Suddenly reads will go to disk again, no?

That is an interesting question!

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS 
implementation of posix_fadvise, so it may differ between OS.

This is a reason why I think that flushing should be kept a guc, even if the 
sort guc is removed and always on. The sync_file_range implementation is 
clearly always very beneficial for Linux, and the posix_fadvise may or may 
not induce a good behavior depending on the underlying system.

This is also a reason why the default value for the flush guc is currently 
set to false in the patch. The documentation should advise to turn it on for 
Linux and to test otherwise. Or if Linux is assumed to be often a host, then 
maybe to set the default to on and to suggest that on some systems it may be 
better to have it off. (Another reason to keep it "off" is that I'm not sure 
about what happens with such HD flushing features on virtual servers).

Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host 
and it was as bad as Linux (namely the database and even the box was offline 
for long minutes...), and if you can avoid that having to read back some data 
may be not that bad a down payment.

The issue is largely mitigated if the data is not removed from 
shared_buffers, because the OS buffer is just a copy of already hold data. 
What I would do on such systems is to increase shared_buffers and keep 
flushing on, that is to count less on the system cache and more on postgres 
own cache.

Overall, I'm not convince that the practice of relying on the OS cache is a 
good one, given what it does with it, at least on Linux.

Now, if someone could provide a dedicated box with posix_fadvise (say 
FreeBSD, maybe others...) for testing that would allow to provide data 
instead of speculating... and then maybe to decide to change its default 
value.

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: checkpointer continuous flushing
Next
From: Kouhei Kaigai
Date:
Subject: Re: Our trial to TPC-DS but optimizer made unreasonable plan