<Oops, stalled post, sorry wrong "From", resent..>
Hello Andres,
>> + rc = posix_fadvise(context->fd, context->offset, [...]
>
> I'm a bit wary that this might cause significant regressions on
> platforms not supporting sync_file_range, but support posix_fadvise()
> for workloads that are bigger than shared_buffers. Consider what happens
> if the workload does *not* fit into shared_buffers but *does* fit into
> the OS's buffer cache. Suddenly reads will go to disk again, no?
That is an interesting question!
My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
implementation of posix_fadvise, so it may differ between OS.
This is a reason why I think that flushing should be kept a guc, even if the
sort guc is removed and always on. The sync_file_range implementation is
clearly always very beneficial for Linux, and the posix_fadvise may or may
not induce a good behavior depending on the underlying system.
This is also a reason why the default value for the flush guc is currently
set to false in the patch. The documentation should advise to turn it on for
Linux and to test otherwise. Or if Linux is assumed to be often a host, then
maybe to set the default to on and to suggest that on some systems it may be
better to have it off. (Another reason to keep it "off" is that I'm not sure
about what happens with such HD flushing features on virtual servers).
Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
and it was as bad as Linux (namely the database and even the box was offline
for long minutes...), and if you can avoid that having to read back some data
may be not that bad a down payment.
The issue is largely mitigated if the data is not removed from
shared_buffers, because the OS buffer is just a copy of already hold data.
What I would do on such systems is to increase shared_buffers and keep
flushing on, that is to count less on the system cache and more on postgres
own cache.
Overall, I'm not convince that the practice of relying on the OS cache is a
good one, given what it does with it, at least on Linux.
Now, if someone could provide a dedicated box with posix_fadvise (say
FreeBSD, maybe others...) for testing that would allow to provide data
instead of speculating... and then maybe to decide to change its default
value.
--
Fabien.