Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1508171911360.5011@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

>>> [...] posix_fadvise().
>>
>> My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
>> implementation of posix_fadvise, so it may differ between OS.
>
> As long as fadvise has no 'undirty' option, I don't see how that
> problem goes away. You're telling the OS to throw the buffer away, so
> unless it ignores it that'll have consequences when you read the page
> back in.

Yep, probably.

Note that we are talking about checkpoints, which "write" buffers out 
*but* keep them nevertheless. As the buffer is kept, the OS page is a 
duplicate, and freeing it should not harm, at least immediatly.

The situation is different if the memory is reused in between, which is 
the work of the bgwriter I think, based on LRU/LFU heuristics, but such 
writes are not flushed by the current patch.

Now, if a buffer was recently updated it should not be selected by the 
bgwriter, if the LRU/LFU heuristics works as expected, which mitigate the 
issue somehow...

To sum up, I agree that it is indeed possible that flushing with 
posix_fadvise could reduce read OS-memory hits on some systems for some 
workloads, although not on Linux, see below.

So the option is best kept as "off" for now, without further data, I'm 
fine with that.

> [...] I'd say it should then be an os-specific default. No point in 
> making people work for it needlessly on linux and/or elsewhere.

Ok. Version 9 attached does that, "on" for Linux, "off" for others because 
of the potential issues you mentioned.

>> (Another reason to keep it "off" is that I'm not sure about what
>> happens with such HD flushing features on virtual servers).
>
> I don't see how that matters? Either the host will entirely ignore
> flushing, and thus the sync_file_range and the fsync won't cost much, or
> fsync will be honored, in which case the pre-flushing is helpful.

Possibly. I know that I do not know:-)  The distance between the database 
and real hardware is so great in VM, that I think that it may have any 
effect, including good, bad or none:-)

>> Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
>> and it was as bad as Linux (namely the database and even the box was offline
>> for long minutes...), and if you can avoid that having to read back some
>> data may be not that bad a down payment.
>
> I don't see how that'd alleviate my fear.

I'm trying to mitigate your fears, not to alleviate them:-)

> Sure, the latency for many workloads will be better, but I don't how 
> that argument says anything about the reads?

It just says that there may be a compromise, better in some case, possibly 
not so in others, because posix_fadvise does not really say what the 
database would like to say to the OS, this is why I wrote such a large 
comment about it in the source file in the first place.

> And we'll not just use this in cases it'd be beneficial...

I'm fine if it is off by default for some systems. If people want to avoid 
write stalls they can use the option, but it may have adverse effect on 
the tps in some cases, that's life? Not using the option also has adverse 
effects in some cases, because you have write stalls... and currently you 
do not have the choice, so it would be a progress.

>> The issue is largely mitigated if the data is not removed from
>> shared_buffers, because the OS buffer is just a copy of already hold data.
>> What I would do on such systems is to increase shared_buffers and keep
>> flushing on, that is to count less on the system cache and more on postgres
>> own cache.
>
> That doesn't work that well for a bunch of reasons. For one it's
> completely non-adaptive. With the OS's page cache you can rely on free
> memory being used for caching *and* it be available should a query or
> another program need lots of memory.

Yep. I was thinking about a dedicated database server, not a shared one.

>> Overall, I'm not convince that the practice of relying on the OS cache is a
>> good one, given what it does with it, at least on Linux.
>
> The alternatives aren't super realistic near-term though. Using direct
> IO efficiently on the set of operating systems we support is
> *hard*. [...]

Sure.  This is not necessarily what I had in mind.

Currently pg "write"s stuff to the OS, and then suddenly calls "fsync" out 
of the blue, hoping that in between the OS will actually have done a good 
job with the underlying hardware.  This is pretty naive, the fsync 
generates write storms, and the database is offline: trying to improve 
these things is the motivation for this patch.

Now if you think of the bgwriter, it does pretty much the same, and 
probably may generate plenty of random I/Os, because the underlying 
LRU/LFU heuristics used to select buffers does not care about the file 
structures.

So I think that to get good performance the database must take some 
control over the OS. That does not mean that direct I/O needs to be 
involved, although maybe it could, but this patch shows that it is not 
needed to improve things.

>> Now, if someone could provide a dedicated box with posix_fadvise (say
>> FreeBSD, maybe others...) for testing that would allow to provide data
>> instead of speculating... and then maybe to decide to change its default
>> value.
>
> Testing, as an approximation, how it turns out to work on linux would be
> a good step.

Do you mean testing with posix_fadvise on Linux?

I did think about it, but the documented behavior of this call on Linux is 
disappointing: if the buffer has been written to disk, it is freed by the 
OS. If not, nothing is done. Given that the flush is called pretty close 
after writes, mostly the buffer will not have been written to disk yet, 
and the call would just be a no-op... So I concluded that there is no 
point in trying that on Linux because it will have no effect other than 
loosing some time, IMO.

Really, a useful test would be FreeBSD, when posix_fadvise does move 
things to disk, although the actual offsets & length are ignored, but I do 
not think that it would be a problem. I do not know about other systems 
and what they do with posix_fadvise.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: jsonb array-style subscripting
Next
From: Alvaro Herrera
Date:
Subject: Re: jsonb array-style subscripting