Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | alpine.DEB.2.10.1508171911360.5011@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) |
Responses |
Re: checkpointer continuous flushing
|
List | pgsql-hackers |
Hello Andres, >>> [...] posix_fadvise(). >> >> My current thinking is "maybe yes, maybe no":-), as it may depend on the OS >> implementation of posix_fadvise, so it may differ between OS. > > As long as fadvise has no 'undirty' option, I don't see how that > problem goes away. You're telling the OS to throw the buffer away, so > unless it ignores it that'll have consequences when you read the page > back in. Yep, probably. Note that we are talking about checkpoints, which "write" buffers out *but* keep them nevertheless. As the buffer is kept, the OS page is a duplicate, and freeing it should not harm, at least immediatly. The situation is different if the memory is reused in between, which is the work of the bgwriter I think, based on LRU/LFU heuristics, but such writes are not flushed by the current patch. Now, if a buffer was recently updated it should not be selected by the bgwriter, if the LRU/LFU heuristics works as expected, which mitigate the issue somehow... To sum up, I agree that it is indeed possible that flushing with posix_fadvise could reduce read OS-memory hits on some systems for some workloads, although not on Linux, see below. So the option is best kept as "off" for now, without further data, I'm fine with that. > [...] I'd say it should then be an os-specific default. No point in > making people work for it needlessly on linux and/or elsewhere. Ok. Version 9 attached does that, "on" for Linux, "off" for others because of the potential issues you mentioned. >> (Another reason to keep it "off" is that I'm not sure about what >> happens with such HD flushing features on virtual servers). > > I don't see how that matters? Either the host will entirely ignore > flushing, and thus the sync_file_range and the fsync won't cost much, or > fsync will be honored, in which case the pre-flushing is helpful. Possibly. I know that I do not know:-) The distance between the database and real hardware is so great in VM, that I think that it may have any effect, including good, bad or none:-) >> Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host >> and it was as bad as Linux (namely the database and even the box was offline >> for long minutes...), and if you can avoid that having to read back some >> data may be not that bad a down payment. > > I don't see how that'd alleviate my fear. I'm trying to mitigate your fears, not to alleviate them:-) > Sure, the latency for many workloads will be better, but I don't how > that argument says anything about the reads? It just says that there may be a compromise, better in some case, possibly not so in others, because posix_fadvise does not really say what the database would like to say to the OS, this is why I wrote such a large comment about it in the source file in the first place. > And we'll not just use this in cases it'd be beneficial... I'm fine if it is off by default for some systems. If people want to avoid write stalls they can use the option, but it may have adverse effect on the tps in some cases, that's life? Not using the option also has adverse effects in some cases, because you have write stalls... and currently you do not have the choice, so it would be a progress. >> The issue is largely mitigated if the data is not removed from >> shared_buffers, because the OS buffer is just a copy of already hold data. >> What I would do on such systems is to increase shared_buffers and keep >> flushing on, that is to count less on the system cache and more on postgres >> own cache. > > That doesn't work that well for a bunch of reasons. For one it's > completely non-adaptive. With the OS's page cache you can rely on free > memory being used for caching *and* it be available should a query or > another program need lots of memory. Yep. I was thinking about a dedicated database server, not a shared one. >> Overall, I'm not convince that the practice of relying on the OS cache is a >> good one, given what it does with it, at least on Linux. > > The alternatives aren't super realistic near-term though. Using direct > IO efficiently on the set of operating systems we support is > *hard*. [...] Sure. This is not necessarily what I had in mind. Currently pg "write"s stuff to the OS, and then suddenly calls "fsync" out of the blue, hoping that in between the OS will actually have done a good job with the underlying hardware. This is pretty naive, the fsync generates write storms, and the database is offline: trying to improve these things is the motivation for this patch. Now if you think of the bgwriter, it does pretty much the same, and probably may generate plenty of random I/Os, because the underlying LRU/LFU heuristics used to select buffers does not care about the file structures. So I think that to get good performance the database must take some control over the OS. That does not mean that direct I/O needs to be involved, although maybe it could, but this patch shows that it is not needed to improve things. >> Now, if someone could provide a dedicated box with posix_fadvise (say >> FreeBSD, maybe others...) for testing that would allow to provide data >> instead of speculating... and then maybe to decide to change its default >> value. > > Testing, as an approximation, how it turns out to work on linux would be > a good step. Do you mean testing with posix_fadvise on Linux? I did think about it, but the documented behavior of this call on Linux is disappointing: if the buffer has been written to disk, it is freed by the OS. If not, nothing is done. Given that the flush is called pretty close after writes, mostly the buffer will not have been written to disk yet, and the call would just be a no-op... So I concluded that there is no point in trying that on Linux because it will have no effect other than loosing some time, IMO. Really, a useful test would be FreeBSD, when posix_fadvise does move things to disk, although the actual offsets & length are ignored, but I do not think that it would be a problem. I do not know about other systems and what they do with posix_fadvise. -- Fabien.
pgsql-hackers by date: