Thread: posix_fadvise missing in the walsender
In access/transam/xlog.c we give the OS buffer caching a hint that we won't need a WAL file any time soon with posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); before closing the WAL file, but only if we don't have walsenders. That's reasonable because the walsender will reopen that same file shortly after. However the walsender doesn't call posix_fadvise once it's done with the file and I'm proposing to add this to walsender.c for consistency as well. Since there could be multiple walsenders, only the "slowest" one should call this function. Finding out the slowest walsender can be done by inspecting the shared memory and looking at the sentPtr of each walsender. Any comments?
On 17.02.2013 14:55, Joachim Wieland wrote: > In access/transam/xlog.c we give the OS buffer caching a hint that we > won't need a WAL file any time soon with > > posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); > > before closing the WAL file, but only if we don't have walsenders. > That's reasonable because the walsender will reopen that same file > shortly after. > > However the walsender doesn't call posix_fadvise once it's done with > the file and I'm proposing to add this to walsender.c for consistency > as well. > > Since there could be multiple walsenders, only the "slowest" one > should call this function. Finding out the slowest walsender can be > done by inspecting the shared memory and looking at the sentPtr of > each walsender. I doubt it's worth it, the OS cache generally does a reasonable job at deciding what to keep. In the non-walsender case, it's pretty clear that we don't need the WAL file anymore, but if we need to work any harder than a one-line posix_fadvise call, meh. I could be convinced otherwise with some performance test results, of course. - Heikki
On Mon, Feb 18, 2013 at 2:16 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 17.02.2013 14:55, Joachim Wieland wrote: >> >> In access/transam/xlog.c we give the OS buffer caching a hint that we >> won't need a WAL file any time soon with >> >> posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); >> >> before closing the WAL file, but only if we don't have walsenders. >> That's reasonable because the walsender will reopen that same file >> shortly after. >> >> However the walsender doesn't call posix_fadvise once it's done with >> the file and I'm proposing to add this to walsender.c for consistency >> as well. >> >> Since there could be multiple walsenders, only the "slowest" one >> should call this function. Finding out the slowest walsender can be >> done by inspecting the shared memory and looking at the sentPtr of >> each walsender. > > > I doubt it's worth it, the OS cache generally does a reasonable job at > deciding what to keep. In the non-walsender case, it's pretty clear that we > don't need the WAL file anymore, but if we need to work any harder than a > one-line posix_fadvise call, meh. If that's the case, why have the advisory call at all? The OS is being lied too (in some cases)... merlin
On 19 February 2013 20:19, Merlin Moncure <mmoncure@gmail.com> wrote: >>> In access/transam/xlog.c we give the OS buffer caching a hint that we >>> won't need a WAL file any time soon with >>> >>> posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); >>> > If that's the case, why have the advisory call at all? The OS is > being lied too (in some cases)... I agree with Merlin and Joachim - if we have the call in one place, we should have it in both. This means that if a standby fails it will likely have to re-read these files from disk. Cool, we can live with that. Patch please, -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> In access/transam/xlog.c we give the OS buffer caching a hint that we >>>> won't need a WAL file any time soon with >>>> >>>> posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); >>>> > > I agree with Merlin and Joachim - if we have the call in one place, we > should have it in both. You could argue that if it's considered beneficial in the case with no walsender, then you should definitely have it if there are walsenders around: The walsenders reopen and read those files which gives the OS reason to believe that other processes might do the same in the near future and hence that it should not evict those pages too early. Joachim
On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I agree with Merlin and Joachim - if we have the call in one place, we > should have it in both. We might want to assess whether we even want to have it one place. I've seen cases where the existing call hurts performance, because of WAL file recycling. If we don't flush the WAL file blocks out of cache, then they're still there when we recycle the WAL file and we can overwrite them without further I/O. But if we tell the OS to blow them away, then it has to reread them when we try to overwrite the old files, and so we stall waiting for the I/O. I was able to clearly measure this problem back when I was hacking on write scalability, so it's not a purely hypothetical risk. As for the proposed optimization, I tend to doubt that it's a good idea. We're talking about doing extra work to give the OS cache a hint that may not be right anyway. Color me skeptical... but like Heikki, I'm certainly willing to be proven wrong by some actual benchmark results. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 20, 2013 at 4:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I agree with Merlin and Joachim - if we have the call in one place, we >> should have it in both. > > We might want to assess whether we even want to have it one place. > I've seen cases where the existing call hurts performance, because of > WAL file recycling. That's interesting, I hadn't thought about WAL recycling. I now agree that this whole thing is even more complicated, you might have an archive_command set as well, like "cp" for instance, that reads in the WAL file again, possibly even right after we called posix_fadvise on it. It appears to me that the right strategy depends on a few factors: a) what ratio of your active dataset fits into RAM? b) how many WAL files do you have? c) how long does it take for them to get recycled? d) archive_command set / wal_senders active? And recommendations for the two extremes would be: If your dataset fits mostly into RAM and if you have only few WAL files that get recycled quickly then you don't want to evict the WAL file from the buffer cache. On the other hand if your dataset doesn't fit into RAM and you have many WAL files that take a while until they get recycled, then you should consider hinting to the OS. If you're in that second category (I am) and you're also using the archive_command you could just piggyback the posix_fadvise call onto your archive_command, assuming that the walsender is already done with the file at that moment. And I'm also pretty certain that Robert's setup that he used for the write scalability tests fell into the first category. So given the above, I think it's possible to come up with benchmarks that prove whatever you want to prove :-) Joachim
On Wed, Feb 20, 2013 at 9:49 PM, Joachim Wieland <joe@mcknight.de> wrote: > So given the above, I think it's possible to come up with benchmarks > that prove whatever you want to prove :-) Yeah. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 20, 2013 at 7:54 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I agree with Merlin and Joachim - if we have the call in one place, we >> should have it in both. > > We might want to assess whether we even want to have it one place. > I've seen cases where the existing call hurts performance, because of > WAL file recycling. If we don't flush the WAL file blocks out of > cache, then they're still there when we recycle the WAL file and we > can overwrite them without further I/O. But if we tell the OS to blow > them away, then it has to reread them when we try to overwrite the old > files, and so we stall waiting for the I/O. Does the kernel really read a data block from disk into memory in order to immediately overwrite it? I would have thought it would optimize that away, at least if the writes are sized and aligned to 512 or 1024 bytes blocks (which WAL should be). Well, stranger things than that happen, I guess. (For example on ext4, when a file with dirty pages goes away due to another file getting renamed over the top of it, the disappearing file automatically gets fsynced, or the equivalent.) Cheers, Jeff
On Thu, Feb 21, 2013 at 12:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > Does the kernel really read a data block from disk into memory in > order to immediately overwrite it? I would have thought it would > optimize that away, at least if the writes are sized and aligned to > 512 or 1024 bytes blocks (which WAL should be). Now that you mention that I agree it seems strange, but that's what I saw. /me scratches head It does seem pretty odd, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Jeff Janes: > Does the kernel really read a data block from disk into memory in > order to immediately overwrite it? I would have thought it would > optimize that away, at least if the writes are sized and aligned to > 512 or 1024 bytes blocks (which WAL should be). With Linux, you'd have to use O_DIRECT to get that effect (but don't do that), otherwise writes happen in page size granularity, writing in 512 or 1024 byte blocks should really trigger a read-modify-write cycle.