Thread: posix_fadvise missing in the walsender

posix_fadvise missing in the walsender

From
Joachim Wieland
Date:
In access/transam/xlog.c we give the OS buffer caching a hint that we
won't need a WAL file any time soon with
   posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);

before closing the WAL file, but only if we don't have walsenders.
That's reasonable because the walsender will reopen that same file
shortly after.

However the walsender doesn't call posix_fadvise once it's done with
the file and I'm proposing to add this to walsender.c for consistency
as well.

Since there could be multiple walsenders, only the "slowest" one
should call this function. Finding out the slowest walsender can be
done by inspecting the shared memory and looking at the sentPtr of
each walsender.

Any comments?



Re: posix_fadvise missing in the walsender

From
Heikki Linnakangas
Date:
On 17.02.2013 14:55, Joachim Wieland wrote:
> In access/transam/xlog.c we give the OS buffer caching a hint that we
> won't need a WAL file any time soon with
>
>      posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
>
> before closing the WAL file, but only if we don't have walsenders.
> That's reasonable because the walsender will reopen that same file
> shortly after.
>
> However the walsender doesn't call posix_fadvise once it's done with
> the file and I'm proposing to add this to walsender.c for consistency
> as well.
>
> Since there could be multiple walsenders, only the "slowest" one
> should call this function. Finding out the slowest walsender can be
> done by inspecting the shared memory and looking at the sentPtr of
> each walsender.

I doubt it's worth it, the OS cache generally does a reasonable job at 
deciding what to keep. In the non-walsender case, it's pretty clear that 
we don't need the WAL file anymore, but if we need to work any harder 
than a one-line posix_fadvise call, meh. I could be convinced otherwise 
with some performance test results, of course.

- Heikki



Re: posix_fadvise missing in the walsender

From
Merlin Moncure
Date:
On Mon, Feb 18, 2013 at 2:16 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 17.02.2013 14:55, Joachim Wieland wrote:
>>
>> In access/transam/xlog.c we give the OS buffer caching a hint that we
>> won't need a WAL file any time soon with
>>
>>      posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
>>
>> before closing the WAL file, but only if we don't have walsenders.
>> That's reasonable because the walsender will reopen that same file
>> shortly after.
>>
>> However the walsender doesn't call posix_fadvise once it's done with
>> the file and I'm proposing to add this to walsender.c for consistency
>> as well.
>>
>> Since there could be multiple walsenders, only the "slowest" one
>> should call this function. Finding out the slowest walsender can be
>> done by inspecting the shared memory and looking at the sentPtr of
>> each walsender.
>
>
> I doubt it's worth it, the OS cache generally does a reasonable job at
> deciding what to keep. In the non-walsender case, it's pretty clear that we
> don't need the WAL file anymore, but if we need to work any harder than a
> one-line posix_fadvise call, meh.

If that's the case, why have the advisory call at all?  The OS is
being lied too (in some cases)...

merlin



Re: posix_fadvise missing in the walsender

From
Simon Riggs
Date:
On 19 February 2013 20:19, Merlin Moncure <mmoncure@gmail.com> wrote:

>>> In access/transam/xlog.c we give the OS buffer caching a hint that we
>>> won't need a WAL file any time soon with
>>>
>>>      posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
>>>

> If that's the case, why have the advisory call at all?  The OS is
> being lied too (in some cases)...

I agree with Merlin and Joachim - if we have the call in one place, we
should have it in both.

This means that if a standby fails it will likely have to re-read
these files from disk. Cool, we can live with that.

Patch please,

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: posix_fadvise missing in the walsender

From
Joachim Wieland
Date:
On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> In access/transam/xlog.c we give the OS buffer caching a hint that we
>>>> won't need a WAL file any time soon with
>>>>
>>>>      posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
>>>>
>
> I agree with Merlin and Joachim - if we have the call in one place, we
> should have it in both.

You could argue that if it's considered beneficial in the case with no
walsender, then you should definitely have it if there are walsenders
around:
The walsenders reopen and read those files which gives the OS reason
to believe that other processes might do the same in the near future
and hence that it should not evict those pages too early.


Joachim



Re: posix_fadvise missing in the walsender

From
Robert Haas
Date:
On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I agree with Merlin and Joachim - if we have the call in one place, we
> should have it in both.

We might want to assess whether we even want to have it one place.
I've seen cases where the existing call hurts performance, because of
WAL file recycling.  If we don't flush the WAL file blocks out of
cache, then they're still there when we recycle the WAL file and we
can overwrite them without further I/O.  But if we tell the OS to blow
them away, then it has to reread them when we try to overwrite the old
files, and so we stall waiting for the I/O.  I was able to clearly
measure this problem back when I was hacking on write scalability, so
it's not a purely hypothetical risk.

As for the proposed optimization, I tend to doubt that it's a good
idea.  We're talking about doing extra work to give the OS cache a
hint that may not be right anyway.  Color me skeptical...  but like
Heikki, I'm certainly willing to be proven wrong by some actual
benchmark results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: posix_fadvise missing in the walsender

From
Joachim Wieland
Date:
On Wed, Feb 20, 2013 at 4:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I agree with Merlin and Joachim - if we have the call in one place, we
>> should have it in both.
>
> We might want to assess whether we even want to have it one place.
> I've seen cases where the existing call hurts performance, because of
> WAL file recycling.

That's interesting, I hadn't thought about WAL recycling.

I now agree that this whole thing is even more complicated, you might
have an archive_command set as well, like "cp" for instance, that
reads in the WAL file again, possibly even right after we called
posix_fadvise on it.

It appears to me that the right strategy depends on a few factors:

a) what ratio of your active dataset fits into RAM?
b) how many WAL files do you have?
c) how long does it take for them to get recycled?
d) archive_command set / wal_senders active?

And recommendations for the two extremes would be:

If your dataset fits mostly into RAM and if you have only few WAL
files that get recycled quickly then you don't want to evict the WAL
file from the buffer cache.
On the other hand if your dataset doesn't fit into RAM and you have
many WAL files that take a while until they get recycled, then you
should consider hinting to the OS.

If you're in that second category (I am) and you're also using the
archive_command you could just piggyback the posix_fadvise call onto
your archive_command, assuming that the walsender is already done with
the file at that moment. And I'm also pretty certain that Robert's
setup that he used for the write scalability tests fell into the first
category.

So given the above, I think it's possible to come up with benchmarks
that prove whatever you want to prove :-)


Joachim



Re: posix_fadvise missing in the walsender

From
Robert Haas
Date:
On Wed, Feb 20, 2013 at 9:49 PM, Joachim Wieland <joe@mcknight.de> wrote:
> So given the above, I think it's possible to come up with benchmarks
> that prove whatever you want to prove :-)

Yeah.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: posix_fadvise missing in the walsender

From
Jeff Janes
Date:
On Wed, Feb 20, 2013 at 7:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I agree with Merlin and Joachim - if we have the call in one place, we
>> should have it in both.
>
> We might want to assess whether we even want to have it one place.
> I've seen cases where the existing call hurts performance, because of
> WAL file recycling.  If we don't flush the WAL file blocks out of
> cache, then they're still there when we recycle the WAL file and we
> can overwrite them without further I/O.  But if we tell the OS to blow
> them away, then it has to reread them when we try to overwrite the old
> files, and so we stall waiting for the I/O.

Does the kernel really read a data block from disk into memory in
order to immediately overwrite it?  I would have thought it would
optimize that away, at least if the writes are sized and aligned to
512 or 1024 bytes blocks (which WAL should be).  Well, stranger things
than that happen, I guess.  (For example on ext4, when a file with
dirty pages goes away due to another file getting renamed over the top
of it, the disappearing file automatically gets fsynced, or the
equivalent.)

Cheers,

Jeff



Re: posix_fadvise missing in the walsender

From
Robert Haas
Date:
On Thu, Feb 21, 2013 at 12:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Does the kernel really read a data block from disk into memory in
> order to immediately overwrite it?  I would have thought it would
> optimize that away, at least if the writes are sized and aligned to
> 512 or 1024 bytes blocks (which WAL should be).

Now that you mention that I agree it seems strange, but that's what I saw.

/me scratches head

It does seem pretty odd, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: posix_fadvise missing in the walsender

From
Florian Weimer
Date:
* Jeff Janes:

> Does the kernel really read a data block from disk into memory in
> order to immediately overwrite it?  I would have thought it would
> optimize that away, at least if the writes are sized and aligned to
> 512 or 1024 bytes blocks (which WAL should be).

With Linux, you'd have to use O_DIRECT to get that effect (but don't
do that), otherwise writes happen in page size granularity, writing in
512 or 1024 byte blocks should really trigger a read-modify-write
cycle.