Thread: posix_fadvise() and pg_receivexlog

posix_fadvise() and pg_receivexlog

From
Fujii Masao
Date:
Hi,

The WAL files that pg_receivexlog writes will not be re-read soon basically,
so we can advise the OS to release any cached pages when WAL file is
closed. I feel inclined to change pg_receivexlog that way. Thought?

Regards,

-- 
Fujii Masao



Re: posix_fadvise() and pg_receivexlog

From
Robert Haas
Date:
On Wed, Aug 6, 2014 at 1:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> The WAL files that pg_receivexlog writes will not be re-read soon basically,
> so we can advise the OS to release any cached pages when WAL file is
> closed. I feel inclined to change pg_receivexlog that way. Thought?

How do we know that the user doesn't plan to read them soon?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: posix_fadvise() and pg_receivexlog

From
Heikki Linnakangas
Date:
On 08/06/2014 08:39 PM, Fujii Masao wrote:
> Hi,
>
> The WAL files that pg_receivexlog writes will not be re-read soon basically,
> so we can advise the OS to release any cached pages when WAL file is
> closed. I feel inclined to change pg_receivexlog that way. Thought?

-1. The OS should be smart enough to not thrash the cache by files that 
are written sequentially and never read. If we go down this path, we'd 
need to sprinkle posix_fadvises into many, many places.

Anyway, who are we to say that they won't be re-read soon? You might e.g 
have a secondary backup site where you copy the files received by 
pg_receivexlog, as soon as they're completed.

- Heikki




Re: posix_fadvise() and pg_receivexlog

From
Fujii Masao
Date:
On Thu, Aug 7, 2014 at 3:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 08/06/2014 08:39 PM, Fujii Masao wrote:
>>
>> Hi,
>>
>> The WAL files that pg_receivexlog writes will not be re-read soon
>> basically,
>> so we can advise the OS to release any cached pages when WAL file is
>> closed. I feel inclined to change pg_receivexlog that way. Thought?
>
>
> -1. The OS should be smart enough to not thrash the cache by files that are
> written sequentially and never read.

Yep, the OS should be so smart, but I'm not sure if it actually is. Maybe not,
so I was thinking that posix_fadvise is called when the server closes WAL file.

> If we go down this path, we'd need to
> sprinkle posix_fadvises into many, many places.

Yes, that's valid concern. But if we can prove that adding posix_fadvise to
a certain place can improve the performance well, I'm inclined to do that.

> Anyway, who are we to say that they won't be re-read soon? You might e.g
> have a secondary backup site where you copy the files received by
> pg_receivexlog, as soon as they're completed.

So whether posix_fadvise is called or not needs to be exposed as an
user-configurable option. We would need to measure how useful exposing
that is, though.

Regards,

-- 
Fujii Masao



Re: posix_fadvise() and pg_receivexlog

From
Mitsumasa KONDO
Date:
Hi,

2014-08-07 13:47 GMT+09:00 Fujii Masao <masao.fujii@gmail.com>:
On Thu, Aug 7, 2014 at 3:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 08/06/2014 08:39 PM, Fujii Masao wrote:
>> The WAL files that pg_receivexlog writes will not be re-read soon
>> basically,
>> so we can advise the OS to release any cached pages when WAL file is
>> closed. I feel inclined to change pg_receivexlog that way. Thought?
>
>
> -1. The OS should be smart enough to not thrash the cache by files that are
> written sequentially and never read.
OS's buffer strategy is optimized for general situation. Do you forget OS hackers discussion last a half of year?
 
Yep, the OS should be so smart, but I'm not sure if it actually is. Maybe not,
so I was thinking that posix_fadvise is called when the server closes WAL file.
That's right.
 
> If we go down this path, we'd need to
> sprinkle posix_fadvises into many, many places.
Why do you aim to be perfect at the beginning?
It is as same as history of postgres, your concern doesn't make sense.


> Anyway, who are we to say that they won't be re-read soon? You might e.g
> have a secondary backup site where you copy the files received by
> pg_receivexlog, as soon as they're completed.

So whether posix_fadvise is called or not needs to be exposed as an
user-configurable option. We would need to measure how useful exposing
that is, though.
By the way, does pg_receivexlog process have fsync() in every WAL commit?
If yes, I think that we need no or less fsync() option for the better performance. It is general in NOSQL storages.
If no, we need fsync() option for more getting reliability and data integrarity. 


Best regards,
--
Mitsumasa KONDO

Re: posix_fadvise() and pg_receivexlog

From
Heikki Linnakangas
Date:
On 08/07/2014 10:10 AM, Mitsumasa KONDO wrote:
> 2014-08-07 13:47 GMT+09:00 Fujii Masao <masao.fujii@gmail.com>:
>
>> On Thu, Aug 7, 2014 at 3:59 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> On 08/06/2014 08:39 PM, Fujii Masao wrote:
>>>> The WAL files that pg_receivexlog writes will not be re-read soon
>>>> basically,
>>>> so we can advise the OS to release any cached pages when WAL file is
>>>> closed. I feel inclined to change pg_receivexlog that way. Thought?
>>>
>>>
>>> -1. The OS should be smart enough to not thrash the cache by files that
>> are
>>> written sequentially and never read.
>>
> OS's buffer strategy is optimized for general situation. Do you forget OS
> hackers discussion last a half of year?
>
>> Yep, the OS should be so smart, but I'm not sure if it actually is. Maybe
>> not,
>> so I was thinking that posix_fadvise is called when the server closes WAL
>> file.
>
> That's right.

Well, I'd like to hear someone from the field complaining that 
pg_receivexlog is thrashing the cache and thus reducing the performance 
of some other process. Or a least a synthetic test case that 
demonstrates that happening.

> By the way, does pg_receivexlog process have fsync() in every WAL commit?

It fsync's each file after finishing to write it. Ie. each WAL file is 
fsync'd once.

> If yes, I think that we need no or less fsync() option for the better
> performance. It is general in NOSQL storages.
> If no, we need fsync() option for more getting reliability and data
> integrarity.

Hmm. An fsync=off style option might make sense, although I doubt the 
one fsync at end of file is causing a performance problem for anyone in 
practice. Haven't heard any complaints, anyway.

An option to fsync after every commit record might make sense if you use 
pg_receivexlog with synchronous replication. Doing that would require 
parsing the WAL, though, to see where the commit records are. But then 
again, the fsync's wouldn't need to correspond to commit records. We 
could fsync just before we go to sleep to wait for more WAL to be received.

- Heikki




Re: posix_fadvise() and pg_receivexlog

From
Fujii Masao
Date:
On Thu, Aug 7, 2014 at 5:02 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 08/07/2014 10:10 AM, Mitsumasa KONDO wrote:
>>
>> 2014-08-07 13:47 GMT+09:00 Fujii Masao <masao.fujii@gmail.com>:
>>
>>> On Thu, Aug 7, 2014 at 3:59 AM, Heikki Linnakangas
>>> <hlinnakangas@vmware.com> wrote:
>>>>
>>>> On 08/06/2014 08:39 PM, Fujii Masao wrote:
>>>>>
>>>>> The WAL files that pg_receivexlog writes will not be re-read soon
>>>>> basically,
>>>>> so we can advise the OS to release any cached pages when WAL file is
>>>>> closed. I feel inclined to change pg_receivexlog that way. Thought?
>>>>
>>>>
>>>>
>>>> -1. The OS should be smart enough to not thrash the cache by files that
>>>
>>> are
>>>>
>>>> written sequentially and never read.
>>>
>>>
>> OS's buffer strategy is optimized for general situation. Do you forget OS
>> hackers discussion last a half of year?
>>
>>> Yep, the OS should be so smart, but I'm not sure if it actually is. Maybe
>>> not,
>>> so I was thinking that posix_fadvise is called when the server closes WAL
>>> file.
>>
>>
>> That's right.
>
>
> Well, I'd like to hear someone from the field complaining that
> pg_receivexlog is thrashing the cache and thus reducing the performance of
> some other process. Or a least a synthetic test case that demonstrates that
> happening.

Yeah, I will test that by seeing the performance of PostgreSQL which is
running in the same server as pg_receivexlog is running. We can just
compare that performance with normal pg_receivexlog and that with
the patched one (i.e., posix_fadvise is called).

>
>
>> By the way, does pg_receivexlog process have fsync() in every WAL commit?
>
>
> It fsync's each file after finishing to write it. Ie. each WAL file is
> fsync'd once.
>
>
>> If yes, I think that we need no or less fsync() option for the better
>> performance. It is general in NOSQL storages.
>> If no, we need fsync() option for more getting reliability and data
>> integrarity.
>
>
> Hmm. An fsync=off style option might make sense, although I doubt the one
> fsync at end of file is causing a performance problem for anyone in
> practice. Haven't heard any complaints, anyway.
>
> An option to fsync after every commit record might make sense if you use
> pg_receivexlog with synchronous replication. Doing that would require
> parsing the WAL, though, to see where the commit records are. But then
> again, the fsync's wouldn't need to correspond to commit records. We could
> fsync just before we go to sleep to wait for more WAL to be received.

That's what Furuya-san proposed in last CommitFest.

Regards,

-- 
Fujii Masao



Re: posix_fadvise() and pg_receivexlog

From
didier
Date:
Hi

> Well, I'd like to hear someone from the field complaining that
> pg_receivexlog is thrashing the cache and thus reducing the performance of
> some other process. Or a least a synthetic test case that demonstrates that
> happening.
It's not with pg_receivexlog but it's related.

On a small box without replication server connected perfs were good
enough but not so with a replication server connected, there was 1GB
worth of WAL sitting in RAM vs next to nothing without slave!
setup:
8GB RAM
2GB shared_buffers (smaller has other issues)
checkpoint_segments 40 (smaller value trigger too much xlog checkpoint)
checkpoints spread over 10 mn and write 30 to 50% of shared buffers.
live data set fit in RAM.
constant load.

On startup (1 or 2/hour) applications were running requests on cold
data which were now saturating IO.
I'm not sure it's an OS bug as the WAL were 'hotter' than the cold data.

A cron task every minute with vmtouch -e for evicting old WAL files
from memory has solved the issue.

Regards



Re: posix_fadvise() and pg_receivexlog

From
Fujii Masao
Date:
On Tue, Sep 9, 2014 at 9:07 PM, didier <did447@gmail.com> wrote:
> Hi
>
>> Well, I'd like to hear someone from the field complaining that
>> pg_receivexlog is thrashing the cache and thus reducing the performance of
>> some other process. Or a least a synthetic test case that demonstrates that
>> happening.
> It's not with pg_receivexlog but it's related.
>
> On a small box without replication server connected perfs were good
> enough but not so with a replication server connected, there was 1GB
> worth of WAL sitting in RAM vs next to nothing without slave!

After WAL file is filled up and closed, it will not be re-read
if wal_level is set to minimal (i.e., neither archiving nor
replication is enabled). So, in this case, PostgreSQL advises the OS
to release any cached pages of that WAL file. But not if archiving
or replication is enabled, and then WAL file keeps being cached
even after it's closed. Probably this is the cause of what you
observed, I guess.

Regards,

-- 
Fujii Masao



Re: posix_fadvise() and pg_receivexlog

From
Robert Haas
Date:
On Tue, Sep 9, 2014 at 8:07 AM, didier <did447@gmail.com> wrote:
>> Well, I'd like to hear someone from the field complaining that
>> pg_receivexlog is thrashing the cache and thus reducing the performance of
>> some other process. Or a least a synthetic test case that demonstrates that
>> happening.
> It's not with pg_receivexlog but it's related.
>
> On a small box without replication server connected perfs were good
> enough but not so with a replication server connected, there was 1GB
> worth of WAL sitting in RAM vs next to nothing without slave!
> setup:
> 8GB RAM
> 2GB shared_buffers (smaller has other issues)
> checkpoint_segments 40 (smaller value trigger too much xlog checkpoint)
> checkpoints spread over 10 mn and write 30 to 50% of shared buffers.
> live data set fit in RAM.
> constant load.
>
> On startup (1 or 2/hour) applications were running requests on cold
> data which were now saturating IO.
> I'm not sure it's an OS bug as the WAL were 'hotter' than the cold data.
>
> A cron task every minute with vmtouch -e for evicting old WAL files
> from memory has solved the issue.

That seems like pretty good evidence that it might be worth doing
something here.  But I still think maybe it should be optional,
because if the user plans to reread those files and, say, copy them
somewhere else, then they won't want this behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company