Thread: Free WAL caches on switching segments

Free WAL caches on switching segments

From
ITAGAKI Takahiro
Date:
Hi,

Here is a small patch to prevent undesired WAL file caching by kernel.
posix_fadvise(POSIX_FADV_DONTNEED) attempts to free cached pages and
the kernel will discard them in preference to other data caches.

I think it works just like as O_DIRECT in terms of cache control.
O_DIRECT may be a better solution than posix_fadvise, but
posix_fadvise may be used on platforms where O_DIRECT is not supported.

pgbench results are as follows:
  wal_sync_method
  - open_sync         : 156.0 tps
  - fdatasync         : 126.3 tps
  - fdatasync+fadvise : 161.2 tps
 (8.1beta1 on Linux 2.6.8-24)

I'll appreciate any comments and advices,
Takahiro



*** xlog.c    Mon Aug 29 11:51:19 2005
--- xlog-with-fadvise.c    Tue Aug 30 15:08:24 2005
***************
*** 1352,1357 ****
--- 1352,1358 ----
              Assert(npages == 0);
              if (openLogFile >= 0)
              {
+                 posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
                  if (close(openLogFile))
                      ereport(PANIC,
                              (errcode_for_file_access(),
***************
*** 1535,1540 ****
--- 1536,1542 ----
              if (openLogFile >= 0 &&
               !XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
              {
+                 posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
                  if (close(openLogFile))
                      ereport(PANIC,
                              (errcode_for_file_access(),


Re: Free WAL caches on switching segments

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
> Here is a small patch to prevent undesired WAL file caching by kernel.
> posix_fadvise(POSIX_FADV_DONTNEED) attempts to free cached pages and
> the kernel will discard them in preference to other data caches.

On plenty of platforms, this won't even compile ...

            regards, tom lane

Re: Free WAL caches on switching segments

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > Here is a small patch to prevent undesired WAL file caching by kernel.
> > posix_fadvise(POSIX_FADV_DONTNEED) attempts to free cached pages and
> > the kernel will discard them in preference to other data caches.
>
> On plenty of platforms, this won't even compile ...


Do you mean simply following code? or more pretty way?

#ifdef POSIX_FADV_DONTNEED
    posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif


---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
ITAGAKI Takahiro wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > > Here is a small patch to prevent undesired WAL file caching by kernel.
> > > posix_fadvise(POSIX_FADV_DONTNEED) attempts to free cached pages and
> > > the kernel will discard them in preference to other data caches.
> >
> > On plenty of platforms, this won't even compile ...
>
>
> Do you mean simply following code? or more pretty way?
>
> #ifdef POSIX_FADV_DONTNEED
>     posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
> #endif

Yes, but we probably have to have a configure test to see if
posix_fadvise exists too.  I will keep this for 8.2.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
ITAGAKI Takahiro
Date:
Bruce Momjian <pgman@candle.pha.pa.us> wrote:

> > > > Here is a small patch to prevent undesired WAL file caching by kernel.
> > > > posix_fadvise(POSIX_FADV_DONTNEED) attempts to free cached pages and
> > > > the kernel will discard them in preference to other data caches.
> > >
> > > On plenty of platforms, this won't even compile ...
>
> Yes, but we probably have to have a configure test to see if
> posix_fadvise exists too.  I will keep this for 8.2.

I think we can use _POSIX_ADVISORY_INFO to test if posix_fadvise exists.
Also, I added the check on whether WAL archiving is enabled, because
archivers might use the caches to read the WAL segment.


By the way, should we put posix_fadvise on the separated place with renaming
pg_fadvise? If we use posix_fadvise in other purposes, for example,
read-ahead control, the separation would be good to keep codes clean.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories


Attachment

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
I looked this over and I am unsure what this does for us that isn't
already accomplished using the wal_sync_method settings.  See xlog.c for
a description of O_DIRECT and when it is used.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> wrote:
>
> > > > > Here is a small patch to prevent undesired WAL file caching by kernel.
> > > > > posix_fadvise(POSIX_FADV_DONTNEED) attempts to free cached pages and
> > > > > the kernel will discard them in preference to other data caches.
> > > >
> > > > On plenty of platforms, this won't even compile ...
> >
> > Yes, but we probably have to have a configure test to see if
> > posix_fadvise exists too.  I will keep this for 8.2.
>
> I think we can use _POSIX_ADVISORY_INFO to test if posix_fadvise exists.
> Also, I added the check on whether WAL archiving is enabled, because
> archivers might use the caches to read the WAL segment.
>
>
> By the way, should we put posix_fadvise on the separated place with renaming
> pg_fadvise? If we use posix_fadvise in other purposes, for example,
> read-ahead control, the separation would be good to keep codes clean.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
ITAGAKI Takahiro
Date:
Bruce Momjian <pgman@candle.pha.pa.us> wrote:

> I looked this over and I am unsure what this does for us that isn't
> already accomplished using the wal_sync_method settings.  See xlog.c for
> a description of O_DIRECT and when it is used.

I proposed it to supplement the cache control. There are some OSes that
supports posix_fadvise but not O_DIRECT, for example, NetBSD 4.0
(http://www.netbsd.org/Changes/changes-4.0.html).

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
ITAGAKI Takahiro wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> wrote:
>
> > I looked this over and I am unsure what this does for us that isn't
> > already accomplished using the wal_sync_method settings.  See xlog.c for
> > a description of O_DIRECT and when it is used.
>
> I proposed it to supplement the cache control. There are some OSes that
> supports posix_fadvise but not O_DIRECT, for example, NetBSD 4.0
> (http://www.netbsd.org/Changes/changes-4.0.html).

Oh, that makes sense then.   Let me re-add it to the queue.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> ITAGAKI Takahiro wrote:
>> Bruce Momjian <pgman@candle.pha.pa.us> wrote:
>>> I looked this over and I am unsure what this does for us that isn't
>>> already accomplished using the wal_sync_method settings.  See xlog.c for
>>> a description of O_DIRECT and when it is used.
>>
>> I proposed it to supplement the cache control. There are some OSes that
>> supports posix_fadvise but not O_DIRECT, for example, NetBSD 4.0
>> (http://www.netbsd.org/Changes/changes-4.0.html).

> Oh, that makes sense then.   Let me re-add it to the queue.

Could we see some performance measurements *from those OSes*?  The given
test on Linux certainly does not justify adding another operating
system dependency to the WAL code.  For that matter, even if it is a big
win on some versions of NetBSD, I'm not sure I'd want to accept it ...
how many NetBSD users do we have who would care?

Depending on OS features we have never depended on before is a *huge*
ongoing maintenance cost, and I have not seen an argument that I think
justifies this one.

            regards, tom lane

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > ITAGAKI Takahiro wrote:
> >> Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> >>> I looked this over and I am unsure what this does for us that isn't
> >>> already accomplished using the wal_sync_method settings.  See xlog.c for
> >>> a description of O_DIRECT and when it is used.
> >>
> >> I proposed it to supplement the cache control. There are some OSes that
> >> supports posix_fadvise but not O_DIRECT, for example, NetBSD 4.0
> >> (http://www.netbsd.org/Changes/changes-4.0.html).
>
> > Oh, that makes sense then.   Let me re-add it to the queue.
>
> Could we see some performance measurements *from those OSes*?  The given
> test on Linux certainly does not justify adding another operating
> system dependency to the WAL code.  For that matter, even if it is a big
> win on some versions of NetBSD, I'm not sure I'd want to accept it ...
> how many NetBSD users do we have who would care?
>
> Depending on OS features we have never depended on before is a *huge*
> ongoing maintenance cost, and I have not seen an argument that I think
> justifies this one.

I disagree.  It is a localized change and seems like a win, and it uses
a standard POSIX feature, rather than an OS-specific one.

I fact the patch cleans up our code by centralizing the WAL close() code
and error message.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> Depending on OS features we have never depended on before is a *huge*
>> ongoing maintenance cost, and I have not seen an argument that I think
>> justifies this one.

> I disagree.  It is a localized change and seems like a win, and it uses
> a standard POSIX feature, rather than an OS-specific one.

It's still gonna need a configure test and so on.  "POSIX" does not mean
"exists everywhere".  Moreover, the submitter has not even proven that
the code works (or even builds, much less does anything useful) on the
platforms it's supposedly for.

            regards, tom lane

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Depending on OS features we have never depended on before is a *huge*
> >> ongoing maintenance cost, and I have not seen an argument that I think
> >> justifies this one.
>
> > I disagree.  It is a localized change and seems like a win, and it uses
> > a standard POSIX feature, rather than an OS-specific one.
>
> It's still gonna need a configure test and so on.  "POSIX" does not mean
> "exists everywhere".  Moreover, the submitter has not even proven that
> the code works (or even builds, much less does anything useful) on the
> platforms it's supposedly for.

The submitter believes the C macro test is sufficient:

    I think we can use _POSIX_ADVISORY_INFO to test if posix_fadvise exists.
    Also, I added the check on whether WAL archiving is enabled, because
    archivers might use the caches to read the WAL segment.

I assume if it follows the POSIX spec it will work on all platforms that
support this feature.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> It's still gonna need a configure test and so on.

> The submitter believes the C macro test is sufficient:

Do I get to revert the patch the moment that fails in buildfarm?

More to the point, the utility of the patch remains unproven.
We are not in the habit of adding OS dependencies on speculation.

            regards, tom lane

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> It's still gonna need a configure test and so on.
>
> > The submitter believes the C macro test is sufficient:
>
> Do I get to revert the patch the moment that fails in buildfarm?

You bet.  In fact I liked his centralizing the WAL close() irrespective
of the feature itself.

> More to the point, the utility of the patch remains unproven.
> We are not in the habit of adding OS dependencies on speculation.

He ran tests, though it is speculation because non-caching is a pretty
hard thing to find a benefit from except under low memory situations.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> It's still gonna need a configure test and so on.
>
> > The submitter believes the C macro test is sufficient:
>
> Do I get to revert the patch the moment that fails in buildfarm?

In fact, I will revert it when it fails in the build farm.  :-)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> More to the point, the utility of the patch remains unproven.
>> We are not in the habit of adding OS dependencies on speculation.

> He ran tests, though it is speculation because non-caching is a pretty
> hard thing to find a benefit from except under low memory situations.

Well, the tests (a) didn't show any particularly good speedup, and
(b) were not on the platforms that this is speculated to be useful on
(ie, those without O_DIRECT).

I really don't think that an adequate case has been made for adding
a new OS dependency.

            regards, tom lane

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> More to the point, the utility of the patch remains unproven.
> >> We are not in the habit of adding OS dependencies on speculation.
>
> > He ran tests, though it is speculation because non-caching is a pretty
> > hard thing to find a benefit from except under low memory situations.
>
> Well, the tests (a) didn't show any particularly good speedup, and
> (b) were not on the platforms that this is speculated to be useful on
> (ie, those without O_DIRECT).
>
> I really don't think that an adequate case has been made for adding
> a new OS dependency.

Well, I think the patch should be applied, and the submitter does too,
so unless I hear other votes, it is going in.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
daveg
Date:
On Mon, Feb 13, 2006 at 02:59:43PM -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Tom Lane wrote:
> > >> More to the point, the utility of the patch remains unproven.
> > >> We are not in the habit of adding OS dependencies on speculation.
> >
> > > He ran tests, though it is speculation because non-caching is a pretty
> > > hard thing to find a benefit from except under low memory situations.
> >
> > Well, the tests (a) didn't show any particularly good speedup, and
> > (b) were not on the platforms that this is speculated to be useful on
> > (ie, those without O_DIRECT).
> >
> > I really don't think that an adequate case has been made for adding
> > a new OS dependency.
>
> Well, I think the patch should be applied, and the submitter does too,
> so unless I hear other votes, it is going in.

I vote no for whatever that is worth.  A "performance" change needs to
actually demonstrate improved performance.  If the change is really
desireable to clean up some messy code, then add it as a cleanup change
without the extra system calls. Otherwise it just adds one more bit of
mystery for future maintainers who may be decieved into thinking that posix
advise calls are important voodoo.

-dg

--
David Gould                      daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
daveg wrote:
> On Mon, Feb 13, 2006 at 02:59:43PM -0500, Bruce Momjian wrote:
> > Tom Lane wrote:
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > Tom Lane wrote:
> > > >> More to the point, the utility of the patch remains unproven.
> > > >> We are not in the habit of adding OS dependencies on speculation.
> > >
> > > > He ran tests, though it is speculation because non-caching is a pretty
> > > > hard thing to find a benefit from except under low memory situations.
> > >
> > > Well, the tests (a) didn't show any particularly good speedup, and
> > > (b) were not on the platforms that this is speculated to be useful on
> > > (ie, those without O_DIRECT).
> > >
> > > I really don't think that an adequate case has been made for adding
> > > a new OS dependency.
> >
> > Well, I think the patch should be applied, and the submitter does too,
> > so unless I hear other votes, it is going in.
>
> I vote no for whatever that is worth.  A "performance" change needs to
> actually demonstrate improved performance.  If the change is really
> desireable to clean up some messy code, then add it as a cleanup change
> without the extra system calls. Otherwise it just adds one more bit of
> mystery for future maintainers who may be decieved into thinking that posix
> advise calls are important voodoo.

Yes, your vote counts very much.  What if I apply the patch, but mark
the posix_advise() call in a NOT_USED macro block, so it will be ready
for people to test, but will not be used until we are sure.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Free WAL caches on switching segments

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Yes, your vote counts very much.  What if I apply the patch, but mark
> the posix_advise() call in a NOT_USED macro block, so it will be ready
> for people to test, but will not be used until we are sure.

Sounds like a recipe for ensuring it never will be tested.  What's
needed here is some actual tests, not preparation...

            regards, tom lane

Re: Free WAL caches on switching segments

From
Mark Kirkwood
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>
>>Yes, your vote counts very much.  What if I apply the patch, but mark
>>the posix_advise() call in a NOT_USED macro block, so it will be ready
>>for people to test, but will not be used until we are sure.
>
>
> Sounds like a recipe for ensuring it never will be tested.  What's
> needed here is some actual tests, not preparation...
>

Gotta second that.

Does the OP have a test scenario that those of us with appropriate OS's
could try? Come to think of it, what are the appropriate OS's? (I see
NetBSD mentioned so I suppose all the *BSDs, but what others?).

Cheers

Mark

Re: Free WAL caches on switching segments

From
Mark Kirkwood
Date:
Mark Kirkwood wrote:
> Come to think of it, what are the appropriate OS's? (I see
> NetBSD mentioned so I suppose all the *BSDs, but what others?).
>

FWIW FreeBSD (6.0) does *not* have posix_fadvise, only posix_madvise.

regards

Mark


Re: Free WAL caches on switching segments

From
Simon Riggs
Date:
On Mon, 2006-02-13 at 23:33 -0500, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Yes, your vote counts very much.  What if I apply the patch, but mark
> > the posix_advise() call in a NOT_USED macro block, so it will be ready
> > for people to test, but will not be used until we are sure.
>
> Sounds like a recipe for ensuring it never will be tested.  What's
> needed here is some actual tests, not preparation...

Without discussing this particular patch, IMHO we need a clear checklist
of items that are required before a patch is accepted onto the patches
awaiting application list.

We seem to have had a few patches rejected on grounds that could have
been picked up by other people, not just Tom and Bruce. This wastes
everybody's time because the patch writer might have fixed whatever was
wrong with it a while back if the patch had been rejected earlier. It
will also eventually bring the process into disrepute if one accepts
patches onto the queue and then another raises major objections that
could easily have been rectified earlier.

So let's agree a checklist beforehand. That way patch submitters can be
told to resubmit much earlier by other list watchers, with less wasted
time (elapsed and from core folk) and perhaps a gentler experience for
first-time submitters. Control can and should still lie with committers.

Suggested checklist:
1. has patch been discussed previously? Y/N
- if Y, give direct link to archive of message, and/or bug#
- if N discuss on appropriate list, or expect to be rejected
2. patch target: cvstip or stated branch(es)
3. patch application location: root directory only
4. patch format: diff -c only
5. confirm licence is BSD: Y/N
6. port specific: Y/N, if Y list ports
7. confirm passes make check (on listed ports)
8. provide implementation overview, preferably in code comments
9. if it is a performance patch, provide confirming test results. It is
OK to post patches without these, though the patch will not be applied
until *somebody* has tested the patches and found a valuable performance
effect directly attributable to the patch.
10. If it is a new feature patch, confirm that it has been tested for
all desired scenarios. If it has not, this should be clearly stated as a
request for a particular kind of test to be performed. Note that the
patch will go no further until that test has been performed.
11. if it is a new feature patch, does it break any existing defaults?
Explain why this is *required* or patch will be rejected. New feature
patches should be accompanied by doc patches also.
12. Even if you pass all of the above, the patch may still be rejected
for other technical reasons. You should be prepared to listen to
comments received and perform any agreed rework. Even if you have
received positive comments from some community members, others may spot
problems with your approach, coding style or many other issues.
13. Successful patches will be notified to you by email and you will be
credited for that work in the next set of release notes.

I would also suggest that we have two patch queues:
- patches awaiting performance testing
- patches awaiting application (current one)

That way anybody wanting to test new performance add-ons can do so and
reply to the list with confirmation that the patch can now be added to
the second (main) list of patches.

Of course, this suggestion will be immediately rejected because it
wasn't discussed on -hackers ;-)

Best Regards, Simon Riggs


Re: Free WAL caches on switching segments

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Tom Lane wrote:
>> Sounds like a recipe for ensuring it never will be tested.  What's
>> needed here is some actual tests, not preparation...

> Does the OP have a test scenario that those of us with appropriate OS's
> could try? Come to think of it, what are the appropriate OS's? (I see
> NetBSD mentioned so I suppose all the *BSDs, but what others?).

The test run by the OP was just pgbench, which is probably not the
greatest scenario for showing the benefits of this patch, but at least
it's neutral ground.  You need a situation in which the kernel is under
memory stress, else early free of disk cache buffers isn't going to make
any difference whatever --- so choose a pgbench scale factor that makes
the database noticeably larger than the test machine's RAM.  Other than
that, follow the usual guidelines for producing trustworthy pgbench
numbers: number of clients smaller than scale factor, number of
transactions per client at least 1000 or so (to eliminate startup
transients), repeat test a couple times to make sure numbers are
reproducible.

            regards, tom lane

Re: Free WAL caches on switching segments

From
Simon Riggs
Date:
On Tue, 2006-02-14 at 12:54 +0000, Simon Riggs wrote:
> On Mon, 2006-02-13 at 23:33 -0500, Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Yes, your vote counts very much.  What if I apply the patch, but mark
> > > the posix_advise() call in a NOT_USED macro block, so it will be ready
> > > for people to test, but will not be used until we are sure.
> >
> > Sounds like a recipe for ensuring it never will be tested.  What's
> > needed here is some actual tests, not preparation...
>
> Without discussing this particular patch, IMHO we need a clear checklist
> of items that are required before a patch is accepted onto the patches
> awaiting application list.

This was supposed to be a serious suggestion, so apologies if this came
across stronger than it was meant.

The onus is of course upon the patch submitter to improve their game,
but there seems only benefit in setting out the (simpler) rules of the
game to show people what is unacceptable, even before they submit.

Best Regards, Simon Riggs



Re: Free WAL caches on switching segments

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> This was supposed to be a serious suggestion, so apologies if this came
> across stronger than it was meant.

If you want to have a serious discussion about it, you need to start a
thread on pghackers under a more suitable subject line.  The people
reading this thread are going to be a relatively small group.

            regards, tom lane

Re: Free WAL caches on switching segments

From
Mark Kirkwood
Date:
Tom Lane wrote:
> Mark Kirkwood <markir@paradise.net.nz> writes:
>
>>Tom Lane wrote:
>>
>>>Sounds like a recipe for ensuring it never will be tested.  What's
>>>needed here is some actual tests, not preparation...
>
>
>>Does the OP have a test scenario that those of us with appropriate OS's
>>could try? Come to think of it, what are the appropriate OS's? (I see
>>NetBSD mentioned so I suppose all the *BSDs, but what others?).
>
>
> The test run by the OP was just pgbench,

Ah - right, missed that sorry.

> which is probably not the
> greatest scenario for showing the benefits of this patch, but at least
> it's neutral ground.  You need a situation in which the kernel is under
> memory stress, else early free of disk cache buffers isn't going to make
> any difference whatever --- so choose a pgbench scale factor that makes
> the database noticeably larger than the test machine's RAM.  Other than
> that, follow the usual guidelines for producing trustworthy pgbench
> numbers: number of clients smaller than scale factor, number of
> transactions per client at least 1000 or so (to eliminate startup
> transients), repeat test a couple times to make sure numbers are
> reproducible.
>

Thinking about this, presumably any write intensive, multi-user
benchmark would seem to be suitable, so would something like OSDL's
DBT-2 actually be better to try?

Cheers

Mark

(P.s - academic in my case, unless I try out the latest NetBSD or Linux
on one of my FreeBSD boxes....)

Re: Free WAL caches on switching segments

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Thinking about this, presumably any write intensive, multi-user
> benchmark would seem to be suitable, so would something like OSDL's
> DBT-2 actually be better to try?

I'm certainly not wedded to pgbench, give it a try.

BTW, I forgot to mention that it would be useful to try different
wal_sync_methods along with this.  The reason why it seems unlikely
the patch is useful on Linux is that the sync methods that use O_DIRECT
probably dominate using the patch anyway.  There may or may not be
a similar dependence on sync method on other kernels ...

            regards, tom lane

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
Tom Lane wrote:
> Mark Kirkwood <markir@paradise.net.nz> writes:
> > Thinking about this, presumably any write intensive, multi-user
> > benchmark would seem to be suitable, so would something like OSDL's
> > DBT-2 actually be better to try?
>
> I'm certainly not wedded to pgbench, give it a try.
>
> BTW, I forgot to mention that it would be useful to try different
> wal_sync_methods along with this.  The reason why it seems unlikely
> the patch is useful on Linux is that the sync methods that use O_DIRECT
> probably dominate using the patch anyway.  There may or may not be
> a similar dependence on sync method on other kernels ...

I am thinking the only way to test this would be to do one heavy update
session to generate a lot of WAL traffic, and another session that is
doing a sequential scan on a table that fills most of the cache.  It
isn't an easy test to make, which was why I was thinking we just add the
patch, but the community disagrees.

--
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
I have modified your patch (attached) and will apply soon, unless there
are more community comments.  Thanks.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> wrote:
>
> > > > > Here is a small patch to prevent undesired WAL file caching by kernel.
> > > > > posix_fadvise(POSIX_FADV_DONTNEED) attempts to free cached pages and
> > > > > the kernel will discard them in preference to other data caches.
> > > >
> > > > On plenty of platforms, this won't even compile ...
> >
> > Yes, but we probably have to have a configure test to see if
> > posix_fadvise exists too.  I will keep this for 8.2.
>
> I think we can use _POSIX_ADVISORY_INFO to test if posix_fadvise exists.
> Also, I added the check on whether WAL archiving is enabled, because
> archivers might use the caches to read the WAL segment.
>
>
> By the way, should we put posix_fadvise on the separated place with renaming
> pg_fadvise? If we use posix_fadvise in other purposes, for example,
> read-ahead control, the separation would be good to keep codes clean.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.237
diff -c -c -r1.237 xlog.c
*** src/backend/access/transam/xlog.c    20 Apr 2006 04:07:38 -0000    1.237
--- src/backend/access/transam/xlog.c    14 Jun 2006 17:38:10 -0000
***************
*** 478,483 ****
--- 478,484 ----
                         bool use_lock);
  static int    XLogFileOpen(uint32 log, uint32 seg);
  static int    XLogFileRead(uint32 log, uint32 seg, int emode);
+ static void    XLogFileClose(void);
  static bool RestoreArchivedFile(char *path, const char *xlogfname,
                      const char *recovername, off_t expectedSize);
  static int    PreallocXlogFiles(XLogRecPtr endptr);
***************
*** 1384,1397 ****
               */
              Assert(npages == 0);
              if (openLogFile >= 0)
!             {
!                 if (close(openLogFile))
!                     ereport(PANIC,
!                             (errcode_for_file_access(),
!                         errmsg("could not close log file %u, segment %u: %m",
!                                openLogId, openLogSeg)));
!                 openLogFile = -1;
!             }
              XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);

              /* create/use new log file */
--- 1385,1391 ----
               */
              Assert(npages == 0);
              if (openLogFile >= 0)
!                 XLogFileClose();
              XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);

              /* create/use new log file */
***************
*** 1567,1580 ****
          {
              if (openLogFile >= 0 &&
                  !XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
!             {
!                 if (close(openLogFile))
!                     ereport(PANIC,
!                             (errcode_for_file_access(),
!                         errmsg("could not close log file %u, segment %u: %m",
!                                openLogId, openLogSeg)));
!                 openLogFile = -1;
!             }
              if (openLogFile < 0)
              {
                  XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
--- 1561,1567 ----
          {
              if (openLogFile >= 0 &&
                  !XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
!                 XLogFileClose();
              if (openLogFile < 0)
              {
                  XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
***************
*** 2153,2158 ****
--- 2140,2173 ----
  }

  /*
+  * Close the current logfile segment for writing.
+  */
+ static void
+ XLogFileClose(void)
+ {
+     Assert(openLogFile >= 0);
+
+ #ifdef _POSIX_ADVISORY_INFO
+     /*
+      * WAL caches will not be accessed in the future, so we advise OS to
+      * free them. But we will not do so if WAL archiving is active,
+      * because archivers might use the caches to read the WAL segment.
+      * While O_DIRECT works for O_SYNC, posix_fadvise() works for fsync()
+      * and O_SYNC, and some platforms only have posix_fadvise().
+      */
+     if (!XLogArchivingActive())
+         posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
+ #endif
+
+     if (close(openLogFile))
+         ereport(PANIC,
+             (errcode_for_file_access(),
+             errmsg("could not close log file %u, segment %u: %m",
+                    openLogId, openLogSeg)));
+     openLogFile = -1;
+ }
+
+ /*
   * Attempt to retrieve the specified file from off-line archival storage.
   * If successful, fill "path" with its complete path (note that this will be
   * a temp file name that doesn't follow the normal naming convention), and
***************
*** 5609,5622 ****
                           errmsg("could not fsync log file %u, segment %u: %m",
                                  openLogId, openLogSeg)));
              if (open_sync_bit != new_sync_bit)
!             {
!                 if (close(openLogFile))
!                     ereport(PANIC,
!                             (errcode_for_file_access(),
!                         errmsg("could not close log file %u, segment %u: %m",
!                                openLogId, openLogSeg)));
!                 openLogFile = -1;
!             }
          }
          sync_method = new_sync_method;
          open_sync_bit = new_sync_bit;
--- 5624,5630 ----
                           errmsg("could not fsync log file %u, segment %u: %m",
                                  openLogId, openLogSeg)));
              if (open_sync_bit != new_sync_bit)
!                 XLogFileClose();
          }
          sync_method = new_sync_method;
          open_sync_bit = new_sync_bit;

Re: Free WAL caches on switching segments

From
Bruce Momjian
Date:
Patch applied.  Thanks.

---------------------------------------------------------------------------


Bruce Momjian wrote:
>
> I have modified your patch (attached) and will apply soon, unless there
> are more community comments.  Thanks.
>
> ---------------------------------------------------------------------------
>
> ITAGAKI Takahiro wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> >
> > > > > > Here is a small patch to prevent undesired WAL file caching by kernel.
> > > > > > posix_fadvise(POSIX_FADV_DONTNEED) attempts to free cached pages and
> > > > > > the kernel will discard them in preference to other data caches.
> > > > >
> > > > > On plenty of platforms, this won't even compile ...
> > >
> > > Yes, but we probably have to have a configure test to see if
> > > posix_fadvise exists too.  I will keep this for 8.2.
> >
> > I think we can use _POSIX_ADVISORY_INFO to test if posix_fadvise exists.
> > Also, I added the check on whether WAL archiving is enabled, because
> > archivers might use the caches to read the WAL segment.
> >
> >
> > By the way, should we put posix_fadvise on the separated place with renaming
> > pg_fadvise? If we use posix_fadvise in other purposes, for example,
> > read-ahead control, the separation would be good to keep codes clean.
> >
> > ---
> > ITAGAKI Takahiro
> > NTT Cyber Space Laboratories
> >
>
> [ Attachment, skipping... ]
>
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >        choose an index scan if your joining column's datatypes do not
> >        match
>
> --
>   Bruce Momjian   http://candle.pha.pa.us
>   EnterpriseDB    http://www.enterprisedb.com
>
>   + If your life is a hard drive, Christ can be your backup. +

> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.237
> diff -c -c -r1.237 xlog.c
> *** src/backend/access/transam/xlog.c    20 Apr 2006 04:07:38 -0000    1.237
> --- src/backend/access/transam/xlog.c    14 Jun 2006 17:38:10 -0000
> ***************
> *** 478,483 ****
> --- 478,484 ----
>                          bool use_lock);
>   static int    XLogFileOpen(uint32 log, uint32 seg);
>   static int    XLogFileRead(uint32 log, uint32 seg, int emode);
> + static void    XLogFileClose(void);
>   static bool RestoreArchivedFile(char *path, const char *xlogfname,
>                       const char *recovername, off_t expectedSize);
>   static int    PreallocXlogFiles(XLogRecPtr endptr);
> ***************
> *** 1384,1397 ****
>                */
>               Assert(npages == 0);
>               if (openLogFile >= 0)
> !             {
> !                 if (close(openLogFile))
> !                     ereport(PANIC,
> !                             (errcode_for_file_access(),
> !                         errmsg("could not close log file %u, segment %u: %m",
> !                                openLogId, openLogSeg)));
> !                 openLogFile = -1;
> !             }
>               XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
>
>               /* create/use new log file */
> --- 1385,1391 ----
>                */
>               Assert(npages == 0);
>               if (openLogFile >= 0)
> !                 XLogFileClose();
>               XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
>
>               /* create/use new log file */
> ***************
> *** 1567,1580 ****
>           {
>               if (openLogFile >= 0 &&
>                   !XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> !             {
> !                 if (close(openLogFile))
> !                     ereport(PANIC,
> !                             (errcode_for_file_access(),
> !                         errmsg("could not close log file %u, segment %u: %m",
> !                                openLogId, openLogSeg)));
> !                 openLogFile = -1;
> !             }
>               if (openLogFile < 0)
>               {
>                   XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
> --- 1561,1567 ----
>           {
>               if (openLogFile >= 0 &&
>                   !XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> !                 XLogFileClose();
>               if (openLogFile < 0)
>               {
>                   XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
> ***************
> *** 2153,2158 ****
> --- 2140,2173 ----
>   }
>
>   /*
> +  * Close the current logfile segment for writing.
> +  */
> + static void
> + XLogFileClose(void)
> + {
> +     Assert(openLogFile >= 0);
> +
> + #ifdef _POSIX_ADVISORY_INFO
> +     /*
> +      * WAL caches will not be accessed in the future, so we advise OS to
> +      * free them. But we will not do so if WAL archiving is active,
> +      * because archivers might use the caches to read the WAL segment.
> +      * While O_DIRECT works for O_SYNC, posix_fadvise() works for fsync()
> +      * and O_SYNC, and some platforms only have posix_fadvise().
> +      */
> +     if (!XLogArchivingActive())
> +         posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
> + #endif
> +
> +     if (close(openLogFile))
> +         ereport(PANIC,
> +             (errcode_for_file_access(),
> +             errmsg("could not close log file %u, segment %u: %m",
> +                    openLogId, openLogSeg)));
> +     openLogFile = -1;
> + }
> +
> + /*
>    * Attempt to retrieve the specified file from off-line archival storage.
>    * If successful, fill "path" with its complete path (note that this will be
>    * a temp file name that doesn't follow the normal naming convention), and
> ***************
> *** 5609,5622 ****
>                            errmsg("could not fsync log file %u, segment %u: %m",
>                                   openLogId, openLogSeg)));
>               if (open_sync_bit != new_sync_bit)
> !             {
> !                 if (close(openLogFile))
> !                     ereport(PANIC,
> !                             (errcode_for_file_access(),
> !                         errmsg("could not close log file %u, segment %u: %m",
> !                                openLogId, openLogSeg)));
> !                 openLogFile = -1;
> !             }
>           }
>           sync_method = new_sync_method;
>           open_sync_bit = new_sync_bit;
> --- 5624,5630 ----
>                            errmsg("could not fsync log file %u, segment %u: %m",
>                                   openLogId, openLogSeg)));
>               if (open_sync_bit != new_sync_bit)
> !                 XLogFileClose();
>           }
>           sync_method = new_sync_method;
>           open_sync_bit = new_sync_bit;

>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +