Re: Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Linux kernel impact on PostgreSQL performance
Date
Msg-id CAMkU=1zDtxQyF+f1HU+ArMdBQRi=xv8p=1o11wjmyJX6uoaWnw@mail.gmail.com
Whole thread Raw
In response to Re: Linux kernel impact on PostgreSQL performance  (Mel Gorman <mgorman@suse.de>)
Responses Re: Linux kernel impact on PostgreSQL performance  (Jim Nasby <jim@nasby.net>)
Re: Linux kernel impact on PostgreSQL performance  (Mel Gorman <mgorman@suse.de>)
List pgsql-hackers
On Mon, Jan 13, 2014 at 2:36 PM, Mel Gorman <mgorman@suse.de> wrote:
On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <jim@nasby.net> wrote:
> > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>
> >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <robertmhaas@gmail.com>
> >> wrote:
> >>>
> >>> On a related note, there's also the problem of double-buffering.  When
> >>> we read a page into shared_buffers, we leave a copy behind in the OS
> >>> buffers, and similarly on write-out.  It's very unclear what to do
> >>> about this, since the kernel and PostgreSQL don't have intimate
> >>> knowledge of what each other are doing, but it would be nice to solve
> >>> somehow.
> >>
> >>
> >>
> >> There you have a much harder algorithmic problem.
> >>
> >> You can basically control duplication with fadvise and WONTNEED. The
> >> problem here is not the kernel and whether or not it allows postgres
> >> to be smart about it. The problem is... what kind of smarts
> >> (algorithm) to use.
> >
> >
> > Isn't this a fairly simple matter of when we read a page into shared buffers
> > tell the kernel do forget that page? And a corollary to that for when we
> > dump a page out of shared_buffers (here kernel, please put this back into
> > your cache).
>
>
> That's my point. In terms of kernel-postgres interaction, it's fairly simple.
>
> What's not so simple, is figuring out what policy to use. Remember,
> you cannot tell the kernel to put some page in its page cache without
> reading it or writing it. So, once you make the kernel forget a page,
> evicting it from shared buffers becomes quite expensive.

posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
forcing readahead.

But telling the kernel to forget a page, then telling it to read it in again from disk because it might be needed again in the near future is itself very expensive.  We would need to hand the page to the kernel so it has it without needing to go to disk to get it.
 
If you evict it prematurely then you do get kinda
screwed because you pay the IO cost to read it back in again even if you
had enough memory to cache it. Maybe this is the type of kernel-postgres
interaction that is annoying you.

If you don't evict, the kernel eventually steps in and evicts the wrong
thing. If you do evict and it was unnecessarily you pay an IO cost.

That could be something we look at. There are cases buried deep in the
VM where pages get shuffled to the end of the LRU and get tagged for
reclaim as soon as possible. Maybe you need access to something like
that via posix_fadvise to say "reclaim this page if you need memory but
leave it resident if there is no memory pressure" or something similar.
Not exactly sure what that interface would look like or offhand how it
could be reliably implemented.

I think the "reclaim this page if you need memory but leave it resident if there is no memory pressure" hint would be more useful for temporary working files than for what was being discussed above (shared buffers).  When I do work that needs large temporary files, I often see physical write IO spike but physical read IO does not.  I interpret that to mean that the temporary data is being written to disk to satisfy either dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS cache and so disk reads are not needed to satisfy it.  So a hint that says "this file will never be fsynced so please ignore dirty_*bytes and dirty_expire_centisecs.  I will need it again relatively soon (but not after a reboot), but will do so mostly sequentially, so please don't evict this without need, but if you do need to then it is a good candidate" would be good.

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: extension_control_path
Next
From: Simon Riggs
Date:
Subject: Re: ALTER TABLE lock strength reduction patch is unsafe