Re: Clock sweep not caching enough B-Tree leaf pages? - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: Clock sweep not caching enough B-Tree leaf pages?
Date
Msg-id 534CBBA8.8080900@nasby.net
Whole thread Raw
In response to Re: Clock sweep not caching enough B-Tree leaf pages?  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
On 4/14/14, 7:43 PM, Stephen Frost wrote:
> * Jim Nasby (jim@nasby.net) wrote:
>> I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of
whichhas it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of
multiplepage pools without the overhead. In reality I believe that argument is false, because the clocks for each page
poolin an OS *run at different rates* based on system demands.
 
>
> They're also maintained in *parallel*, no?  That's something that I've
> been talking over with a few folks at various conferences- that we
> should consider breaking up shared buffers and then have new backend
> processes which work through each pool independently and in parallel.

I suspect that varies based on the OS, but it certainly happens in a separate process from user processes. The
expectationis that there should always be pages on the free list so requests for memory can happen quickly.
 

http://www.freebsd.org/doc/en/articles/vm-design/freeing-pages.html contains a good overview of what FreeBSD does. See
http://www.freebsd.org/doc/en/articles/vm-design/allen-briggs-qa.html#idp62990256as well.
 

>> I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember
thisdifference any time we look at what OSes do.
 
>
> It's my suspicion that the one-big-pool is exactly why we see many cases
> where PG performs worse when the pool is more than a few gigs.  Of
> course, this is all speculation and proper testing needs to be done..

I think there some critical take-aways from FreeBSD that apply here (in no particular order):

1: The system is driven by memory pressure. No pressure means no processing.
2: It sounds like the active list is LFU, not LRU. The cache list is LRU.
3: *The use counter is maintained by a clock.* Because the clock only runs so often this means there is no run-away
incrementinglike we see in Postgres.
 
4: Once a page is determined to not be active it goes onto a separate list depending on whether it's clean or dirty.
5: Dirty pages are only written to maintain a certain clean/dirty ratio and again, only when there's actual memory
pressure.
6: The system maintains a list of free pages to serve memory requests quickly. In fact, lower level functions (ie:
http://www.leidinger.net/FreeBSD/dox/vm/html/d4/d65/vm__phys_8c_source.html#l00862)simply return NULL if they can't
findpages on the free list.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Dynamic Shared Memory stuff
Next
From: Peter Eisentraut
Date:
Subject: Re: integrate pg_upgrade analyze_new_cluster.sh into vacuumdb