Re: Bug: Buffer cache is not scan resistant - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Bug: Buffer cache is not scan resistant
Date
Msg-id 1173252124.3708.30.camel@localhost.localdomain
Whole thread Raw
In response to Re: Bug: Buffer cache is not scan resistant  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Bug: Buffer cache is not scan resistant  ("Marko Kreen" <markokr@gmail.com>)
List pgsql-hackers
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis:
> On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote:
> > Jeff Davis wrote:
> > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> > >>> Another approach I proposed back in December is to not have a  
> > >>> variable like that at all, but scan the buffer cache for pages  
> > >>> belonging to the table you're scanning to initialize the scan.  
> > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy  
> > >>> operation, but it might be ok given that we're talking about large  
> > >>> I/O bound sequential scans. It would require no DBA tuning and  
> > >>> would work more robustly in varying conditions. I'm not sure where  
> > >>> you would continue after scanning the in-cache pages. At the  
> > >>> highest in-cache block number, perhaps.
> > >> If there was some way to do that, it'd be what I'd vote for.
> > >>
> > > 
> > > I still don't know how to make this take advantage of the OS buffer
> > > cache. 

Maybe it should not ?

Mostly there can be use of OS cache only if it is much bigger than
shared buffer cache. It may make sense to forget about OS cache and just
tell those who can make use of sync scans to set most of memory aside
for shared buffers.

Then we can do better predictions/lookups of how much of a table is
actually in memory.

Dual caching is usually not very beneficial anyway, not to mention about
difficulties in predicting any doual-caching effects.

> > Yep, I don't see any way to do that. I think we could live with that, 
> > though. If we went with the sync_scan_offset approach, you'd have to 
> > leave a lot of safety margin in that as well.
> > 
> 
> Right, there would certainly have to be a safety margin with
> sync_scan_offset. However, your plan only works when the shared buffers
> are dominated by this sequential scan. Let's say you have 40% of
> physical memory for shared buffers, and say that 50% are being used for
> hot pages in other parts of the database. That means you have access to
> only 20% of physical memory to optimize for this sequential scan, and
> 20% of the physical memory is basically unavailable (being used for
> other parts of the database).

The simplest thing in case table si much bigger than buffer cache usable
for it is to start the second scan at the point the first scan is
traversing *now*, and hope that the scans will stay together. Or start
at some fixed lag, which makes the first scan to be always the one
issuing reads and second just freerides on buffers already in cache. It
may even be a good idea to throttle the second scan to stay N pages
behind if the OS readahead gets confused when same file is read from
multiple processes.

If the table is smaller than the cache, then just scan it without
syncing.

Trying to read buffers in the same order starting from near the point
where ppages are still in shared buffer cache seems good mostly for case
where table is as big as or just a little larger than cache.

> In my current implementation, you could set sync_scan_offset to 1.0
> (meaning 1.0 x shared_buffers), giving you 40% of physical memory that
> would be used for starting this sequential scan. In this case, that
> should be a good margin of error, considering that as much as 80% of the
> physical memory might actually be in cache (OS or PG cache).
> 
> This all needs to be backed up by testing, of course. I'm just
> extrapolating some numbers that look vaguely reasonable to me.

If there is an easy way to tell PG "give me this page only if it is in
shared cache already", then a good approach might be to start 2nd scan
at the point where 1st is now, and move in both directions
simultabeously, like this:

First scan is at page N.

Second scan:

M=N-1
WHILE NOT ALL PAGES ARE READ:    IF PAGE N IS IN CACHE :   -- FOLLOW FIRST READER       READ PAGE N       N++   ELSE IF
M>=0AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES       READ PAGE M       M--   ELSE IF FIRST READER STILL GOING:
--NO OLDER PAGES, WAIT FOR 1st       WAIT FOR PAGE N TO BECOME AVAILABLE       READ PAGE N       N++   ELSE:
--BECOME 1st reader      READ PAGE N      N++   PROCESS PAGE   --   IF N > PAGES_IF_TABLE: N=0   IF M < 0:
M=PAGES_IF_TABLE


This should work reasonably well for LRU caches and it may be made to
work with clock sweep scheme if the sweep arranges pages to purge in
file order.

If we could make the IF PAGE x IS IN CACHE part also know about OS cache
this could also make use of os cache.


Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
systems ?

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




pgsql-hackers by date:

Previous
From: Stefan Kaltenbrunner
Date:
Subject: Re: Auto creation of Partitions
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Auto creation of Partitions