Re: Bug: Buffer cache is not scan resistant - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: Bug: Buffer cache is not scan resistant |
Date | |
Msg-id | 1173252124.3708.30.camel@localhost.localdomain Whole thread Raw |
In response to | Re: Bug: Buffer cache is not scan resistant (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Bug: Buffer cache is not scan resistant
|
List | pgsql-hackers |
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis: > On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote: > > Jeff Davis wrote: > > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: > > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: > > >>> Another approach I proposed back in December is to not have a > > >>> variable like that at all, but scan the buffer cache for pages > > >>> belonging to the table you're scanning to initialize the scan. > > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy > > >>> operation, but it might be ok given that we're talking about large > > >>> I/O bound sequential scans. It would require no DBA tuning and > > >>> would work more robustly in varying conditions. I'm not sure where > > >>> you would continue after scanning the in-cache pages. At the > > >>> highest in-cache block number, perhaps. > > >> If there was some way to do that, it'd be what I'd vote for. > > >> > > > > > > I still don't know how to make this take advantage of the OS buffer > > > cache. Maybe it should not ? Mostly there can be use of OS cache only if it is much bigger than shared buffer cache. It may make sense to forget about OS cache and just tell those who can make use of sync scans to set most of memory aside for shared buffers. Then we can do better predictions/lookups of how much of a table is actually in memory. Dual caching is usually not very beneficial anyway, not to mention about difficulties in predicting any doual-caching effects. > > Yep, I don't see any way to do that. I think we could live with that, > > though. If we went with the sync_scan_offset approach, you'd have to > > leave a lot of safety margin in that as well. > > > > Right, there would certainly have to be a safety margin with > sync_scan_offset. However, your plan only works when the shared buffers > are dominated by this sequential scan. Let's say you have 40% of > physical memory for shared buffers, and say that 50% are being used for > hot pages in other parts of the database. That means you have access to > only 20% of physical memory to optimize for this sequential scan, and > 20% of the physical memory is basically unavailable (being used for > other parts of the database). The simplest thing in case table si much bigger than buffer cache usable for it is to start the second scan at the point the first scan is traversing *now*, and hope that the scans will stay together. Or start at some fixed lag, which makes the first scan to be always the one issuing reads and second just freerides on buffers already in cache. It may even be a good idea to throttle the second scan to stay N pages behind if the OS readahead gets confused when same file is read from multiple processes. If the table is smaller than the cache, then just scan it without syncing. Trying to read buffers in the same order starting from near the point where ppages are still in shared buffer cache seems good mostly for case where table is as big as or just a little larger than cache. > In my current implementation, you could set sync_scan_offset to 1.0 > (meaning 1.0 x shared_buffers), giving you 40% of physical memory that > would be used for starting this sequential scan. In this case, that > should be a good margin of error, considering that as much as 80% of the > physical memory might actually be in cache (OS or PG cache). > > This all needs to be backed up by testing, of course. I'm just > extrapolating some numbers that look vaguely reasonable to me. If there is an easy way to tell PG "give me this page only if it is in shared cache already", then a good approach might be to start 2nd scan at the point where 1st is now, and move in both directions simultabeously, like this: First scan is at page N. Second scan: M=N-1 WHILE NOT ALL PAGES ARE READ: IF PAGE N IS IN CACHE : -- FOLLOW FIRST READER READ PAGE N N++ ELSE IF M>=0AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES READ PAGE M M-- ELSE IF FIRST READER STILL GOING: --NO OLDER PAGES, WAIT FOR 1st WAIT FOR PAGE N TO BECOME AVAILABLE READ PAGE N N++ ELSE: --BECOME 1st reader READ PAGE N N++ PROCESS PAGE -- IF N > PAGES_IF_TABLE: N=0 IF M < 0: M=PAGES_IF_TABLE This should work reasonably well for LRU caches and it may be made to work with clock sweep scheme if the sweep arranges pages to purge in file order. If we could make the IF PAGE x IS IN CACHE part also know about OS cache this could also make use of os cache. Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix systems ? -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
pgsql-hackers by date: