Thread: our buffer replacement strategy is kind of lame
While I was poking around at the index-only scans patch, I couldn't help noticing that our buffer replacement algorithm leaves something to be desired. Here's the query again: select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); As sample_data is not indexed, it sequential scans that table and does an index-only scan of pgbench_accounts for each aid. When this is done, here's what pg_buffercache has to say: rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1; usagecount | sum ------------+------- 1 | 136 2 | 12 3 | 72 4 | 7 5 | 13755 | 37218 (6 rows) Only 96 of the 14286 buffers in sample_data are in shared_buffers, despite the fact that we have 37,218 *completely unused* buffers lying around. That sucks, because it means that the sample query did a whole lot of buffer eviction that was completely needless. The ring buffer is there to prevent trashing the shared buffer arena, but here it would have been fine to trash the arena: there wasn't anything there we would have minded losing (or, indeed, anything at all). On the other hand, the buffer manager has *no problem at all* trashing the buffer arena if we're faulting in pages for an index scan rather than a sequential scan. If you manage to get all of sample_data into memory (by running many copies of the above query in parallel, you can get each one to allocate its own ring buffer, and eventually pull in all the pages), and then run some query that probes an index which is too large to fit in shared_buffers, it cheerfully blows the whole sample_data table out without a second thought. Had you sequentially scanned a big table, of course, there would be some protection, but an index scan can stomp all over everything with complete impunity. The general problem here is that we are not very smart about handling workloads with weak locality - i.e. the working set is larger than shared buffers. If the working set fits in shared_buffers, we will keep it there, and it will be strongly protected against eviction. But if the working set *doesn't* fit in shared buffers, then we fall back on some not-very-smart heuristics to decide what to do: keep pages involved in index scans, ditch those involved in sequential scans. It seems to me that we need a smarter algorithm. It seems that NetBSD has adopted the use of Clock-Pro, and there are some discussions of it being used in Linux as well, though I'm not sure whether that's actually (still?) the case. Paper is here, although I find the description of the algorithm to be approximately clear as mud: http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf One of the important notions which many of the algorithms in the literature seem to hit on in one way or another is that you mustn't go and decide that all of your buffers - or nearly all your buffers - are hot. That's really equivalent to saying that none of your buffers are hot; and it's also pretty easy to see that such a scenario would be disastrous for our current algorithm, which - if everything in shared_buffers has usage-count 5 all the time - will have to clock sweep a long way each time it needs to allocate a buffer. In fact, the more of your buffers want to be hot (and thus hard to evict), the fewer of them should actually be allowed to acquire that status. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On > the other hand, the buffer manager has *no problem at all* trashing > the buffer arena if we're faulting in pages for an index scan rather > than a sequential scan. If you manage to get all of sample_data into > memory (by running many copies of the above query in parallel, you can > get each one to allocate its own ring buffer, and eventually pull in > all the pages), and then run some query that probes an index which is > too large to fit in shared_buffers, it cheerfully blows the whole > sample_data table out without a second thought. Had you sequentially > scanned a big table, of course, there would be some protection, but an > index scan can stomp all over everything with complete impunity. That's a good observation and I think we should do this * Make an IndexScan use a ring buffer once it has used 32 blocks. The vast majority won't do that, so we avoid overhead on the common path. * Make an BitmapIndexScan use a ring buffer when we know that the index is larger than 32 blocks. (Ignore upper parts of tree for that calc). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: > rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1; > usagecount | sum > ------------+------- > 1 | 136 > 2 | 12 > 3 | 72 > 4 | 7 > 5 | 13755 > | 37218 > (6 rows) > > Only 96 of the 14286 buffers in sample_data are in shared_buffers, > despite the fact that we have 37,218 *completely unused* buffers lying > around. That sucks, because it means that the sample query did a > whole lot of buffer eviction that was completely needless. The ring > buffer is there to prevent trashing the shared buffer arena, but here > it would have been fine to trash the arena: there wasn't anything > there we would have minded losing (or, indeed, anything at all). You're missing an important point. The SeqScan is measurably faster when using the ring buffer because of the effects of L2 cacheing on the buffers. Also, your logic is a little off, since you're doing the scan on an otherwise quiet machine soon after startup and there are some completely unused buffers. That isn't the typical state of the buffer cache and so spoiling the cache is not acceptable in the general case. The above case doesn't suck because it carefully reserved parts of shared buffers for other users when spoiling the cache would be of no benefit to the user, so it worked great from my perspective. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The general problem here is that we are not very smart about handling > workloads with weak locality - i.e. the working set is larger than > shared buffers. If the working set fits in shared_buffers, we will > keep it there, and it will be strongly protected against eviction. > But if the working set *doesn't* fit in shared buffers, then we fall > back on some not-very-smart heuristics to decide what to do: keep > pages involved in index scans, ditch those involved in sequential > scans. > > It seems to me that we need a smarter algorithm. It seems that NetBSD > has adopted the use of Clock-Pro, and there are some discussions of it > being used in Linux as well, though I'm not sure whether that's > actually (still?) the case. Paper is here, although I find the > description of the algorithm to be approximately clear as mud: > > http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf > > One of the important notions which many of the algorithms in the > literature seem to hit on in one way or another is that you mustn't go > and decide that all of your buffers - or nearly all your buffers - are > hot. That's really equivalent to saying that none of your buffers are > hot; and it's also pretty easy to see that such a scenario would be > disastrous for our current algorithm, which - if everything in > shared_buffers has usage-count 5 all the time - will have to clock > sweep a long way each time it needs to allocate a buffer. In fact, > the more of your buffers want to be hot (and thus hard to evict), the > fewer of them should actually be allowed to acquire that status. This is something I've been actively working on. I was on the trail of possible reasons that increasing the size of shared buffers would slow down PostgreSQL. The worst case behaviour of the current freelist code is that it can take up to 5 * shared_buffers checks before identifying a victim buffer. That occurs when we have a working set exactly matching size of shared buffers. There are problems when large portions of shared buffers are very popular, so that the free-ish buffers are a small proportion of the total buffers - this case gets worse if the distribution of buffer allocations is non-random so you get say 10 free-ish blocks together, followed by N-10 very popular ones. That makes every 11th freelist request take ~O(shared_buffers). These problems will show themselves as sporadic BufFreelistLock contention. Sporadic is the problem here since it may make the contention hard to measure in aggregate, yet with real impact on performance. For us to benefit from increased shared_buffers we need to have an algorithm that is O(k) in its worst case behaviour and O(1) in its best case behaviour - the latter aspect is provided by a correctly functioning bgwriter. At the moment, we do nothing to enforce O(k) behaviour. The following patch implements O(k) behaviour using a heuristic limit. My first attempt was a fixed heuristic limit, but that was less suitable because there are actually 2 requirements. The value of k must be small to have a measurable impact on the smoothness of StrategyGetBuffer(), but k must also be fairly large to ensure the choice of victim is a relatively good one. So the way to balance those conflicting objectives is to have a progressively increasing search window. An just to repeat, k is not a % of shared buffers, which would still give O(N) behaviour. I've not been reading "the literature", given the problems we had in 2004/5 regarding patents in this area. I also think that since we rely on the underlying filesystem for cacheing that we don't have exactly the same problem as other systems. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Fri, Aug 12, 2011 at 11:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I've not been reading "the literature", given the problems we had in > 2004/5 regarding patents in this area. I also think that since we rely > on the underlying filesystem for cacheing that we don't have exactly > the same problem as other systems. Reading the link you gave, I'm unimpressed by ClockPro. The measurements made are all in low numbers of MB and the results show that Clock is roughly same or better for large caches, but one might read them multiple ways. The zone of interest for us is above 1GB and the issues we care about are contention, so even if we think ClockPro has a chance of improving things we really don't have any measurements that support the cases we care about. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > You're missing an important point. The SeqScan is measurably faster > when using the ring buffer because of the effects of L2 cacheing on > the buffers. I hadn't thought of that, but I think that's only true if the relation won't fit in shared_buffers (or whatever portion of shared_buffers is reasonably available, given the other activity on the machine). In this particular case, it's almost 20% faster if the relation is all in shared_buffers; I tested it. I think what's going on here is that initscan() has a heuristic that tries to use a BufferAccessStrategy if the relation is larger than 1/4 of shared_buffers. That's probably a pretty good heuristic in general, but in this case I have a relation which just so happens to be 27.9% of shared_buffers but will still fit. As you say below, that may not be typical in real life, although there are probably data warehousing systems where it's normal to have only one big query running at a time. > Also, your logic is a little off, since you're doing the scan on an > otherwise quiet machine soon after startup and there are some > completely unused buffers. That isn't the typical state of the buffer > cache and so spoiling the cache is not acceptable in the general case. > The above case doesn't suck because it carefully reserved parts of > shared buffers for other users when spoiling the cache would be of no > benefit to the user, so it worked great from my perspective. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 4:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On >> the other hand, the buffer manager has *no problem at all* trashing >> the buffer arena if we're faulting in pages for an index scan rather >> than a sequential scan. If you manage to get all of sample_data into >> memory (by running many copies of the above query in parallel, you can >> get each one to allocate its own ring buffer, and eventually pull in >> all the pages), and then run some query that probes an index which is >> too large to fit in shared_buffers, it cheerfully blows the whole >> sample_data table out without a second thought. Had you sequentially >> scanned a big table, of course, there would be some protection, but an >> index scan can stomp all over everything with complete impunity. > > That's a good observation and I think we should do this > > * Make an IndexScan use a ring buffer once it has used 32 blocks. The > vast majority won't do that, so we avoid overhead on the common path. > > * Make an BitmapIndexScan use a ring buffer when we know that the > index is larger than 32 blocks. (Ignore upper parts of tree for that > calc). We'd need to think about what happens to the internal pages of the btree, the leaf pages, and then the heap pages from the underlying relation; those probably shouldn't all be treated the same. Also, I think the tricky part is figuring out when to apply the optimization in the first place. Once we decide we need a ring buffer, a very small one (like 32 blocks) is probably the way to go. But it will be a loser to apply the optimization to data sets that would otherwise have fit in shared_buffers. This is a classic case of the LRU/MRU problem. You want to evict buffers in LRU fashion until the working set gets larger than you can cache; and then you want to switch to MRU to avoid uselessly caching pages that you'll never manage to revisit before they're evicted. The point of algorithms like Clock-Pro is to try to have the system work that out on the fly, based on the actual workload, rather than using heuristics. I agree with you there's no convincing evidence that Clock-Pro would be better for us; I mostly thought it was interesting because it seems that the NetBSD and Linux guys find it interesting, and they're managing much larger caches than the ones we're dealing with. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> You're missing an important point. The SeqScan is measurably faster >> when using the ring buffer because of the effects of L2 cacheing on >> the buffers. > > I hadn't thought of that, but I think that's only true if the relation > won't fit in shared_buffers (or whatever portion of shared_buffers is > reasonably available, given the other activity on the machine). In > this particular case, it's almost 20% faster if the relation is all in > shared_buffers; I tested it. I think what's going on here is that > initscan() has a heuristic that tries to use a BufferAccessStrategy if > the relation is larger than 1/4 of shared_buffers. That's probably a > pretty good heuristic in general, but in this case I have a relation > which just so happens to be 27.9% of shared_buffers but will still > fit. As you say below, that may not be typical in real life, although > there are probably data warehousing systems where it's normal to have > only one big query running at a time. I think there are reasonable arguments to make * prefer_cache = off (default) | on a table level storage parameter, =on will disable the use of BufferAccessStrategy * make cache_spoil_threshold a parameter, with default 0.25 Considering the world of very large RAMs in which we now live, some control of the above makes sense. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 6:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > The worst case behaviour of the current freelist code is that it can > take up to 5 * shared_buffers checks before identifying a victim > buffer. That occurs when we have a working set exactly matching size > of shared buffers. There are problems when large portions of shared > buffers are very popular, so that the free-ish buffers are a small > proportion of the total buffers - this case gets worse if the > distribution of buffer allocations is non-random so you get say 10 > free-ish blocks together, followed by N-10 very popular ones. That > makes every 11th freelist request take ~O(shared_buffers). These > problems will show themselves as sporadic BufFreelistLock contention. > Sporadic is the problem here since it may make the contention hard to > measure in aggregate, yet with real impact on performance. I've been thinking about this exact same set of problems. > For us to benefit from increased shared_buffers we need to have an > algorithm that is O(k) in its worst case behaviour and O(1) in its > best case behaviour - the latter aspect is provided by a correctly > functioning bgwriter. At the moment, we do nothing to enforce O(k) > behaviour. Check. > The following patch implements O(k) behaviour using a heuristic limit. > My first attempt was a fixed heuristic limit, but that was less > suitable because there are actually 2 requirements. The value of k > must be small to have a measurable impact on the smoothness of > StrategyGetBuffer(), but k must also be fairly large to ensure the > choice of victim is a relatively good one. So the way to balance those > conflicting objectives is to have a progressively increasing search > window. An just to repeat, k is not a % of shared buffers, which would > still give O(N) behaviour. This is a very interesting idea, and I think it has a lot of potential. One additional thought I had is that right now I think it's far too easy for buffers to get pushed all the way up to usage count 5, because we bump the reference count every time the buffer is pinned, which can easily happen many times per clock sweep. I think we would be better off having a "used" flag on each buffer. Each pin sets the used bit, but does nothing if it is already set. The usage count is only changed by the clock sweep, which does the following on each buffer: - If the "used" flag is set, then the flag is cleared and the usage count increases by one to a maximum of 5. - If the "used" flag is not set, then the usage count decreases by one. That way, buffers can't get hot because of a fast flurry of access followed by nothing - to get up to usage_count 5, they've actually got to stay interesting for a period of time. As a side effect, I believe we could get rid of the spinlock that we currently take and release for every buffer the clock sweep hits, because we could regard the usage_count as protected by the BufFreelistLock rather than by the buffer header spinlock; and the "used" flag doesn't require perfect synchronization. We'd only pin the buffer we actually plan to evict (and could recheck the "used" flag before doing so, in case a memory ordering effect made us miss the fact that it had been recently set). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > But it will be > a loser to apply the optimization to data sets that would otherwise > have fit in shared_buffers. Spoiling the cache is a bad plan, even if it makes the current query faster. I think we should make the optimisation stronger still and use FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the OS cache is a problem as well. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 8:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> You're missing an important point. The SeqScan is measurably faster >>> when using the ring buffer because of the effects of L2 cacheing on >>> the buffers. >> >> I hadn't thought of that, but I think that's only true if the relation >> won't fit in shared_buffers (or whatever portion of shared_buffers is >> reasonably available, given the other activity on the machine). In >> this particular case, it's almost 20% faster if the relation is all in >> shared_buffers; I tested it. I think what's going on here is that >> initscan() has a heuristic that tries to use a BufferAccessStrategy if >> the relation is larger than 1/4 of shared_buffers. That's probably a >> pretty good heuristic in general, but in this case I have a relation >> which just so happens to be 27.9% of shared_buffers but will still >> fit. As you say below, that may not be typical in real life, although >> there are probably data warehousing systems where it's normal to have >> only one big query running at a time. > > I think there are reasonable arguments to make > > * prefer_cache = off (default) | on a table level storage parameter, > =on will disable the use of BufferAccessStrategy > > * make cache_spoil_threshold a parameter, with default 0.25 Yeah, something like that might make sense. Of course, a completely self-tuning system would be better, but I'm not sure we're going to find one of those. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 8:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> But it will be >> a loser to apply the optimization to data sets that would otherwise >> have fit in shared_buffers. > > Spoiling the cache is a bad plan, even if it makes the current query faster. > > I think we should make the optimisation stronger still and use > FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the > OS cache is a problem as well. That would probably be better for really big tables, but it will be worse for those where the entire table fits in the OS cache. Caching spoiling means you're putting things into the cache which won't still be there the next time they're needed. If the data in question fits in cache (and I don't think we can regard that as uncommon, particularly for the OS cache, which can be half a terabyte) then you don't want the anti-spoiling logic to kick in. One thing we could consider for large sequential scans is doing POSIX_FADV_SEQUENTIAL before beginning the scan (and maybe one more time if the scan wraps). That's basically throwing the problem of whether or not to go LRU or MRU back on the OS, but the OS may well have a better idea whether there's enough cache available to hold the whole than we do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 01:28:49PM +0100, Simon Riggs wrote: > I think there are reasonable arguments to make > > * prefer_cache = off (default) | on a table level storage parameter, > =on will disable the use of BufferAccessStrategy > > * make cache_spoil_threshold a parameter, with default 0.25 > > Considering the world of very large RAMs in which we now live, some > control of the above makes sense. As long as we are discussion cache settings for tables, I have a client who would like to be able to lock specific tables and indexes into cache as they have strict response time requirements for particular queries. At the moment they are running postgres with a tablespace on ramfs and taking frequent backups, but this is not optimal. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Only 96 of the 14286 buffers in sample_data are in shared_buffers, > despite the fact that we have 37,218 *completely unused* buffers lying > around. That sucks, because it means that the sample query did a > whole lot of buffer eviction that was completely needless. The ring > buffer is there to prevent trashing the shared buffer arena, but here > it would have been fine to trash the arena: there wasn't anything > there we would have minded losing (or, indeed, anything at all). I don't disagree with the general thrust of your point, but I just wanted to point out one case where not using free buffers even though they're available might make sense: If you execute a large batch delete or update or even just set lots of hint bits you'll dirty a lot of buffers. The ring buffer forces the query that is actually dirtying all these buffers to also do the i/o to write them out. Otherwise you leave them behind to slow down other queries. This was one of the problems with the old vacuum code which the ring buffer replaced. It left behind lots of dirtied buffers for other queries to do i/o on. -- greg
On Fri, Aug 12, 2011 at 10:51 PM, Greg Stark <stark@mit.edu> wrote: > On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Only 96 of the 14286 buffers in sample_data are in shared_buffers, >> despite the fact that we have 37,218 *completely unused* buffers lying >> around. That sucks, because it means that the sample query did a >> whole lot of buffer eviction that was completely needless. The ring >> buffer is there to prevent trashing the shared buffer arena, but here >> it would have been fine to trash the arena: there wasn't anything >> there we would have minded losing (or, indeed, anything at all). > > I don't disagree with the general thrust of your point, but I just > wanted to point out one case where not using free buffers even though > they're available might make sense: > > If you execute a large batch delete or update or even just set lots of > hint bits you'll dirty a lot of buffers. The ring buffer forces the > query that is actually dirtying all these buffers to also do the i/o > to write them out. Otherwise you leave them behind to slow down other > queries. This was one of the problems with the old vacuum code which > the ring buffer replaced. It left behind lots of dirtied buffers for > other queries to do i/o on. Interesting point. After thinking about this some more, I'm coming around to the idea that we need to distinguish between: 1. Ensuring a sufficient supply of evictable buffers, and 2. Evicting a buffer. The second obviously needs to be done only when needed, but the first one should really be done as background work. Currently, the clock sweep serves both functions, and that's not good. We shouldn't ever let ourselves get to the point where there are no buffers at all with reference count zero, so that the next guy who needs a buffer has to spin the clock hand around until the reference counts get low enough. Maintaining a sufficient supply of refcount-zero buffers should be done as a background task; and possibly we ought to put them all in a linked list so that the next guy who needs a buffer can just pop one off. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > and possibly we ought to put them all in a > linked list so that the next guy who needs a buffer can just pop one The whole point of the clock sweep algorithm is to approximate an LRU without needing to maintain a linked list. The problem with a linked list is that you need to serialize access to it so every time you reference a buffer you need to wait on a lock for the list so you can move that buffer around in the list. It does kind of seem like your numbers indicate we're missing part of the picture though. The idea with the clock sweep algorithm is that you keep approximately 1/nth of the buffers with each of the n values. If we're allowing nearly all the buffers to reach a reference count of 5 then you're right that we've lost any information about which buffers have been referenced most recently. I wonder if we're suppoesd to be doing something like advancing the clock hand each time we increment a reference count as well? -- greg
On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote: > On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> and possibly we ought to put them all in a >> linked list so that the next guy who needs a buffer can just pop one > > The whole point of the clock sweep algorithm is to approximate an LRU > without needing to maintain a linked list. The problem with a linked > list is that you need to serialize access to it so every time you > reference a buffer you need to wait on a lock for the list so you can > move that buffer around in the list. Right, but the same doesn't apply to what I'm talking about. You just put each guy into the linked list when his reference count goes down to 0. When you want to allocate a buffer, you pop somebody off. If his reference count is still 0, you forget about him and pop the next guy, and keep going until you find a suitable victim. The problem with just running the clock sweep every time you need a victim buffer is that you may have to pass over a large number of unevictable buffers before you find someone you can actually kick out.That's work that should really be getting done in thebackground, not at the absolute last moment when we discover, hey, we need a buffer. > It does kind of seem like your numbers indicate we're missing part of > the picture though. The idea with the clock sweep algorithm is that > you keep approximately 1/nth of the buffers with each of the n values. > If we're allowing nearly all the buffers to reach a reference count of > 5 then you're right that we've lost any information about which > buffers have been referenced most recently. It doesn't seem right to have 1/nth of the buffers at each of n values because that might not match the actual workload. For example, you might have lightning-hot buffers that fill 50% of your available cache. Trying to artificially force some of those down to usage count 4 or 3 doesn't actually get you anywhere. In fact, AFAICS, there's no direct reason to care about about how many buffers have usage count 1 vs. usage count 5. What IS important for performance is that there are enough with usage count 0. Any other distinction is only useful to the extent that it allows us to make a better decision about which buffers we should push down to 0 next. > I wonder if we're suppoesd to be doing something like advancing the > clock hand each time we increment a reference count as well? That precise thing seems far too expensive, but I agree that something's missing. Right now we decrease usage counts when the clock hand moves (which is driven by buffer allocation), and we increase them when buffers are pinned (which is driven by access to already-resident buffers). I feel like that's comparing apples and oranges. If some subset of shared_buffers is very hot relative to the uncached pages, then everything's going to converge to 5 (or whatever arbitrary maximum we care to set in lieu of 5). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote: >> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> and possibly we ought to put them all in a >>> linked list so that the next guy who needs a buffer can just pop one >> >> The whole point of the clock sweep algorithm is to approximate an LRU >> without needing to maintain a linked list. The problem with a linked >> list is that you need to serialize access to it so every time you >> reference a buffer you need to wait on a lock for the list so you can >> move that buffer around in the list. > > Right, but the same doesn't apply to what I'm talking about. You just > put each guy into the linked list when his reference count goes down > to 0. When you want to allocate a buffer, you pop somebody off. If > his reference count is still 0, you forget about him and pop the next > guy, and keep going until you find a suitable victim. > > The problem with just running the clock sweep every time you need a > victim buffer is that you may have to pass over a large number of > unevictable buffers before you find someone you can actually kick out. > That's work that should really be getting done in the background, not > at the absolute last moment when we discover, hey, we need a buffer. I think Greg is right here. Those suggested changes just bring back the LRU. If you keep a separate list then you need to serialize access to it. usage_count == 0 and "unevictable" are dynamic states, so if the bgwriter sees those states they can change before anyone uses the buffer. The only state which is unchangeable is a recently filled block, such as from a COPY, which is why I originally suggested a filled-block-list - though I don't think we need it now because of the buffer cycle strategy. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I agree that > something's missing. I'm quoting you completely out of context here, but yes, something is missing. We can't credibly do one test on usage count in shared buffers and then start talking about how buffer management is all wrong. My own patch requires more test evidence before we can commit it, which is why I hadn't published it before now. I'll endeavour to do that now its on the table. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Aug 14, 2011 at 6:57 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote: >>> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> and possibly we ought to put them all in a >>>> linked list so that the next guy who needs a buffer can just pop one >>> >>> The whole point of the clock sweep algorithm is to approximate an LRU >>> without needing to maintain a linked list. The problem with a linked >>> list is that you need to serialize access to it so every time you >>> reference a buffer you need to wait on a lock for the list so you can >>> move that buffer around in the list. >> >> Right, but the same doesn't apply to what I'm talking about. You just >> put each guy into the linked list when his reference count goes down >> to 0. When you want to allocate a buffer, you pop somebody off. If >> his reference count is still 0, you forget about him and pop the next >> guy, and keep going until you find a suitable victim. >> >> The problem with just running the clock sweep every time you need a >> victim buffer is that you may have to pass over a large number of >> unevictable buffers before you find someone you can actually kick out. >> That's work that should really be getting done in the background, not >> at the absolute last moment when we discover, hey, we need a buffer. > > I think Greg is right here. Those suggested changes just bring back the LRU. Since the problem with LRU is that it requires moving each buffer to the front of the list every time it's touched, and since the idea that I proposed does not require that, I don't know what you mean by this. > If you keep a separate list then you need to serialize access to it. That is true. However, there are some reasons to think it would be better than what we're doing right now. Right now, we acquire the BufFreelistLock, and then take and release some number of spinlocks. It seems fairly easy to construct cases where that number is routinely over 100; even on my laptop, I could construct cases where it was routinely 50-100 range. A linked list that we can just dequeue things from could be protected by a single spinlock, which would hopefully be somewhat faster. (In the interest of credit where credit is due, this is pretty much the point Jim Nasby has been making every time this argument comes up, and I've been dismissing it, but now that I understand how much work gets done in that clock sweep code, I'm starting to think he might be right.) However, if it turns out that that there's still too much contention over that spinlock, we can probably fix that by maintaining N lists instead of one, distributing buffers between them in round-robin fashion, and having each backend choose a list based on its backend ID modulo N. The testing I've done so far seems to indicate that spinlock contention doesn't really become noticeable until you have 32-40 CPUs pounding hard on the same spinlock, so even N = 4 is probably enough to make the problem go away. But I don't even think we're going to have to go that far right at the outset, because there's another major source of contention in the buffer eviction path: the buffer mapping locks. > usage_count == 0 and "unevictable" are dynamic states, so if the > bgwriter sees those states they can change before anyone uses the > buffer. Yep. That's already a problem. When we pin the buffer to be evicted, we're not yet holding the relevant buffer mapping locks. By the time we do, someone else can have pinned the page. So there's already retry logic here. In theory, you can imagine a backend getting starved for an arbitrarily long period of time, because it keeps picking a victim buffer that someone else wrenches away before we actually manage to kick it out. I haven't yet seen any evidence that that's a problem in practice, but if it is, this idea will make it worse. It seems hard to know without testing it, though. The big problem with this idea is that it pretty much requires that the work you mentioned in one of your other emails - separating the background writer and checkpoint machinery into two separate processes - to happen first. So I'm probably going to have to code that up to see whether this works. If you're planning to post that patch soon I'll wait for it. Otherwise, I'm going to have to cobble together something that is at least good enough for testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 13, 2011 at 09:40:15PM +0100, Greg Stark wrote: > On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > and possibly we ought to put them all in a > > linked list so that the next guy who needs a buffer can just pop one > > The whole point of the clock sweep algorithm is to approximate an LRU > without needing to maintain a linked list. The problem with a linked > list is that you need to serialize access to it so every time you > reference a buffer you need to wait on a lock for the list so you can > move that buffer around in the list. Well, there are such things as lock-free linked lists. Whether they'd help here is the question though. http://www.cse.yorku.ca/~ruppert/papers/lfll.pdf Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > The big problem with this idea is that it pretty much requires that > the work you mentioned in one of your other emails - separating the > background writer and checkpoint machinery into two separate processes > - to happen first. So I'm probably going to have to code that up to > see whether this works. If you're planning to post that patch soon > I'll wait for it. Otherwise, I'm going to have to cobble together > something that is at least good enough for testing. No, the big problem with the idea is that regrettably it is just an idea on your part and has no basis in external theory or measurement. I would not object to you investigating such a path and I think you are someone that could invent something new and original there, but it seems much less likely to be fruitful, or at least would require significant experimental results to demonstrate an improvement in a wide range of use cases to the rest of the hackers. As to you not being able to work on your idea until I've split bgwriter/checkpoint, that's completely unnecessary, and you know it. A single ifdef is sufficient there, if at all. The path I was working on (as shown in the earlier patch) was to apply some corrections to the existing algorithm to reduce its worst case behaviour. That's something I've seen mention of people doing for RedHat kernels. Overall, I think a minor modification is the appropriate path. If Linux or other OS already use ClockPro then we already benefit from it. It seems silly to track blocks that recently left shared buffers when they are very likely still actually in memory in the filesystem cache. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Aug 14, 2011 at 6:57 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote: >>>> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>>> and possibly we ought to put them all in a >>>>> linked list so that the next guy who needs a buffer can just pop one >>>> >>>> The whole point of the clock sweep algorithm is to approximate an LRU >>>> without needing to maintain a linked list. The problem with a linked >>>> list is that you need to serialize access to it so every time you >>>> reference a buffer you need to wait on a lock for the list so you can >>>> move that buffer around in the list. >>> >>> Right, but the same doesn't apply to what I'm talking about. You just >>> put each guy into the linked list when his reference count goes down >>> to 0. When you want to allocate a buffer, you pop somebody off. If >>> his reference count is still 0, you forget about him and pop the next >>> guy, and keep going until you find a suitable victim. >>> >>> The problem with just running the clock sweep every time you need a >>> victim buffer is that you may have to pass over a large number of >>> unevictable buffers before you find someone you can actually kick out. >>> That's work that should really be getting done in the background, not >>> at the absolute last moment when we discover, hey, we need a buffer. >> >> I think Greg is right here. Those suggested changes just bring back the LRU. > > Since the problem with LRU is that it requires moving each buffer to > the front of the list every time it's touched, and since the idea that > I proposed does not require that, I don't know what you mean by this. > >> If you keep a separate list then you need to serialize access to it. > > That is true. However, there are some reasons to think it would be > better than what we're doing right now. Right now, we acquire the > BufFreelistLock, and then take and release some number of spinlocks. > It seems fairly easy to construct cases where that number is routinely > over 100; even on my laptop, I could construct cases where it was > routinely 50-100 range. A linked list that we can just dequeue things > from could be protected by a single spinlock, which would hopefully be > somewhat faster. (In the interest of credit where credit is due, this > is pretty much the point Jim Nasby has been making every time this > argument comes up, and I've been dismissing it, but now that I > understand how much work gets done in that clock sweep code, I'm > starting to think he might be right.) > > However, if it turns out that that there's still too much contention > over that spinlock, we can probably fix that by maintaining N lists > instead of one, distributing buffers between them in round-robin > fashion, and having each backend choose a list based on its backend ID > modulo N. The testing I've done so far seems to indicate that > spinlock contention doesn't really become noticeable until you have > 32-40 CPUs pounding hard on the same spinlock, so even N = 4 is > probably enough to make the problem go away. But I don't even think > we're going to have to go that far right at the outset, because > there's another major source of contention in the buffer eviction > path: the buffer mapping locks. All of the above presumes that we have a list of best-next-victims. We don't have that list, so describing how we would optimise access to it means nothing. Yes, if we had it, we could dequeue them quickly - that is not the problem. The problem is that creating the best-next-victim list AND making it accurate is the act that causes contention. Yes, we can maintain an inaccurate list cheaply. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Aug 14, 2011 at 10:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> The big problem with this idea is that it pretty much requires that >> the work you mentioned in one of your other emails - separating the >> background writer and checkpoint machinery into two separate processes >> - to happen first. So I'm probably going to have to code that up to >> see whether this works. If you're planning to post that patch soon >> I'll wait for it. Otherwise, I'm going to have to cobble together >> something that is at least good enough for testing. > > No, the big problem with the idea is that regrettably it is just an > idea on your part and has no basis in external theory or measurement. > I would not object to you investigating such a path and I think you > are someone that could invent something new and original there, but it > seems much less likely to be fruitful, or at least would require > significant experimental results to demonstrate an improvement in a > wide range of use cases to the rest of the hackers. All right, well, I'll mull over whether it's worth pursuing. Unless I or someone else comes up with an idea I like better, I think it probably is. > As to you not being able to work on your idea until I've split > bgwriter/checkpoint, that's completely unnecessary, and you know it. A > single ifdef is sufficient there, if at all. Hmm. Well, it might be unnecessary, but if I knew it were unnecessary, I wouldn't have said that I thought it was necessary. > The path I was working on (as shown in the earlier patch) was to apply > some corrections to the existing algorithm to reduce its worst case > behaviour. That's something I've seen mention of people doing for > RedHat kernels. Yeah. Your idea is appealing because it bounds the amount of time . There is some chance that you might kick out a really hot buffer if there are a long series of such buffers in a row. Without testing, I don't know whether that's a serious problem or not. > Overall, I think a minor modification is the appropriate path. If > Linux or other OS already use ClockPro then we already benefit from > it. It seems silly to track blocks that recently left shared buffers > when they are very likely still actually in memory in the filesystem > cache. You may be right. Basically, my concern is that buffer eviction is too slow. On a 32-core system, it's easy to construct a workload where the whole system bottlenecks on the rate at which buffers can be evicted and replaced - not because the system is fundamentally incapable of copying data around that quickly, but because everything piles up behind BufFreelistLock, and to a lesser extent the buffer mapping locks. Your idea may help with that, but I doubt that it's a complete solution. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I agree that something's missing. > I'm quoting you completely out of context here, but yes, something is missing. > We can't credibly do one test on usage count in shared buffers and > then start talking about how buffer management is all wrong. More generally: the originally presented facts suggest that there might be value in improving the "buffer access strategy" code that keeps particular operations from using all of shared_buffers. It seems to me to be a giant and unsubstantiated leap from that to the conclusion that there's anything wrong with the clock sweep algorithm. Moreover, several of the proposed "fixes" amount to reversion to methods that we already know are less good than the clock sweep, because we already tried them years ago. So I've been quite unimpressed with the quality of discussion in this thread. regards, tom lane
On Sun, Aug 14, 2011 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I agree that something's missing. > >> I'm quoting you completely out of context here, but yes, something is missing. > >> We can't credibly do one test on usage count in shared buffers and >> then start talking about how buffer management is all wrong. > > More generally: the originally presented facts suggest that there might > be value in improving the "buffer access strategy" code that keeps > particular operations from using all of shared_buffers. Possibly. Since I've realized that we only switch to a ring buffer when the size of the relation is more than 25% of shared buffers, I'm less concerned about that problem. My test case only demonstrated a ~20% performance improvement from getting rid of the ring buffer, and that was with a relation that happened to be 27% of the size of shared_buffers, so it was a bit unlucky. I think it'd be interesting to try to make this smarter in some way, but it's not bugging me as much now that I've realized that I was unlucky to fall down that particular well. > It seems to me > to be a giant and unsubstantiated leap from that to the conclusion that > there's anything wrong with the clock sweep algorithm. Moreover, > several of the proposed "fixes" amount to reversion to methods that > we already know are less good than the clock sweep, because we already > tried them years ago. So I've been quite unimpressed with the quality > of discussion in this thread Well, here's the problem I'm worried about: if 99% of shared_buffers is filled with a very hot working set, every new page that gets brought in will need to scan, on average, 100 buffers before finding something to evict. That seems slow. Simon is proposing to bound the really bad case where you flip through the entire ring multiple times before you find a buffer, and that may well be worth doing. But I think even scanning 100 buffers every time you need to bring something in is too slow. What's indisputable is that a SELECT-only workload which is larger than shared_buffers can be very easily rate-limited by the speed at which BufFreelistLock can be taken and released. If you have a better idea for solving that problem, I'm all ears... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 08/12/2011 10:51 PM, Greg Stark wrote: > If you execute a large batch delete or update or even just set lots of > hint bits you'll dirty a lot of buffers. The ring buffer forces the > query that is actually dirtying all these buffers to also do the i/o > to write them out. Otherwise you leave them behind to slow down other > queries. This was one of the problems with the old vacuum code which > the ring buffer replaced. It left behind lots of dirtied buffers for > other queries to do i/o on. > I ran into the other side of this when trying to use Linux's relatively new dirty_background_bytes tunable to constrain the OS write cache. When running with the current VACUUM ring buffer code, if there isn't also a large OS write cache backing that, performance suffers quite a bit. I've been adding test rigging to quantify this into pgbench-tools recently, and I fear that one of the possible outcomes could pushback pressure toward making the database's ring buffer bigger. Just a theory--waiting on some numbers still. Anyway, I think every idea thrown out here so far needs about an order of magnitude more types of benchmarking test cases before it can be evaluated at all. The case I just mentioned is a good example of why. Every other test I ran showed a nice improvement with the kernel tuning I tried. But VACUUM was massively detuned in the process. I have an entire file folder filled with notes on way to reorganize the buffer cache, from my background writer work for 8.3. In my mind they're all sitting stuck behind the problem of getting enough standardized benchmark workloads to have a performance regression suite. The idea of touching any of this code without a look at a large number of different tests is a bit optimistic. What I expect to happen here that all initially proposed changes will end up tuning for one workload at the expense of other, not measured ones. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
Greg Smith <greg@2ndQuadrant.com> wrote: > Anyway, I think every idea thrown out here so far needs about an > order of magnitude more types of benchmarking test cases before it > can be evaluated at all. Right. I'm very excited about all the optimizations going in, and can't see where the ones committed so far are very likely to cause problems for workloads other than those tested, but here we're heading into territory where the performance farm is very desperately needed. That said, I'll throw out one more idea, as something that worked very well for me in a similar situation, and played well with external caching which knew lots about the storage media but nothing about the database side. It's very similar to the clock sweep algorithm, but uses a weighted average rather than a simple count. It didn't tend to leave as many buffers clustered at one end or the other, and the gradation did seem to be useful. I started out this post trying to describe it more generally, but it all came out sounding too vague and hand-wavy; so I'll give a description with more particulars which could, of course, be tweaked without tossing the concept. This particular description modifies the techniques I've used to try to better fit the particulars of PostgreSQL. It may fall down on contention issues, but since it benchmarked better for me than the alternatives, I thought it might at least help spin off other useful ideas. (1) Each buffer would have a one byte count, incremented on access, similar to the current count. Of course, 255 would not wrap on access, but be left unchanged. (2) We would have 256 int counters for how many buffers were available with each access count. These would be maintained when the access counts changed. (3) While sweeping, we wouldn't decrement the access counters for buffers we couldn't use; rather, there would be a boolean flag to say whether to divide the access counts for non-chosen buffers by two. (4) The "reduce" flag would be set when any access count goes to 255, and cleared after one complete sweep through the buffers. (So, obviously, we note the location when we set the flag.) (5) To find the next buffer to evict, we would find out what is the lowest count in use, and sweep forward until we found one. (6) We would have a background process doing this sweeping and count reduction which would try to keep a few evicted buffers in a queue for quick pickup. This queue would be the only source for getting a buffer. Getting a buffer would always be a very lightweight operation if there's something in the queue. The background task would try very hard to keep the queue non-empty, only coming to rest when the queue was "full" -- whatever that was chosen to mean (possibly based on the sort of adaptive calculations currently used by the background writer). There are a lot of interesting ways this could interact with the background writer. One of the more intriguing to me would be for the "sweep" process to estimate what current access count levels will need to be used on its next sweep and queue up any at those levels which are dirty for the background writer. This is vaguely conceptually similar to the "wash marker" used in some LRU lists. -Kevin
On Aug 13, 2011, at 3:40 PM, Greg Stark wrote: > It does kind of seem like your numbers indicate we're missing part of > the picture though. The idea with the clock sweep algorithm is that > you keep approximately 1/nth of the buffers with each of the n values. > If we're allowing nearly all the buffers to reach a reference count of > 5 then you're right that we've lost any information about which > buffers have been referenced most recently. One possible missing piece here is that OS clock-sweeps depend on the clock hand to both increment and decrement the usagecount. The hardware sets a bit any time a page is accessed; as the clock sweeps in increases usage count if the bitis set and decreases it if it's clear. I believe someone else in the thread suggested this, and I definitely think it'sworth an experiment. Presumably this would also ease some lock contention issues. There is another piece that might be relevant... many (most?) OSes keep multiple lists of pages. FreeBSD for example containsthese page lists (http://www.freebsd.org/doc/en/articles/vm-design/article.html). Full description follows, but Ithink the biggest take-away is that there is a difference in how pages are handled once they are no longer active basedon whither the page is dirty or not. Active: These pages are actively in use and are not currently under consideration for eviction. This is roughy equivalentto all of our buffers with a usage count of 5. When an active page's usage count drops to it's minimum value, it will get unmapped from process space and moved to one oftwo queues: Inactive: DIRTY pages that are eligible for eviction once they've been written out. Cache: CLEAN pages that may be immediately reclaimed Free: A small set of pages that are basically the tail of the Cache list. The OS *must* maintain some pages on this listto support memory needed during interrupt handling. The size of this list is typically kept very small, and I'm not sureif non-interrupt processing will pull from this list. It's important to note that the OS can pull a page back out of the Inactive and Cache lists back into Active very cheaply. I think there are two interesting points here. First: after a page has been determined to no longer be in active use it goesinto inactive or cache based on whether it's dirty. ISTM that allows for much better scheduling of the flushing of dirtypages. That said; I'm not sure how much that would help us due to checkpoint requirements. Second: AFAIK only the Active list has a clock sweep. I believe the others are LRU (the mentioned URL refers to them as queues).I believe this works well because if a page faults it just needs to be removed from whichever queue it is in, addedto the Active queue, and mapped back into process space. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Sun, Aug 14, 2011 at 7:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Simon is proposing to bound the > really bad case where you flip through the entire ring multiple times > before you find a buffer, and that may well be worth doing. But I > think even scanning 100 buffers every time you need to bring something > in is too slow. What's indisputable is that a SELECT-only workload > which is larger than shared_buffers can be very easily rate-limited by > the speed at which BufFreelistLock can be taken and released. If you > have a better idea for solving that problem, I'm all ears... I felt we were on the right track here for a while. Does anyone dispute that BufFreelistLock is a problem? shared buffer replacement is *not* O(k) and it definitely needs to be. Does anyone have a better idea for reducing BufFreelistLock contention? Something simple that will work for 9.2? What steps are there between here and committing the freelist_ok.v2.patch? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > Does anyone have a better idea for reducing BufFreelistLock > contention? Something simple that will work for 9.2? Get rid of the freelist? Once shared buffers are full, it's just about useless anyway. But you'd need to think about the test cases that you pay attention to, as there might be scenarios where it remains useful. regards, tom lane
On Mon, Jan 2, 2012 at 5:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> Does anyone have a better idea for reducing BufFreelistLock >> contention? Something simple that will work for 9.2? > > Get rid of the freelist? Once shared buffers are full, it's just about > useless anyway. But you'd need to think about the test cases that you > pay attention to, as there might be scenarios where it remains useful. Agree freelist is mostly useless, but startup and dropping objects requires it. When its full its just an integer > 0 test, which is cheap and its on the same cacheline as the nextVictimBuffer anyway, so we have to fetch it. The clock sweep is where all the time goes, in its current form. We have 2 problems 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so bgwriter has to fight with backends to find out where it should clean. As a result it spends lots of time waiting and insufficient time cleaning. 2. When a backend can't find a free buffer, it spins for a long time while holding the lock. This makes the buffer strategy O(N) in its worst case, which slows everything down. Notably, while this is happening the bgwriter sits doing nothing at all, right at the moment when it is most needed. The Clock algorithm is an approximation of an LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid worst case behaviour makes sense. How much to tweak? Well,... (1) is fixed by buffreelistlock_reduction.v1.patch (2) is fixed by freelist_ok.v2.patch -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Mon, Jan 2, 2012 at 2:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Get rid of the freelist? Once shared buffers are full, it's just about >> useless anyway. But you'd need to think about the test cases that you >> pay attention to, as there might be scenarios where it remains useful. > > Agree freelist is mostly useless, but startup and dropping objects requires it. Not really. If someone drops an object, we must invalidate all the buffers immediately, but adding them to the free list is just an optimization to make sure we reuse the invalidated buffers in preference to evicting buffers that still have valid contents. But I suspect that in practice this is not a very important optimization, because (1) most people probably don't drop permanent tables or databases very often and (2) buffers that are being heavily used should have a positive reference count, in which case the clock sweep will pass over them and land on one of the newly-invalidated buffers anyway. > When its full its just an integer > 0 test, which is cheap and its on > the same cacheline as the nextVictimBuffer anyway, so we have to fetch > it. > > The clock sweep is where all the time goes, in its current form. ...but I agree with this. In its current form, the clock sweep has to acquire a spinlock for every buffer it touches. That's really expensive, and I think we need to either get rid of it altogether or at least find some way to move it into the background. Ideally, in the common case, a backend that wants to allocate a buffer shouldn't need to do more than acquire a spinlock, pop a buffer off some sort of linked list of allocation targets, and release the spinlock. Whatever other computation is needed should get pushed out of foreground processing. > We have 2 problems > > 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so > bgwriter has to fight with backends to find out where it should clean. > As a result it spends lots of time waiting and insufficient time > cleaning. I have trouble accepting that this is really the problem we should be solving. There's only one background writer process, and that process is waking up every 200ms by default. Many people probably tune that down quite a bit, but unless I'm quite mistaken, it should still be a drop in the bucket next to what the backends are doing. > 2. When a backend can't find a free buffer, it spins for a long time > while holding the lock. This makes the buffer strategy O(N) in its > worst case, which slows everything down. Notably, while this is > happening the bgwriter sits doing nothing at all, right at the moment > when it is most needed. The Clock algorithm is an approximation of an > LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid > worst case behaviour makes sense. How much to tweak? Well,... I generally agree with this analysis, but I don't think the proposed patch is going to solve the problem. It may have some merit as a way of limiting the worst case behavior. For example, if every shared buffer has a reference count of 5, the first buffer allocation that misses is going to have a lot of work to do before it can actually come up with a victim. But I don't think it's going to provide good scaling in general. Even if the background writer only spins through, on average, ten or fifteen buffers before finding one to evict, that still means we're acquiring ten or fifteen spinlocks while holding BufFreelistLock. I don't currently have the measurements to prove that's too expensive, but I bet it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> The clock sweep is where all the time goes, in its current form. > > ...but I agree with this. In its current form, the clock sweep has to > acquire a spinlock for every buffer it touches. That's really > expensive, and I think we need to either get rid of it altogether or > at least find some way to move it into the background. Ideally, in > the common case, a backend that wants to allocate a buffer shouldn't > need to do more than acquire a spinlock, pop a buffer off some sort of > linked list of allocation targets, and release the spinlock. Whatever > other computation is needed should get pushed out of foreground > processing. So you don't think a freelist is worth having, but you want a list of allocation targets. What is the practical difference? >> We have 2 problems >> >> 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so >> bgwriter has to fight with backends to find out where it should clean. >> As a result it spends lots of time waiting and insufficient time >> cleaning. > > I have trouble accepting that this is really the problem we should be > solving. There's only one background writer process, and that process > is waking up every 200ms by default. Many people probably tune that > down quite a bit, but unless I'm quite mistaken, it should still be a > drop in the bucket next to what the backends are doing. That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes, most of the contention is caused by the other backends, but the bgwriter experiences that contention currently when it has no need to do so. Presumably if you did have an allocation list maintained in the background by the bgwriter you wouldn't expect the bgwriter to wait behind a lock for no reason when it could be doing useful work. >> 2. When a backend can't find a free buffer, it spins for a long time >> while holding the lock. This makes the buffer strategy O(N) in its >> worst case, which slows everything down. Notably, while this is >> happening the bgwriter sits doing nothing at all, right at the moment >> when it is most needed. The Clock algorithm is an approximation of an >> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid >> worst case behaviour makes sense. How much to tweak? Well,... > > I generally agree with this analysis, but I don't think the proposed > patch is going to solve the problem. It may have some merit as a way > of limiting the worst case behavior. For example, if every shared > buffer has a reference count of 5, the first buffer allocation that > misses is going to have a lot of work to do before it can actually > come up with a victim. But I don't think it's going to provide good > scaling in general. Even if the background writer only spins through, > on average, ten or fifteen buffers before finding one to evict, that > still means we're acquiring ten or fifteen spinlocks while holding > BufFreelistLock. I don't currently have the measurements to prove > that's too expensive, but I bet it is. I think its worth reducing the cost of scanning, but that has little to do with solving the O(N) problem. I think we need both. I've left the way open for you to redesign freelist management in many ways. Please take the opportunity and go for it, though we must realise that major overhauls require significantly more testing to prove their worth. Reducing spinlocking only sounds like a good way to proceed for this release. If you don't have time in 9.2, then these two small patches are worth having. The bgwriter locking patch needs less formal evidence to show its worth. We simply don't need to have a bgwriter that just sits waiting doing nothing. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 3, 2012 at 10:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> The clock sweep is where all the time goes, in its current form. >> >> ...but I agree with this. In its current form, the clock sweep has to >> acquire a spinlock for every buffer it touches. That's really >> expensive, and I think we need to either get rid of it altogether or >> at least find some way to move it into the background. Ideally, in >> the common case, a backend that wants to allocate a buffer shouldn't >> need to do more than acquire a spinlock, pop a buffer off some sort of >> linked list of allocation targets, and release the spinlock. Whatever >> other computation is needed should get pushed out of foreground >> processing. > > So you don't think a freelist is worth having, but you want a list of > allocation targets. > What is the practical difference? I think that our current freelist is practically useless, because it is almost always empty, and the cases where it's not empty (startup, and after a table or database drop) are so narrow that we don't really get any benefit out of having it. However, I'm not opposed to the idea of a freelist in general: I think that if we actually put in some effort to keep the freelist in a non-empty state it would help a lot, because backends would then have much less work to do at buffer allocation time. >> I have trouble accepting that this is really the problem we should be >> solving. There's only one background writer process, and that process >> is waking up every 200ms by default. Many people probably tune that >> down quite a bit, but unless I'm quite mistaken, it should still be a >> drop in the bucket next to what the backends are doing. > > That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes, > most of the contention is caused by the other backends, but the > bgwriter experiences that contention currently when it has no need to > do so. Sure, but if that contention is a negligible fraction of the total and doesn't cost anything measurable, then it doesn't make sense to add code to eliminate it. There are all sorts of places in the system where we have contention at a level that doesn't affect performance. Many of those could probably be fixed by adding more complicated locking, but that would just complicate the code to no purpose. If there's a demonstrable performance benefit to this change then it's worth a harder look, but my gut feeling is that there won't be. >>> 2. When a backend can't find a free buffer, it spins for a long time >>> while holding the lock. This makes the buffer strategy O(N) in its >>> worst case, which slows everything down. Notably, while this is >>> happening the bgwriter sits doing nothing at all, right at the moment >>> when it is most needed. The Clock algorithm is an approximation of an >>> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid >>> worst case behaviour makes sense. How much to tweak? Well,... >> >> I generally agree with this analysis, but I don't think the proposed >> patch is going to solve the problem. It may have some merit as a way >> of limiting the worst case behavior. For example, if every shared >> buffer has a reference count of 5, the first buffer allocation that >> misses is going to have a lot of work to do before it can actually >> come up with a victim. But I don't think it's going to provide good >> scaling in general. Even if the background writer only spins through, >> on average, ten or fifteen buffers before finding one to evict, that >> still means we're acquiring ten or fifteen spinlocks while holding >> BufFreelistLock. I don't currently have the measurements to prove >> that's too expensive, but I bet it is. > > I think its worth reducing the cost of scanning, but that has little > to do with solving the O(N) problem. I think we need both. Maybe. I would really like a unified solution to both problems, but I'd be willing to accept a solution for one of them if we're confident that we've actually solved it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jan 3, 2012, at 11:15 AM, Robert Haas wrote: >> So you don't think a freelist is worth having, but you want a list of >> allocation targets. >> What is the practical difference? > > I think that our current freelist is practically useless, because it > is almost always empty, and the cases where it's not empty (startup, > and after a table or database drop) are so narrow that we don't really > get any benefit out of having it. However, I'm not opposed to the > idea of a freelist in general: I think that if we actually put in some > effort to keep the freelist in a non-empty state it would help a lot, > because backends would then have much less work to do at buffer > allocation time. This is exactly what the FreeBSD VM system does (which is at least one of the places where the idea of a clock sweep forPG came from ages ago). There is a process that does nothing but attempt to keep X amount of memory on the free list,where it can immediately be grabbed by anything that needs memory. Pages on the freelist are guaranteed to be clean(as in not dirty), but not zero'd. In fact, IIRC if a page on the freelist gets referenced again it can be pulled backout of the free list and put back into an active state. The one downside I see to this is that we'd need some heuristic to determine how many buffers we want to keep on the freelist. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Jan 3, 2012 at 6:22 PM, Jim Nasby <jim@nasby.net> wrote: > On Jan 3, 2012, at 11:15 AM, Robert Haas wrote: >>> So you don't think a freelist is worth having, but you want a list of >>> allocation targets. >>> What is the practical difference? >> >> I think that our current freelist is practically useless, because it >> is almost always empty, and the cases where it's not empty (startup, >> and after a table or database drop) are so narrow that we don't really >> get any benefit out of having it. However, I'm not opposed to the >> idea of a freelist in general: I think that if we actually put in some >> effort to keep the freelist in a non-empty state it would help a lot, >> because backends would then have much less work to do at buffer >> allocation time. > > This is exactly what the FreeBSD VM system does (which is at least one of the places where the idea of a clock sweep forPG came from ages ago). There is a process that does nothing but attempt to keep X amount of memory on the free list,where it can immediately be grabbed by anything that needs memory. Pages on the freelist are guaranteed to be clean(as in not dirty), but not zero'd. In fact, IIRC if a page on the freelist gets referenced again it can be pulled backout of the free list and put back into an active state. > > The one downside I see to this is that we'd need some heuristic to determine how many buffers we want to keep on the freelist. Fortuitously, I believe the background writer already has most of the necessary logic: it attempts to predict how many buffers are about to be needed - I think based on a decaying average. Actually, I think that logic could use some improvement, because I believe I've heard Greg Smith comment that it's often necessary to tune bgwriter_delay downward. It'd be nice to make the delay adaptive somehow, to avoid the need for manual tuning (and unnecessary wake-ups when the system goes idle). But possibly the existing logic is good enough for a first cut. However, in the interest of full disclosure, I'll admit that I've done no testing in this area at all and am talking mostly out of my posterior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/03/2012 06:22 PM, Jim Nasby wrote: > On Jan 3, 2012, at 11:15 AM, Robert Haas wrote: > >> I think that our current freelist is practically useless, because it >> is almost always empty, and the cases where it's not empty (startup, >> and after a table or database drop) are so narrow that we don't really >> get any benefit out of having it. However, I'm not opposed to the >> idea of a freelist in general: I think that if we actually put in some >> effort to keep the freelist in a non-empty state it would help a lot, >> because backends would then have much less work to do at buffer >> allocation time. >> > This is exactly what the FreeBSD VM system does (which is at least one of the places where the idea of a clock sweep forPG came from ages ago). There is a process that does nothing but attempt to keep X amount of memory on the free list,where it can immediately be grabbed by anything that needs memory. Pages on the freelist are guaranteed to be clean(as in not dirty), but not zero'd. In fact, IIRC if a page on the freelist gets referenced again it can be pulled backout of the free list and put back into an active state. > > The one downside I see to this is that we'd need some heuristic to determine how many buffers we want to keep on the freelist. > http://wiki.postgresql.org/wiki/Todo#Background_Writer has "Consider adding buffers the background writer finds reusable to the free list" and "Automatically tune bgwriter_delay based on activity rather then using a fixed interval", which both point to my 8.3 musing and other suggestionss starting at http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php I could write both those in an afternoon. The auto-tuning stuff already in the background writer originally intended to tackle this issue, but dropped it in lieu of shipping something simpler first. There's even a prototype somewhere on an old drive here. The first missing piece needed before this was useful was separating out the background writer and checkpointer processes. Once I realized the checkpoints were monopolizing so much time, especially when they hit bad states, it was obvious the writer couldn't be relied upon for this job. That's much better now since Simon's 806a2aee3791244bf0f916729bfdb5489936e068 "Split work of bgwriter between 2 processes: bgwriter and checkpointer", which just became available in November to build on. The second missing piece blocking this work in my mind was how exactly we're going to benchmark the result, mainly to prove it doesn't hurt some workloads. I haven't fully internalized the implications of Robert's upthread comments, in terms of being able to construct a benchmark stressing both the best and worst case situation here. That's really the hardest part of this whole thing, by a lot. Recent spending has brought me an 8 HyperThread core laptop that can also run DTrace, so I expect to have better visibility into this soon too. I think here in 2011 the idea of having a background writer process that could potentially occupy most of a whole core doing work so backends don't have to is an increasingly attractive one. So long as that comes along with an auto-tuning delay, it shouldn't hurt the work toward lowering power management either. Might even help really, by allowing larger values of bgwriter_delay than you'd want to use during busy periods. I was planning to mimic the sort of fast attack/slow delay logic already used for the auto-tuned timing, so that you won't fall behind by more than one bgwriter_delay worth of activity. Then it should realize a burst is here and the writer has to start moving faster. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On 03.01.2012 17:56, Simon Riggs wrote: > On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas<robertmhaas@gmail.com> wrote: > >>> 2. When a backend can't find a free buffer, it spins for a long time >>> while holding the lock. This makes the buffer strategy O(N) in its >>> worst case, which slows everything down. Notably, while this is >>> happening the bgwriter sits doing nothing at all, right at the moment >>> when it is most needed. The Clock algorithm is an approximation of an >>> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid >>> worst case behaviour makes sense. How much to tweak? Well,... >> >> I generally agree with this analysis, but I don't think the proposed >> patch is going to solve the problem. It may have some merit as a way >> of limiting the worst case behavior. For example, if every shared >> buffer has a reference count of 5, the first buffer allocation that >> misses is going to have a lot of work to do before it can actually >> come up with a victim. But I don't think it's going to provide good >> scaling in general. Even if the background writer only spins through, >> on average, ten or fifteen buffers before finding one to evict, that >> still means we're acquiring ten or fifteen spinlocks while holding >> BufFreelistLock. I don't currently have the measurements to prove >> that's too expensive, but I bet it is. > > I think its worth reducing the cost of scanning, but that has little > to do with solving the O(N) problem. I think we need both. > > I've left the way open for you to redesign freelist management in many > ways. Please take the opportunity and go for it, though we must > realise that major overhauls require significantly more testing to > prove their worth. Reducing spinlocking only sounds like a good way to > proceed for this release. > > If you don't have time in 9.2, then these two small patches are worth > having. The bgwriter locking patch needs less formal evidence to show > its worth. We simply don't need to have a bgwriter that just sits > waiting doing nothing. I'd like to see some benchmarks that show a benefit from these patches, before committing something like this that complicates the code. These patches are fairly small, but nevertheless. Once we have a test case, we can argue whether the benefit we're seeing is worth the extra code, or if there's some better way to achieve it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Jan 20, 2012 at 9:29 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I'd like to see some benchmarks that show a benefit from these patches, > before committing something like this that complicates the code. These > patches are fairly small, but nevertheless. Once we have a test case, we can > argue whether the benefit we're seeing is worth the extra code, or if > there's some better way to achieve it. Agreed. I don't really like either of these patches stylistically: when you start sometimes locking things and sometimes not locking things, it gets hard to reason about how the code will behave, and you limit what you can in the future do inside the critical section because you have to keep in mind that you don't really have full mutual exclusion. There are places where it may be worth making that trade-off if we can show a large performance benefit, but I am wary of quick hacks that may get in the way of more complete solutions we want to implement later, especially if the performance benefit is marginal. I'm in the process of trying to pull together benchmark numbers for the various performance-related patches that are floating around this CommitFest, but a big problem is that many of them need to be tested in different ways, which are frequently not explicitly laid out in the emails in which they are submitted. Some of them are calculated to improve latency, and others throughput. Some require pgbench run with a small scale factor, some require pgbench with a large scale factor, and some need some completely other type of testing altogether. Some need short tests, others need long tests, still others require testing with very specific parameters, and some don't really need more testing so much as they need profiling. Many (like the double-write patch) need more thought about worst case scenarios, like a sequential scan on a large, unhinted table. Some only really need testing in one or two scenarios, but others could affect a broad variety of workloads and really ought to be tested in many different ways to fully understand the behavior. Some of them may be interdependent in complex ways. My current belief, based on the testing I did before Christmas and my gut feelings about the remaining patches, is that the patches which have the best chance of making a major difference on general workloads are your XLOG insert scaling patch and the group commit patch. Those are, in fact, virtually the only ones that have had more than zero test results posted to the list so far, and those test results are extremely promising. The various checkpoint-related patches also seem somewhat promising to me, despite the lack (so far) of any test results. Everything else strikes me as likely to make a difference that is either very small or only applicable in a fairly circumscribed set of cases. This preliminary opinion is subject to change as more evidence comes in, of course. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company