Thread: Thinking about breaking up the BufMgrLock
We've been speculating for awhile about ways to reduce contention for the BufMgrLock, in hopes of fixing the "context swap storm" problem exhibited by certain patterns of concurrent queries. The traditional way to fix such problems is to divide what is protected by the lock into finer-grain lockable objects. (I'm not totally convinced that this approach can help us, because more locks means more low-level locking overhead. But let's pursue it and see where it goes; we sure don't have any other ideas at the moment.) Neil Conway took a stab at this about a year ago: http://archives.postgresql.org/pgsql-hackers/2004-01/msg00173.php but that approach had some problems IMHO, and in any case he never got the patch to the point of working. Here's a bit of fresh thinking about the problem. To fix the test cases we have been looking at, there are two things we need to fix the performance of: ReadBuffer for the case where the requested page is already in a shared buffer, and ReleaseBuffer. The test cases essentially put multiple backends into tight loops doing these operations. These cases do not require I/O or any kernel call, so ideally they should be fast, but what we are seeing is that they suffer from excessive contention for the BufMgrLock. ReadBuffer needs to do a lookup to map the page ID to a buffer ID, which in principle requires only a shared lock on the page-to-buffer mapping (embodied in the buf_table hash table). Assuming success, it also needs to mark the buffer pinned and update the LRU-list position of the buffer. Marking pinned is certainly a buffer-local change, so we could imagine splitting up the BufMgrLock like this: 1. A global LWLock for the page-to-buffer mapping, call it say BufMappingLock. Share lock on this is sufficient to allow reading the hash table; must get exclusive lock when reassigning any buffer's page identity. 2. A global LWLock for the LRU strategy information, say BufLRULock or BufStrategyLock. 3. Per-buffer LWLocks (or maybe just spinlocks) protecting the state of each buffer header; you need this lock to examine or change a buffer header. ReleaseBuffer has no locking problems in this formulation: it just grabs the per-buffer lock, decrements its refcount, releases the lock. ReadBuffer looks like: * Acquire share lock on BufMappingLock.* Search hash table for desired ID. (we assume success)* acquire per-buffer lock.*increment buffer refcount.* release per-buffer lock.* release share lock on BufMappingLock.* update the LRU state. (We have to bump the pin count on the target buffer before releasing the BufMappingLock, otherwise someone could reassign the buffer as soon as we release BufMappingLock.) This would be pretty good from a locking point of view, except that "update the LRU state" seems to require taking an exclusive lock on a global data structure, which puts us about back where we were. Contention for a global BufLRULock might be a bit better than for the existing overall BufMgrLock, but it'll still cripple the performance of ReadBuffer. Perhaps someone knows of a neat data structure that can maintain an LRU list with only local updates? I don't though. The best idea I've had about it is to add flag bits to the per-buffer state that essentially indicate "pending move to head of T1" or "pending move to head of T2". We would make ReadBuffer set the appropriate bit when it grabs a buffer, but not actually update the global LRU state. Operations such as the periodic bgwriter scan for work to do, or an attempt to locate a reusable buffer, would scan the LRU list from tail to front. When they hit a buffer that's got one of these bits set, they would not process it immediately, but would move it to the list head and clear the bit. Doing this would require write lock on the LRU list, as well as per-buffer locks on each buffer as it's examined or changed, but *not* any lock on BufMappingLock. So these operations wouldn't interfere too badly with the success path in ReadBuffer. This would convert the existing "strict LRU" behavior into an "approximate LRU". I'm worried that the change might be penny-wise and pound-foolish: if a poorer cache management algorithm causes us to have to do more I/O, it's probably a net loss to save some lock contention. But without a smart idea about data structures I don't see how to do better. Thoughts? regards, tom lane
On Sun, 2005-02-06 at 19:30 -0500, Tom Lane wrote: > This would be pretty good from a locking point of view, except that > "update the LRU state" seems to require taking an exclusive lock on a > global data structure, which puts us about back where we were. We're only concerned with a buffer's access recency when it is on the free list, right? (That is certainly true with naive LRU; I confess I haven't had a chance to grok the 2Q paper yet). If so: - when ReadBuffer() is called on an in-cache but previously-free buffer, increment the buffer's ref count; this logically removes it from the LRU, but does not actually do the physical list removal. - when we need to fault in a page from disk, acquire the LRU lock and select a buffer for replacement. At this point we can do some physical cleanup of the free list, by removing buffers with a non-zero refcount from the free list. We can do this cleanup relatively cheaply now, because we had to acquire the LRU lock anyway, and this is the slow path of ReadBuffer(). This work might also be done by the bgwriter (removing clean but pinned buffers from close to the LRU of the free list). - we need only update a pinned buffer's LRU position when its shared refcount drops to zero, not on every ReleaseBuffer() This means we need to acquire the LRU lock once for each acquire/release pair on a previously-unpinned buffer (rather than twice, if we had to acquire the LRU lock on acquire as well). Not sure if that's enough to remove the contention problem; on the plus side, it doesn't change the actual behavior of the replacement policy. > Perhaps someone knows of a neat data structure that can maintain an LRU > list with only local updates? I don't though. What operations does 2Q require on the shared lists? (Assuming that's the replacement policy we're going with.) Depending on how complex the list modifications are, non-blocking algorithms might be worth considering. For example, to remove a node from the middle of a linked list can be done via atomic CAS; you need to redo the CAS in the presence of a concurrent modification of the particular node you're trying to modify, but that means we are contending over individual list nodes, not the list as a whole. -Neil
Neil Conway <neilc@samurai.com> writes: > We're only concerned with a buffer's access recency when it is on the > free list, right? Right; pinned buffers are not logically part of the freelist. (Whether they are so physically is open to choice.) Another free variable, AFAICS, is whether to do the move-to-front during ReadBuffer or during ReleaseBuffer. You could possibly argue that the latter would be slightly more accurate, but it probably doesn't matter much. > This means we need to acquire the LRU lock once for each acquire/release > pair on a previously-unpinned buffer (rather than twice, if we had to > acquire the LRU lock on acquire as well). I think this is true of any scheme that separates the lookup table from LRU management: you hit the LRU overhead during acquire, or during release, but not both. I suspect that we need more than a 2X reduction in contention to fix the problem, though. > What operations does 2Q require on the shared lists? Remove from list, insert at head of list (often but not always paired as a move-to-front operation). I think we also have an insert-at-tail case for VACUUM. One difficulty is that we need to maintain a list length counter, which is a central point of contention except for the move-to-front-of-same-list case. > For example, to remove a node from the middle of a linked > list can be done via atomic CAS; you need to redo the CAS in the > presence of a concurrent modification of the particular node you're > trying to modify, but that means we are contending over individual list > nodes, not the list as a whole. Unfortunately, most of the time you need to touch the list head, so I'm unclear that this buys us much, at least if we try to preserve true "move to front" behavior. It might be interesting to consider "move ahead one place" instead of "move to front" --- that would avoid the pressure on the list head. Again you have to worry whether you're blowing off too much performance in the cache management algorithm... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > ReadBuffer needs to do a lookup to map the page ID to a buffer ID, > which in principle requires only a shared lock on the page-to-buffer > mapping (embodied in the buf_table hash table). Assuming success, it > also needs to mark the buffer pinned and update the LRU-list position > of the buffer. Marking pinned is certainly a buffer-local change, > so we could imagine splitting up the BufMgrLock like this: So the only reason you need the global lock is for the LRU updates? This is a well understood problem. I remember it from my Systems class in school. And searching on google finds lecture notes that match my memory that there are other systems generally preferred to LRU precisely because they don't require extensive list management in the critical path. For example these slides: http://www.cs.utexas.edu/users/lorenzo/corsi/cs372/03F/notes/11-20.pdf They describe a few that could be relevant. I recall the clock sweep having been recommended in class as having most of the best properties of LRU with very low cost in the critical path. To use a buffer you only have to set a single local flag indicating the buffer has been used. A separate step advances circularly one buffer at a time clearing the flags. If it finds any buffer that hasn't been used for a complete cycle through the list then it can remove it, either entirely or merely to a second list. -- greg
Greg Stark <gsstark@mit.edu> writes: > I recall the clock sweep having > been recommended in class as having most of the best properties of LRU with > very low cost in the critical path. Right. The "pending move to front" idea that I suggested is basically a variant of a clock algorithm: it takes two trips through the LRU list before a page falls off and becomes eligible for replacement. (It's even closer to the "second chance" clock algorithm.) The $64 question with any of these things is how much performance at the cache-management level are we willing to give up in order to gain low-level efficiency? We probably don't want to go very far in that direction. But maybe a clock scan will be a good compromise. regards, tom lane
On Sun, Feb 06, 2005 at 10:53:38PM -0500, Greg Stark wrote: > This is a well understood problem. I remember it from my Systems class in > school. And searching on google finds lecture notes that match my memory that > there are other systems generally preferred to LRU precisely because they > don't require extensive list management in the critical path. > > For example these slides: > > http://www.cs.utexas.edu/users/lorenzo/corsi/cs372/03F/notes/11-20.pdf It might be worth taking a gander at the appropriate sections of either The Design and Implementation of the 4.4 BSD Operating System (http://lnk.nu/amazon.com/1ds) or the 4.3 version (http://lnk.nu/amazon.com/1dt). I've read the 4.3 one, and there's a lot of discussion of the buffer management algorithms chosen, and why they were chosen. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
>Neil Conway writes > We're only concerned with a buffer's access recency when it is on the > free list, right? (That is certainly true with naive LRU; I confess I > haven't had a chance to grok the 2Q paper yet). If so: > - we need only update a pinned buffer's LRU position when its shared > refcount drops to zero, not on every ReleaseBuffer() > > This means we need to acquire the LRU lock once for each > acquire/release > pair on a previously-unpinned buffer (rather than twice, if we had to > acquire the LRU lock on acquire as well). Not sure if that's enough to > remove the contention problem; on the plus side, it doesn't change the > actual behavior of the replacement policy. This is a fantastic idea, well done. > - when we need to fault in a page from disk, acquire the LRU lock and > select a buffer for replacement. At this point we can do some physical > cleanup of the free list, by removing buffers with a non-zero refcount > from the free list. We can do this cleanup relatively cheaply now, > because we had to acquire the LRU lock anyway, and this is > the slow path > of ReadBuffer(). This work might also be done by the bgwriter > (removing > clean but pinned buffers from close to the LRU of the free list). I'm not sure I understand this. Removing clean buffers doesn't cost much, its the removal of dirty ones that cost, surely? There is no work to remove a clean buffer. Best Regards, Simon Riggs
>Tom Lane > But without a smart idea about data structures I don't see how to do > better. > > Thoughts? > Hmm, seems like you summed up the lack of algorithmic choices very well. After much thought, I believe the best way is to implement bufferpools (BPs). That is, we don't just have one bufferhash and one LRU, we have many pairs. We then work out some mapping by which a block can be known to be in a particular BP, then acquire the lock for that BP. Using BPs means that each block is only ever associated with one BP. Hence, there is still only a single BPBufMgrLock to Acquire/Release, but there could be considerably less contention for a specific lock. BPs can be implemented manually or automatically. Commercial systems provide facilities for manual placement of specific tables into specific bufferpools. That is more work, so I would favour automatically assigned BPs, though leaving the door open to further extension. It seems simple to use one or more of the numbers associated with a block as input to a hash algorithm to identify the BP in which to locate the block. After some thinking, using relationId seems to be the best candidate. That would mean an index or sequential scan would be more likely to reAcquire a lock. There is some possibility that the multiple BP LRUs could become unbalanced, so very short LRUs would not benefit as much from splitting up in this way. So, I suggest linking the number of BPs to the number of shared buffers, say 1 BP per 10,000 shared_buffers, up to a maximum of 8. I believe a self-balancing algorithm between the multiple lists would be fairly straightforward addition also. ISTM that there is no chance for deadlock in using BPs. The current code assumes that all blocks are accessed via a single BufMgrLock. There is never an attempt to acquire BufMgrLock when it is already held, so no chance therefore of deadlock. Overall, this idea would reduce BufMgrLock contention by as many multiples as you choose, as well as allowing more practical use of larger shared_buffers settings which would further drive up performance. The costs need not be incurred at all for smaller shared_buffers and the current situation (num_bufferpools = 1) can remain the default. Also, the current code already allows for multiple lists, so extending the number from 4 (or 3) to multiples seems more straightforward. One other consideration is the eviction of dirty buffers, but more detail on that in another post. What should be observed there is that dirty buffer evictions perform Acquire/Release of the BufMgrLock at least twice. Improving the bgwriter will also have an effect on BufMgrLock contention. Best Regards, Simon Riggs
"Simon Riggs" <simon@2ndquadrant.com> writes: > After much thought, I believe the best way is to implement bufferpools > (BPs). That is, we don't just have one bufferhash and one LRU, we have > many pairs. We then work out some mapping by which a block can be known > to be in a particular BP, then acquire the lock for that BP. I think this is basically wrongheaded, because it achieves its reduction in contention by a one-for-one sacrifice of cache allocation efficiency; that is, any individual page is now fighting for survival in a pool only 1/Nth as large as before. We're going to lose more in I/O than we can hope to save at the processor level. I think a clock algorithm or something similar may be a reasonable way to go, though. regards, tom lane
> What operations does 2Q require on the shared lists? (Assuming that's > the replacement policy we're going with.) Depending on how complex the > list modifications are, non-blocking algorithms might be worth > considering. For example, to remove a node from the middle of a linked > list can be done via atomic CAS; you need to redo the CAS in the > presence of a concurrent modification of the particular node you're > trying to modify, but that means we are contending over individual list > nodes, not the list as a whole. If you plan to use CAS to have lock-free algorithm, there was a thread about concurrent lock-free algorithm few days ago. http://archives.postgresql.org/pgsql-hackers/2005-01/msg00736.php I give some references about new paper I found about wait-free algorithm. I think we can adapt to the cache list the GC wait-free discribe http://www.cs.chalmers.se/~phs/TechnicalReports/Sun04_WaitFreeRef.pdf In a general manner, I think a deep study of these recent works on wait-free algorithms will be a big win. Cordialement, Jean-Gérard Pailloncy
>Tom Lane > "Simon Riggs" <simon@2ndquadrant.com> writes: > > After much thought, I believe the best way is to implement > bufferpools > > (BPs). That is, we don't just have one bufferhash and one > LRU, we have > > many pairs. We then work out some mapping by which a block > can be known > > to be in a particular BP, then acquire the lock for that BP. > > I think this is basically wrongheaded, because it achieves > its reduction > in contention by a one-for-one sacrifice of cache allocation > efficiency; > that is, any individual page is now fighting for survival in > a pool only > 1/Nth as large as before. We're going to lose more in I/O than we can > hope to save at the processor level. Well, as you might expect, I disagree. I also expected that reaction: ISTM when all good ideas are used up - as you carefully explained was the case, it must be a more offbeat idea that is the next good one (to paraphrase Sherlock Holmes). I do agree that the goal is to reduce contention without increasing I/O. With BPs: Yes, an individual block is fighting for survival in a smaller pool, but the number of blocks that might want to go in the pool is also reduced by the same ratio. With a small shared_buffers, shorter lists might be a problem, but sizing it large enough would take away some issues also. I think a straightforward list balancing algorithm would reduce any imbalance across lists. BPs would not giving much benefit on a single CPU - my goal here is to increase SMP performance. Overall, performance results must be the final arbiter in what is best. Best Regards, Simon Riggs
On Sun, Feb 06, 2005 at 07:30:37PM -0500, Tom Lane wrote: > > ReadBuffer needs to do a lookup to map the page ID to a buffer ID, > which in principle requires only a shared lock on the page-to-buffer > mapping (embodied in the buf_table hash table). Assuming success, it > also needs to mark the buffer pinned and update the LRU-list position > of the buffer. Marking pinned is certainly a buffer-local change, > so we could imagine splitting up the BufMgrLock like this: > > 1. A global LWLock for the page-to-buffer mapping, call it say > BufMappingLock. Share lock on this is sufficient to allow reading the > hash table; must get exclusive lock when reassigning any buffer's page > identity. > > 2. A global LWLock for the LRU strategy information, say BufLRULock > or BufStrategyLock. > > 3. Per-buffer LWLocks (or maybe just spinlocks) protecting the state of > each buffer header; you need this lock to examine or change a buffer > header. > > ReleaseBuffer has no locking problems in this formulation: it just grabs > the per-buffer lock, decrements its refcount, releases the lock. > For the per-buffer, a latch would provide a lightweight method of updating the contents of the buffer without hampering the read-only access. A latch is comprised of a latch bit and a sequence number that can be set in an atomic action. The flow for the two cases is simple: Write:1. Get latch.2. Update the buffer.3. Increment the sequence number.4. Release the latch. Read:1. Read version number.2. Read buffer.3. Check latch. If latched, go to 1.4. If version number has changed, go to 1. By using this process, readers will only see a consistent state of the buffer. Also, since the read does not entail a write operation it will not cause a cache line update and contribute to the a cache update "storm". The "get latch" operation can be implemented using an atomic operation such as TAS (test-and-set) and CAS (compare-and-set). This would provide readers an extremely lightweight access to the buffer - no cache line update hit. If you need to have sequenced access to the buffer, then you would need to use LWLocks but in many cases such as 3. in Tom's list a latch would work well. > ReadBuffer looks like: > > * Acquire share lock on BufMappingLock. > * Search hash table for desired ID. (we assume success) > * acquire per-buffer lock. > * increment buffer refcount. > * release per-buffer lock. > * release share lock on BufMappingLock. > * update the LRU state. > > (We have to bump the pin count on the target buffer before releasing the > BufMappingLock, otherwise someone could reassign the buffer as soon as > we release BufMappingLock.) > > This would be pretty good from a locking point of view, except that > "update the LRU state" seems to require taking an exclusive lock on a > global data structure, which puts us about back where we were. > Contention for a global BufLRULock might be a bit better than for the > existing overall BufMgrLock, but it'll still cripple the performance > of ReadBuffer. > > Perhaps someone knows of a neat data structure that can maintain an LRU > list with only local updates? I don't though. > The clock algorithm is pretty close to this and provides an approximation to LRU that eleminates the need to move buffers to the MRU position by using a reference bit. > This would convert the existing "strict LRU" behavior into an > "approximate LRU". I'm worried that the change might be penny-wise and > pound-foolish: if a poorer cache management algorithm causes us to have > to do more I/O, it's probably a net loss to save some lock contention. > But without a smart idea about data structures I don't see how to do > better. > One alternative to an approximate LRU, such as the clock algorithm, would be to have multiple buffer pools as we discussed in the previous thread. The contention would be reduced by 1/N, where N is the number of pools. Of course, buffers should be allocated in a fashion that would maximize locality and minimize the effect of scan cache polution. More food for thought. Ken