Thread: Re: [PATCHES] update i386 spinlock for hyperthreading
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Manfred Spraul wrote: >> I remember there were patches that tried other algorithms instead of the >> simple LRU for the buffer manager. Has anyone tried to change the >> locking of the buffer manager? > CVS HEAD already has an Adaptive Replacement Cache (ARC) for buffer > replacement. That's irrelevant to the problem, though. Unless the ARC code uses data structures that are more amenable to localized locking than the old global buffer freelist. (Jan?) regards, tom lane
Tom Lane wrote: >Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >>Manfred Spraul wrote: >> >> >>>I remember there were patches that tried other algorithms instead of the >>>simple LRU for the buffer manager. Has anyone tried to change the >>>locking of the buffer manager? >>> >>> > > > >>CVS HEAD already has an Adaptive Replacement Cache (ARC) for buffer >>replacement. >> >> > >That's irrelevant to the problem, though. Unless the ARC code uses data >structures that are more amenable to localized locking than the old >global buffer freelist. (Jan?) > > regards, tom lane > > Not that I know of. The new strategy uses one shared hash table like the old, and one buffer pool as well. It grabs the same old Bufmgr lock during the lookup+replacement decision process, gives it up during eventual IO, grabs it again when done with the IO. As a matter of fact, the strategy itself does no locking at all. Like the old LRU code it simply assumes that the buffer manager holds the lock during calls. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: >> That's irrelevant to the problem, though. Unless the ARC code uses data >> structures that are more amenable to localized locking than the old >> global buffer freelist. (Jan?) > the strategy itself does no locking at all. Like the old LRU code it > simply assumes that the buffer manager holds the lock during calls. Okay, I suspected as much but wasn't sure. Manfred's numbers definitely say that we need to find a way to break down the BufMgrLock into multiple finer-grain locks. We already have all those per-buffer LWLocks, but I don't see how to apply those to the problem of managing the global lookup and replacement datastructures. Anyone see an attack path here? regards, tom lane
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > >> That's irrelevant to the problem, though. Unless the ARC code uses data > >> structures that are more amenable to localized locking than the old > >> global buffer freelist. (Jan?) > > > the strategy itself does no locking at all. Like the old LRU code it > > simply assumes that the buffer manager holds the lock during calls. > > Okay, I suspected as much but wasn't sure. > > Manfred's numbers definitely say that we need to find a way to break > down the BufMgrLock into multiple finer-grain locks. We already have > all those per-buffer LWLocks, but I don't see how to apply those to > the problem of managing the global lookup and replacement datastructures. > > Anyone see an attack path here? Should we have one lock per hash bucket rather than one for the entire hash? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: >>Anyone see an attack path here? >> >> > >Should we have one lock per hash bucket rather than one for the entire >hash? > > That's the simple part. The problem is the aging strategy: we need a strategy that doesn't rely on a global list that's updated after every lookup. If I understand the ARC code correctly, there is a STRAT_MRU_INSERT(cdb, STRAT_LIST_T2) that happen in every lookup. -- Manfred
Manfred Spraul wrote: > Bruce Momjian wrote: > >>>Anyone see an attack path here? >>> >>> >> >>Should we have one lock per hash bucket rather than one for the entire >>hash? >> >> > That's the simple part. The problem is the aging strategy: we need a > strategy that doesn't rely on a global list that's updated after every > lookup. If I understand the ARC code correctly, there is a > STRAT_MRU_INSERT(cdb, STRAT_LIST_T2) that happen in every lookup. Moving the Cache Directory Block (cdb) on a hit to the MRU position of the appropriate queue "is the bookkeeping" of this strategy. The whole algorithm is based on it, and I don't see yet how to avoid that without opening a huge can of worms that look like deadlocks. But I'll think about it for a while. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > Manfred Spraul wrote: > > Bruce Momjian wrote: > > > >>>Anyone see an attack path here? > >>> > >>> > >> > >>Should we have one lock per hash bucket rather than one for the entire > >>hash? > >> > >> > > That's the simple part. The problem is the aging strategy: we need a > > strategy that doesn't rely on a global list that's updated after every > > lookup. If I understand the ARC code correctly, there is a > > STRAT_MRU_INSERT(cdb, STRAT_LIST_T2) that happen in every lookup. > > Moving the Cache Directory Block (cdb) on a hit to the MRU position of > the appropriate queue "is the bookkeeping" of this strategy. The whole > algorithm is based on it, and I don't see yet how to avoid that without > opening a huge can of worms that look like deadlocks. But I'll think > about it for a while. If we can't eliminate the global lock, and we reduce its duration? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Jan Wieck wrote: >Moving the Cache Directory Block (cdb) on a hit to the MRU position of >the appropriate queue "is the bookkeeping" of this strategy. The whole >algorithm is based on it, and I don't see yet how to avoid that without >opening a huge can of worms that look like deadlocks. But I'll think >about it for a while. > I feared that. Are there strategies that do not rely on a global lock? The Linux kernel uses a lazy LRU with referenced bits: on access, the referenced bit is set. The freespace logic takes pages from the end of a linked list, and checks that bit: if it's set, then the page is moved back to the top of the list. Otherwise it's a candidate for replacement. Pages start at the head of that pseudo-lru list, with the reference bit clear: that way a page that is accessed only once has a lower priority than a frequently accessed page. At least that's how I understand the algorithm. -- Manfred
Bruce Momjian <pgman@candle.pha.pa.us> writes: > If we can't eliminate the global lock, and we reduce its duration? It'd be a big win even if we could just arrange that ReadBuffer and ReleaseBuffer don't *both* grab the same global lock. Can we logically separate the buffer lookup hashtable from the freelist maintenance code? I am wondering about having two locks, one for each of those areas. In principle I think a fast-path ReadBuffer (one that doesn't need to do I/O because the desired page is in a buffer already) would need to get only a share lock on the lookup table, and never lock the freelist structure at all. ReleaseBuffer would need a write lock on the freelist, but need not touch the lookup table. Only in the case where ReadBuffer has to do I/O, and therefore needs to acquire a free buffer to read the page into, do you need locks on both structures --- and with sufficient care, you might not need to lock them both at the same time, even in that path. To make this work, we would have to recognize that the "freelist" might be out of date --- that is, a page that had been in the freelist because it has zero pins would be left in the freelist by ReadBuffer, and anyone trying to remove it from the freelist for reuse would have to notice that it has positive pin count and realize that it's not really free. But that should be pretty simple. Beyond that we could think about locking just parts of each of these structures (for instance, doesn't ARC really use multiple freelists?) but I think we ought to focus first on that basic division. regards, tom lane
Manfred Spraul <manfred@colorfullife.com> writes: > Are there strategies that do not rely on a global lock? The Linux kernel > uses a lazy LRU with referenced bits: on access, the referenced bit is > set. The freespace logic takes pages from the end of a linked list, and > checks that bit: if it's set, then the page is moved back to the top of > the list. Otherwise it's a candidate for replacement. I think this is the same idea as what I was just suggesting: add an extra check when looking for a free page, and thereby avoid having to lock/update the global datastructure during ReadBuffer. regards, tom lane
Tom Lane wrote: > Manfred Spraul <manfred@colorfullife.com> writes: >> Are there strategies that do not rely on a global lock? The Linux kernel >> uses a lazy LRU with referenced bits: on access, the referenced bit is >> set. The freespace logic takes pages from the end of a linked list, and >> checks that bit: if it's set, then the page is moved back to the top of >> the list. Otherwise it's a candidate for replacement. > > I think this is the same idea as what I was just suggesting: add an > extra check when looking for a free page, and thereby avoid having to > lock/update the global datastructure during ReadBuffer. Hmmmm ... speaking without having done in depth thought on this one: If there would be an easy way to determine the approximate position of a CDB inside the queue, one could just forget about moving it to the MRU position if it's in the upper 3rd or so anyway. The theory of the whole ARC strategy (which is nothing more than four LRU's with a twist) is that the access frequency of blocks is reverse proportional to their numbers (a few are high frequency used, some are medium often requested, the vast majority rather seldom). That means, that the buffers in the upper third or quarter of the T1 queue (and that's the critical one) are basically circling around each other with frequent visitors from the lower regions or other queues. So if there would be a way to make a good guess on that relative queue position, that expensive CDB shuffeling can be avoided often. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
>Tom Lane > Manfred's numbers definitely say that we need to find a way to break > down the BufMgrLock into multiple finer-grain locks. We already have > all those per-buffer LWLocks, but I don't see how to apply those to > the problem of managing the global lookup and replacement datastructures. > > Anyone see an attack path here? Not specifically, but some ideas that might lead to something useful... Taking a leaf from somebody else's book is always the best way - much as has been done with the ARC algorithm itself. 1. Scaling Oracle8iT: Building Highly Scalable OLTP System Architectures James Morle ISBN: 0-201-32574-8 ...seems like an interesting purchase in this regard (I have no connection with the author) - but I can't vouch for usefulness of it. http://www.aw-bc.com/catalog/academic/product/0,4096,0201325748-TOC,00.h tml This is nothing to do with OPS or RAC, which I definitely do not advocate as an approach to the current challenges. 2. DB2 has long offered multiple BUFFERPOOLS. Why not explicitly define multiple pools and then assign specific database objects to them - each would have its own MRU list. ...that spurs some further off-the-wall ideas that follow the more usual PostgreSQL line "design for automated operation" - how to make use of multiple buffer pools without the need to manually assign database objects to them??? First a question: am I right in thinking that the problem is proportional to the number of independent actors (i.e. CPUs), but not dependant at all on the actual size of the cache? 3. Split the T1 list into 2 (maybe more?) pieces, completely independent of one another - T1a and T1b. When a T2, T1a, or T1b hit occurs, we *somehow* pick one of the T1 lists, in a pseudo random fashion and move it to the top of that list. This would clearly not be the best that ARC can achieve, but if the buffer (and therefore the list) is big enough, then the skew between the two lists might well be less than the loss of performance from having a hot single MRU list. This might be regarded as vertical partitioning. >Jan Wieck writes: >The theory of the whole ARC strategy (which is nothing more than four >LRU's with a twist) is that the access frequency of blocks is reverse >proportional to their numbers (a few are high frequency used, some are >medium often requested, the vast majority rather seldom). That means, >that the buffers in the upper third or quarter of the T1 queue (and >that's the critical one) are basically circling around each other with >frequent visitors from the lower regions or other queues. 4. Taking the analogy from (3) differently: partition T1 horizontally, like the floors of a building, or the league divisions in Football (Major/Minor? in the US, or 1st, 2nd, 3rd etc in the UK). The ARC algorithm is identical, but T1 effectively has multiple MRU points, or put another way, multiple lists all chained together. Say we give it 3 levels - T1.1 thru T1.3. When a block is first promoted from T2, it goes to the MRU of T1.3. When a hit occurs when it is in T1.3 it gets punted to the MRU of T1.2, and when a hit occurs when in T1.2 it would get promoted to T1.1 MRU. On the way down, blocks falling off of T1.1 would go to T1.2 then T1.3. The T1.1 MRU would still be hotter than the rest. The way I am thinking about this now is that the MRU points are evenly spaced, so the ARC algorithm doesn't dynamically resize them as it does with T1/T2, but that could change also. Doing things this way might mean that ARC doesn't have to remember which transaction id added it - a specific ARC tailoring feature added for PostgreSQL, since even if an UPDATE/DELETE touches it twice, it won't move very far up the list. Would that save time on the list reordering code? 4a) Perhaps MRU points should be spaced as 1/6, 1/3, 1/2 of the list, or someother fairly simple but non-linear way in an attempt to balance the hit frequency of the MRU points. Perhaps use four levels, then split 1/16, 2/16, 4/16, 8/16 (so last level if 9/16ths of list). 4b) Combine this with idea (3) above to split that list into two or more equal sized lists, T1.1a and T1.1b. 4c) Promote blocks differently, according to their content type? Put index non-leaf blocks (from fastroot down) & the rightmost leaf block that are in T1 straight into T2.1, put permanent table heap blocks & index leaf blocks into T2.2 and put temporary objects into T2.3. Would require tasting the block or enquiring details about its parent object rather than handling it blind - without regard to its contents, as ARC does now - that sounds expensive. Hope some of that helps.. Best Regards, Simon Riggs