Re: [PATCHES] update i386 spinlock for hyperthreading - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: [PATCHES] update i386 spinlock for hyperthreading |
Date | |
Msg-id | 000c01c3ee9b$7178ce70$8ac887d9@LaptopDellXP Whole thread Raw |
In response to | Re: [PATCHES] update i386 spinlock for hyperthreading (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
>Tom Lane > Manfred's numbers definitely say that we need to find a way to break > down the BufMgrLock into multiple finer-grain locks. We already have > all those per-buffer LWLocks, but I don't see how to apply those to > the problem of managing the global lookup and replacement datastructures. > > Anyone see an attack path here? Not specifically, but some ideas that might lead to something useful... Taking a leaf from somebody else's book is always the best way - much as has been done with the ARC algorithm itself. 1. Scaling Oracle8iT: Building Highly Scalable OLTP System Architectures James Morle ISBN: 0-201-32574-8 ...seems like an interesting purchase in this regard (I have no connection with the author) - but I can't vouch for usefulness of it. http://www.aw-bc.com/catalog/academic/product/0,4096,0201325748-TOC,00.h tml This is nothing to do with OPS or RAC, which I definitely do not advocate as an approach to the current challenges. 2. DB2 has long offered multiple BUFFERPOOLS. Why not explicitly define multiple pools and then assign specific database objects to them - each would have its own MRU list. ...that spurs some further off-the-wall ideas that follow the more usual PostgreSQL line "design for automated operation" - how to make use of multiple buffer pools without the need to manually assign database objects to them??? First a question: am I right in thinking that the problem is proportional to the number of independent actors (i.e. CPUs), but not dependant at all on the actual size of the cache? 3. Split the T1 list into 2 (maybe more?) pieces, completely independent of one another - T1a and T1b. When a T2, T1a, or T1b hit occurs, we *somehow* pick one of the T1 lists, in a pseudo random fashion and move it to the top of that list. This would clearly not be the best that ARC can achieve, but if the buffer (and therefore the list) is big enough, then the skew between the two lists might well be less than the loss of performance from having a hot single MRU list. This might be regarded as vertical partitioning. >Jan Wieck writes: >The theory of the whole ARC strategy (which is nothing more than four >LRU's with a twist) is that the access frequency of blocks is reverse >proportional to their numbers (a few are high frequency used, some are >medium often requested, the vast majority rather seldom). That means, >that the buffers in the upper third or quarter of the T1 queue (and >that's the critical one) are basically circling around each other with >frequent visitors from the lower regions or other queues. 4. Taking the analogy from (3) differently: partition T1 horizontally, like the floors of a building, or the league divisions in Football (Major/Minor? in the US, or 1st, 2nd, 3rd etc in the UK). The ARC algorithm is identical, but T1 effectively has multiple MRU points, or put another way, multiple lists all chained together. Say we give it 3 levels - T1.1 thru T1.3. When a block is first promoted from T2, it goes to the MRU of T1.3. When a hit occurs when it is in T1.3 it gets punted to the MRU of T1.2, and when a hit occurs when in T1.2 it would get promoted to T1.1 MRU. On the way down, blocks falling off of T1.1 would go to T1.2 then T1.3. The T1.1 MRU would still be hotter than the rest. The way I am thinking about this now is that the MRU points are evenly spaced, so the ARC algorithm doesn't dynamically resize them as it does with T1/T2, but that could change also. Doing things this way might mean that ARC doesn't have to remember which transaction id added it - a specific ARC tailoring feature added for PostgreSQL, since even if an UPDATE/DELETE touches it twice, it won't move very far up the list. Would that save time on the list reordering code? 4a) Perhaps MRU points should be spaced as 1/6, 1/3, 1/2 of the list, or someother fairly simple but non-linear way in an attempt to balance the hit frequency of the MRU points. Perhaps use four levels, then split 1/16, 2/16, 4/16, 8/16 (so last level if 9/16ths of list). 4b) Combine this with idea (3) above to split that list into two or more equal sized lists, T1.1a and T1.1b. 4c) Promote blocks differently, according to their content type? Put index non-leaf blocks (from fastroot down) & the rightmost leaf block that are in T1 straight into T2.1, put permanent table heap blocks & index leaf blocks into T2.2 and put temporary objects into T2.3. Would require tasting the block or enquiring details about its parent object rather than handling it blind - without regard to its contents, as ARC does now - that sounds expensive. Hope some of that helps.. Best Regards, Simon Riggs
pgsql-hackers by date: