Re: Further reduction of bufmgr lock contention - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Further reduction of bufmgr lock contention |
Date | |
Msg-id | 9019.1153424488@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Further reduction of bufmgr lock contention (Simon Riggs <simon@2ndquadrant.com>) |
Responses |
Re: Further reduction of bufmgr lock contention
Re: Further reduction of bufmgr lock contention |
List | pgsql-hackers |
A couple months ago we were discussing partitioning the buffer mapping table so as to reduce contention for the BufMappingLock. The discussion stalled after I complained about the uncertainty of shared memory usage arising from having a separate hashtable for each partition (since different numbers of buffers might fall into each partition at different times). I've been thinking about it more after seeing an example from Tatsuo that seems to be exactly the same problem as Gavin saw. We could fix the uncertain-usage objection if we stick with a single hashtable and ensure that concurrent access to different partitions of it is safe. I believe this is easily doable, if we make hashtables used in this mode allocate the maximum allowed number of buckets immediately during hashtable initialization. (Since shared hashtables already have a fixed maximum directory size, they already have an upper bound on the number of buckets, so this loses no flexibility.) Then there will be no on-the-fly bucket splits, and that means that accesses to different hash buckets operate completely independently. Therefore, we can regard the hashtable as logically partitioned on the basis of any classification of entries that will uniquely assign hash buckets to classes --- taking the low order bits of entry hash codes will do fine. The only changeable state that is shared across all buckets is the entry freelist and the "nentries" counter. We could protect these with a spinlock (one spinlock is sufficient since changes to nentries go along with addition or removal of freelist entries). Usage of a partitioned hash table would then be like compute hashtable lookup key;entryhashcode = calc_hash(lookup key);partitionnumber = entryhashcode % NumPartitions;LWLockAcquire(PartitionLock[partitionnumber],...);manipulate hashtable;LWLockRelease(PartitionLock[partitionnumber]); We could do this without changing the API of hash_search, but then we'd be computing the entry hashcode twice, so I'm inclined to provide an additional entry point that takes a precalculated hashcode. Potential downsides of applying this idea to the buffer mapping table: 1. Reassigning a buffer to a new page will (usually) require two cycles of LWLockAcquire/Release for the two different partitions involved. Since this path also requires at least a read() kernel call (maybe a write() too), I don't think there'll be any meaningful slowdown. 2. The current logic for reassigning a buffer attempts to make a hashtable entry for its new page number (to detect collisions) before releasing the old hashtable entry. This would only be safe if we held both partition LWLocks concurrently; which seems bad for concurrency, plus avoiding deadlock would complicate the code significantly. I'm inclined to release the old entry and then try to insert the new one, holding only one lock at a time. If the insertion fails (because someone was concurrently loading the same page), we'd have to throw the buffer onto the freelist rather than allowing it to retain its previous valid data. Which is slightly annoying, but in practice the case probably doesn't happen often enough to be worth worrying about. 3. Taking the freelist spinlock is new computation that wasn't there before. But, again, it's only needed in code paths that will also be doing a kernel call. If we do this we should probably also handle the lmgr lock tables the same way (partially reverting my partition-the-LockMgrLock patch of a couple months ago). However, downside #3 might be a stronger objection for lmgr, since it can create or destroy lock objects without necessarily doing any I/O. Thoughts, objections? regards, tom lane
pgsql-hackers by date: