Thread: Further reduction of bufmgr lock contention
I've been looking into Gavin Hamill's recent report of poor performance with PG 8.1 on an 8-way IBM PPC64 box. strace'ing backends shows a lot of semop() calls, indicating blocking at the LWLock or lmgr-lock levels, but not a lot of select() delays, suggesting we don't have too much of a problem at the hardware spinlock level. A typical breakdown of different kernel call types is 566 _llseek 10 brk 10 gettimeofday 4 mmap 4 munmap 562 read 4 recv 8 select 3014 semop 12send 1 time 3 write (I'm hoping to get some oprofile results to confirm there's nothing strange going on at the hardware level, but no luck yet on getting oprofile to work on Debian/PPC64 ... anyone know anything about suitable kernels to use for that?) Instrumenting LWLockAcquire (with a patch I had developed last fall, but just now got around to cleaning up and committing to CVS) shows that the contention is practically all for the BufMappingLock: $ grep ^PID postmaster.log | sort +9nr | head -20 PID 23820 lwlock 0: shacq 2446470 exacq 6154 blk 12755 PID 23823 lwlock 0: shacq 2387597 exacq 4297 blk 9255 PID 23824 lwlock 0: shacq 1678694 exacq 4433 blk 8692 PID 23826 lwlock 0: shacq 1221221 exacq 3224 blk 5893 PID 23821 lwlock 0: shacq 1892453 exacq 1665 blk 5766 PID 23835 lwlock 0: shacq 2390685 exacq 1453 blk 5511 PID 23822 lwlock 0: shacq 1669419 exacq 1615 blk 4926 PID 23830 lwlock 0: shacq 1039468 exacq 1248 blk 2946 PID 23832 lwlock 0: shacq 698622 exacq 397 blk 1818 PID 23836 lwlock 0: shacq 544472 exacq 530 blk 1300 PID 23839 lwlock 0: shacq 497505 exacq 46 blk 885 PID 23842 lwlock 0: shacq 305281 exacq 1 blk 720 PID 23840 lwlock 0: shacq 317554 exacq 226 blk 355 PID 23840 lwlock 2: shacq 0 exacq 2872 blk 7 PID 23835 lwlock 2: shacq 0 exacq 3434 blk 6 PID 23835 lwlock 1: shacq 0 exacq 1452 blk 4 PID 23822 lwlock 1: shacq 0 exacq 1614 blk 3 PID 23820 lwlock 2: shacq 0 exacq 3582 blk 2 PID 23821 lwlock 1: shacq 0 exacq 1664 blk 2 PID 23830 lwlock 1: shacq 0 exacq 1247 blk 2 These numbers show that our rewrite of the bufmgr has done a great job of cutting down the amount of potential contention --- most of the traffic on this lock is shared rather than exclusive acquisitions --- but it seems that if you have enough CPUs it's still not good enough. (My best theory as to why Gavin is seeing better performance from a dual Opteron is simply that 2 processors will have 1/4th as much contention as 8 processors.) I have an idea about how to improve matters: I think we could break the buffer tag to buffer mapping hashtable into multiple partitions based on some hash value of the buffer tags, and protect each partition under a separate LWLock, similar to what we did with the lmgr lock table not long ago. Anyone have a comment on this strategy, or a better idea? regards, tom lane
On 4/21/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've been looking into Gavin Hamill's recent report of poor performance > with PG 8.1 on an 8-way IBM PPC64 box. We have recently encountered some odd performance with 8.2dev on a 16-way Opteron. In the next few days we'll look into it and see if it may be related. Otherwise, it sounds good to me; a trial of the new strategy should definitely prove its value one way or the other. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
On Fri, 2006-04-21 at 13:01 -0400, Tom Lane wrote: > I've been looking into Gavin Hamill's recent report of poor performance > with PG 8.1 on an 8-way IBM PPC64 box. Ah good. > Instrumenting LWLockAcquire (with a patch I had developed last fall, > but just now got around to cleaning up and committing to CVS) shows > that the contention is practically all for the BufMappingLock: > $ grep ^PID postmaster.log | sort +9nr | head -20 > PID 23820 lwlock 0: shacq 2446470 exacq 6154 blk 12755 > PID 23823 lwlock 0: shacq 2387597 exacq 4297 blk 9255 > PID 23824 lwlock 0: shacq 1678694 exacq 4433 blk 8692 > PID 23826 lwlock 0: shacq 1221221 exacq 3224 blk 5893 BufMappingLock contention can be made worse by a poorly tuned bgwriter or if the cache hit rate is low. Perhaps in this case, increasing shared_buffers (again) might be enough to further reduce the contention? When we discussed this before http://archives.postgresql.org/pgsql-hackers/2005-02/msg00702.php ISTM then that a low shared_buffers cache hit rate combined with a high OS cache hit rate will cause high contention in an SMP environment. > These numbers show that our rewrite of the bufmgr has done a great job > of cutting down the amount of potential contention --- most of the > traffic on this lock is shared rather than exclusive acquisitions --- > but it seems that if you have enough CPUs it's still not good enough. > (My best theory as to why Gavin is seeing better performance from a > dual Opteron is simply that 2 processors will have 1/4th as much > contention as 8 processors.) Jonah mentions some 16-way CPU testing we have just begun. There are some interesting effects to decode, but most importantly all the CPUs do stay at 100% for much of the time (when other tuning has been done). So my feeling is that the BufMappingLock contention seen by Gavin is much worse than we see. (...and I had been thinking to investigate further with him on that point, though have just arrived back in UK). Another difference is the amount of read/write. My understanding is that Gavin's workload is mostly read-only which will greatly increase the buffer request rate since backends will spend proportionally more time consuming data and less time in xlog (etc). My understanding is that contention increases geometrically with number of potential lock holders (i.e. CPUs). > I have an idea about how to improve matters: I think we could break the > buffer tag to buffer mapping hashtable into multiple partitions based on > some hash value of the buffer tags, and protect each partition under a > separate LWLock, similar to what we did with the lmgr lock table not > long ago. Anyone have a comment on this strategy, or a better idea? I think this is the right way to go http://archives.postgresql.org/pgsql-hackers/2005-02/msg00240.php though the work for 8.1 was right to have been performed first. The earlier lmgr lock partitioning had a hard-coded number of partitions, which was sensible because of the reduced likelihood of effectiveness beyond a certain number of partitions. That doesn't follow here since the BufMappingLock contention will vary with the size of shared_buffers and with the number of CPUs in use (for a given workload). I'd like to see the partitioning calculated at server startup either directly from shared_buffers or via a parameter. We may not be restricted to only using a hash function as we were with lmgr, perhaps using a simple range partitioning. Test-wise: May be able to trial something next week, though system access not yet confirmed and I'm not sure we'll see an improvement on the workload we're testing on currently. I'll have a think about a pure test that we can run on both systems to measure the contention. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com/
Simon Riggs <simon@2ndquadrant.com> writes: > On Fri, 2006-04-21 at 13:01 -0400, Tom Lane wrote: >> I've been looking into Gavin Hamill's recent report of poor performance >> with PG 8.1 on an 8-way IBM PPC64 box. > BufMappingLock contention can be made worse by a poorly tuned bgwriter > or if the cache hit rate is low. Perhaps in this case, increasing > shared_buffers (again) might be enough to further reduce the contention? Well, the cache hit rate is evidently pretty high, since so few of the BufMappingLock accesses are exclusive (any miss would result in an exclusive access). I did try increasing shared_buffers (from 40000 to 80000) but it didn't change performance noticeably. > Another difference is the amount of read/write. My understanding is that > Gavin's workload is mostly read-only which will greatly increase the > buffer request rate since backends will spend proportionally more time > consuming data and less time in xlog (etc). I believe the particular test case being looked at here is read-only (Gavin, is that correct?) > The earlier lmgr lock partitioning had a hard-coded number of > partitions, which was sensible because of the reduced likelihood of > effectiveness beyond a certain number of partitions. That doesn't follow > here since the BufMappingLock contention will vary with the size of > shared_buffers and with the number of CPUs in use (for a given > workload). I'd like to see the partitioning calculated at server startup > either directly from shared_buffers or via a parameter. We may not be > restricted to only using a hash function as we were with lmgr, perhaps > using a simple range partitioning. I don't think any of that follows; and a large number of partitions is risky because it increases the probability of exhausting shared memory (due to transient variations in the actual size of the hashtables for different partitions). > Test-wise: May be able to trial something next week, though system > access not yet confirmed and I'm not sure we'll see an improvement on > the workload we're testing on currently. I'll have a think about a pure > test that we can run on both systems to measure the contention. Keep in mind that Gavin's 8-way turns back into a pumpkin on Monday :-( regards, tom lane
On Fri, 2006-04-21 at 17:38 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > The earlier lmgr lock partitioning had a hard-coded number of > > partitions, which was sensible because of the reduced likelihood of > > effectiveness beyond a certain number of partitions. That doesn't follow > > here since the BufMappingLock contention will vary with the size of > > shared_buffers and with the number of CPUs in use (for a given > > workload). I'd like to see the partitioning calculated at server startup > > either directly from shared_buffers or via a parameter. We may not be > > restricted to only using a hash function as we were with lmgr, perhaps > > using a simple range partitioning. > > I don't think any of that follows; and a large number of partitions is > risky because it increases the probability of exhausting shared memory > (due to transient variations in the actual size of the hashtables for > different partitions). lmgr partitioning uses either 4 or 16, restricted by the hash function, for various reasons. I see no similar restriction on using a hash function here - we could equally well use range partitioning. That relieves the restriction on the number of partitions, allowing us either more or less partitions, according to need. We can place a limit on that if you see a problem - at what level do you see a problem? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com/
Simon Riggs <simon@2ndquadrant.com> writes: > lmgr partitioning uses either 4 or 16, restricted by the hash function, > for various reasons. I see no similar restriction on using a hash > function here - we could equally well use range partitioning. I don't really see any difference at all between the two cases as far as what hashing we use. The thing that's nagging at me at the moment is the realization that a partitioned hashtable will eat more shared memory than a single hashtable. It wasn't that long ago that we had to do some hacking to ensure that the buffer hashtable couldn't run out of memory after startup, and I'm afraid of re-introducing that failure mode. The lock manager can run out of memory without crashing the system, but the bufmgr can't (or at least could not in the recent past...) Now that we're considering using partitioning methods for both the buffer and lock hashtables, I wonder if there is any value in teaching dynahash.c itself about concurrent access --- that is, find a way to use a single shared hashtable instead of separate hashtables for the different partitions, by having the hashtable itself deal with concurrency to some extent. This is just handwaving at the moment... regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > The thing that's nagging at me at the moment is the realization that a > partitioned hashtable will eat more shared memory than a single > hashtable. It wasn't that long ago that we had to do some hacking to > ensure that the buffer hashtable couldn't run out of memory after > startup, and I'm afraid of re-introducing that failure mode. The lock > manager can run out of memory without crashing the system, but the > bufmgr can't (or at least could not in the recent past...) > IHMO overflow is not avoidable no matter we use hash or range. Theoretically seems we could have a data structure like this: (1) a set of k partition tables, each is with a LWLock and size NBuffers/k; (2) a set of k overflow tables (actually we only need k-1) plus a LWLock protecting them, each is with size NBuffers/k. If any partition table overflows, we can assign a overflow table for it to contain extra hash elements. At run time, the hash tables for buffer pool may look like this: [partition 0] [partition 1][overflow 2] [partition 2][overflow 0] [partition 3] But I am not sure how difficult to implement it in current hash code - another handwaiving ... Regards, Qingqing
On Fri, 21 Apr 2006 17:38:01 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > I believe the particular test case being looked at here is read-only > (Gavin, is that correct?) Yes - I made sure the devels made it readonly so I could farm search requests out to Slony-replicated machines (ended up running live searches for the whole site on a host hundreds of miles away :) > Keep in mind that Gavin's 8-way turns back into a pumpkin on > Monday :-( Aye, it would've been gone earlier today but the rental company were being a bit slack so pushed it back to Monday. The pickup is already arranged so I can't stall them at this stage. I guess I could play the 'help the greater good by lending your kit for open source devel' card with them once they get it back to their office. Otherwise, I've seen at least one offer of pSeries hardware on the performance list - I'm sure I could make available a version of our data with all sensitive stuff removed so that it could be tested on other machines. .. and to top it all off, I didn't even get to go to the ball - and I doubt there'll be a glass slipper on offer... gdh
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On Fri, 2006-04-21 at 13:01 -0400, Tom Lane wrote: >>> I've been looking into Gavin Hamill's recent report of poor performance >>> with PG 8.1 on an 8-way IBM PPC64 box. > > Keep in mind that Gavin's 8-way turns back into a pumpkin on Monday :-( I have a 4-way dual-core POWER5 system available... Mark
Tom Lane wrote: > I've been looking into Gavin Hamill's recent report of poor performance > with PG 8.1 on an 8-way IBM PPC64 box. [...] Hullo again :) I'm unfamiliar with postgres development practices, so this is more a request for information than anything else. It's been about a month since the last activity on bufmgr as documented on the hackers list and I was just concerned that this issue had been filed as an interesting toy at the time, but now left for the circular filing cabinet :) Tom + Simon were able to see a fairly easy 25% performance boost against our dataset and I'd obv. be very keen to see this work make it into 8.1.4 or 8.2.0 :) Cheers, Gavin.
Gavin Hamill <gdh@acentral.co.uk> writes: > It's been about a month since the last activity on bufmgr as documented > on the hackers list and I was just concerned that this issue had been > filed as an interesting toy at the time, but now left for the circular > filing cabinet :) > Tom + Simon were able to see a fairly easy 25% performance boost against > our dataset and I'd obv. be very keen to see this work make it into > 8.1.4 or 8.2.0 :) We're certainly not putting any such thing into 8.1.*. The proposed patch for 8.2 is stalled ATM because of the problem of not having a predictable size for the per-partition hash tables. Fixed-size shared memory is a harsh mistress :-( regards, tom lane
Tom Lane wrote: > We're certainly not putting any such thing into 8.1.*. The proposed > patch for 8.2 is stalled ATM because of the problem of not having a > predictable size for the per-partition hash tables. Fixed-size shared > memory is a harsh mistress :-( Fair enough :) Just wanted to ascertain that it was still a going concern - I have full confidence that you'll have a brainwave one morning as to the perfect solution =) gdh
Tom, BTW, we're going to be testing this patch on Sun Niagara servers. What's the outstanding bug with it? I don't quite follow. I think I can get some of the Sun MDEs to take a stab at it if I can understand the issue. Links ok if maybe I've not found part of this thread in the archives. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: > BTW, we're going to be testing this patch on Sun Niagara servers. What's > the outstanding bug with it? I don't quite follow. It's not acceptable as-is because of the risk of running out of shared memory for hashtable entries. In the existing code, there's a clear upper bound on the number of entries in the block-number-to-buffer hash table, ie, shared_buffers + 1 (the +1 because we acquire the new entry before releasing the old when reassigning a buffer). With multiple hashtables serving subsets of the buffers, the different tables might at different times need different numbers of entries, and that makes it a lot harder to be sure you won't run out of memory. I don't say it's insoluble, but the current patch wasn't even claimed to be safe by its author... regards, tom lane
Is this a TODO? --------------------------------------------------------------------------- Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > BTW, we're going to be testing this patch on Sun Niagara servers. What's > > the outstanding bug with it? I don't quite follow. > > It's not acceptable as-is because of the risk of running out of shared > memory for hashtable entries. In the existing code, there's a clear > upper bound on the number of entries in the block-number-to-buffer hash > table, ie, shared_buffers + 1 (the +1 because we acquire the new entry > before releasing the old when reassigning a buffer). With multiple > hashtables serving subsets of the buffers, the different tables might > at different times need different numbers of entries, and that makes it > a lot harder to be sure you won't run out of memory. I don't say it's > insoluble, but the current patch wasn't even claimed to be safe by its > author... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
A couple months ago we were discussing partitioning the buffer mapping table so as to reduce contention for the BufMappingLock. The discussion stalled after I complained about the uncertainty of shared memory usage arising from having a separate hashtable for each partition (since different numbers of buffers might fall into each partition at different times). I've been thinking about it more after seeing an example from Tatsuo that seems to be exactly the same problem as Gavin saw. We could fix the uncertain-usage objection if we stick with a single hashtable and ensure that concurrent access to different partitions of it is safe. I believe this is easily doable, if we make hashtables used in this mode allocate the maximum allowed number of buckets immediately during hashtable initialization. (Since shared hashtables already have a fixed maximum directory size, they already have an upper bound on the number of buckets, so this loses no flexibility.) Then there will be no on-the-fly bucket splits, and that means that accesses to different hash buckets operate completely independently. Therefore, we can regard the hashtable as logically partitioned on the basis of any classification of entries that will uniquely assign hash buckets to classes --- taking the low order bits of entry hash codes will do fine. The only changeable state that is shared across all buckets is the entry freelist and the "nentries" counter. We could protect these with a spinlock (one spinlock is sufficient since changes to nentries go along with addition or removal of freelist entries). Usage of a partitioned hash table would then be like compute hashtable lookup key;entryhashcode = calc_hash(lookup key);partitionnumber = entryhashcode % NumPartitions;LWLockAcquire(PartitionLock[partitionnumber],...);manipulate hashtable;LWLockRelease(PartitionLock[partitionnumber]); We could do this without changing the API of hash_search, but then we'd be computing the entry hashcode twice, so I'm inclined to provide an additional entry point that takes a precalculated hashcode. Potential downsides of applying this idea to the buffer mapping table: 1. Reassigning a buffer to a new page will (usually) require two cycles of LWLockAcquire/Release for the two different partitions involved. Since this path also requires at least a read() kernel call (maybe a write() too), I don't think there'll be any meaningful slowdown. 2. The current logic for reassigning a buffer attempts to make a hashtable entry for its new page number (to detect collisions) before releasing the old hashtable entry. This would only be safe if we held both partition LWLocks concurrently; which seems bad for concurrency, plus avoiding deadlock would complicate the code significantly. I'm inclined to release the old entry and then try to insert the new one, holding only one lock at a time. If the insertion fails (because someone was concurrently loading the same page), we'd have to throw the buffer onto the freelist rather than allowing it to retain its previous valid data. Which is slightly annoying, but in practice the case probably doesn't happen often enough to be worth worrying about. 3. Taking the freelist spinlock is new computation that wasn't there before. But, again, it's only needed in code paths that will also be doing a kernel call. If we do this we should probably also handle the lmgr lock tables the same way (partially reverting my partition-the-LockMgrLock patch of a couple months ago). However, downside #3 might be a stronger objection for lmgr, since it can create or destroy lock objects without necessarily doing any I/O. Thoughts, objections? regards, tom lane
On Thu, 2006-07-20 at 15:41 -0400, Tom Lane wrote: > Usage of a partitioned hash table would then be like > > compute hashtable lookup key; > entryhashcode = calc_hash(lookup key); > partitionnumber = entryhashcode % NumPartitions; > LWLockAcquire(PartitionLock[partitionnumber], ...); > manipulate hashtable; > LWLockRelease(PartitionLock[partitionnumber]); > > We could do this without changing the API of hash_search, but then we'd > be computing the entry hashcode twice, so I'm inclined to provide an > additional entry point that takes a precalculated hashcode. That should be an additional win anyway, since hash_any() uses about 1% CPU on tests I've seen - so we will hold locks slightly shorter duration. > Potential downsides of applying this idea to the buffer mapping table: > > 1. Reassigning a buffer to a new page will (usually) require two cycles > of LWLockAcquire/Release for the two different partitions involved. > Since this path also requires at least a read() kernel call (maybe a > write() too), I don't think there'll be any meaningful slowdown. > 3. Taking the freelist spinlock is new computation that wasn't there > before. But, again, it's only needed in code paths that will also be > doing a kernel call. ...So the additional overhead sounds acceptable, given we will save somewhat on the hash_any() > If we do this we should probably also handle the lmgr lock tables the > same way (partially reverting my partition-the-LockMgrLock patch of a > couple months ago). However, downside #3 might be a stronger objection > for lmgr, since it can create or destroy lock objects without necessarily > doing any I/O. We should be in a position to test this soon. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Is this being kept for 8.3? --------------------------------------------------------------------------- Tom Lane wrote: > A couple months ago we were discussing partitioning the buffer mapping > table so as to reduce contention for the BufMappingLock. The discussion > stalled after I complained about the uncertainty of shared memory usage > arising from having a separate hashtable for each partition (since > different numbers of buffers might fall into each partition at different > times). I've been thinking about it more after seeing an example from > Tatsuo that seems to be exactly the same problem as Gavin saw. > > We could fix the uncertain-usage objection if we stick with a single > hashtable and ensure that concurrent access to different partitions of > it is safe. I believe this is easily doable, if we make hashtables > used in this mode allocate the maximum allowed number of buckets > immediately during hashtable initialization. (Since shared hashtables > already have a fixed maximum directory size, they already have an upper > bound on the number of buckets, so this loses no flexibility.) Then > there will be no on-the-fly bucket splits, and that means that accesses > to different hash buckets operate completely independently. Therefore, > we can regard the hashtable as logically partitioned on the basis of any > classification of entries that will uniquely assign hash buckets to > classes --- taking the low order bits of entry hash codes will do fine. > > The only changeable state that is shared across all buckets is the entry > freelist and the "nentries" counter. We could protect these with a > spinlock (one spinlock is sufficient since changes to nentries go along > with addition or removal of freelist entries). > > Usage of a partitioned hash table would then be like > > compute hashtable lookup key; > entryhashcode = calc_hash(lookup key); > partitionnumber = entryhashcode % NumPartitions; > LWLockAcquire(PartitionLock[partitionnumber], ...); > manipulate hashtable; > LWLockRelease(PartitionLock[partitionnumber]); > > We could do this without changing the API of hash_search, but then we'd > be computing the entry hashcode twice, so I'm inclined to provide an > additional entry point that takes a precalculated hashcode. > > Potential downsides of applying this idea to the buffer mapping table: > > 1. Reassigning a buffer to a new page will (usually) require two cycles > of LWLockAcquire/Release for the two different partitions involved. > Since this path also requires at least a read() kernel call (maybe a > write() too), I don't think there'll be any meaningful slowdown. > > 2. The current logic for reassigning a buffer attempts to make a > hashtable entry for its new page number (to detect collisions) before > releasing the old hashtable entry. This would only be safe if we held > both partition LWLocks concurrently; which seems bad for concurrency, > plus avoiding deadlock would complicate the code significantly. I'm > inclined to release the old entry and then try to insert the new one, > holding only one lock at a time. If the insertion fails (because > someone was concurrently loading the same page), we'd have to throw the > buffer onto the freelist rather than allowing it to retain its previous > valid data. Which is slightly annoying, but in practice the case > probably doesn't happen often enough to be worth worrying about. > > 3. Taking the freelist spinlock is new computation that wasn't there > before. But, again, it's only needed in code paths that will also be > doing a kernel call. > > If we do this we should probably also handle the lmgr lock tables the > same way (partially reverting my partition-the-LockMgrLock patch of a > couple months ago). However, downside #3 might be a stronger objection > for lmgr, since it can create or destroy lock objects without necessarily > doing any I/O. > > Thoughts, objections? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Is this being kept for 8.3? No, it was committed a month ago. regards, tom lane