Thread: patch: avoid heavyweight locking on hash metapage
I developed the attached patch to avoid taking a heavyweight lock on the metapage of a hash index. Instead, an exclusive buffer content lock is viewed as sufficient permission to modify the metapage, and a shared buffer content lock is used when such modifications need to be prevented. For the most part this is a trivial change, because we were already taking these locks: we were just taking the heavyweight locks in addition. The only sticking point is that, when we're searching or inserting, we previously locked the bucket before releasing the heavyweight metapage lock, which is unworkable when holding only a buffer content lock because (1) we might deadlock and (2) buffer content locks can't be held for long periods of time even when there's no deadlock risk. To fix this, I implemented a simple loop-and-retry system: we release the metapage content lock, acquire the heavyweight lock on the target bucket, and then reacquire the metapage content lock and check that the bucket mapping has not changed. Normally it hasn't, and we're done. But if by chance it has, we simply unlock the metapage, release the heavyweight lock we acquired previously, lock the new bucket, and loop around again. Even in the worst case we cannot loop very many times here, since we don't split the same bucket again until we've split all the other buckets, and 2^N gets big pretty fast. I tested the effect of this by setting up a series of 5-minute read-only pgbench run at scale factor 300 with 8GB of shared buffers on the IBM POWER7 machine. For these runs, I dropped the the primary key constraint on pgbench_accounts (aid) and created a hash index on that column instead. I ran each test three times and took the median result. Here are the results on unpatched master, at various client counts: m01 tps = 9004.070422 (including connections establishing) m04 tps = 34838.126542 (including connections establishing) m08 tps = 70584.356826 (including connections establishing) m16 tps = 128726.248198 (including connections establishing) m32 tps = 123639.248172 (including connections establishing) m64 tps = 104650.296143 (including connections establishing) m80 tps = 88412.736416 (including connections establishing) And here are the results with the patch: h01 tps = 9110.561413 (including connections establishing) [+1.2%] h04 tps = 36012.787524 (including connections establishing) [+3.4%] h08 tps = 72606.302993 (including connections establishing) [+2.9%] h16 tps = 141938.762793 (including connections establishing) [+10%] h32 tps = 205325.232316 (including connections establishing) [+66%] h64 tps = 274156.881975 (including connections establishing) [+162%] h80 tps = 291224.012066 (including connections establishing) [+229%] Obviously, even with this change, there's a lot not to like about hash indexes: they still won't be crash-safe, and they still won't perform as well under high concurrency as btree indexes. But neither of those problems seems like a good reason not to fix this problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
2012/5/30 Robert Haas <robertmhaas@gmail.com>: > I tested the effect of this by setting up a series of 5-minute > read-only pgbench run at scale factor 300 with 8GB of shared buffers > on the IBM POWER7 machine. I know it doesn't matter, but out of curiosity what OS you used? best regards, -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://guedesoft.net - http://www.postgresql.org.br
On Thu, May 31, 2012 at 7:07 AM, Dickson S. Guedes <listas@guedesoft.net> wrote: > 2012/5/30 Robert Haas <robertmhaas@gmail.com>: >> I tested the effect of this by setting up a series of 5-minute >> read-only pgbench run at scale factor 300 with 8GB of shared buffers >> on the IBM POWER7 machine. > > I know it doesn't matter, but out of curiosity what OS you used? Fedora 16, Linux 3.2.6-3.fc16.ppc64 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 30, 2012 at 3:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I developed the attached patch to avoid taking a heavyweight lock on > the metapage of a hash index. Instead, an exclusive buffer content > lock is viewed as sufficient permission to modify the metapage, and a > shared buffer content lock is used when such modifications need to be > prevented. For the most part this is a trivial change, because we > were already taking these locks: we were just taking the heavyweight > locks in addition. The only sticking point is that, when we're > searching or inserting, we previously locked the bucket before > releasing the heavyweight metapage lock, which is unworkable when > holding only a buffer content lock because (1) we might deadlock and > (2) buffer content locks can't be held for long periods of time even > when there's no deadlock risk. To fix this, I implemented a simple > loop-and-retry system: we release the metapage content lock, acquire > the heavyweight lock on the target bucket, and then reacquire the > metapage content lock and check that the bucket mapping has not > changed. Normally it hasn't, and we're done. But if by chance it > has, we simply unlock the metapage, release the heavyweight lock we > acquired previously, lock the new bucket, and loop around again. Even > in the worst case we cannot loop very many times here, since we don't > split the same bucket again until we've split all the other buckets, > and 2^N gets big pretty fast. Do we need the retry flag (applies to two places)? If oldblkno is still InvalidBlockNumber then it can't equal blkno. I think the extra variable might be clearer than the magic value, but we already have the magic value so do we want to have both a flag variable and a magic value? + if (retry) + { + if (oldblkno == blkno) + break; + _hash_droplock(rel, oldblkno, HASH_SHARE); + } In the README, the psuedo codes probably needs to mention re-locking the meta page in the loop. Also, "page" is used to mean either the disk page or the shared buffer currently holding that page, depending on context. This is confusing.Maybe we should clarify "Lock the meta page buffer". Of course this gripe precedes your patch and applies to other parts of the code as well, but since we mingle LW locks (on buffers) and heavy locks (on names of disk pages) it might make sense to be more meticulous here. "exclusive-lock page 0 (assert the right to begin a split)" is no longer true, nor is "release X-lock on page 0" Also in the README, section "To prevent deadlock we enforce these coding rules:" would need to be changed as those rules are being changed. But, should we change them at all? In _hash_expandtable, the claim "But since we are only trylocking here it should be OK" doesn't completely satisfy me. Even a conditional heavy-lock acquire still takes a transient non-conditional LW Lock on the lock manager partition. Unless there is a global rule that no one can take a buffer content lock while holding a lock manager partition lock, this seems dangerous. Could this be redone to follow the pattern of heavy locking with no content lock, then reacquiring the buffer content lock to check to make sure we locked the correct things? I don't know if it would be better to loop, or just give up, if the meta page changed underneath us. Cheers, Jeff
On Fri, Jun 15, 2012 at 1:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > Do we need the retry flag (applies to two places)? If oldblkno is > still InvalidBlockNumber then it can't equal blkno. I was a bit concerned that blkno = BUCKET_TO_BLKNO(metap, bucket) might set blkno to InvalidBlockNumber, thus possibly messing up the algorithm. This way seemed better in terms of not requiring any assumption in that area. > In the README, the psuedo codes probably needs to mention re-locking > the meta page in the loop. Good point. Fixed. > Also, "page" is used to mean either the disk page or the shared buffer > currently holding that page, depending on context. This is confusing. > Maybe we should clarify "Lock the meta page buffer". Of course this > gripe precedes your patch and applies to other parts of the code as > well, but since we mingle LW locks (on buffers) and heavy locks (on > names of disk pages) it might make sense to be more meticulous here. I attempted to improve this somewhat in the README, but I think it may be more than I can do completely. > "exclusive-lock page 0 (assert the right to begin a split)" is no > longer true, nor is "release X-lock on page 0" I think this is fixed. > Also in the README, section "To prevent deadlock we enforce these > coding rules:" would need to be changed as those rules are being > changed. But, should we change them at all? > > In _hash_expandtable, the claim "But since we are only trylocking here > it should be OK" doesn't completely satisfy me. Even a conditional > heavy-lock acquire still takes a transient non-conditional LW Lock on > the lock manager partition. Unless there is a global rule that no one > can take a buffer content lock while holding a lock manager partition > lock, this seems dangerous. Could this be redone to follow the > pattern of heavy locking with no content lock, then reacquiring the > buffer content lock to check to make sure we locked the correct > things? I don't know if it would be better to loop, or just give up, > if the meta page changed underneath us. Hmm. That was actually a gloss I added on existing code to try to convince myself that it was safe; I don't think that the changes I made make that any more or less safe than it was before. But the question of whether or not it is safe is an interesting one. I do believe that your statement is true, though: lock manager locks - or backend locks, for the fast-path - should be the last thing we lock. In some cases we lock more than one lock manager partition lock, but we never lock anything else afterwards, AFAIK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Jun 18, 2012 at 5:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Hmm. That was actually a gloss I added on existing code to try to > convince myself that it was safe; I don't think that the changes I > made make that any more or less safe than it was before. Right, sorry. I thought there was some strength reduction going on there as well. Thanks for the various explanations, they address my concerns. I see that v2 applies over v1. I've verified performance improvements using 8 cores with my proposed pgbench -P benchmark, with a scale that fits in shared_buffers. It brings it most of the way, but not quite, up to the btree performance. I've marked this as ready for committer. Cheers, Jeff
On Mon, Jun 25, 2012 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Mon, Jun 18, 2012 at 5:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> Hmm. That was actually a gloss I added on existing code to try to >> convince myself that it was safe; I don't think that the changes I >> made make that any more or less safe than it was before. > > Right, sorry. I thought there was some strength reduction going on > there as well. > > Thanks for the various explanations, they address my concerns. I see > that v2 applies over v1. > > I've verified performance improvements using 8 cores with my proposed > pgbench -P benchmark, with a scale that fits in shared_buffers. > It brings it most of the way, but not quite, up to the btree performance. > > > I've marked this as ready for committer. Thanks for the review; committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company