Thread: patch: avoid heavyweight locking on hash metapage

patch: avoid heavyweight locking on hash metapage

From
Robert Haas
Date:
I developed the attached patch to avoid taking a heavyweight lock on
the metapage of a hash index.  Instead, an exclusive buffer content
lock is viewed as sufficient permission to modify the metapage, and a
shared buffer content lock is used when such modifications need to be
prevented.  For the most part this is a trivial change, because we
were already taking these locks: we were just taking the heavyweight
locks in addition.  The only sticking point is that, when we're
searching or inserting, we previously locked the bucket before
releasing the heavyweight metapage lock, which is unworkable when
holding only a buffer content lock because (1) we might deadlock and
(2) buffer content locks can't be held for long periods of time even
when there's no deadlock risk.  To fix this, I implemented a simple
loop-and-retry system: we release the metapage content lock, acquire
the heavyweight lock on the target bucket, and then reacquire the
metapage content lock and check that the bucket mapping has not
changed.   Normally it hasn't, and we're done.  But if by chance it
has, we simply unlock the metapage, release the heavyweight lock we
acquired previously, lock the new bucket, and loop around again.  Even
in the worst case we cannot loop very many times here, since we don't
split the same bucket again until we've split all the other buckets,
and 2^N gets big pretty fast.

I tested the effect of this by setting up a series of 5-minute
read-only pgbench run at scale factor 300 with 8GB of shared buffers
on the IBM POWER7 machine.  For these runs, I dropped the the primary
key constraint on pgbench_accounts (aid) and created a hash index on
that column instead.  I ran each test three times and took the median
result.  Here are the results on unpatched master, at various client
counts:

m01 tps = 9004.070422 (including connections establishing)
m04 tps = 34838.126542 (including connections establishing)
m08 tps = 70584.356826 (including connections establishing)
m16 tps = 128726.248198 (including connections establishing)
m32 tps = 123639.248172 (including connections establishing)
m64 tps = 104650.296143 (including connections establishing)
m80 tps = 88412.736416 (including connections establishing)

And here are the results with the patch:

h01 tps = 9110.561413 (including connections establishing) [+1.2%]
h04 tps = 36012.787524 (including connections establishing) [+3.4%]
h08 tps = 72606.302993 (including connections establishing) [+2.9%]
h16 tps = 141938.762793 (including connections establishing) [+10%]
h32 tps = 205325.232316 (including connections establishing) [+66%]
h64 tps = 274156.881975 (including connections establishing) [+162%]
h80 tps = 291224.012066 (including connections establishing) [+229%]

Obviously, even with this change, there's a lot not to like about hash
indexes: they still won't be crash-safe, and they still won't perform
as well under high concurrency as btree indexes.  But neither of those
problems seems like a good reason not to fix this problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: patch: avoid heavyweight locking on hash metapage

From
"Dickson S. Guedes"
Date:
2012/5/30 Robert Haas <robertmhaas@gmail.com>:
> I tested the effect of this by setting up a series of 5-minute
> read-only pgbench run at scale factor 300 with 8GB of shared buffers
> on the IBM POWER7 machine.

I know it doesn't matter, but out of curiosity what OS you used?

best regards,
-- 
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br


Re: patch: avoid heavyweight locking on hash metapage

From
Robert Haas
Date:
On Thu, May 31, 2012 at 7:07 AM, Dickson S. Guedes <listas@guedesoft.net> wrote:
> 2012/5/30 Robert Haas <robertmhaas@gmail.com>:
>> I tested the effect of this by setting up a series of 5-minute
>> read-only pgbench run at scale factor 300 with 8GB of shared buffers
>> on the IBM POWER7 machine.
>
> I know it doesn't matter, but out of curiosity what OS you used?

Fedora 16, Linux 3.2.6-3.fc16.ppc64

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: patch: avoid heavyweight locking on hash metapage

From
Jeff Janes
Date:
On Wed, May 30, 2012 at 3:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I developed the attached patch to avoid taking a heavyweight lock on
> the metapage of a hash index.  Instead, an exclusive buffer content
> lock is viewed as sufficient permission to modify the metapage, and a
> shared buffer content lock is used when such modifications need to be
> prevented.  For the most part this is a trivial change, because we
> were already taking these locks: we were just taking the heavyweight
> locks in addition.  The only sticking point is that, when we're
> searching or inserting, we previously locked the bucket before
> releasing the heavyweight metapage lock, which is unworkable when
> holding only a buffer content lock because (1) we might deadlock and
> (2) buffer content locks can't be held for long periods of time even
> when there's no deadlock risk.  To fix this, I implemented a simple
> loop-and-retry system: we release the metapage content lock, acquire
> the heavyweight lock on the target bucket, and then reacquire the
> metapage content lock and check that the bucket mapping has not
> changed.   Normally it hasn't, and we're done.  But if by chance it
> has, we simply unlock the metapage, release the heavyweight lock we
> acquired previously, lock the new bucket, and loop around again.  Even
> in the worst case we cannot loop very many times here, since we don't
> split the same bucket again until we've split all the other buckets,
> and 2^N gets big pretty fast.

Do we need the retry flag (applies to two places)?  If oldblkno is
still InvalidBlockNumber then it can't equal blkno.
I think the extra variable might be clearer than the magic value, but
we already have the magic value so do we want to have both a flag
variable and a magic value?

+       if (retry)
+       {
+           if (oldblkno == blkno)
+               break;
+           _hash_droplock(rel, oldblkno, HASH_SHARE);
+       }


In the README, the psuedo codes probably needs to mention re-locking
the meta page in the loop.

Also, "page" is used to mean either the disk page or the shared buffer
currently holding that page, depending on context.  This is confusing.Maybe we should clarify "Lock the meta page
buffer". Of course this 
gripe precedes your patch and applies to other parts of the code as
well, but since we mingle LW locks (on buffers) and heavy locks (on
names of disk pages) it might make sense to be more meticulous here.

"exclusive-lock page 0 (assert the right to begin a split)" is no
longer true, nor is "release X-lock on page 0"


Also in the README, section "To prevent deadlock we enforce these
coding rules:" would need to be changed as those rules are being
changed.  But, should we change them at all?

In _hash_expandtable, the claim "But since we are only trylocking here
it should be OK" doesn't completely satisfy me.  Even a conditional
heavy-lock acquire still takes a transient non-conditional LW Lock on
the lock manager partition.  Unless there is a global rule that no one
can take a buffer content lock while holding a lock manager partition
lock, this seems dangerous.  Could this be redone to follow the
pattern of heavy locking with no content lock, then reacquiring the
buffer content lock to check to make sure we locked the correct
things?  I don't know if it would be better to loop, or just give up,
if the meta page changed underneath us.

Cheers,

Jeff


Re: patch: avoid heavyweight locking on hash metapage

From
Robert Haas
Date:
On Fri, Jun 15, 2012 at 1:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Do we need the retry flag (applies to two places)?  If oldblkno is
> still InvalidBlockNumber then it can't equal blkno.

I was a bit concerned that blkno = BUCKET_TO_BLKNO(metap, bucket)
might set blkno to InvalidBlockNumber, thus possibly messing up the
algorithm.  This way seemed better in terms of not requiring any
assumption in that area.

> In the README, the psuedo codes probably needs to mention re-locking
> the meta page in the loop.

Good point.  Fixed.

> Also, "page" is used to mean either the disk page or the shared buffer
> currently holding that page, depending on context.  This is confusing.
>  Maybe we should clarify "Lock the meta page buffer".  Of course this
> gripe precedes your patch and applies to other parts of the code as
> well, but since we mingle LW locks (on buffers) and heavy locks (on
> names of disk pages) it might make sense to be more meticulous here.

I attempted to improve this somewhat in the README, but I think it may
be more than I can do completely.

> "exclusive-lock page 0 (assert the right to begin a split)" is no
> longer true, nor is "release X-lock on page 0"

I think this is fixed.

> Also in the README, section "To prevent deadlock we enforce these
> coding rules:" would need to be changed as those rules are being
> changed.  But, should we change them at all?
>
> In _hash_expandtable, the claim "But since we are only trylocking here
> it should be OK" doesn't completely satisfy me.  Even a conditional
> heavy-lock acquire still takes a transient non-conditional LW Lock on
> the lock manager partition.  Unless there is a global rule that no one
> can take a buffer content lock while holding a lock manager partition
> lock, this seems dangerous.  Could this be redone to follow the
> pattern of heavy locking with no content lock, then reacquiring the
> buffer content lock to check to make sure we locked the correct
> things?  I don't know if it would be better to loop, or just give up,
> if the meta page changed underneath us.

Hmm.  That was actually a gloss I added on existing code to try to
convince myself that it was safe; I don't think that the changes I
made make that any more or less safe than it was before.  But the
question of whether or not it is safe is an interesting one.  I do
believe that your statement is true, though: lock manager locks - or
backend locks, for the fast-path - should be the last thing we lock.
In some cases we lock more than one lock manager partition lock, but
we never lock anything else afterwards, AFAIK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: patch: avoid heavyweight locking on hash metapage

From
Jeff Janes
Date:
On Mon, Jun 18, 2012 at 5:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Hmm.  That was actually a gloss I added on existing code to try to
> convince myself that it was safe; I don't think that the changes I
> made make that any more or less safe than it was before.

Right, sorry.  I thought there was some strength reduction going on
there as well.

Thanks for the various explanations, they address my concerns.  I see
that v2 applies over v1.

I've verified performance improvements using 8 cores with my proposed
pgbench -P benchmark, with a scale that fits in shared_buffers.
It brings it most of the way, but not quite, up to the btree performance.


I've marked this as ready for committer.

Cheers,

Jeff


Re: patch: avoid heavyweight locking on hash metapage

From
Robert Haas
Date:
On Mon, Jun 25, 2012 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Jun 18, 2012 at 5:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> Hmm.  That was actually a gloss I added on existing code to try to
>> convince myself that it was safe; I don't think that the changes I
>> made make that any more or less safe than it was before.
>
> Right, sorry.  I thought there was some strength reduction going on
> there as well.
>
> Thanks for the various explanations, they address my concerns.  I see
> that v2 applies over v1.
>
> I've verified performance improvements using 8 cores with my proposed
> pgbench -P benchmark, with a scale that fits in shared_buffers.
> It brings it most of the way, but not quite, up to the btree performance.
>
>
> I've marked this as ready for committer.

Thanks for the review; committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company