Thread: hash index concurrency

hash index concurrency

From

Robert Haas

Date:

29 May 2012, 21:19:47

I ran a SELECT-only pgbench test today on the IBM POWER7 box with 64
concurrent clients and got roughly 305,000 tps.  Then, I created a
hash index on pgbench_accounts (aid), dropped the primary key, and
reran the test.  I got roughly 104,000 tps. 'perf -g -e cs' suggested
lock contention in _hash_first(), so I whacked out the calls to
_hash_getlock(rel, 0, HASH_SHARE) and _hash_droplock(rel, 0,
HASH_SHARE).  With that change, I got roughly 270,000 TPS.  Taking a
little further, I then changed the definition of USELOCKING() to 0,
effectively removing all the heavyweight page locks.  With that
change, I got 324,000 tps.  Also, with this change, the CPU is fully
saturated - about 77% user time, 23% system time.

I don't immediately have a proposal to deal with this, but it seems
that if we want good hash index performance under high concurrency, we
need to work a bit harder.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hash index concurrency

From

Jeff Janes

Date:

30 May 2012, 00:21:19

On Tue, May 29, 2012 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I ran a SELECT-only pgbench test today on the IBM POWER7 box with 64
> concurrent clients and got roughly 305,000 tps.  Then, I created a
> hash index on pgbench_accounts (aid), dropped the primary key, and
> reran the test.  I got roughly 104,000 tps. 'perf -g -e cs' suggested
> lock contention in _hash_first(), so I whacked out the calls to
> _hash_getlock(rel, 0, HASH_SHARE) and _hash_droplock(rel, 0,
> HASH_SHARE).  With that change, I got roughly 270,000 TPS.  Taking a
> little further, I then changed the definition of USELOCKING() to 0,
> effectively removing all the heavyweight page locks.  With that
> change, I got 324,000 tps.  Also, with this change, the CPU is fully
> saturated - about 77% user time, 23% system time.

Did the data fit in shared_buffers, or only in RAM?

>
> I don't immediately have a proposal to deal with this, but it seems
> that if we want good hash index performance under high concurrency, we
> need to work a bit harder.

This was a hobby horse of mine a couple of years ago, but I never got
much traction.  The main question I have is, what do we even want hash
indexes to be?  NBTree is very good, has been extensively optimized,
and extensively tested.  If there is a niche left for hash indexes,
what is it?  Is it just very large keys which don't do well in BTrees,
or something else?

If we really want to embrace a niche other than very large keys, I
would say rip out the dynamic bucket splitting altogether.  Specify
the number of buckets when the index is built, and if you later change
your mind then use concurrent index to make a new larger one and then
drop the old one.  Not needing to deal with dynamic buffer splitting
would make both highly concurrent access and Write-Ahead Logging far
simpler to implement.  (It seems to me that solving either concurrency
or WAL is not going to get very far, they both need to be solved
together--or at least the design of the first to be implemented needs
to take into account the future implementation of the other).

Barring that:

1) on each main bucket page, record what the level of split for that
bucket is.  That way you don't need to interlock the heavy locks
between the meta page and the main bucket page.  Rather,  when you
leave the meta page you remember not just what bucket you are heading
for but also how far it has been split.  Then when you get there, if
the split level you find doesn't match what you remember, you have to
go back to the meta page and try again.  Indeed, this way you not only
don't need to heavy-lock the meta page, you don't even need to visit
it at all.  Just store the look-up table in the relcache, and then you
only need to reload it from the metapage when you realize it is out of
date.  But, you still need to heavy lock the bucket itself.  But
implementing this does require an on-disk change to the hash index
format.

2) Only support bitmap scans and not ordinary tid scans (the way gin
indexes already do).  Now you never need to pass execution back out of
the hash am while you are still in a bucket, so you don't need heavy
weight locks at all.  You would still need to interlock LWLocks as you
move down the chain of overflow pages, but that should be much more
concurrent.

Cheers,

Jeff

Re: hash index concurrency

From

Robert Haas

Date:

30 May 2012, 00:54:35

On Tue, May 29, 2012 at 11:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, May 29, 2012 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I ran a SELECT-only pgbench test today on the IBM POWER7 box with 64
>> concurrent clients and got roughly 305,000 tps.  Then, I created a
>> hash index on pgbench_accounts (aid), dropped the primary key, and
>> reran the test.  I got roughly 104,000 tps. 'perf -g -e cs' suggested
>> lock contention in _hash_first(), so I whacked out the calls to
>> _hash_getlock(rel, 0, HASH_SHARE) and _hash_droplock(rel, 0,
>> HASH_SHARE).  With that change, I got roughly 270,000 TPS.  Taking a
>> little further, I then changed the definition of USELOCKING() to 0,
>> effectively removing all the heavyweight page locks.  With that
>> change, I got 324,000 tps.  Also, with this change, the CPU is fully
>> saturated - about 77% user time, 23% system time.
>
> Did the data fit in shared_buffers, or only in RAM?

shared_buffers.

>> I don't immediately have a proposal to deal with this, but it seems
>> that if we want good hash index performance under high concurrency, we
>> need to work a bit harder.
>
> This was a hobby horse of mine a couple of years ago, but I never got
> much traction.  The main question I have is, what do we even want hash
> indexes to be?  NBTree is very good, has been extensively optimized,
> and extensively tested.  If there is a niche left for hash indexes,
> what is it?  Is it just very large keys which don't do well in BTrees,
> or something else?

Well, TBH, I was hoping they'd be faster than btree.

> If we really want to embrace a niche other than very large keys, I
> would say rip out the dynamic bucket splitting altogether.

Yeah, maybe, although that's got some disadvantages.

> Barring that:
>
> 1) on each main bucket page, record what the level of split for that
> bucket is.  That way you don't need to interlock the heavy locks
> between the meta page and the main bucket page.  Rather,  when you
> leave the meta page you remember not just what bucket you are heading
> for but also how far it has been split.  Then when you get there, if
> the split level you find doesn't match what you remember, you have to
> go back to the meta page and try again.  Indeed, this way you not only
> don't need to heavy-lock the meta page, you don't even need to visit
> it at all.  Just store the look-up table in the relcache, and then you
> only need to reload it from the metapage when you realize it is out of
> date.  But, you still need to heavy lock the bucket itself.  But
> implementing this does require an on-disk change to the hash index
> format.

This occurred to me almost at once when I looked at the code, and I
think it might be a good idea, though I also think it will not solve
the main problem here.

> 2) Only support bitmap scans and not ordinary tid scans (the way gin
> indexes already do).  Now you never need to pass execution back out of
> the hash am while you are still in a bucket, so you don't need heavy
> weight locks at all.  You would still need to interlock LWLocks as you
> move down the chain of overflow pages, but that should be much more
> concurrent.

-1 on losing amgettuple.  I regret that we lost that for GIN and I
shall regret it more if we lose it anywhere else.

But... even without doing either of the above, how about trying to
eliminate the heavyweight locking around the metapage?  I think the
only reason we're using it there is for deadlock detection - we can't
unlock the metapage until we've got the bucket lock, and we can't hold
an lwlock on the metapage while acquiring a heavyweight lock for fear
of deadlock (though I'm slightly unclear on what the deadlock scenario
is).  Instead, suppose we take a shared buffer content lock on the
metapage, compute the target bucket, and release the buffer content
lock on the metapage.  Next, we acquire the heavyweight lock on the
bucket.   Then, we reacquire the buffer content lock on the metapage
and double-check that no split has occurred.  If it has, we call a
do-over; else, we release the lwlock and proceed.

This would actually be fewer lwlock acquisitions than we have now,
because losing the heavyweight acquire-and-release would save us two
lwlock acquire-and-release cycles, both in exclusive mode, and instead
we'd add the double-check code which would be only one lwlock
acquire-and-release cycle, and that in shared mode.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hash index concurrency

From

Tom Lane

Date:

30 May 2012, 01:21:52

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, May 29, 2012 at 11:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> 2) Only support bitmap scans and not ordinary tid scans (the way gin
>> indexes already do).

> -1 on losing amgettuple.  I regret that we lost that for GIN and I
> shall regret it more if we lose it anywhere else.

Not sure that's all that big a deal for hash.  IIRC the only reasons to
want it are for index-only scans (not possible anyway with hash) and
exclusion constraints (for which you might as well use a btree, or plain
index-supported uniqueness if hash had that).

> But... even without doing either of the above, how about trying to
> eliminate the heavyweight locking around the metapage?  I think the
> only reason we're using it there is for deadlock detection

I'm too lazy to go reread the README file, but my recollection is that
the potential deadlock is between different operations that're holding
locks on different buckets.  It seems possible that we could use an
LWLock for the metapage while using heavyweight locks for buckets;
though it's not clear how much that buys us.  But if you want to get rid
of heavyweight locks altogether, I think you have to do what Jeff said
and ditch amgettuple scans.  The issue fundamentally is that if we
suspend an indexscan and return control to the executor while still
holding a lock, we risk deadlock because the executor could start up
some other scan that will want some other bucket lock, and meanwhile
some other backend could try to get the same two bucket locks in the
other order.

I guess another possibility would be to try to manage intra-bucket scans
similarly to the way btree does, where you stop "between" pages and
invent arcane page-splitting rules to ensure no loss of consistency.
Then maybe you don't need any lock while suspended; though it's not at
all clear how bucket splits could be allowed in such an environment.
        regards, tom lane

Re: hash index concurrency

From

Simon Riggs

Date:

30 May 2012, 05:50:24

On 30 May 2012 04:54, Robert Haas <robertmhaas@gmail.com> wrote:

>> This was a hobby horse of mine a couple of years ago, but I never got
>> much traction.  The main question I have is, what do we even want hash
>> indexes to be?  NBTree is very good, has been extensively optimized,
>> and extensively tested.  If there is a niche left for hash indexes,
>> what is it?  Is it just very large keys which don't do well in BTrees,
>> or something else?
>
> Well, TBH, I was hoping they'd be faster than btree.

They are faster than btree in terms of response time, just not as concurrent.

Right now if you have a table bigger than RAM with direct access then
hash indexes will be faster, but I agree that the use case is not
large enough to be worth spending the time to improve hash indexes.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: hash index concurrency

From

Merlin Moncure

Date:

30 May 2012, 10:34:06

On Wed, May 30, 2012 at 3:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 30 May 2012 04:54, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>> This was a hobby horse of mine a couple of years ago, but I never got
>>> much traction.  The main question I have is, what do we even want hash
>>> indexes to be?  NBTree is very good, has been extensively optimized,
>>> and extensively tested.  If there is a niche left for hash indexes,
>>> what is it?  Is it just very large keys which don't do well in BTrees,
>>> or something else?
>>
>> Well, TBH, I was hoping they'd be faster than btree.
>
> They are faster than btree in terms of response time, just not as concurrent.
>
> Right now if you have a table bigger than RAM with direct access then
> hash indexes will be faster, but I agree that the use case is not
> large enough to be worth spending the time to improve hash indexes.

Yeah -- for i/o bound lookups hash indexes can be around 2x faster for
large tables than a btree-wrapping-digest index.  I confirmed this a
few months back when openly conjecturing if the hash index code should
be in fact replaced with a btree wrapper.  I've never used a hash
index in a production database.

merlin

Re: hash index concurrency

From

David Fetter

Date:

31 May 2012, 12:35:08

On Wed, May 30, 2012 at 12:21:33AM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, May 29, 2012 at 11:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> >> 2) Only support bitmap scans and not ordinary tid scans (the way gin
> >> indexes already do).
> 
> > -1 on losing amgettuple.  I regret that we lost that for GIN and I
> > shall regret it more if we lose it anywhere else.
> 
> Not sure that's all that big a deal for hash.  IIRC the only reasons to
> want it are for index-only scans (not possible anyway with hash) and
> exclusion constraints (for which you might as well use a btree, or plain
> index-supported uniqueness if hash had that).

It does via EXCLUDE constraints, so it could with what as far as I've
been able to tell would be some relatively small amount of coding.

dfetter@dfetter:5492=# CREATE TABLE foo(i TEXT, EXCLUDE USING HASH(i WITH =));
NOTICE:  CREATE TABLE / EXCLUDE will create implicit index "foo_i_excl" for table "foo"
CREATE TABLE
dfetter@dfetter:5492=# insert into foo VALUES (1),(1);
ERROR:  conflicting key value violates exclusion constraint "foo_i_excl"
DETAIL:  Key (i)=(1) conflicts with existing key (i)=(1).

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate