Thread: hash index concurrency
I ran a SELECT-only pgbench test today on the IBM POWER7 box with 64 concurrent clients and got roughly 305,000 tps. Then, I created a hash index on pgbench_accounts (aid), dropped the primary key, and reran the test. I got roughly 104,000 tps. 'perf -g -e cs' suggested lock contention in _hash_first(), so I whacked out the calls to _hash_getlock(rel, 0, HASH_SHARE) and _hash_droplock(rel, 0, HASH_SHARE). With that change, I got roughly 270,000 TPS. Taking a little further, I then changed the definition of USELOCKING() to 0, effectively removing all the heavyweight page locks. With that change, I got 324,000 tps. Also, with this change, the CPU is fully saturated - about 77% user time, 23% system time. I don't immediately have a proposal to deal with this, but it seems that if we want good hash index performance under high concurrency, we need to work a bit harder. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 29, 2012 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I ran a SELECT-only pgbench test today on the IBM POWER7 box with 64 > concurrent clients and got roughly 305,000 tps. Then, I created a > hash index on pgbench_accounts (aid), dropped the primary key, and > reran the test. I got roughly 104,000 tps. 'perf -g -e cs' suggested > lock contention in _hash_first(), so I whacked out the calls to > _hash_getlock(rel, 0, HASH_SHARE) and _hash_droplock(rel, 0, > HASH_SHARE). With that change, I got roughly 270,000 TPS. Taking a > little further, I then changed the definition of USELOCKING() to 0, > effectively removing all the heavyweight page locks. With that > change, I got 324,000 tps. Also, with this change, the CPU is fully > saturated - about 77% user time, 23% system time. Did the data fit in shared_buffers, or only in RAM? > > I don't immediately have a proposal to deal with this, but it seems > that if we want good hash index performance under high concurrency, we > need to work a bit harder. This was a hobby horse of mine a couple of years ago, but I never got much traction. The main question I have is, what do we even want hash indexes to be? NBTree is very good, has been extensively optimized, and extensively tested. If there is a niche left for hash indexes, what is it? Is it just very large keys which don't do well in BTrees, or something else? If we really want to embrace a niche other than very large keys, I would say rip out the dynamic bucket splitting altogether. Specify the number of buckets when the index is built, and if you later change your mind then use concurrent index to make a new larger one and then drop the old one. Not needing to deal with dynamic buffer splitting would make both highly concurrent access and Write-Ahead Logging far simpler to implement. (It seems to me that solving either concurrency or WAL is not going to get very far, they both need to be solved together--or at least the design of the first to be implemented needs to take into account the future implementation of the other). Barring that: 1) on each main bucket page, record what the level of split for that bucket is. That way you don't need to interlock the heavy locks between the meta page and the main bucket page. Rather, when you leave the meta page you remember not just what bucket you are heading for but also how far it has been split. Then when you get there, if the split level you find doesn't match what you remember, you have to go back to the meta page and try again. Indeed, this way you not only don't need to heavy-lock the meta page, you don't even need to visit it at all. Just store the look-up table in the relcache, and then you only need to reload it from the metapage when you realize it is out of date. But, you still need to heavy lock the bucket itself. But implementing this does require an on-disk change to the hash index format. 2) Only support bitmap scans and not ordinary tid scans (the way gin indexes already do). Now you never need to pass execution back out of the hash am while you are still in a bucket, so you don't need heavy weight locks at all. You would still need to interlock LWLocks as you move down the chain of overflow pages, but that should be much more concurrent. Cheers, Jeff
On Tue, May 29, 2012 at 11:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, May 29, 2012 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I ran a SELECT-only pgbench test today on the IBM POWER7 box with 64 >> concurrent clients and got roughly 305,000 tps. Then, I created a >> hash index on pgbench_accounts (aid), dropped the primary key, and >> reran the test. I got roughly 104,000 tps. 'perf -g -e cs' suggested >> lock contention in _hash_first(), so I whacked out the calls to >> _hash_getlock(rel, 0, HASH_SHARE) and _hash_droplock(rel, 0, >> HASH_SHARE). With that change, I got roughly 270,000 TPS. Taking a >> little further, I then changed the definition of USELOCKING() to 0, >> effectively removing all the heavyweight page locks. With that >> change, I got 324,000 tps. Also, with this change, the CPU is fully >> saturated - about 77% user time, 23% system time. > > Did the data fit in shared_buffers, or only in RAM? shared_buffers. >> I don't immediately have a proposal to deal with this, but it seems >> that if we want good hash index performance under high concurrency, we >> need to work a bit harder. > > This was a hobby horse of mine a couple of years ago, but I never got > much traction. The main question I have is, what do we even want hash > indexes to be? NBTree is very good, has been extensively optimized, > and extensively tested. If there is a niche left for hash indexes, > what is it? Is it just very large keys which don't do well in BTrees, > or something else? Well, TBH, I was hoping they'd be faster than btree. > If we really want to embrace a niche other than very large keys, I > would say rip out the dynamic bucket splitting altogether. Yeah, maybe, although that's got some disadvantages. > Barring that: > > 1) on each main bucket page, record what the level of split for that > bucket is. That way you don't need to interlock the heavy locks > between the meta page and the main bucket page. Rather, when you > leave the meta page you remember not just what bucket you are heading > for but also how far it has been split. Then when you get there, if > the split level you find doesn't match what you remember, you have to > go back to the meta page and try again. Indeed, this way you not only > don't need to heavy-lock the meta page, you don't even need to visit > it at all. Just store the look-up table in the relcache, and then you > only need to reload it from the metapage when you realize it is out of > date. But, you still need to heavy lock the bucket itself. But > implementing this does require an on-disk change to the hash index > format. This occurred to me almost at once when I looked at the code, and I think it might be a good idea, though I also think it will not solve the main problem here. > 2) Only support bitmap scans and not ordinary tid scans (the way gin > indexes already do). Now you never need to pass execution back out of > the hash am while you are still in a bucket, so you don't need heavy > weight locks at all. You would still need to interlock LWLocks as you > move down the chain of overflow pages, but that should be much more > concurrent. -1 on losing amgettuple. I regret that we lost that for GIN and I shall regret it more if we lose it anywhere else. But... even without doing either of the above, how about trying to eliminate the heavyweight locking around the metapage? I think the only reason we're using it there is for deadlock detection - we can't unlock the metapage until we've got the bucket lock, and we can't hold an lwlock on the metapage while acquiring a heavyweight lock for fear of deadlock (though I'm slightly unclear on what the deadlock scenario is). Instead, suppose we take a shared buffer content lock on the metapage, compute the target bucket, and release the buffer content lock on the metapage. Next, we acquire the heavyweight lock on the bucket. Then, we reacquire the buffer content lock on the metapage and double-check that no split has occurred. If it has, we call a do-over; else, we release the lwlock and proceed. This would actually be fewer lwlock acquisitions than we have now, because losing the heavyweight acquire-and-release would save us two lwlock acquire-and-release cycles, both in exclusive mode, and instead we'd add the double-check code which would be only one lwlock acquire-and-release cycle, and that in shared mode. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 29, 2012 at 11:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> 2) Only support bitmap scans and not ordinary tid scans (the way gin >> indexes already do). > -1 on losing amgettuple. I regret that we lost that for GIN and I > shall regret it more if we lose it anywhere else. Not sure that's all that big a deal for hash. IIRC the only reasons to want it are for index-only scans (not possible anyway with hash) and exclusion constraints (for which you might as well use a btree, or plain index-supported uniqueness if hash had that). > But... even without doing either of the above, how about trying to > eliminate the heavyweight locking around the metapage? I think the > only reason we're using it there is for deadlock detection I'm too lazy to go reread the README file, but my recollection is that the potential deadlock is between different operations that're holding locks on different buckets. It seems possible that we could use an LWLock for the metapage while using heavyweight locks for buckets; though it's not clear how much that buys us. But if you want to get rid of heavyweight locks altogether, I think you have to do what Jeff said and ditch amgettuple scans. The issue fundamentally is that if we suspend an indexscan and return control to the executor while still holding a lock, we risk deadlock because the executor could start up some other scan that will want some other bucket lock, and meanwhile some other backend could try to get the same two bucket locks in the other order. I guess another possibility would be to try to manage intra-bucket scans similarly to the way btree does, where you stop "between" pages and invent arcane page-splitting rules to ensure no loss of consistency. Then maybe you don't need any lock while suspended; though it's not at all clear how bucket splits could be allowed in such an environment. regards, tom lane
On 30 May 2012 04:54, Robert Haas <robertmhaas@gmail.com> wrote: >> This was a hobby horse of mine a couple of years ago, but I never got >> much traction. The main question I have is, what do we even want hash >> indexes to be? NBTree is very good, has been extensively optimized, >> and extensively tested. If there is a niche left for hash indexes, >> what is it? Is it just very large keys which don't do well in BTrees, >> or something else? > > Well, TBH, I was hoping they'd be faster than btree. They are faster than btree in terms of response time, just not as concurrent. Right now if you have a table bigger than RAM with direct access then hash indexes will be faster, but I agree that the use case is not large enough to be worth spending the time to improve hash indexes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 30, 2012 at 3:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 30 May 2012 04:54, Robert Haas <robertmhaas@gmail.com> wrote: > >>> This was a hobby horse of mine a couple of years ago, but I never got >>> much traction. The main question I have is, what do we even want hash >>> indexes to be? NBTree is very good, has been extensively optimized, >>> and extensively tested. If there is a niche left for hash indexes, >>> what is it? Is it just very large keys which don't do well in BTrees, >>> or something else? >> >> Well, TBH, I was hoping they'd be faster than btree. > > They are faster than btree in terms of response time, just not as concurrent. > > Right now if you have a table bigger than RAM with direct access then > hash indexes will be faster, but I agree that the use case is not > large enough to be worth spending the time to improve hash indexes. Yeah -- for i/o bound lookups hash indexes can be around 2x faster for large tables than a btree-wrapping-digest index. I confirmed this a few months back when openly conjecturing if the hash index code should be in fact replaced with a btree wrapper. I've never used a hash index in a production database. merlin
On Wed, May 30, 2012 at 12:21:33AM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, May 29, 2012 at 11:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > >> 2) Only support bitmap scans and not ordinary tid scans (the way gin > >> indexes already do). > > > -1 on losing amgettuple. I regret that we lost that for GIN and I > > shall regret it more if we lose it anywhere else. > > Not sure that's all that big a deal for hash. IIRC the only reasons to > want it are for index-only scans (not possible anyway with hash) and > exclusion constraints (for which you might as well use a btree, or plain > index-supported uniqueness if hash had that). It does via EXCLUDE constraints, so it could with what as far as I've been able to tell would be some relatively small amount of coding. dfetter@dfetter:5492=# CREATE TABLE foo(i TEXT, EXCLUDE USING HASH(i WITH =)); NOTICE: CREATE TABLE / EXCLUDE will create implicit index "foo_i_excl" for table "foo" CREATE TABLE dfetter@dfetter:5492=# insert into foo VALUES (1),(1); ERROR: conflicting key value violates exclusion constraint "foo_i_excl" DETAIL: Key (i)=(1) conflicts with existing key (i)=(1). Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate