Re: Hash Indexes - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Hash Indexes
Date
Msg-id CAA4eK1+VLQzBXyrYQpL3=GCQW39vCfxmWgmQB=1Y2GB4X1w64Q@mail.gmail.com
Whole thread Raw
In response to Re: Hash Indexes  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Hash Indexes  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Jun 21, 2016 at 9:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 16, 2016 at 3:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Incomplete splits can be completed either by vacuum or insert as both
> >> needs exclusive lock on bucket.  If vacuum finds split-in-progress flag on a
> >> bucket then it will complete the split operation, vacuum won't see this flag
> >> if actually split is in progress on that bucket as vacuum needs cleanup lock
> >> and split retains pin till end of operation.  To make it work for Insert
> >> operation, one simple idea could be that if insert finds split-in-progress
> >> flag, then it releases the current exclusive lock on bucket and tries to
> >> acquire a cleanup lock on bucket, if it gets cleanup lock, then it can
> >> complete the split and then the insertion of tuple, else it will have a
> >> exclusive lock on bucket and just perform the insertion of tuple.  The
> >> disadvantage of trying to complete the split in vacuum is that split might
> >> require new pages and allocating new pages at time of vacuum is not
> >> advisable.  The disadvantage of doing it at time of Insert is that Insert
> >> might skip it even if there is some scan on the bucket is going on as scan
> >> will also retain pin on the bucket, but I think that is not a big deal.  The
> >> actual completion of split can be done in two ways: (a) scan the new bucket
> >> and build a hash table with all of the TIDs you find there.  When copying
> >> tuples from the old bucket, first probe the hash table; if you find a match,
> >> just skip that tuple (idea suggested by Robert Haas offlist) (b) delete all
> >> the tuples that are marked as moved_by_split in the new bucket and perform
> >> the split operation from the beginning using old bucket.
> >
> > I have completed the patch with respect to incomplete splits and delayed
> > cleanup of garbage tuples.  For incomplete splits, I have used the option
> > (a) as mentioned above.  The incomplete splits are completed if the
> > insertion sees split-in-progress flag in a bucket.
>
> It seems to me that there is a potential performance problem here.  If
> the split is still being performed, every insert will see the
> split-in-progress flag set.  The in-progress split retains only a pin
> on the primary bucket, so other backends could also get an exclusive
> lock, which is all they need for an insert.  It seems that under this
> algorithm they will now take the exclusive lock, release the exclusive
> lock, try to take a cleanup lock, fail, again take the exclusive lock.
> That seems like a lot of extra monkeying around.  Wouldn't it be
> better to take the exclusive lock and then afterwards check if the pin
> count is 1?  If so, even though we only intended to take an exclusive
> lock, it is actually a cleanup lock.  If not, we can simply proceed
> with the insertion.  This way you avoid unlocking and relocking the
> buffer repeatedly.
>

We can do it in the way as you are suggesting, but there is another thing which we need to consider here.  As of now, the patch tries to finish the split if it finds split-in-progress flag in either old or new bucket.  We need to lock both old and new buckets to finish the split, so it is quite possible that two different backends try to lock them in opposite order leading to a deadlock.  I think the correct way to handle is to always try to lock the old bucket first and then new bucket.  To achieve that, if the insertion on new bucket finds that split-in-progress flag is set on a bucket, it needs to release the lock and then acquire the lock first on old bucket, ensure pincount is 1 and then lock new bucket again and ensure that pincount is 1. I have already maintained the order of locks in scan (old bucket first and then new bucket; refer changes in _hash_first()).  Alternatively, we can try to  finish the splits only when someone tries to insert in old bucket.

> > The second major thing
> > this new version of patch has achieved is cleanup of garbage tuples i.e the
> > tuples that are left in old bucket during split.  Currently (in HEAD), as
> > part of a split operation, we clean the tuples from old bucket after moving
> > them to new bucket, as we have heavy-weight locks on both old and new bucket
> > till the whole split operation.  In the new design, we need to take cleanup
> > lock on old bucket and exclusive lock on new bucket to perform the split
> > operation and we don't retain those locks till the end (release the lock as
> > we move on to overflow buckets).  Now to cleanup the tuples we need a
> > cleanup lock on a bucket which we might not have at split-end.  So I choose
> > to perform the cleanup of garbage tuples during vacuum and when re-split of
> > the bucket happens as during both the operations, we do hold cleanup lock.
> > We can extend the cleanup of garbage to other operations as well if
> > required.
>
> I think it's OK for the squeeze phase to be deferred until vacuum or a
> subsequent split, but simply removing dead tuples seems like it should
> be done earlier  if possible.

Yes, probably we can do it at time of insertion in a bucket, if we don't have concurrent scan issue.


--

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Hash Indexes
Next
From: Ashutosh Bapat
Date:
Subject: Re: Postgres_fdw join pushdown - wrong results with whole-row reference