Home > mailing lists

Re: Hash Indexes - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: Hash Indexes
Date	November 10, 2016 04:04:17
Msg-id	CAA4eK1JiqkEqqShc4B3SAFm_st6MN91R78DZq3=VkQCw-4=njw@mail.gmail.com Whole thread
In response to	Re: Hash Indexes (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

On Thu, Nov 10, 2016 at 2:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Some more review:
>
> The API contract of _hash_finish_split seems a bit unfortunate.   The
> caller is supposed to have obtained a cleanup lock on both the old and
> new buffers, but the first thing it does is walk the entire new bucket
> chain, completely ignoring the old one.  That means holding a cleanup
> lock on the old buffer across an unbounded number of I/O operations --
> which also means that you can't interrupt the query by pressing ^C,
> because an LWLock (on the old buffer) is held.
>

I see the problem you are talking about, but it was done to ensure
locking order, old bucket first and then new bucket, else there could
be a deadlock risk.  However, I think we can avoid holding the cleanup
lock on old bucket till we scan the new bucket to form a hash table of
TIDs.

>  Moreover, the
> requirement to hold a lock on the new buffer isn't convenient for
> either caller; they both have to go do it, so why not move it into the
> function?
>

Yeah, we can move the locking of new bucket entirely into new function.

>  Perhaps the function should be changed so that it
> guarantees that a pin is held on the primary page of the existing
> bucket, but no locks are held.
>

Okay, so we can change the locking order as follows:
a. ensure a cleanup lock on old bucket and check if the bucket (old)
has pending split.
b. if there is a pending split, release the lock on old bucket, but not pin.

below steps will be performed by _hash_finish_split():

c. acquire the read content lock on new bucket and form the hash table
of TIDs and in the process of forming hash table, we need to traverse
whole bucket chain.  While traversing bucket chain, release the lock
on previous bucket (both lock and pin if not a primary bucket page).
d. After the hash table is formed, acquire cleanup lock on old and new
buckets conditionaly; if we are not able to get cleanup lock on
either, then just return from there.
e. Perform split operation.
f. release the lock and pin on new bucket
g. release the lock on old bucket.  We don't want to release the pin
on old bucket as the caller has ensure it before passing it to
_hash_finish_split(), so releasing pin should be resposibility of
caller.

Now, both the callers need to ensure that they restart the operation
from begining as after we release lock on old bucket, somebody might
have split the bucket.

Does the above change in locking strategy sounds okay?

> Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
> should it instead be trying to get the lock conditionally and
> returning immediately if it doesn't get the lock?  Seems like a good
> idea...
>

Yeah, we can take a cleanup lock conditionally, but it would waste the
effort of forming hash table, if we don't get cleanup lock
immediately.  Considering incomplete splits to be a rare operation,
may be this the wasted effort is okay, but I am not sure.  Don't you
think we should avoid that effort?

>      * We're at the end of the old bucket chain, so we're done partitioning
>      * the tuples.  Mark the old and new buckets to indicate split is
>      * finished.
>      *
>      * To avoid deadlocks due to locking order of buckets, first lock the old
>      * bucket and then the new bucket.
>
> These comments have drifted too far from the code to which they refer.
> The first part is basically making the same point as the
> slightly-later comment /* indicate that split is finished */.
>

I think we can remove the second comment /* indicate that split is
finished */.  Apart from that, I think the above comment you have
quoted seems to be inline with current code.  At that point, we have
finished partitioning the tuples, so I don't understand what makes you
think that it is drifted from the code? Is it because of second part
of comment (To avoid deadlocks ...)?  If so, I think we can move it to
few lines down where we actually performs the locking on old and new
bucket?

> The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is
> coming to seem like a horrible idea to me.  That's not your fault - it
> was like this before - but maybe in a followup patch we should
> consider ripping all of that out and just calling MarkBufferDirty(),
> ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or
> UnlockReleaseBuffer() as appropriate.  As far as I can see, the
> current style is just obfuscating the code.
>

Okay, we can do some study and try to change it in the way you are
suggesting.  It seems partially this has been derived from btree code
where we have function _bt_relbuf().  I am sure that we don't need
_hash_wrtbuf after WAL patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

From: Peter Geoghegan
Date: 10 November 2016, 03:19:05
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Robert Haas
Date: 10 November 2016, 04:17:06
Subject: Re: Re: [COMMITTERS] pgsql: pgbench: Allow the transaction log file prefix to be changed.

Re: Hash Indexes - Mailing list pgsql-hackers

Previous

Next