Re: Hash Indexes - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Hash Indexes |
Date | |
Msg-id | CAA4eK1JiqkEqqShc4B3SAFm_st6MN91R78DZq3=VkQCw-4=njw@mail.gmail.com Whole thread Raw |
In response to | Re: Hash Indexes (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Thu, Nov 10, 2016 at 2:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Some more review: > > The API contract of _hash_finish_split seems a bit unfortunate. The > caller is supposed to have obtained a cleanup lock on both the old and > new buffers, but the first thing it does is walk the entire new bucket > chain, completely ignoring the old one. That means holding a cleanup > lock on the old buffer across an unbounded number of I/O operations -- > which also means that you can't interrupt the query by pressing ^C, > because an LWLock (on the old buffer) is held. > I see the problem you are talking about, but it was done to ensure locking order, old bucket first and then new bucket, else there could be a deadlock risk. However, I think we can avoid holding the cleanup lock on old bucket till we scan the new bucket to form a hash table of TIDs. > Moreover, the > requirement to hold a lock on the new buffer isn't convenient for > either caller; they both have to go do it, so why not move it into the > function? > Yeah, we can move the locking of new bucket entirely into new function. > Perhaps the function should be changed so that it > guarantees that a pin is held on the primary page of the existing > bucket, but no locks are held. > Okay, so we can change the locking order as follows: a. ensure a cleanup lock on old bucket and check if the bucket (old) has pending split. b. if there is a pending split, release the lock on old bucket, but not pin. below steps will be performed by _hash_finish_split(): c. acquire the read content lock on new bucket and form the hash table of TIDs and in the process of forming hash table, we need to traverse whole bucket chain. While traversing bucket chain, release the lock on previous bucket (both lock and pin if not a primary bucket page). d. After the hash table is formed, acquire cleanup lock on old and new buckets conditionaly; if we are not able to get cleanup lock on either, then just return from there. e. Perform split operation. f. release the lock and pin on new bucket g. release the lock on old bucket. We don't want to release the pin on old bucket as the caller has ensure it before passing it to _hash_finish_split(), so releasing pin should be resposibility of caller. Now, both the callers need to ensure that they restart the operation from begining as after we release lock on old bucket, somebody might have split the bucket. Does the above change in locking strategy sounds okay? > Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf), > should it instead be trying to get the lock conditionally and > returning immediately if it doesn't get the lock? Seems like a good > idea... > Yeah, we can take a cleanup lock conditionally, but it would waste the effort of forming hash table, if we don't get cleanup lock immediately. Considering incomplete splits to be a rare operation, may be this the wasted effort is okay, but I am not sure. Don't you think we should avoid that effort? > * We're at the end of the old bucket chain, so we're done partitioning > * the tuples. Mark the old and new buckets to indicate split is > * finished. > * > * To avoid deadlocks due to locking order of buckets, first lock the old > * bucket and then the new bucket. > > These comments have drifted too far from the code to which they refer. > The first part is basically making the same point as the > slightly-later comment /* indicate that split is finished */. > I think we can remove the second comment /* indicate that split is finished */. Apart from that, I think the above comment you have quoted seems to be inline with current code. At that point, we have finished partitioning the tuples, so I don't understand what makes you think that it is drifted from the code? Is it because of second part of comment (To avoid deadlocks ...)? If so, I think we can move it to few lines down where we actually performs the locking on old and new bucket? > The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is > coming to seem like a horrible idea to me. That's not your fault - it > was like this before - but maybe in a followup patch we should > consider ripping all of that out and just calling MarkBufferDirty(), > ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or > UnlockReleaseBuffer() as appropriate. As far as I can see, the > current style is just obfuscating the code. > Okay, we can do some study and try to change it in the way you are suggesting. It seems partially this has been derived from btree code where we have function _bt_relbuf(). I am sure that we don't need _hash_wrtbuf after WAL patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: