Re: Next Steps with Hash Indexes - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Next Steps with Hash Indexes |
Date | |
Msg-id | CAA4eK1++CKz8r-zRBp+m6ygPPk=jiygHsHcdXpKGcvFe_tXpDA@mail.gmail.com Whole thread Raw |
In response to | Re: Next Steps with Hash Indexes (Simon Riggs <simon.riggs@enterprisedb.com>) |
Responses |
Re: Next Steps with Hash Indexes
|
List | pgsql-hackers |
On Tue, Jul 20, 2021 at 6:32 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > On Tue, Jul 20, 2021 at 1:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs > > <simon.riggs@enterprisedb.com> wrote: > > > > I think the main thing to think about for uniqueness check during > > split (where we scan both the old and new buckets) was whether we need > > to lock both the old (bucket_being_split) and new > > (bucket_being_populated) buckets or just holding locks on one of them > > (the current bucket in which we are inserting) is sufficient? During a > > scan of the new bucket, we just retain pins on both the buckets (see > > comments in _hash_first()) but if we need to retain locks on both > > buckets then we need to do something different then we do it for > > scans. But, I think it is sufficient to just hold an exclusive lock on > > the primary bucket page in the bucket we are trying to insert and pin > > on the other bucket (old bucket as we do for scans). Because no > > concurrent inserter should try to insert into the old bucket and new > > bucket the same tuple as before starting the split we always update > > the metapage for hashm_lowmask and hashm_highmask which decides the > > routing of the tuples. > > During an incomplete split, we need to scan both old and new. So > during insert, we need to scan both old and new, while holding > exclusive locks on both bucket pages. > It will surely work if we have an exclusive lock on both the buckets (old and new) in this case but I think it is better if we can avoid exclusive locking the old bucket (bucket_being_split) unless it is really required. We need an exclusive lock on the primary bucket where we are trying to insert to avoid any other inserter with the same key but I think we don't need it for the old bucket because no inserter with the same key can try to insert the key in an old bucket which would belong to the new bucket. > I've spent a few days looking at > the split behavior and this seems a complete approach. I'm working on > a patch now, still at hacking stage. > > (BTW, my opinion of the split mechanism has now changed from bad to > very good. It works really well for unique data, but can be completely > ineffective for badly skewed data). > > > Now, I think here the other problem we need to think about is that for > > the hash index after finding the tuple in the index, we need to always > > recheck in the heap as we don't store the actual value in the hash > > index. For that in the scan, we get the tuple(s) from the index > > (release locks) and then match qual after fetching tuple from the > > heap. But we can't do that for uniqueness check because if we release > > the locks on the index bucket page then another inserter could come > > before we match it in heap. I think we need some mechanism that after > > fetching TID from the index, we recheck the actual value in heap > > before releasing the lock on the index bucket page. > > I don't think btree does that, so I'm not sure we do need that for > hash. Yes, there is a race condition, but someone will win. Do we care > who? Do we care enough to take the concurrency hit? > I think if we don't care we might end up with duplicates. I might be missing something here but let me explain the problem I see. Say, while doing a unique check we found the same hash value in the bucket we are trying to insert, we can't say unique key violation at this stage and error out without checking the actual value in heap. This is because there is always a chance that two different key values can map to the same hash value. Now, after checking in the heap if we found that the actual value doesn't match so we decide to insert the value in the hash index, and in the meantime, another insert of the same key value already performed these checks and ends up inserting the value in hash index and that would lead to a duplicate value in the hash index. I think btree doesn't have a similar risk so we don't need such a mechanism for btree. -- With Regards, Amit Kapila.
pgsql-hackers by date: