Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Date
Msg-id CAA4eK1+sH9L2FhgvTXrC7COJvfNvwMkc7=Seok5_fTyowZtCkQ@mail.gmail.com
Whole thread Raw
In response to Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Claudio Freire <klaussfreire@gmail.com>)
List pgsql-hackers
On Sat, Aug 20, 2016 at 10:23 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>>
>>> A couple of points make me uneasy about this patch, yet I can think of
>>> no better alternative, so I seek feedback:
>>>
>>>  - introducing a different format for inner index tuples makes for an
>>> invasive patch and quite difficult-to-reason code (it's easy to forget
>>> whether a page is leaf or inner and that now causes assertion failures
>>> or segfaults)
>>>  - the ascent-descent to check for uniqueness when there are large
>>> dead tuple runs could have locking subtleties that escape me. Perhaps
>>> there's better ways to solve this.
>>
>> I have tried to study this part of your patch and it seems somewhat
>> non-trivial and risky part of proposal.
>>
>> + } else {
>> + /*
>> + * We found the actual first item on another block, so we have to perform
>> + * a two-step search - first half, until our write-locked buffer, then another
>> + * starting from our write-locked buffer.
>> + */
>> + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
>> + LockBuffer(buf, BT_WRITE);
>> +
>> + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
>> + true, stack, BT_WRITE, NULL);
>> +
>> + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
>> +
>> + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
>> first_offset, itup_scankey,
>> + checkUnique, &is_unique, &speculativeToken);
>> +
>> + _bt_relbuf(rel, nbuf);
>> + }
>>
>> The idea for uniqueness check is that we hold the write lock on
>> buffer/page on which we are trying to operate (we do take read locks
>> on the consecutive pages during this check).  Here, in your changed
>> proposal, you have two buffers (one to which the search led you and
>> one buffer previous in the chain) and before checking uniqueness, on
>> one of the buffers you seem to have write lock and on other you have
>> read lock.  This seems to change the way uniqueness check works till
>> now, can you explain how this works (can we safely assume that all
>> searches for uniqueness check will lead to the same buffer first).
>
> All uniqueness checks will find the same "nbuf" (read-locked), which
> is the first one that matches the key.
>
> They could have different "buf" buffers (the write-locked) one. But
> since read locks cannot be held while a write-lock is held, I believe
> it doesn't matter. All uniqueness checks will try to read-lock nbuf
> unless the uniqueness check for that insertion is done. When they do,
> they will block until the other insertion is finished, and when that
> happens, either the insert happened, or it returned an xid to wait for
> and retry. If it succeeded, the check attempting the read-lock will
> see the inserted tuple and return the xid of the earlier transaction
> and wait on it. If it failed, no insert happened because there's a
> visible tuple there, and the read-lock will succeed, find the visible
> tuple, and the insert will fail too.
>
> In essence, it is still the case, I believe, that only one insert can
> succeed at a time, all the others will retry. It only happens that
> with the patch, the contention will be between a read lock and a write
> lock, and not two write locks, and there's a little less contention
> (some more work can be done in parallel).
>
>> With this new mechanism, do we have two type of search interfaces
>> where one would work for keys (as now) and other for key-ctid or it
>> will be just a single interface which works both ways?  I think either
>> way there are pros and cons.
>
> The patch doesn't add another interface, but the idea is to add it.
> The implementation will probably fall on a common function that can do
> both
>

That makes sense, but this means there is a chance that the searches
could lead to different buffers in case of uniqueness checks (the
search with key-ctid could lead to a different buffer than the search
with just key).  I am not clear do we have requirement for doing this
uniqueness check for key-ctid search API, because as I understand you
want to do it mainly for vacuum and WARM tuples. Vacuum won't add new
tuples, so is this required for WARM tuples?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: konstantin knizhnik
Date:
Subject: Re: Logical decoding restart problems
Next
From: Amit Kapila
Date:
Subject: Re: dsm_unpin_segment