Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Date
Msg-id CAGTBQpagrctiOhOccAzM-h0969eBdFUFy1ZRm7MjOEc3OUSZ5A@mail.gmail.com
Whole thread Raw
In response to Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Claudio Freire <klaussfreire@gmail.com>)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>
>> A couple of points make me uneasy about this patch, yet I can think of
>> no better alternative, so I seek feedback:
>>
>>  - introducing a different format for inner index tuples makes for an
>> invasive patch and quite difficult-to-reason code (it's easy to forget
>> whether a page is leaf or inner and that now causes assertion failures
>> or segfaults)
>>  - the ascent-descent to check for uniqueness when there are large
>> dead tuple runs could have locking subtleties that escape me. Perhaps
>> there's better ways to solve this.
>
> I have tried to study this part of your patch and it seems somewhat
> non-trivial and risky part of proposal.
>
> + } else {
> + /*
> + * We found the actual first item on another block, so we have to perform
> + * a two-step search - first half, until our write-locked buffer, then another
> + * starting from our write-locked buffer.
> + */
> + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> + LockBuffer(buf, BT_WRITE);
> +
> + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
> + true, stack, BT_WRITE, NULL);
> +
> + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
> +
> + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
> first_offset, itup_scankey,
> + checkUnique, &is_unique, &speculativeToken);
> +
> + _bt_relbuf(rel, nbuf);
> + }
>
> The idea for uniqueness check is that we hold the write lock on
> buffer/page on which we are trying to operate (we do take read locks
> on the consecutive pages during this check).  Here, in your changed
> proposal, you have two buffers (one to which the search led you and
> one buffer previous in the chain) and before checking uniqueness, on
> one of the buffers you seem to have write lock and on other you have
> read lock.  This seems to change the way uniqueness check works till
> now, can you explain how this works (can we safely assume that all
> searches for uniqueness check will lead to the same buffer first).

All uniqueness checks will find the same "nbuf" (read-locked), which
is the first one that matches the key.

They could have different "buf" buffers (the write-locked) one. But
since read locks cannot be held while a write-lock is held, I believe
it doesn't matter. All uniqueness checks will try to read-lock nbuf
unless the uniqueness check for that insertion is done. When they do,
they will block until the other insertion is finished, and when that
happens, either the insert happened, or it returned an xid to wait for
and retry. If it succeeded, the check attempting the read-lock will
see the inserted tuple and return the xid of the earlier transaction
and wait on it. If it failed, no insert happened because there's a
visible tuple there, and the read-lock will succeed, find the visible
tuple, and the insert will fail too.

In essence, it is still the case, I believe, that only one insert can
succeed at a time, all the others will retry. It only happens that
with the patch, the contention will be between a read lock and a write
lock, and not two write locks, and there's a little less contention
(some more work can be done in parallel).

> With this new mechanism, do we have two type of search interfaces
> where one would work for keys (as now) and other for key-ctid or it
> will be just a single interface which works both ways?  I think either
> way there are pros and cons.

The patch doesn't add another interface, but the idea is to add it.
The implementation will probably fall on a common function that can do
both, but some index ams can't implement the key-ctid operation (hash
indexes), so they will have to be separate interfaces to be able to
properly represent the separate capabilities (besides, other
implementations may well use completely different paths for the two
operations).

> Also, in the thread below you are talking about using the last bit in
> t_info, I want to bring it to your notice that there is a patch of
> mine [1] in which I have used it to avoid changing on-disk
> compatibility of hash indexes.  I am not saying that we should not
> plan it for some other use, but just to keep in mind that there are
> other proposals to use it.  We can decide what is best way to proceed,
> if we are aware of all the proposals that want to use it.
>
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

I wasn't aware of that particular thread, but there's another (the
WARM one) where there's another proposed use of that bit.

IMHO, any use of the bit that doesn't leave room for further
extensions of the bit space is a bad idea, because it closes the door
on (easy) further flags being added. And the fact that there's already
several proposals to use that bit for different purposes only
reinforces that idea.

The idea I'm working on allows further extension, and, although it
uses more space, I believe it's a better way. But we'll discuss that
when I have the implementation I guess.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: CLUSTER, reform_and_rewrite_tuple(), and parallelism
Next
From: Christian Convey
Date:
Subject: Re: WIP: About CMake v2