Re: Optimize single tuple fetch from nbtree index - Mailing list pgsql-hackers

From Floris Van Nee
Subject Re: Optimize single tuple fetch from nbtree index
Date
Msg-id 1566890597643.9842@Optiver.com
Whole thread Raw
In response to Re: Optimize single tuple fetch from nbtree index  (Floris Van Nee <florisvannee@Optiver.com>)
List pgsql-hackers
>> It seems that it contradicts the very idea of your patch, so probably we
>> should look for other ways to optimize this use-case.
>> Maybe this restriction can be relaxed for write only tables, that never
>> have to reread the page because of visibility, or something like that.
>> Also we probably can add to IndexScanDescData info about expected number
>> of tuples, to allow index work more optimal
>> and avoid the overhead for other loads.=

> The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current
implementationof the patch should be correct like this - that's why I added the look-back code on the page if the tuple
couldn'tbe found anymore on the same location on the page. Similarly, it'll look on the page to the right if it
detecteda page split. These two measures combined should give a correct implementation of the 'it's possible that a
scanstops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the
performanceadvantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to
othersuggestions though. 

Although now that I think of it - do you mean the case where the tuple that we returned to the caller after _bt_first
actuallygets deleted (not moved) from the page? I guess that can theoretically happen if _bt_first returns a
non-visibletuple (but not DEAD yet in the index at the time of _bt_first). For my understanding, would a situation like
thefollowing lead to this (in my patch)? 
1) Backend 1 does an index scan and returns the first tuple on _bt_first - this tuple is actually deleted in the heap
already,however it's not marked dead yet in the index. 
2) Backend 1 does a heap fetch to check actual visibility and determines the tuple is actually dead
3) While backend 1 is busy doing the heap fetch (so in between _bt_first and _bt_next) backend 2 comes in and manages
tosomehow do 1) a _bt_killitems on the page to mark tuples dead as well as 2) compact items on the page, thereby
actuallyremoving this item from the page. 
4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to locate the tuple where it left off, but
cannotfind it anymore because it got removed completely by backend 2. 

If this is indeed possible then it's a bad issue unfortunately, and quite hard to try to reproduce, as a lot of things
needto happen concurrently while doing a visiblity check. 

As for your patch, I've had some time to take a look at it. For the two TODOs:

+        /* TODO Is it possible that currPage is not valid anymore? */
+        Assert(BTScanPosIsValid(so->currPos))

This Assert exists already a couple of lines earlier at the start of this function.

+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */

The comments in the source mention the following about this:
         * We note the buffer's block number so that we can release the pin later.
         * This allows us to re-read the buffer if it is needed again for hinting.
         */
        so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);

As we figured out earlier, so->currPos.buf gets set to invalid when we release the pin by the unpin macro. So, if we
don'tstore currPage number somewhere else, we cannot obtain the pin again if we need it during killitems. I think
that'sthe reason that currPage is stored. 

Other than the two TODOs in the code, I think the comments really help clarifying what's going on in the code - I'd be
happyif this gets added. 

-Floris



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Fault injection framework
Next
From: Richard Guo
Date:
Subject: Re: A problem about partitionwise join