Re: Optimize single tuple fetch from nbtree index - Mailing list pgsql-hackers
From | Floris Van Nee |
---|---|
Subject | Re: Optimize single tuple fetch from nbtree index |
Date | |
Msg-id | 1566890597643.9842@Optiver.com Whole thread Raw |
In response to | Re: Optimize single tuple fetch from nbtree index (Floris Van Nee <florisvannee@Optiver.com>) |
List | pgsql-hackers |
>> It seems that it contradicts the very idea of your patch, so probably we >> should look for other ways to optimize this use-case. >> Maybe this restriction can be relaxed for write only tables, that never >> have to reread the page because of visibility, or something like that. >> Also we probably can add to IndexScanDescData info about expected number >> of tuples, to allow index work more optimal >> and avoid the overhead for other loads.= > The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current implementationof the patch should be correct like this - that's why I added the look-back code on the page if the tuple couldn'tbe found anymore on the same location on the page. Similarly, it'll look on the page to the right if it detecteda page split. These two measures combined should give a correct implementation of the 'it's possible that a scanstops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the performanceadvantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to othersuggestions though. Although now that I think of it - do you mean the case where the tuple that we returned to the caller after _bt_first actuallygets deleted (not moved) from the page? I guess that can theoretically happen if _bt_first returns a non-visibletuple (but not DEAD yet in the index at the time of _bt_first). For my understanding, would a situation like thefollowing lead to this (in my patch)? 1) Backend 1 does an index scan and returns the first tuple on _bt_first - this tuple is actually deleted in the heap already,however it's not marked dead yet in the index. 2) Backend 1 does a heap fetch to check actual visibility and determines the tuple is actually dead 3) While backend 1 is busy doing the heap fetch (so in between _bt_first and _bt_next) backend 2 comes in and manages tosomehow do 1) a _bt_killitems on the page to mark tuples dead as well as 2) compact items on the page, thereby actuallyremoving this item from the page. 4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to locate the tuple where it left off, but cannotfind it anymore because it got removed completely by backend 2. If this is indeed possible then it's a bad issue unfortunately, and quite hard to try to reproduce, as a lot of things needto happen concurrently while doing a visiblity check. As for your patch, I've had some time to take a look at it. For the two TODOs: + /* TODO Is it possible that currPage is not valid anymore? */ + Assert(BTScanPosIsValid(so->currPos)) This Assert exists already a couple of lines earlier at the start of this function. + * TODO It is not clear to me + * why to check scanpos validity based on currPage value. + * I wonder, if we need currPage at all? Is there any codepath that + * assumes that currPage is not the same as BufferGetBlockNumber(buf)? + */ The comments in the source mention the following about this: * We note the buffer's block number so that we can release the pin later. * This allows us to re-read the buffer if it is needed again for hinting. */ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf); As we figured out earlier, so->currPos.buf gets set to invalid when we release the pin by the unpin macro. So, if we don'tstore currPage number somewhere else, we cannot obtain the pin again if we need it during killitems. I think that'sthe reason that currPage is stored. Other than the two TODOs in the code, I think the comments really help clarifying what's going on in the code - I'd be happyif this gets added. -Floris
pgsql-hackers by date: