Re: Optimize single tuple fetch from nbtree index - Mailing list pgsql-hackers

From Anastasia Lubennikova
Subject Re: Optimize single tuple fetch from nbtree index
Date
Msg-id 4b660433-d74e-2075-28d9-907e09df9b9a@postgrespro.ru
Whole thread Raw
In response to Re: Optimize single tuple fetch from nbtree index  (Floris Van Nee <florisvannee@Optiver.com>)
Responses Re: Optimize single tuple fetch from nbtree index
Re: Optimize single tuple fetch from nbtree index
List pgsql-hackers
05.08.2019 14:34, Floris Van Nee wrote:
> I have one further question about these index offsets. There are several comments in master that indicate that it's
impossiblethat an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems
hasa comment like this:
 
>   
> * Note that if we hold a pin on the target page continuously from initially
>   * reading the items until applying this function, VACUUM cannot have deleted
>   * any items from the page, and so there is no need to search left from the
>   * recorded offset.  (This observation also guarantees that the item is still
>   * the right one to delete, which might otherwise be questionable since heap
>   * TIDs can get recycled.)    This holds true even if the page has been modified
>   * by inserts and page splits, so there is no need to consult the LSN.
>   
> Still, exactly this case happens in practice. In my tests I was able to get behavior like:
> 1) pin + lock a page in _bt_first
> 2) read a tuple, record indexOffset (for example offset=100) and heap tid
> 3) unlock page, but*keep*  the pin (end of _bt_first of my patch)
> 4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred)
> 5) look inside the current page for the heap Tid that we registered earlier
> 6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible.
> This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look
lefton the page from the previous indexOffset.
 
>
> However, this is in contradiction with the comments (and code) of _bt_killitems.
> Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index
itemsleft even though there are others holding a pin?
 

Hello,
welcome to hackers with your first patch)

As far as I understood from the thread above, the design of this 
optimization is under discussion, so I didn't review the proposed patch 
itself.
Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to 
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned 
again between two reads,
and the condition of holding it continuously has not been met.

I didn't dig into the code, but this line looks suspicious (see my 
findings about BTScanPosIsPinned below):

         /* bump pin on current buffer for assignment to mark buffer */
         if (BTScanPosIsPinned(so->currPos))
             IncrBufferRefCount(so->currPos.buf);


While reading the code to answer your question, I noticed that 
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it 
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().

I propose that we update BTScanPosIsPinned macro. Or, at least write a 
comment, why its current behavior is fine.
There are a few existing callers, that are definitely expecting that 
this macro checks a pin, which it doesn't do.
I don't quite understand if that already causes any subtle bug, or the 
current algorithm is fine.

Peter, Tom, what do you think?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Hstore OID bigger than an integer
Next
From: Fabien COELHO
Date:
Subject: Re: pg_checksums --help synopsis is quite long