Re: Optimize single tuple fetch from nbtree index - Mailing list pgsql-hackers
From | Anastasia Lubennikova |
---|---|
Subject | Re: Optimize single tuple fetch from nbtree index |
Date | |
Msg-id | 4e6943f5-3890-1e00-0ff1-a607540f1a4b@postgrespro.ru Whole thread Raw |
In response to | Re: Optimize single tuple fetch from nbtree index (Floris Van Nee <florisvannee@Optiver.com>) |
Responses |
Re: Optimize single tuple fetch from nbtree index
|
List | pgsql-hackers |
25.08.2019 0:59, Floris Van Nee wrote: >> Though, I got interested in the comment inconsistency you have found. >> I added debug message into this code branch of the patch and was able to >> see it in regression.diffs after 'make check': >> Speaking of your patch, it seems that the buffer was unpinned and pinned >> again between two reads, >> and the condition of holding it continuously has not been met. > May I ask what makes you conclude that the condition of holding the pin continuously has not been met? > Your reply encouraged me to dig a little bit more into this today. First, I wanted to check if indeed the pin was continuouslyheld by the backend or not. I added some debug info to ReleaseBuffer for this: it turned out that the pin onthe buffer was definitely never released by the backend between the calls to _bt_first and _bt_next. So the buffer gotcompacted while the backend held a pin on it. > After some more searching I found the following code: _bt_vacuum_one_page in nbtinsert.c > This function compacts one single page without taking a super-exclusive lock. It is used during inserts to make room ona page. I verified that if I comment out the calls to this function, the compacting never happens while I have a pin onthe buffer. > So I guess that answers my own question: cleaning up garbage during inserts is one of the cases where compacting may happeneven while other backends hold a pin to the buffer. Perhaps this should also be more clearly phrased in the commentsin eg. _bt_killitems? Because currently those comments make it look like this case never occurs. You're right, the pin was not released between page reads. I also added debug to UnpinBuffer, but now I see that I had interpreted it wrongly. As far as I understand, the issue with your patch is that it breaks the *scan stops "between" pages* assumption and thus it unsafely interacts with _bt_vacuum_one_page() cleanup. See README: >Page read locks are held only for as long as a scan is examining a page. To minimize lock/unlock traffic, an index scan always searches a leaf page to identify all the matching items at once, copying their heap tuple IDs into backend-local storage. The heap tuple IDs are then processed while not holding any page lock within the index. We do continue to hold a pin on the leaf page in some circumstances, to protect against concurrent deletions (see below). In this state the scan is effectively stopped "between" pages, either before or after the page it has pinned. This is safe in the presence of concurrent insertions and even page splits, because items are never moved across pre-existing page boundaries --- so the scan cannot miss any items it should have seen, nor accidentally return the same item twice. and >Once an index tuple has been marked LP_DEAD it can actually be removed from the index immediately; since index scans only stop "between" pages, no scan can lose its place from such a deletion. It seems that it contradicts the very idea of your patch, so probably we should look for other ways to optimize this use-case. Maybe this restriction can be relaxed for write only tables, that never have to reread the page because of visibility, or something like that. Also we probably can add to IndexScanDescData info about expected number of tuples, to allow index work more optimal and avoid the overhead for other loads.= >> While reading the code to answer your question, I noticed that >> BTScanPosIsPinned macro name is misleading. >> It calls BufferIsValid(), not BufferIsPinned() as one could expect. >> And BufferIsValid in bufmgr.h comment explicitly states that it >> shouldn't be confused with BufferIsPinned. >> The same goes for BTScanPosUnpinIfPinned(). > I agree the name is misleading. It clearly does something else than how it's named. However, I don't believe this introducesproblems in these particular pieces of code, as long as the macro's are always used. BTScanPosIsPinned actuallychecks whether it's valid and not necessarily whether it's pinned, as you mentioned. However, any time the buffergets unpinned using the macro BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, any consecutivecall to BTScanPosIsPinned should indeed return false. It'd definitely be nice if this gets clarified in commentsthough. That's true. It took me quite some time to understand that existing code is correct. There is a comment for the structure's field that claims that BufferIsValid is the same that BufferIsPinned in ScanPos context. Attached patch contains some comments' updates. Any suggestions on how to improve them are welcome. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: