Re: Returning nbtree posting list TIDs in DESC order during backwards scans - Mailing list pgsql-hackers
| From | Peter Geoghegan |
|---|---|
| Subject | Re: Returning nbtree posting list TIDs in DESC order during backwards scans |
| Date | |
| Msg-id | CAH2-WzmRsiCOZaL9HzEvtr_yNHjX0GsujLgz=9599sSjNVkR8Q@mail.gmail.com Whole thread Raw |
| In response to | Re: Returning nbtree posting list TIDs in DESC order during backwards scans (Chao Li <li.evan.chao@gmail.com>) |
| List | pgsql-hackers |
On Wed, Dec 3, 2025 at 7:32 AM Chao Li <li.evan.chao@gmail.com> wrote: > The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always bms_free(so->killedItems)and re-alloc memory when next time adding a member to bms. > > I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next time.But unfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I don’twant to do that, I can try that once you push this patch. I don't think that it's worth the complexity. We can rely on palloc() to recycle the memory that was freed by the most recent bms_clear(). I say this in part because the complexity of reusing the same Bitmapset would be rather high. The idea that the only valid representation of an empty set is a NULL pointer is baked into the Bitmapset API. Note also that we'll use much less memory for killedItems by representing it as a Bitmapset. We'll use at most one bit per so->currPos.items[] item, whereas before we used 4 bytes per item. > I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts fromnitems-2, will it still must be at lease one item? By definition, a posting list tuple has at least 2 TIDs -- that's a posting list invariant. If there was only 1 TID, then it wouldn't be a posting list in the first place. (Unlike within GIN, where single TID posting lists are possible.) > /* > - * Don't bother advancing the outermost loop's int iterator to > - * avoid processing killed items that relate to the same > - * offnum/posting list tuple. This micro-optimization hardly > - * seems worth it. (Further iterations of the outermost loop > - * will fail to match on this same posting list's first heap > - * TID instead, so we'll advance to the next offnum/index > - * tuple pretty quickly.) > + * Don't advance itemIndex for outermost loop, no matter how > + * nextIndex was advanced. It's possible that items whose > + * TIDs weren't matched in posting list can still be killed > + * (there might be a later tuple whose TID is a match). > */ > if (j == nposting) > killtuple = true; > ``` > > I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple totrue, then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop "while((itemIndex = bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex. The old comment should probably have been written more like the new one (that I propose to replace it with). It's saying "don't try to be clever by remembering that we already determined that all the TIDs that we tried to compare to this posting list aren't matches for *any* TID". But I don't think that that's accurate; we *haven't* determined that those TIDs aren't matches for *any and all* TIDs on the page (only for those now in the posting list specifically). We might still be able to match those TIDs to later tuples, immediately after the posting list. Note that this is all a bit academic and theoretical; in practice it rarely comes up. During so->dropPin scans (which is the common case), we'll give up/get scared at the start of _bt_killitems if the page has changed at all since it was initially examined within _bt_readpage anyway -- no matter how the page was modified. It doesn't matter that the page probably *won't* have been modified by VACUUM when _bt_kilitems gets scared of modifying the page like this (VACUUM is the only thing that truly makes it unsafe for _bt_killitems to run, but _bt_killitems is conservative/easily scared). So while it is true that "We might still be able to match those TIDs to later tuples, immediately after the posting list" (as I said in the paragraph before the previous paragraph), we can only do so during !so->dropPin scans. In other words, only during index-only scans, or scans of an unlogged relation, or when we don't use an MVCC snapshot. All of which are rare. Which makes all this fairly academic/theoretical (mostly it's historical, 10+ years ago *all* scans were !so->dropPin scans). -- Peter Geoghegan
pgsql-hackers by date: