Home > mailing lists

Re: Returning nbtree posting list TIDs in DESC order during backwards scans - Mailing list pgsql-hackers

From	Peter Geoghegan
Subject	Re: Returning nbtree posting list TIDs in DESC order during backwards scans
Date	December 3 20:31:24
Msg-id	CAH2-WzmRsiCOZaL9HzEvtr_yNHjX0GsujLgz=9599sSjNVkR8Q@mail.gmail.com Whole thread Raw
In response to	Re: Returning nbtree posting list TIDs in DESC order during backwards scans (Chao Li <li.evan.chao@gmail.com>)
List	pgsql-hackers

Tree view

On Wed, Dec 3, 2025 at 7:32 AM Chao Li <li.evan.chao@gmail.com> wrote:
> The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always
bms_free(so->killedItems)and re-alloc memory when next time adding a member to bms. 
>
> I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next
time.But unfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I
don’twant to do that, I can try that once you push this patch. 

I don't think that it's worth the complexity. We can rely on palloc()
to recycle the memory that was freed by the most recent bms_clear().

I say this in part because the complexity of reusing the same
Bitmapset would be rather high. The idea that the only valid
representation of an empty set is a NULL pointer is baked into the
Bitmapset API.

Note also that we'll use much less memory for killedItems by
representing it as a Bitmapset. We'll use at most one bit per
so->currPos.items[] item, whereas before we used 4 bytes per item.

> I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts
fromnitems-2, will it still must be at lease one item? 

By definition, a posting list tuple has at least 2 TIDs -- that's a
posting list invariant. If there was only 1 TID, then it wouldn't be a
posting list in the first place. (Unlike within GIN, where single TID
posting lists are possible.)

>                                 /*
> -                                * Don't bother advancing the outermost loop's int iterator to
> -                                * avoid processing killed items that relate to the same
> -                                * offnum/posting list tuple.  This micro-optimization hardly
> -                                * seems worth it.  (Further iterations of the outermost loop
> -                                * will fail to match on this same posting list's first heap
> -                                * TID instead, so we'll advance to the next offnum/index
> -                                * tuple pretty quickly.)
> +                                * Don't advance itemIndex for outermost loop, no matter how
> +                                * nextIndex was advanced.  It's possible that items whose
> +                                * TIDs weren't matched in posting list can still be killed
> +                                * (there might be a later tuple whose TID is a match).
>                                  */
>                                 if (j == nposting)
>                                         killtuple = true;
> ```
>
> I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple
totrue, then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop
"while((itemIndex = bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex. 

The old comment should probably have been written more like the new
one (that I propose to replace it with). It's saying "don't try to be
clever by remembering that we already determined that all the TIDs
that we tried to compare to this posting list aren't matches for *any*
TID". But I don't think that that's accurate; we *haven't* determined
that those TIDs aren't matches for *any and all* TIDs on the page
(only for those now in the posting list specifically). We might still
be able to match those TIDs to later tuples, immediately after the
posting list.

Note that this is all a bit academic and theoretical; in practice it
rarely comes up. During so->dropPin scans (which is the common case),
we'll give up/get scared at the start of _bt_killitems if the page has
changed at all since it was initially examined within _bt_readpage
anyway -- no matter how the page was modified. It doesn't matter that
the page probably *won't* have been modified by VACUUM when
_bt_kilitems gets scared of modifying the page like this (VACUUM is
the only thing that truly makes it unsafe for _bt_killitems to run,
but _bt_killitems is conservative/easily scared).

So while it is true that "We might still be able to match those TIDs
to later tuples, immediately after the posting list" (as I said in the
paragraph before the previous paragraph), we can only do so during
!so->dropPin scans. In other words, only during index-only scans, or
scans of an unlogged relation, or when we don't use an MVCC snapshot.
All of which are rare. Which makes all this fairly
academic/theoretical (mostly it's historical, 10+ years ago *all*
scans were !so->dropPin scans).

--
Peter Geoghegan

pgsql-hackers by date:

From: Álvaro Herrera
Date: 03 December, 20:25:27
Subject: Re: pgindent versus struct members and typedefs

From: Heikki Linnakangas
Date: 03 December, 20:49:26
Subject: Re: IPC/MultixactCreation on the Standby server

Re: Returning nbtree posting list TIDs in DESC order during backwards scans - Mailing list pgsql-hackers

Previous

Next