Re: Add a berief general comment on BTScanInsertData's nextkey and backward - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Add a berief general comment on BTScanInsertData's nextkey and backward
Date
Msg-id CAH2-Wzk3VivXOSDz-e9fhATUwzxOpmkHGUN=u9gX-j6b8urfwQ@mail.gmail.com
Whole thread Raw
In response to Add a berief general comment on BTScanInsertData's nextkey and backward  (Yugo Nagata <nagata@sraoss.co.jp>)
List pgsql-hackers
On Tue, Nov 18, 2025 at 2:28 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> I've attached a patch that adds the following comment:
>
> + * nextkey determines how the scankey's boundary is interpreted, and backward
> + * indicates a backward scan.  See comments in _bt_first for a more detailed
> + * explanation of these fields.
>
> What do think?

Seems reasonable to me.

I wonder if we also do something about these existing _bt_binsrch comments:

 * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
 * of the last key < given scankey, or last key <= given scankey if nextkey
 * is true.  (Since _bt_compare treats the first data key of such a page as
 * minus infinity, there will be at least one key < scankey, so the result
 * always points at one of the keys on the page.)

Here we describe what happens on an internal page. This is correct,
but *seems* to contradict the higher level comments at the end of
_bt_first. There is actually no contradiction between the two -- not
when you understand that _bt_first describes the whole end-to-end
process of calling _bt_search and then calling _bt_binsrch on the leaf
page (not of calling _bt_binsrch once, against an internal page).

One could think of this _bt_binsrch internal page behavior as
compensating for the fact that internal pages have pivot tuples that
consist of a separator key (except for the first key, which is just
has a -inf key/no key), plus a downlink that points to the *next* page
after that separator key one level down (except for the internal page
high key, which has no downlink at all). Might be useful to say
something like that instead.

This is hard to explain without an example. Right now, an internal
page might have pivot tuples that look like this:

Separator: -inf, Downlink: 1
Separator: 'a', Downlink: 2
Separator: 'c', Downlink: 3
Separator: 'f', Downlink: (none, this is the high key)

But _bt_binsrch "pretends" that our internal page actually contains:

Downlink: 1
Separator: 'a'
Downlink: 2
Separator: 'c'
Downlink: 3
Separator: 'f'

So if our = scan key is (say) 'c', we should descend using the
downlink that points to block 2 (not the one that points to block 3,
as might be expected from looking at the real physical representation
consisting of pivot tuples). That is what the scan needs to do to get
to the page one level down whose high key is also 'c' -- that's where
the scan needs to look. (If the next level down is the leaf level,
then the leaf page we descend to might also contain a *non*-pivot
tuple with the key value 'c', "right before" the high key with 'c',
which the scan will need to return in _bt_readpage. Lehman & Yao allow
the key before a leaf page's high key to be equal to the high key, but
otherwise forbid all duplicate keys.)

I find it very hard to know what explanation will be the least
confusing to other people, at least in this area. Since you're
interested in this area, I thought I'd ask what you think.

(Here I'm pretending that the keys in the tree are unique without
needing a heap TID, per classic L&Y, which is unrealistic but
simplifies the example.)

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Issue with logical replication slot during switchover
Next
From: Chao Li
Date:
Subject: Re: Post-release followup: pg_add_size_overflow()