Thread: Re: Making Row Comparison NULL row member handling more robust during skip scans

On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> In general, when we end a primitive index scan, the code that sets
> continuescan=false (any such code, not just _bt_check_rowcompare NULL
> row member code) has to make sure that starting the next primitive
> index scan will actually allow the top-level scan to make forward
> progress -- the new primscan needs to land on some later leaf page.
> Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd
> consider truly robust. We need that guarantee to avoid these
> cycles/infinite looping. AFAIK there are no bugs of that general
> nature on Postgres 18

Correction: there *is* a live bug like this on Postgres 18/HEAD, which
involves simple scalar inequalities with an incomplete opfamily.
Attached v3 fixes the bug in its new 0001-* patch. (Note that v3's
0002-* patch previously appeared as 0001-* in v2 and v1).

The bug in question involves redundant keys that could not be
eliminated by preprocessing due to a lack of available cross-type
support. Quals with certain combinations of redundant scalar keys can
cycle through the index being scanned ceaselessly. The scan fails to
make any real progress, again and again. Unlike the row compare case,
we lack even kludgey handling that avoids the problem within
_bt_set_startikey.

There's a live bug here, so I have to act. It seems to make sense to
apply everything on HEAD/Postgres 18, and be done with any question of
getting stuck like this/infinite cycling from resetting the scan's
arrays to get a clean slate. But I would certainly welcome other
opinions on this.

Problem statement
=================

I have confirmed the existence of the live bug using fuzz-testing,
combined with hard-coding that makes _bt_compare_scankey_args return
false unconditionally at the point where we're supposed to compare
each sk_argument from a pair of scan keys against the same attribute
(that's the easiest way to test problems in this area). With the right
combination of contradictory runtime keys, and with the right index
scan/index, we can get stuck.

When we get stuck its because we apply forcenonrequired mode in
_bt_readpage, reset the arrays at the end of the same _bt_readpage,
start another primitive index scan when pstate.finaltup is considered,
and then arrive at the same leaf page we ended on last time. We'll
then start the cycle anew, without any hope of ever making useful
progress. Not good.

Fundamentally, the problem here is that _bt_advance_array_keys expects
certain things from _bt_preprocess_keys and _bt_first that they won't
always honor. My approach in 0001-* more or less makes things always
work in the way that _bt_advance_array_keys already expects, rather
than changing _bt_advance_array_keys (or related code such as
_bt_set_startikey). More on that below, under "bugfix".

Background info on _bt_advance_array_keys' expectations
=======================================================

Notably, _bt_advance_array_keys would like to be able to determine if
_bt_first will be able to perform another primitive index scan based
on lower-order required inequalities marked required in the
opposite-to-scan direction only (i.e. an > or >= key when scanning
forwards, a < or <= key when scanning backwards). That detail is
important to my repro of the bug. _bt_advance_array_keys expects to be
able to apply required-in-opposite-direction inequalities to reason
about how _bt_first will behave the next time it is called.

During a forwards scan with a qual "WHERE a IN(1, 2, 3) AND b > 5",
_bt_advance_array_keys needs to know how "b > 5" will affect _bt_first
should it choose to schedule another primscan. It needs to decide
whether or not to actually schedule such a primscan. It might make
sense for _bt_advance_array_keys to stick to the leaf level for the
time being upon reaching the threshold between "a = 1" and "a = 2"
tuples -- the first "a = 2 AND b > 5" might be on the same leaf page.
OTOH, scheduling a new primscan is far preferable when it'll allow
_bt_first to land the scan on a much later leaf page (we'll skip over
many irrelevant leaf pages).

In general, the first leaf page/position that the first tuple "a = 2
AND b > 5" appears on might be many leaf pages after the first leaf
page that the first tuple "a =2" appears on. It all depends.

This scheme works well if there are no redundant > keys (which is very
much the common case, since incomplete opfamilies are rare). It
tacitly assumes that _bt_first shares the same definition of "required
in the opposite-to-scan direction key". But that isn't quite true
right now.

Bugfix
======

My 0001-* bugfix makes nbtree preprocessing deal with a qual like
"WHERE a IN(1, 2, 3) AND b > 5 AND c > 10_0000::bigint" with an
incomplete opfamily on "b" by leaving so->keyData[] in a state that
reflects the redundancy. _bt_preprocess_keys will now make sure that
keys left in so->keyData[] still appear in the most useful order. This
avoids creating confusion elsewhere.

Preprocessing will arbitrarily decide that only the first "b >"
condition gets to be marked required, while *unmarking* the
requiredness marking on the second "b >" condition. That way _bt_first
doesn't get to make its own independent choice about initial
positioning keys, based on ever-so-slightly different rules. Rather,
_bt_first has to agree with everybody else about which > or >= key
should be used, and about which < or <= key should be used (and even
about which "=" key should be used).

_bt_first does what preprocessing tells it to do. Just like
_bt_advance_array_keys will. No ifs, no buts.

Placing nonrequired keys at the end of so->keyData[]
----------------------------------------------------

Preprocessing also reorders so->keyData[] such that the "unlucky" keys
that _bt_preprocess_keys *doesn't* pick go to the end of the array --
we want to keep them out of the way, at least until they're needed.
Note that this makes it possible for keys to appear out of index
attribute order, but that's okay. Nonrequired keys don't need to be in
any particular order.

Reordering so->keyData[] in this way eliminates any risk that some
early nonrequired key on "b" will stop us from getting to some later
required key of interest on "c". Not in _bt_first, not in
_bt_checkkeys or _bt_advance_array_keys. Nowhere.

In general, the rule going forward is that all nbtree code (barring
preprocessing code) can assume that redundant keys don't really exist,
as long as they're marked required (which virtually all keys are now,
thanks to the skip scan/skip array work). There can be at most one
inequality key marked SK_BT_REQFWD and one inequality key marked
SK_BT_REQBKWD per index attribute -- no matter what.

nbtree is blasé about keeping around redundant scan keys right now.
That seems like it has the potential to cause more bugs in the future.
It's easy to forget about redundant required keys, and they're rare
(except perhaps when row compares are used, since preprocessing has no
hardly any smarts about redundancy that happens to involve row
compares right now). For a recent example a bug caused by such an
oversight, see bugfix commit 7e6fb5da.

New _bt_first behavior with row compares
----------------------------------------

Currently, on HEAD, _bt_first is prepared to use a seemingly random
mix of the first row compare member followed by some later
redudant-ish scalar on some later column (not the key on the same
column that appears in the row compare). This v3's row compare patch
now avoids that behavior. That happens automatically in v3, by virtue
of the fact that there can't be a row compare that's marked required
in the opposite-to-scan scan direction followed by some other key
that's also marked required (skip arrays don't accept row compare
inequalities, so it's just not possible).

In short, if _bt_first gets to use a row compare to build its
insertion scan key at all, then it *must* use all of the row members;
it cannot use any additional lower-order keys. (Actually, _bt_first
can only do that with individual row members marked required).

Again, there needs to be symmetry here (perhaps it isn't strictly
necessary to go exactly this far, but the row compare change to
_bt_first *is* necessary). Again, the rules that decide when we'll
start an index scan when scanning in one particular scan direction
need to be symmetric with the rules that decide when we'll end a index
scan with the same qual moving in the opposite scan direction.

Impact on Postgres 17
=====================

On Postgres 17, even a qual with redundancy such as "WHERE a IN(1, 2,
3) AND b > 5 AND c > 10_0000::bigint" shouldn't get infinite cycling
with an incomplete opfamily on "b". That can only happen on HEAD
because recent bugfix commit 5f4d98d4 added a _bt_start_array_keys
call to _bt_readpage, to reset the scan's arrays after we used
forcenonrequired mode to read all the tuples on the page.

Postgres 17 *does* reset the array keys through a call to
_bt_start_array_keys, to get a clean slate. But that only happens in
the context of mark/restore for merge joins, which seems robust
against infinite cycling of the kind I'm seeing on HEAD.

--
Peter Geoghegan

Attachment