Re: Making Row Comparison NULL row member handling more robust during skip scans - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Making Row Comparison NULL row member handling more robust during skip scans |
Date | |
Msg-id | CAH2-Wz=ds4M+3NXMgwxYxqU8MULaLf696_v5g=9WNmWL2=Uo2A@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
On Wed, Jun 18, 2025 at 8:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > In general, when we end a primitive index scan, the code that sets > continuescan=false (any such code, not just _bt_check_rowcompare NULL > row member code) has to make sure that starting the next primitive > index scan will actually allow the top-level scan to make forward > progress -- the new primscan needs to land on some later leaf page. > Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd > consider truly robust. We need that guarantee to avoid these > cycles/infinite looping. AFAIK there are no bugs of that general > nature on Postgres 18 Correction: there *is* a live bug like this on Postgres 18/HEAD, which involves simple scalar inequalities with an incomplete opfamily. Attached v3 fixes the bug in its new 0001-* patch. (Note that v3's 0002-* patch previously appeared as 0001-* in v2 and v1). The bug in question involves redundant keys that could not be eliminated by preprocessing due to a lack of available cross-type support. Quals with certain combinations of redundant scalar keys can cycle through the index being scanned ceaselessly. The scan fails to make any real progress, again and again. Unlike the row compare case, we lack even kludgey handling that avoids the problem within _bt_set_startikey. There's a live bug here, so I have to act. It seems to make sense to apply everything on HEAD/Postgres 18, and be done with any question of getting stuck like this/infinite cycling from resetting the scan's arrays to get a clean slate. But I would certainly welcome other opinions on this. Problem statement ================= I have confirmed the existence of the live bug using fuzz-testing, combined with hard-coding that makes _bt_compare_scankey_args return false unconditionally at the point where we're supposed to compare each sk_argument from a pair of scan keys against the same attribute (that's the easiest way to test problems in this area). With the right combination of contradictory runtime keys, and with the right index scan/index, we can get stuck. When we get stuck its because we apply forcenonrequired mode in _bt_readpage, reset the arrays at the end of the same _bt_readpage, start another primitive index scan when pstate.finaltup is considered, and then arrive at the same leaf page we ended on last time. We'll then start the cycle anew, without any hope of ever making useful progress. Not good. Fundamentally, the problem here is that _bt_advance_array_keys expects certain things from _bt_preprocess_keys and _bt_first that they won't always honor. My approach in 0001-* more or less makes things always work in the way that _bt_advance_array_keys already expects, rather than changing _bt_advance_array_keys (or related code such as _bt_set_startikey). More on that below, under "bugfix". Background info on _bt_advance_array_keys' expectations ======================================================= Notably, _bt_advance_array_keys would like to be able to determine if _bt_first will be able to perform another primitive index scan based on lower-order required inequalities marked required in the opposite-to-scan direction only (i.e. an > or >= key when scanning forwards, a < or <= key when scanning backwards). That detail is important to my repro of the bug. _bt_advance_array_keys expects to be able to apply required-in-opposite-direction inequalities to reason about how _bt_first will behave the next time it is called. During a forwards scan with a qual "WHERE a IN(1, 2, 3) AND b > 5", _bt_advance_array_keys needs to know how "b > 5" will affect _bt_first should it choose to schedule another primscan. It needs to decide whether or not to actually schedule such a primscan. It might make sense for _bt_advance_array_keys to stick to the leaf level for the time being upon reaching the threshold between "a = 1" and "a = 2" tuples -- the first "a = 2 AND b > 5" might be on the same leaf page. OTOH, scheduling a new primscan is far preferable when it'll allow _bt_first to land the scan on a much later leaf page (we'll skip over many irrelevant leaf pages). In general, the first leaf page/position that the first tuple "a = 2 AND b > 5" appears on might be many leaf pages after the first leaf page that the first tuple "a =2" appears on. It all depends. This scheme works well if there are no redundant > keys (which is very much the common case, since incomplete opfamilies are rare). It tacitly assumes that _bt_first shares the same definition of "required in the opposite-to-scan direction key". But that isn't quite true right now. Bugfix ====== My 0001-* bugfix makes nbtree preprocessing deal with a qual like "WHERE a IN(1, 2, 3) AND b > 5 AND c > 10_0000::bigint" with an incomplete opfamily on "b" by leaving so->keyData[] in a state that reflects the redundancy. _bt_preprocess_keys will now make sure that keys left in so->keyData[] still appear in the most useful order. This avoids creating confusion elsewhere. Preprocessing will arbitrarily decide that only the first "b >" condition gets to be marked required, while *unmarking* the requiredness marking on the second "b >" condition. That way _bt_first doesn't get to make its own independent choice about initial positioning keys, based on ever-so-slightly different rules. Rather, _bt_first has to agree with everybody else about which > or >= key should be used, and about which < or <= key should be used (and even about which "=" key should be used). _bt_first does what preprocessing tells it to do. Just like _bt_advance_array_keys will. No ifs, no buts. Placing nonrequired keys at the end of so->keyData[] ---------------------------------------------------- Preprocessing also reorders so->keyData[] such that the "unlucky" keys that _bt_preprocess_keys *doesn't* pick go to the end of the array -- we want to keep them out of the way, at least until they're needed. Note that this makes it possible for keys to appear out of index attribute order, but that's okay. Nonrequired keys don't need to be in any particular order. Reordering so->keyData[] in this way eliminates any risk that some early nonrequired key on "b" will stop us from getting to some later required key of interest on "c". Not in _bt_first, not in _bt_checkkeys or _bt_advance_array_keys. Nowhere. In general, the rule going forward is that all nbtree code (barring preprocessing code) can assume that redundant keys don't really exist, as long as they're marked required (which virtually all keys are now, thanks to the skip scan/skip array work). There can be at most one inequality key marked SK_BT_REQFWD and one inequality key marked SK_BT_REQBKWD per index attribute -- no matter what. nbtree is blasé about keeping around redundant scan keys right now. That seems like it has the potential to cause more bugs in the future. It's easy to forget about redundant required keys, and they're rare (except perhaps when row compares are used, since preprocessing has no hardly any smarts about redundancy that happens to involve row compares right now). For a recent example a bug caused by such an oversight, see bugfix commit 7e6fb5da. New _bt_first behavior with row compares ---------------------------------------- Currently, on HEAD, _bt_first is prepared to use a seemingly random mix of the first row compare member followed by some later redudant-ish scalar on some later column (not the key on the same column that appears in the row compare). This v3's row compare patch now avoids that behavior. That happens automatically in v3, by virtue of the fact that there can't be a row compare that's marked required in the opposite-to-scan scan direction followed by some other key that's also marked required (skip arrays don't accept row compare inequalities, so it's just not possible). In short, if _bt_first gets to use a row compare to build its insertion scan key at all, then it *must* use all of the row members; it cannot use any additional lower-order keys. (Actually, _bt_first can only do that with individual row members marked required). Again, there needs to be symmetry here (perhaps it isn't strictly necessary to go exactly this far, but the row compare change to _bt_first *is* necessary). Again, the rules that decide when we'll start an index scan when scanning in one particular scan direction need to be symmetric with the rules that decide when we'll end a index scan with the same qual moving in the opposite scan direction. Impact on Postgres 17 ===================== On Postgres 17, even a qual with redundancy such as "WHERE a IN(1, 2, 3) AND b > 5 AND c > 10_0000::bigint" shouldn't get infinite cycling with an incomplete opfamily on "b". That can only happen on HEAD because recent bugfix commit 5f4d98d4 added a _bt_start_array_keys call to _bt_readpage, to reset the scan's arrays after we used forcenonrequired mode to read all the tuples on the page. Postgres 17 *does* reset the array keys through a call to _bt_start_array_keys, to get a clean slate. But that only happens in the context of mark/restore for merge joins, which seems robust against infinite cycling of the kind I'm seeing on HEAD. -- Peter Geoghegan
Attachment
pgsql-hackers by date: