Re: Adding skip scan (including MDAM style range skip scan) to nbtree - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |
Date | |
Msg-id | CAH2-Wz=G+amASrOh3FJ+T1N0B_2bhAKbB5aq-f3YUCXk-sVVLQ@mail.gmail.com Whole thread Raw |
In response to | Re: Adding skip scan (including MDAM style range skip scan) to nbtree (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
|
List | pgsql-hackers |
On Mon, Mar 17, 2025 at 6:51 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > This hasn't changed meaningfully in this patch, but I noticed that > pstate.finaltup is never set for the final page of the scan direction > (i.e. P_RIGHTMOST or P_LEFTMOST for forward or backward, > respectively). If it isn't used more than once after the first element > of non-P_RIGHTMOST/LEFTMOST pages, why is it in pstate? Or, if it is > used more than once, why shouldn't it be used in We don't set pstate.finaltup on either the leftmost or rightmost page (nothing new, this is all from the Postgres 17 SAOP patch) because we cannot possibly need to start another primitive index scan from that point. We only use pstate.finaltup to determine if we should start a new primitive index scan, every time the scan's array keys advance (except during "sktrig_required=false" array advancement, and barring the case where we don't have to check pstate.finaltup because we see that the new set of array keys are already an exact match for the tuple passed to _bt_advance_array_keys). In short, pstate.finaltup is used during a significant fraction of all calls to _bt_advance_array_keys, and there is no useful limit on the number of times that pstate.finaltup can be used per _bt_readpage. Although you didn't say it in your email (you said it to me on our most recent call), I believe that you're asking about pstate.finaltup for reasons that are more related to 0003-* than to 0001-*. As I recall, you asked me about why it was that the 0003-* patch avoids _bt_readpage's call to _bt_skip_ikeyprefix on the leftmost or rightmost leaf page (i.e. a page that lacks a pstate.finaltup). You were skeptical about that -- understandably so. To recap, 0003-* avoids calling _bt_skip_ikeyprefix on the rightmost (and leftmost) page because it reasons that activating that optimization is only okay when we know that we'll be able to "recover" afterwards, during the finaltup _bt_checkkeys call. By "recover", I mean restore the invariant for required array keys: that they track our progress through the key space. It felt safer and easier for 0003-* to just not call _bt_skip_ikeyprefix on a page that has no finaltup -- that way we won't have anything that we need to recover from, when we lack the pstate.finaltup that we generally expect to use for this purpose. This approach allowed me to add the following assertions a little bit later on, right at the end of _bt_readpage (quoting 0003-* here): @@ -1993,6 +2031,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, so->currPos.itemIndex = MaxTIDsPerBTreePage - 1; } + /* + * As far as our caller is concerned, the scan's arrays always track its + * progress through the index's key space. + * + * If _bt_skip_ikeyprefix told us to temporarily treat all scan keys as + * nonrequired (during a skip scan), then we must recover afterwards by + * advancing our arrays using finaltup (with !pstate.forcenonrequired). + */ + Assert(pstate.ikey == 0 && !pstate.forcenonrequired); + return (so->currPos.firstItem <= so->currPos.lastItem); } I've actually come around to your point of view on this (or what I thought was your PoV from our call). That is, I now *think* that it would be better if the code added by 0003-* called _bt_skip_ikeyprefix, without regard for whether or not we'll have a finaltup _bt_checkkeys call to "recover" (i.e. whether we're on the leftmost or rightmost page shouldn't matter). My change in perspective on this question is related to another change of perspective, on the question of whether we actually need to call _bt_start_array_keys as part of "recovering/restoring the array invariant", just ahead of the finaltup _bt_checkkeys call. As you know, 0003-* calls _bt_start_array_keys in this way, but that now seems like overkill. It can have undesirable side-effects when the array keys spuriously appear to advance, when in fact they were restarted via the _bt_start_array_keys call, only to have their original values restored via the finaltup call that immediately follows. Here's what I mean about side-effects: We shouldn't allow _bt_advance_array_keys' primitive index scan scheduling logic to falsely believe that the array keys have advanced, when they didn't really advance in any practical sense. The commit message of 0003-* promises that its optimization cannot influence primitive index scan scheduling at all, which seems like a good general principle for us to follow. But I recently discovered that that promise was subtly broken, which I tied to the way that 0003-* calls _bt_start_array_keys (this didn't produce wrong answers, and didn't even really hurt performance, but it seems wonky and hard to test). So I now think that I need to expect more from _bt_skip_ikeyprefix and its pstate.forcenonrequired optimization, so that my promise about primscan scheduling is honored. Tying it back to your concern, once I do that (once I stop calling _bt_start_array_keys in 0003-* to "hard reset" the arrays), I can also stop caring about finaltup being set on the rightmost page, at the point where we decide if _bt_skip_ikeyprefix should be called. Here's how I think that this will be safe: Obviously, we can't expect _bt_skip_ikeyprefix/pstate.forcenonrequired mode to maintain the scan's required arrays in the usual way -- the whole idea in 0003-* is to stop properly maintaining the arrays, until right at the end of the _bt_readpage call, so as to save cycles. However, it should be possible to teach _bt_skip_ikeyprefix to not leave the array keys in an irredeemably bad state -- even when no finaltup call is possible during the same _bt_readpage. And, even when the scan direction changes at an inconvenient time. The next revision (v29) is likely to strengthen the guarantees that _bt_skip_ikeyprefix makes about the state that it'll leave the scan's array keys in, so that its _bt_readpage caller can be laxer about "fully undoing" its call to _bt_skip_ikeyprefix. The invariants we need to restore only really apply when the scan needs to continue in the same scan direction, at least one more page. In short, as long as the array keys can never "go backwards" (relative to the current scan direction), then we'll be able to recover during the next conventional call to _bt_checkkeys (meaning the next !pstate.forcenonrequired call) -- even if that next call should happen on some other page (in practice it is very unlikely that it will, but we can still be prepared for that). While it's true that _bt_advance_array_keys (with sktrig_required=true) always promises to advance the array keys to the maximum possible extent that it can know to be safe, based on the caller's tuple alone, that in itself doesn't obligate _bt_readpage to make sure that _bt_advance_array_keys will be called when the top-level scan is over. We have never expected _bt_advance_array_keys to *reliably* reach the final set of array keys at the end of the scan (e.g., this won't happen when the index is completely empty, since we'll never call _bt_readpage in the first place). > In forward scan mode, recovery from forcenonrequired happens after the > main loop over all page items. In backward mode, it's in the loop: > > > + if (offnum == minoff && pstate.forcenonrequired) > > + { > > + Assert(so->skipScan); > > I think there's a comment missing that details _why_ we do this; > probably something like: > > /* > * We're about to process the final item on the page. > * Un-set forcenonrequired, so the next _bt_checkkeys will > * evaluate required scankeys and signal an end to this > * primitive scan if we've reached a stopping point. > */ I think that the right place to talk about this is above _bt_skip_ikeyprefix itself. > In line with that, could you explain a bit more about the > pstate.forcenonrequired optimization? I _think_ it's got something to > do with "required" scankeys adding some overhead per scankey, which > can be significant with skipscan evaluations and ignoring the > requiredness can thus save some cycles, but the exact method doesn't > seem to be very well articulated. The main benefit of the _bt_skip_ikeyprefix optimization is that it allows us to skip a pstate.ikey prefix of scan keys in many important cases. But that is not compatible with certain existing _bt_advance_array_keys "sktrig_required=true" optimizations. Most notably, we cannot assume that the array keys perfectly track our progress through the index's key space when calling _bt_advance_array_keys with "sktrig_required=false". In particular, it would be wrong to allow the SAOP array binary search cur_elem_trig=true optimization (see _bt_binsrch_array_skey) to be used. We also don't want to attempt to end the primscan during "sktrig_required=false" calls to _bt_advance_array_keys (nothing about that is new here, it's just that _bt_skip_ikeyprefix now temporarily forces the scan to behave this way, dynamically, rather than it being static behavior that is fixed for the whole scan). The underlying problem is that "sktrig_required=false" array advancement cannot precisely reason about the relationship between the scan's progress and the current required array key positions. For example, with a query "WHERE a BETWEEN 0 AND 100 AND b in (42, 44)", on a page whose "a" attribute values all satisfy the range qual on "a" (i.e. with pstate.ikey = 1), our "a" skip array won't advance at all (if we didn't use the _bt_skip_ikeyprefix optimization then we'd only ever do "sktrig_required=false" advancement, and the skip array might advance several times within a page, but we're ignoring "a" here). We cannot reuse work across "b" SAOP binary searches, because in general we're not paying attention to "a" at all -- and so we won't even try to roll over "b" when the value of "a" advances (we're just looking at "b", never "a"). > > _bt_skip_ikeyprefix > > I _think_ it's worth special-casing firstchangingattnum=1, as in that > case we know in advance there is no (immediate) common ground between > the index tuples and thus any additional work we do towards parsing > the scankeys would be wasted - except for matching inequality bounds > for firstchangingatt, or matching "open" skip arrays for a prefix of > attributes starting at firstchangingattnum (as per the > array->null_elem case). Not sure what you mean by "special-casing firstchangingattnum=1"? What "additional work we do towards parsing the scankeys" are you concerned about? It's fairly common for firstchangingattnum=1, even when the _bt_skip_ikeyprefix optimization is working well. For example, that's what'd happen with the example query I just gave (a query "WHERE a BETWEEN 0 AND 100 AND b in (42, 44)" can skip "a" by setting pstate.ikey=1, provided all of the "a" attribute values on the page are within the range of the skip array). > I also notice somed some other missed opportunities for optimizing > page accesses: > > > + if (key->sk_strategy != BTEqualStrategyNumber) > > The code halts optimizing "prefix prechecks" when we notice a > non-equality key. It seems to me that we can do the precheck on shared > prefixes with non-equality keys just the same as with equality keys; > and it'd improve performance in those cases, too. Yeah, I was thinking of doing this already (though not for RowCompare inequalities, which would be hard to evaluate from here). It makes sense because it's exactly the same case as the range skip array case, really -- why not just do it the same way? > > + if (!(key->sk_flags & SK_SEARCHARRAY)) > > + if (key->sk_attno < firstchangingattnum) > > + { > > + if (result == 0) > > + continue; /* safe, = key satisfied by every tuple */ > > + } > > + break; /* pstate.ikey to be set to scalar key's ikey */ > > This code finds out that no tuple on the page can possibly match the > scankey (idxtup=scalar returns non-0 value) but doesn't (can't) use it > to exit the scan. I think that's a missed opportunity for > optimization; now we have to figure that out for every tuple in the > scan. Same applies to the SAOP -array case (i.e. non-skiparray). Maybe, but it's not much of a missed opportunity. It doesn't guarantee that the scan can end in the case of a SAOP (the very next leaf page could satisfy the same scan key, given a SAOP array with "gaps" in the elements). So it can only really end the scan with a scalar = key -- though never when it is preceded by a skip array (doesn't matter if the skip array is definitely satisfied/has only one distinct attribute value on the page). Is this idea related to your previous idea involving "special-casing firstchangingattnum=1"? If I was going to do something like this, I think that it'd work by backing out of applying the optimization entirely. Right now, 0003-* applies the optimization whenever _bt_readpage decides to call _bt_skip_ikeyprefix, regardless of the details after that (it would be easy to teach _bt_skip_ikeyprefix to decide against applying its optimization entirely, but testing seems to show that it always makes sense to go ahead when _bt_skip_ikeyprefix is called, regardless of what _bt_skip_ikeyprefix sees on the page). > Thank you for working on this. Thanks for the review! -- Peter Geoghegan
pgsql-hackers by date: