Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
Date | |
Msg-id | CAH2-WzkUaZesPygO4DCv4F7QG3BpMMfDmB42S+C_KPn-jaa+yA@mail.gmail.com Whole thread Raw |
In response to | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
List | pgsql-hackers |
On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I've attached v14, where 0001 is v13, 0002 is a patch with small > changes + some large comment suggestions, and 0003 which contains > sorted merge join code for _bt_merge_arrays. I have accepted your changes from 0003. Agree that it's better that way. It's at least a little faster, but not meaningfully more complicated. This is part of my next revision, v15, which I've attached (along with a test case that you might find useful, explained below). > I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I > can make of it. That's been the big focus of this new v15, which now goes all out with teaching _bt_preprocess_keys with how to deal with arrays. We now do comprehensive normalization of even very complicated combinations of arrays and non-array scan keys in this version. For example, consider this query: select * from multi_test where a = any('{1, 2, 3, 4, 5}'::int[]) and a > 2::bigint and a = any('{2, 3, 4, 5, 6}'::bigint[]) and a < 6::smallint and a = any('{2, 3, 4, 5, 6}'::smallint[]) and a < 4::int; This has a total of 6 input scankeys -- 3 of which are arrays. But by the time v15's _bt_preprocess_keys is done with it, it'll have only 1 scan key -- which doesn't even have an array (not anymore). And so we won't even need to advance the array keys one single time -- there'll simply be no array left to advance. In other words, it'll be just as if the query was written this way from the start: select * from multi_test where a = 3::int; (Though of course the original query will spend more cycles on preprocessing, compared to this manually simplified variant.) In general, preprocessing can now simplify queries like this to the maximum extent possible (without bringing the optimizer into it), no matter how much crud like this is added -- even including adversarial cases, with massive arrays that have some amount of redundancy/contradictoriness to them. It turned out to not be terribly difficult to teach _bt_preprocess_keys everything it could possibly need to know about arrays, so that it can operate on them directly, as a variant of the standard equality strategy (we do still need _bt_preprocess_array_keys for basic preprocessing of arrays, mostly just merging them). This is better overall (in that it gets every little subtlety right), but it also simplified a number of related issues. For example, there is no longer any need to maintain a mapping between so->keyData[]-wise scan keys (output scan keys), and scan->keyData[]-wise scan keys (input scan keys). We can just add a step to fix-up the references to the end of _bt_preprocess_keys, to make life easier within _bt_advance_array_keys. This preprocessing work should all be happening during planning, not during query execution -- that's the design that makes the most sense. This is something we've discussed in the past in the context of skip scan (see my original email to this thread for the reference). It would be especially useful for the very fancy kinds of preprocessing that are described by the MDAM paper, like using an index scan for a NOT IN() list/array (this can actually make sense with a low cardinality index). The structure for preprocessing that I'm working towards (especially in v15) sets the groundwork for making those shifts in the planner, because we'll no longer treat each array constant as its own primitive index scan during preprocessing. Right now, on HEAD, preprocessing with arrays kinda treats each array constant like the parameter of an imaginary inner index scan, from an imaginary nested loop join. But the planner really needs to operate on the whole qual, all at once, including any arrays. An actual nestloop join's inner index scan naturally cannot do that, and so might still require runtime/dynamic preprocessing in a world where that mostly happens in the planner -- but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE a in(4, 5)" are almost the same thing, and so should be handled in almost the same way by preprocessing). > > +_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir) > [...] > > + if (!(cur->sk_flags & SK_SEARCHARRAY) && > > + cur->sk_strategy != BTEqualStrategyNumber) > > + continue; > > This should use ||, not && Fixed. Yeah, that was a bug. > > + if (readpagetup || result != 0) > > + { > > + Assert(result != 0); > > + return false; > > + } > > I'm confused about this. By asserting !readpagetup after this exit, we > could save a branch condition for the !readpagetup result != 0 path. > Can't we better assert the inverse just below, or is this specifically > for defense-in-depth against bug? E.g. > > + if (result != 0) > + return false; > + > + Assert(!readpagetup); Yeah, that's what it was -- defensively. It seems slightly better as-is, because you'll get an assertion failure if a "readpagetup" caller gets "result == 0". That's never supposed to happen (if it did happen then our ORDER proc won't be in agreement with what our = operator indicated about the same tuple attribute value moments earlier, inside _bt_check_compare). > > + /* > > + * By here we have established that the scan's required arrays were > > + * advanced, and that they haven't become exhausted. > > + */ > > + Assert(arrays_advanced || !arrays_exhausted); > > Should use &&, based on the comment. Fixed by getting rid of the arrays_exhausted variable, which wasn't adding much anyway. > > + * We generally permit primitive index scans to continue onto the next > > + * sibling page when the page's finaltup satisfies all required scan keys > > + * at the point where we're between pages. > > This should probably describe that we permit primitive scans with > array keys to continue until we get to the sibling page, rather than > this rather obvious and generic statement that would cover even the > index scan for id > 0 ORDER BY id asc; or this paragraph can be > removed. It's not quite obvious. The scan's array keys change as the scan makes progress, up to once per tuple read. But even if it really was obvious, it's really supposed to frame the later discussion -- discussion of those less common cases where this isn't what happens. The exceptions. These exceptions are: 1. When a required scan key is deemed "satisfied" only because its value was truncated in a high key finaltup. (Technically this is what _bt_checkkeys has always done, but we're much more sensitive to this stuff now, because we won't necessarily get to make another choice about starting a new primitive index scan for a long time.) 2. When we apply the has_required_opposite_direction_only stuff, and decide to start a new primitive index scan, even though technically all of our required-in-this-direction scan keys are still satisfied. Separately, there is also the potential for undesirable interactions between 1 and 2, which is why we don't let them mix. (We have the "if (so->scanBehind && has_required_opposite_direction_only) goto new_prim_scan" gating condition.) > Further notes: > > I have yet to fully grasp what so->scanBehind is supposed to mean. "/* > Scan might be behind arrays? */" doesn't give me enough hints here. Yes, it is complicated. The best explanation is the one I've added to _bt_readpage, next to the precheck. But that does need more work. Note that the so->scanBehind thing solves two distinct problems for the patch (related, but still clearly distinct). These problem are: 1. It makes the existing precheck/continuescan optimization in _bt_readpage safe -- we'd sometimes get wrong answers to queries if we didn't limit application of the optimization to !so->scanBehind cases. I have a test case that proves that this is true -- the one I mentioned in my introduction. I'm attaching that as precheck_testcase.sql now. It might help you to understand so->scanBehind, particularly this point 1 about the basic correctness of the precheck thing (possibly point 2 also). 2. It is used to speculatively visit the next leaf page in corner cases where truncated -inf attributes from the high key are deemed "satisfied". Generally speaking, we don't want to visit the next leaf page unless we're already 100% sure that it might have matches (if any page has matches for our current array keys at all, that is). But in a certain sense we're only guessing. It isn't guaranteed (and fundamentally can't be guaranteed) to work out once on the next sibling page. We've merely assumed that our array keys satisfy truncated -inf columns, without really knowing what it is that we'll find on the next page when we actually visit it at that point (we're not going to see -inf in the non-pivot tuples on the next page, we'll see some real/non-sentinel low-sorting value). We have good practical reasons to not want to treat them as non-matching (though that would be correct), and to take this limited gamble (we can discuss that some more if you'd like). Once you accept that we have to do this, it follows that we need to be prepared to: A. Notice that we're really just guessing in the sense I've described (before leaving for the next sibling leaf page), by setting so->scanBehind. We'll remember that it happened. and: B. Notice that that limited gamble didn't pay off once on the next/sibling leaf page, so that we can cut our losses and start a new primitive index scan at that point. We do this by checking so->scanBehind (along with the sibling page's high key), once on the sibling page. (We don't need explicit handling for the case when it works out, as it almost always will.) If you want to include discussion of problem 1 here too (not just problem 2), then I should add a third thing that we need to notice: C. Notice that we can't do the precheck thing once in _bt_readpage, because it'd be wrong to allow it. > I find it weird that we call _bt_advance_array_keys for non-required > sktrig. Shouldn't that be as easy as doing a binary search through the > array? Why does this need to hit the difficult path? What difficult path? Advancing non-required arrays isn't that different to advancing required ones. We will never advance required arrays when called just to advance a non-required one, obviously. But we can do the opposite. In fact, we absolutely have to advance non-required arrays (as best we can) when advancing a required one (or when the call was triggered by a non-array required scan key). That said, it could have been clearer than it was in earlier versions. v15 makes the difference between the non-required scan key trigger and required scan key trigger cases clearer within _bt_advance_array_keys. -- Peter Geoghegan
Attachment
pgsql-hackers by date: