Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
Date | |
Msg-id | CAH2-Wz=dbKvQ-wB24VhHzF_gjyckZ531yYDJ6JNfU8aBC+2e6w@mail.gmail.com Whole thread Raw |
In response to | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
|
List | pgsql-hackers |
On Mon, Nov 27, 2023 at 5:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > - +1 on the general idea. Hard to see any downsides if implemented right. Glad you think so. The "no possible downside" perspective is one that the planner sort of relies on, so this isn't just a nice-to-have -- it might actually be a condition of committing the patch. It's important that the planner can be very aggressive about using SAOP index quals, without us suffering any real downside at execution time. > - This changes the meaning of amsearcharray==true to mean that the > ordering is preserved with ScalarArrayOps, right? You change B-tree to > make that true, but what about any out-of-tree index AM extensions? I > don't know if any such extensions exist, and I don't think we should > jump through any hoops to preserve backwards compatibility here, but > probably deserves a notice in the release notes if nothing else. My interpretation is that the planner changes affect amcanorder + amsearcharray index AMs, but have no impact on mere amsearcharray index AMs. If anything this is a step *away* from knowing about nbtree implementation details in the planner (though the planner's definition of amcanorder is very close to the behavior from nbtree, down to things like knowing all about nbtree strategy numbers). The planner changes from the patch are all subtractive -- I'm removing kludges that were added by bug fix commits. Things that weren't in the original feature commit at all. I used the term "my interpretation" here because it seems hard to think of this in abstract terms, and to write a compatibility note for this imaginary audience. I'm happy to go along with whatever you want, though. Perhaps you can suggest a wording for this? > - You use the term "primitive index scan" a lot, but it's not clear to > me what it means. Does one ScalarArrayOps turn into one "primitive index > scan"? Or each element in the array turns into a separate primitive > index scan? Or something in between? Maybe add a new section to the > README explain how that works. The term primitive index scan refers to the thing that happens each time _bt_first() is called -- with and without the patch. In other words, it's what happens when pg_stat_all_indexes.idx_scan is incremented. You could argue that that's not quite the right thing to be focussing on, with this new design. But it has precedent going for it. As I said, it's the thing that pg_stat_all_indexes.idx_scan counts, which is a pretty exact and tangible thing. So it's consistent with historical practice, but also with what other index AMs do when executing ScalarArrayOps non-natively. > - _bt_preprocess_array_keys() is called for each btrescan(). It performs > a lot of work like cmp function lookups and desconstructing and merging > the arrays, even if none of the SAOP keys change in the rescan. That > could make queries with nested loop joins like this slower than before: > "select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1 > and tenk1.two IN (1,2);". But that's nothing new. _bt_preprocess_array_keys() isn't a new function, and the way that it's called isn't new in any way. That said, I certainly agree that we should be worried about any added overhead in _bt_first for nested loop joins with an inner index scan. In my experience that can be an important issue. I actually have a TODO item about this already. It needs to be included in my work on performance validation, on general principle. > - nbtutils.c is pretty large now. Perhaps create a new file > nbtpreprocesskeys.c or something? Let me get back to you on this. > - You and Matthias talked about an implicit state machine. I wonder if > this could be refactored to have a more explicit state machine. The > state transitions and interactions between _bt_checkkeys(), > _bt_advance_array_keys() and friends feel complicated. I agree that it's complicated. That's the main problem that the patch has, by far. It used to be even more complicated, but it's hard to see a way to make it a lot simpler at this point. If you can think of a way to simplify it then I'll definitely give it a go. Can you elaborate on "more explicit state machine"? That seems like it could have value by adding more invariants, and making things a bit more explicit in one or two areas. It could also help us to verify that they hold from assertions. But that isn't really the same thing as simplification. I wouldn't use that word, at least. > > + <note> > > + <para> > > + Every time an index is searched, the index's > > + <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield> > > + field is incremented. This usually happens once per index scan node > > + execution, but might take place several times during execution of a scan > > + that searches for multiple values together. Only queries that use certain > > + <acronym>SQL</acronym> constructs to search for rows matching any value > > + out of a list (or an array) of multiple scalar values are affected. See > > + <xref linkend="functions-comparisons"/> for details. > > + </para> > > + </note> > > + > > Is this true even without this patch? Maybe commit this separately. Yes, it is. The patch doesn't actually change anything in this area. However, something in this area is new: it's a bit weird (but still perfectly consistent and logical) that the count shown by pg_stat_all_indexes.idx_scan will now show a value that is often influenced by low-level implementation details now. Things that are fairly far removed from the SQL query now affect that count -- that part is new. That's what I had in mind when I wrote this, FWIW. > The "Only queries ..." sentence feels difficult. Maybe something like > "For example, queries using IN (...) or = ANY(...) constructs.". I'll get back to you on this. > > * _bt_preprocess_keys treats each primitive scan as an independent piece of > > * work. That structure pushes the responsibility for preprocessing that must > > * work "across array keys" onto us. This division of labor makes sense once > > * you consider that we're typically called no more than once per btrescan, > > * whereas _bt_preprocess_keys is always called once per primitive index scan. > > "That structure ..." is a garden-path sentence. I kept parsing "that > must work" as one unit, the same way as "that structure", and it didn't > make sense. Took me many re-reads to parse it correctly. Now that I get > it, it doesn't bother me anymore, but maybe it could be rephrased. I'll do that ahead of the next revision. > Is there _any_ situation where _bt_preprocess_array_keys() is called > more than once per btrescan? No. Note that we don't know the scan direction within _bt_preprocess_array_keys(). We need a separate function to set up the array keys to their initial positions (this is nothing new). > > /* > > * Look up the appropriate comparison operator in the opfamily. > > * > > * Note: it's possible that this would fail, if the opfamily is > > * incomplete, but it seems quite unlikely that an opfamily would omit > > * non-cross-type comparison operators for any datatype that it supports > > * at all. ... > > */ > > I agree that's unlikely. I cannot come up with an example where you > would have cross-type operators between A and B, but no same-type > operators between B and B. For any real-world opfamily, that would be an > omission you'd probably want to fix. > > Still I wonder if we could easily fall back if it doesn't exist? And > maybe add a check in the 'opr_sanity' test for that. I'll see about an opr_sanity test. > In _bt_readpage(): > > if (!so->firstPage && !numArrayKeys && minoff < maxoff) > > { > > It's sad to disable this optimization completely for array keys. It's > actually a regression from current master, isn't it? There's no > fundamental reason we couldn't do it for array keys so I think we should > do it. I'd say whether or not there's any kind of regression in this area is quite ambiguous, though in a way that isn't really worth discussing now. If it makes sense to extend something like this optimization to array keys (or to add a roughly equivalent optimization), then we should do it. Otherwise we shouldn't. Note that the patch actually disables two distinct and independent optimizations when the scan has array keys. Both of these were added by recent commit e0b1ee17, but they are still independent things. They are: 1. This skipping thing inside _bt_readpage, which you've highlighted. This is only applied on the second or subsequent leaf page read by the scan. Right now, in the case of a scan with array keys, that means the second or subsequent page from the current primitive index scan -- which doesn't seem particularly principled to me. I'd need to invent a heuristic that works with my design to adapt the optimization. Plus I'd need to be able to invalidate the precheck whenever the array keys advanced. And I'd probably need a way of guessing whether or not it's likely that the arrays will advance, ahead of time, so that the precheck doesn't almost always go to waste, in a way that just doesn't make sense. Note that all required scan keys are relevant here. I like to think of plain required equality strategy scan keys without any array as "degenerate single value arrays". Something similar can be said of inequality strategy required scan keys (those required in the *current* scan direction), too. So it's not as if I can "just do the precheck stuff for the non-array scan keys". All required scan keys are virtually the same thing as required array-type scan keys -- they can trigger "roll over", affecting array key advancement for the scan keys that are associated with arrays. 2. The optimization that has _bt_checkkeys skip non-required scan keys that are *only* required in the direction *opposite* the current scan direction -- this can work even without any precheck from _bt_readpage. Note that this second optimization relies on various behaviors in _bt_first() that make it impossible for _bt_checkkeys() to ever see a tuple that could fail to satisfy such a scan key -- we must always have passed over non-matching tuples, thanks to _bt_first(). That prevents my patch with a problem: the logic for determining whether or not we need a new primitive index scan only promises to never require the scan to grovel through many leaf pages that _bt_first() could and should just skip over instead. This new logic makes no promises about skipping over small numbers of tuples. So it's possible that _bt_checkkeys() will see a handful of tuples "after the end of the _bt_first-wise primitive index scan", but "before the _bt_first-wise start of the next would-be primitive index scan". Note that this stuff matters even without bringing optimization 2 into it. There are similar considerations for required equality strategy scan keys, which (by definition) must be required in both scan directions. The new mechanism must never act as if it's past the end of matches in the current scan direction, when in fact it's really before the beginning of matches (that would lead to totally ignoring a group of equal matches). The existing _bt_checkkeys() logic can't really tell the difference on its own, since it only has an = operator to work with (well, I guess it knows about this context, since there is a comment about the dependency on _bt_first behaviors in _bt_checkkeys on HEAD -- very old comments). > _bt_checkkeys() is called in an assertion in _bt_readpage, but it has > the side-effect of advancing the array keys. Side-effects from an > assertion seems problematic. I agree that that's a concern, but just to be clear: there are no side-effects presently. You can't mix the array stuff with the optimization stuff. We won't actually call _bt_checkkeys() in an assertion when it can cause side-effects. Assuming that we ultimately conclude that the optimizations *aren't* worth preserving in any form, it might still be worth making it obvious that the assertions have no side-effects. But that question is unsettled right now. Thanks for the review! I'll try to get the next revision out soon. It'll also have bug fixes for mark + restore and for a similar issue seen when the scan changes direction in just the wrong way. (In short, the array key state machine can be confused about scan direction in certain corner cases.) -- Peter Geoghegan
pgsql-hackers by date: