Re: execExprInterp() questions / How to improve scalar array op expr eval? - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: execExprInterp() questions / How to improve scalar array op expr eval? |
Date | |
Msg-id | CAAaqYe_zVVOURfdPbAhssijw7yV0uKi350gQ=_QGDz7R=HpGGQ@mail.gmail.com Whole thread Raw |
In response to | Re: execExprInterp() questions / How to improve scalar array op expr eval? (James Coleman <jtc331@gmail.com>) |
List | pgsql-hackers |
On Mon, Apr 13, 2020 at 10:40 AM James Coleman <jtc331@gmail.com> wrote: > > I've read through all of the previous discussions related to stable > subexpression caching, and I'm planning to send a summary email with > all of those links in one place. > > But I also happened to stumble upon mention in the TODO of some email > discussion way back in 2007 where Tom suggested [1] we should really > try planning scalar array ops (particularly those with large IN lists) > as `IN (VALUES ...)`. > > That actually would solve the specific case I'd had this problem with > (seq scan on a large constant array IN expression). Ideally any query > with forms like: > select * from t where a in (1, 2,...) > select * from t where a in ((select i from x)) > would always be isomorphic in planning. But thinking about this > overnight and scanning through things quickly this morning, I have a > feeling that'd be 1.) a pretty significant undertaking, and 2.) likely > to explode the number of plans considered. > > Also I don't know if there's a good place to slot that into planning. > Do either of you happen to have any pointers into places that do > similar kinds of rewrites I could look at? And in those cases do we > normally always rewrite or do we consider both styles independently? > ... > [1]: https://www.postgresql.org/message-id/19001.1178823208%40sss.pgh.pa.us I've kept reading the code and thinking this over some more. and it seems like implementing this would require one of: 1. Tracking on a path whether or not it handles a given IN qual, generating paths with and without that qual, tracking the cheapest for each of those, and then running the join search with and without the VALUES RTEs in play...in short that seems...not really feasible. 2. Teaching set_plain_rel_pathlist to run with and without the IN quals, and run a join search for just that base rel and the VALUES RTE, and determine which is lower cost. This seems like it'd be easier to get working, would have the limitation of not fully considering all JOIN combinations including VALUES RTEs, and would still build more than twice as many base rel paths as we do now. I suppose it could predicate all of this on some kind of heuristic about saop cost (e.g., size of the array). But the purist in me finds this attractive -- it means people shouldn't have to know to try both IN (<list>) and IN (VALUES ...) -- it seems like it'd be a large effort with relatively little gain. My conclusion currently is that I believe we'd get 90% of the speed improvement (at least) merely by optimizing scalar array ops, and, possibly (to expand beyond constants) work on the broader effort of caching stable subexpressions with preprocessing support. And I have to assume the IN (<list>) case is a far more commonly used SQL clause anyway. I do wonder though if we could automatically convert some IN clauses with subqueries into saops (parameterized, or even better, in some cases, quasi-consts lazily evaluated)... James
pgsql-hackers by date: