Re: execExprInterp() questions / How to improve scalar array op expr eval? - Mailing list pgsql-hackers

From James Coleman
Subject Re: execExprInterp() questions / How to improve scalar array op expr eval?
Date
Msg-id CAAaqYe_zVVOURfdPbAhssijw7yV0uKi350gQ=_QGDz7R=HpGGQ@mail.gmail.com
Whole thread Raw
In response to Re: execExprInterp() questions / How to improve scalar array op expr eval?  (James Coleman <jtc331@gmail.com>)
List pgsql-hackers
On Mon, Apr 13, 2020 at 10:40 AM James Coleman <jtc331@gmail.com> wrote:
>
> I've read through all of the previous discussions related to stable
> subexpression caching, and I'm planning to send a summary email with
> all of those links in one place.
>
> But I also happened to stumble upon mention in the TODO of some email
> discussion way back in 2007 where Tom suggested [1] we should really
> try planning scalar array ops (particularly those with large IN lists)
> as `IN (VALUES ...)`.
>
> That actually would solve the specific case I'd had this problem with
> (seq scan on a large constant array IN expression). Ideally any query
> with forms like:
> select * from t where a in (1, 2,...)
> select * from t where a in ((select i from x))
> would always be isomorphic in planning. But thinking about this
> overnight and scanning through things quickly this morning, I have a
> feeling that'd be 1.) a pretty significant undertaking, and 2.) likely
> to explode the number of plans considered.
>
> Also I don't know if there's a good place to slot that into planning.
> Do either of you happen to have any pointers into places that do
> similar kinds of rewrites I could look at? And in those cases do we
> normally always rewrite or do we consider both styles independently?
> ...
> [1]: https://www.postgresql.org/message-id/19001.1178823208%40sss.pgh.pa.us

I've kept reading the code and thinking this over some more. and it
seems like implementing this would require one of:

1. Tracking on a path whether or not it handles a given IN qual,
generating paths with and without that qual, tracking the cheapest for
each of those, and then running the join search with and without the
VALUES RTEs in play...in short that seems...not really feasible.

2. Teaching set_plain_rel_pathlist to run with and without the IN
quals, and run a join search for just that base rel and the VALUES
RTE, and determine which is lower cost. This seems like it'd be easier
to get working, would have the limitation of not fully considering all
JOIN combinations including VALUES RTEs, and would still build more
than twice as many base rel paths as we do now. I suppose it could
predicate all of this on some kind of heuristic about saop cost (e.g.,
size of the array).

But the purist in me finds this attractive -- it means people
shouldn't have to know to try both IN (<list>) and IN (VALUES ...) --
it seems like it'd be a large effort with relatively little gain. My
conclusion currently is that I believe we'd get 90% of the speed
improvement (at least) merely by optimizing scalar array ops, and,
possibly (to expand beyond constants) work on the broader effort of
caching stable subexpressions with preprocessing support. And I have
to assume the IN (<list>) case is a far more commonly used SQL clause
anyway.

I do wonder though if we could automatically convert some IN clauses
with subqueries into saops (parameterized, or even better, in some
cases, quasi-consts lazily evaluated)...

James



pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Summary: State of Caching Stable Subexpressions
Next
From: Masahiko Sawada
Date:
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority