Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not? - Mailing list pgsql-hackers
From | Dmitry Astapov |
---|---|
Subject | Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not? |
Date | |
Msg-id | CAFQUnFiQJJctiZd_vO5Tt8yN=H9Pwpg4AfzcyRCCmEz6jnmAvQ@mail.gmail.com Whole thread Raw |
In response to | Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
|
List | pgsql-hackers |
On Wed, May 12, 2021 at 4:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dmitry Astapov <dastapov@gmail.com> writes:
> I am trying to understand the behaviour of the query planner regarding the
> push-down of the conditions "through" the join.
I think your mental model is wrong. What's actually happening here is
that the planner uses equivalence classes to deduce implied conditions.
That is, we have the join condition a.adate = b.bdate and then you've
added the where condition a.adate = '2021-05-12'. Transitivity implies
that b.bdate = '2021-05-12', so we deduce that condition and are able
to apply it at the relation scan of b. Furthermore, having restricted
both a.adate and b.bdate to the same constant value at the scan level,
we no longer need to apply the join condition a.adate = b.bdate at all.
This is important not only to avoid the (probably minor) inefficiency
of rechecking the join condition, but because if we believed that all
three conditions were independently applicable, we'd come out with a
serious underestimate of the size of the join result.
Thank you very much, my mental model was indeed incorrect, and the above is very helpful.
Am I right in thinking that elimination the join condition is actually quite important part of the process?
Could it possibly be the main reason for =ANY/(x IN (..)) not to be optimized the same way?
> In my experiments, I was never able to get an execution plan that "pushes
> down" any condition apart from (=) through to the right side of the join,
None of the argument sketched above works for non-equality conditions.
There are some situations where you could probably figure out how to
use transitivity to deduce some implied condition, but cleaning things
up so that you don't have redundant conditions fouling up the join
size estimates seems like a hard problem.
I agree about inequality conditions, this problem seems to be rather hard to tackle in the general case.
Is it still hard when one thinks about =ANY or (column in (val1, val2, val3, ...)) as well?
I am thinking that =ANY would be a decent workaround for (x BETWEEN a AND b) in quite a lot of cases, if it was propagated to all the columns in the equivalence class.
> Equally surprising is that I was unable to find documentation or past
> mailing list discussions of this or similar topic, which leads me to
> believe that I am just not familiar with the proper terminology and can't
> come up with the right search terms.
src/backend/optimizer/README has a discussion of equivalence classes.
Thank you, this gives me a plethora of keywords for further searches.
I realize that it is possibly off-topic here, but what about workarounds for inequality constraints, joins and views? Maybe you could give me some pointers here as well?
My tables are large to huge (think OLAP, not OLTP). I found out when I have a view that joins several (2 to 10) tables on the column that is semantically the same in all of them (let's say it is ID and we join on ID), I do not have many avenues to efficiently select from such view for a list of IDs at the same time.
I could:
1) Do lots of fast queries and union them:
select * from vw where id=ID1 union all select * from vw where id=ID2 ....., which is only really feasible if the query is generated by the program
2)expose all ID columns from all the tables used in the view body and do:
select * from vw where id=ANY() and id1=ANY() and id2=ANY() and id3=ANY() .....
This only works well if the view hierarchy is flat (no views on views). If there are other views that use this use, re-exports of extra columns quickly snowballs, you might need column renaming if same view ends up being used more than once through two different dependency paths. Plus people not familiar with the problem tend to omit "clearly superfluous" columns from the new views they build on top.
3)forbid views that join tables larger than a certain size/dismantle views that become inefficient (this only works if the problem is detected fast enough and the view did not become popular yet)
So all of the workarounds I see in front of me right now are somewhat sad, but they are necessary, as not doing them means that queries would take hours or days instead of minutes.
Is there anything better that I have not considered in terms of workarounds?
D. Astapov
pgsql-hackers by date: