Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Date
Msg-id CA+TgmoYYMvZm1EnHrD8sBovZ=x2Piy9_yew3J1LDyCW39Ey-HQ@mail.gmail.com
Whole thread Raw
In response to Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
List pgsql-hackers
On Thu, Feb 17, 2022 at 4:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> IMHO the whole problem is we're unable to estimate the join clause as a
> conditional probability, i.e.
>
>    P(A.x = B.x | (A.x < 42) & (B.x < 42))
>
> so maybe instead of trying to generate additional RelOptInfo items we
> should think about improving that. The extra RelOptInfos don't really
> solve this, because even if you decide to join A|x<42 to B|x<42 it does
> nothing to improve the join clause estimate.

I guess I hadn't considered that angle. I think the extra RelOptInfos
(or whatever) actually do solve a problem, because enforcing a
high-selectivity join qual against both sides is potentially quite
wasteful, and you need some way to decide whether to do it on one
side, the other, or both. But it's also true that I was wrong to
assume independence ... and if we could avoid assuming that, then the
join selectivity would work itself out without any of the machinery
that I just proposed.

> It actually deals with a more general form of this case, because the
> clauses don't need to reference the same attribute - so for example this
> would work too, assuming there is extended stats object on the columns
> on each side:
>
>   P(A.c = B.d | (A.e < 42) & (B.f < 42))

That'd be cool.

> Not sure. In my experience queries with both a join clause and other
> clauses referencing the same attribute are pretty rare. But I agree if
> we can do the expensive stuff only when actually needed, with no cost in
> the 99.999% other cases, I don't see why not. Of course, code complexity
> is a cost too.

Right. I mean, we could have a planner GUC to control whether the
optimization is used even in cases where we see that it's possible.
But Tom keeps arguing that it is possible in many queries and would
benefit few queries, and I'm not seeing why that should be so. I think
it's likely to benefit many of the queries to which it applies.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM
Next
From: Peter Geoghegan
Date:
Subject: Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM