On 2/17/22 23:16, Robert Haas wrote:
> On Thu, Feb 17, 2022 at 4:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> IMHO the whole problem is we're unable to estimate the join clause as a
>> conditional probability, i.e.
>>
>> P(A.x = B.x | (A.x < 42) & (B.x < 42))
>>
>> so maybe instead of trying to generate additional RelOptInfo items we
>> should think about improving that. The extra RelOptInfos don't really
>> solve this, because even if you decide to join A|x<42 to B|x<42 it does
>> nothing to improve the join clause estimate.
>
> I guess I hadn't considered that angle. I think the extra RelOptInfos
> (or whatever) actually do solve a problem, because enforcing a
> high-selectivity join qual against both sides is potentially quite
> wasteful, and you need some way to decide whether to do it on one
> side, the other, or both. But it's also true that I was wrong to
> assume independence ... and if we could avoid assuming that, then the
> join selectivity would work itself out without any of the machinery
> that I just proposed.
>
True. We kinda already have this issue for the equality clauses, and
having paths with the condition pushed down (or not) seems like a
natural approach.
>> It actually deals with a more general form of this case, because the
>> clauses don't need to reference the same attribute - so for example this
>> would work too, assuming there is extended stats object on the columns
>> on each side:
>>
>> P(A.c = B.d | (A.e < 42) & (B.f < 42))
>
> That'd be cool.
>
Yeah, but the patch implementing this still needs more work.
>> Not sure. In my experience queries with both a join clause and other
>> clauses referencing the same attribute are pretty rare. But I agree if
>> we can do the expensive stuff only when actually needed, with no cost in
>> the 99.999% other cases, I don't see why not. Of course, code complexity
>> is a cost too.
>
> Right. I mean, we could have a planner GUC to control whether the
> optimization is used even in cases where we see that it's possible.
> But Tom keeps arguing that it is possible in many queries and would
> benefit few queries, and I'm not seeing why that should be so. I think
> it's likely to benefit many of the queries to which it applies.
>
Maybe. Although the example I linked some time ago shows a pretty
dramatic improvement, due to picking merge join + index scan, and not
realizing we'll have to skip a lot of data. But that's just one
anecdotal example.
Anyway, I think the best way to deal with these (perfectly legitimate)
concerns is to show how expensive it is for queries not not having such
join/restriction clauses, with the cost being close to 0. And then for
queries with such clauses but not benefiting from the change (a bit like
a worst case).
regards
[1]
https://www.postgresql.org/message-id/CA%2B1Wm9U_sP9237f7OH7O%3D-UTab71DWOO4Qc-vnC78DfsJQBCwQ%40mail.gmail.com
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company