Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE - Mailing list pgsql-hackers

From Tom Lane
Subject Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
Date
Msg-id 4905.1542993454@sss.pgh.pa.us
Whole thread Raw
In response to Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
Melanie Plageman <melanieplageman@gmail.com> writes:
> Given that you have addressed all of my feedback and that it's a pretty
> low-risk change, I will change the status to "ready for committer".

Thanks for reviewing!

> Has there been discussion in the past about adding a planner test
> extension similar to those in src/test/modules for cardinality
> estimation? I am imagining something that is a "soft" check that either
> the rows estimation that comes out of calc_joinrel_size_estimate is
> within an expected range (given differing estimates across machines) or
> that the selectivity estimate that comes out of eqjoinsel is within an
> expected range. The former seems relatively easy to do in a manner
> similar to the test_predtest extension and the latter seems like it
> could be done even more trivially.

No, I don't recall any discussion about that.  The regression tests in
general embody a lot of checking that the planner makes expected plan
choices: obviously the cases where we do an explicit EXPLAIN do that,
but even where we don't, we'd be likely to get artifacts such as varying
row order if an unexpected plan were chosen.  Perhaps there's a use-case
for a lower-level test harness such as you suggest, but I haven't really
felt a need for it.

> On Sat, Nov 17, 2018 at 12:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> (Here, both "outer rel's size" and "inner rel's size" mean the size after
>> earlier filtering steps.)  So that's why we only clamp nd2 and only do so
>> in eqjoinsel_semi: in the other three cases, we'd be double-counting the
>> selectivity of earlier filters if we did that.

> I just want to make sure I am understanding what the comment is saying: So,
> after we calculate the selectivity for inner join, when we return from
> calc_joinrel_size_estimate we do this math:
>     nrows = outer_rows * inner_rows * fkselec * jselec;
> and in that equation, the outer and inner rows have been adjusted to account
> for any restrictions on the tables, so we don't clamp the ndvs for inner
> join in eqjoinsel_inner. However, for semi-join, that calculation is
>     nrows = outer_rows * fkselec * jselec;
> Which means that we have to adjust the rows of the inner side before we get
> here?

Yeah.  Basically the point is that if we have some WHERE clause that
eliminates rows from the inner side of a semijoin, we can expect that
that means the size of the semijoin result will be smaller than if the
WHERE clause hadn't been there --- because some of the outer-rel rows
only had matches among those excluded rows.  But the equation in
calc_joinrel_size_estimate provides no way to factor that in, except
by adjusting the selectivity value, so that's what we do.

> You wrote:
>> Hm.  Maybe the "Psemi" and "Pinner" notation is not helpful ... would
>> "Ssemi" and "Sinner" be better?

> I think Ssemi and Sinner might be more clear--mostly because we haven't used
> P/predicate here or in most of the other selectivity estimation comments
> that I
> read. Also, in some cases when we have super limited information and make a
> guess, the selectivity feels pretty detached from the join predicate.

OK, thanks --- I'll have another go at writing that comment.

            regards, tom lane


pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: row filtering for logical replication
Next
From: Petr Jelinek
Date:
Subject: Re: row filtering for logical replication