Re: Why am I getting great/terrible estimates with these CTE queries? - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Why am I getting great/terrible estimates with these CTE queries?
Date
Msg-id 5074BF14.4030505@fuzzy.cz
Whole thread Raw
In response to Re: Why am I getting great/terrible estimates with these CTE queries?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On 10.10.2012 01:09, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> I've been fighting with some CTE queries recently, and in the end I've
>> ended up with two basic cases. In one case the CTEs work absolutely
>> great, making the estimates much more precise, while in the other the
>> results are pretty terrible. And I'm not sure why both of these behave
>> the way they do.
>
> You're assuming the case where the estimate is better is better for a
> reason ... but it's only better as a result of blind dumb luck.  The
> outer-level query planner doesn't know anything about the CTE's output
> except the estimated number of rows --- in particular, it doesn't drill
> down to find any statistics about the join column.  So what you're
> getting there is a default selectivity estimate that just happens to
> match reality in this case.  (If you work through the math in
> eqjoinsel_inner for the case where the relation sizes are grossly
> different and we don't have MCV stats, you'll find that it comes out to
> be assuming that each row in the larger relation has one join partner in
> the smaller one, which indeed is your situation here.)  In the other
> example, you're likewise getting a default selectivity estimate, only it
> doesn't match so well, because the default does not include assuming
> that the join keys are unique on both sides.  Without the CTEs, the
> optimizer can see the keys are unique so it makes the right selectivity
> estimate.

Thanks for explaining, now it finally makes some sense.


> In principle we could make the optimizer try to drill down for stats,
> which would make these examples work the same with or without the CTE
> layers.  I'm not sure it's worth the trouble though --- I'm dubious that
> people would use a CTE for cases that are simple enough for the stats
> estimates to be worth anything.

I don't think we need this to be improved with CTEs, we've used them
mostly as an attempt to make the queries faster (and it worked by luck,
as it turned out). If we could get better estimates with plain CTE-free
queries, that'd definitely be the preferred solution.

Actually we need to improve only the first query, as the second one
(joining over PK) is rather crazy.

I'll check the eqjoinsel_inner and the other join estimates, but I'd bet
this all boils down to estimating selectivity of two correlated columns
(because we're querying one and joining over another).

thanks
Tomas


pgsql-performance by date:

Previous
From: Craig James
Date:
Subject: Re: Hyperthreading (was: Two identical systems, radically different performance)
Next
From: Strahinja Kustudić
Date:
Subject: shared_buffers/effective_cache_size on 96GB server