Thread: Why am I getting great/terrible estimates with these CTE queries?

Why am I getting great/terrible estimates with these CTE queries?

From
Tomas Vondra
Date:
Hi,

I've been fighting with some CTE queries recently, and in the end I've
ended up with two basic cases. In one case the CTEs work absolutely
great, making the estimates much more precise, while in the other the
results are pretty terrible. And I'm not sure why both of these behave
the way they do.

I'm aware of the "optimization fence" and I suspect it's the cause here,
but I wonder why in one case it improves the estimates so much while in
the other one it performs so bad.

If anyone could explain what's happening, it'd be great.

The application is using a 'calendar table', with one row for each day
between year 1900 and 2050 (i.e. ~55000 rows) and some additional
columns about the date (month, quarter, week, ...) whatever. All these
are sequences of numbers starting at 1900-01-01. So the table looks like
this:

  id | month | week
--------------------
   1 | 0     | 0
   2 | 0     | 0
   .....
   7 | 0     | 1 <- switch to 2nd week
   .....
  31 | 0
  32 | 1 <- switch to February 1900

and so on. Let's go with 'month' only, so the table looks like this

  CREATE TABLE calendar_table (
      id INT PRIMARY KEY,
      month INT
  );

and let's fill it with data - for simplicity let's suppose all months
have 30 days:

  INSERT INTO calendar_table(id, month)
       SELECT i, (i/30) FROM generate_series(1,55000) s(i);

The second table is a "fact" table containing primary key, date (a FK to
the calendar table) and some additional data (facts), but we don't need
them so let's use just this table

  CREATE TABLE fact_table (
      id INT PRIMARY KEY,
      date_id INT REFERENCES calendar_table (id)
  );

Now let's fill the fact table with data for a very short period of time,
e.g. one month. Let's find what days are min/max for month 1000 (which
is ~1983):

  SELECT min(id), max(id) FROM calendar_table WHERE month = 1000;

    min  |  max
  -------+-------
   30000 | 30029

and let's fill some data randomly (1000000 rows):

  INSERT INTO fact_table(id, date_id)
       SELECT i, floor(30000 + random()*30)
         FROM generate_series(1,1000000) s(i);

Analyze the tables, and we're ready for the two queries.

(A) significant improvement of estimates / performance

A usual query is "do something with data for month X" so let's select
all data from the fact table that join to month 1000.

SELECT * FROM  fact_table f JOIN calendar_table c ON (date_id = c.id)
         WHERE month = 1000;

We do know it should be 1000000 rows, but the explain plan shows an
estimate of just 527 (http://explain.depesz.com/s/Q4oy):

                             QUERY PLAN
---------------------------------------------------------------------
 Hash Join  (cost=9.18..23110.45 rows=527 width=49)
   Hash Cond: (f.date_id = c.id)
   ->  Seq Scan on fact_table f  (cost=0.00..19346.00 rows=1000000 ...)
   ->  Hash  (cost=8.82..8.82 rows=29 width=8)
         ->  Index Scan using calendar_table_idx on calendar_table c
                                      (cost=0.00..8.82 rows=29 width=8)
               Index Cond: (month = 1000)
(6 rows)

Now, I'm pretty sure I understand where the estimate comes from. The
month is ~0.055% of the calendar table, and by applying this to the fact
table we do get ~550 rows. That's almost exactly the estimate.

Now, let's move the calendar table to a CTE (incl. the condition):

WITH c as (SELECT * from calendar_table WHERE month = 1000)
SELECT * FROM fact_table f JOIN c ON (f.date_id = c.id);

This gives us this plan (http://explain.depesz.com/s/5k9)

                             QUERY PLAN
---------------------------------------------------------------------
 Hash Join  (cost=9.76..32772.43 rows=966667 width=49)
   Hash Cond: (f.date_id = c.id)
   CTE c
     ->  Index Scan using calendar_table_idx on calendar_table
                                    (cost=0.00..8.82 rows=29 width=8)
           Index Cond: (month = 1000)
   ->  Seq Scan on fact_table f  (cost=0.00..19346.00 rows=1000000 ...)
   ->  Hash  (cost=0.58..0.58 rows=29 width=8)
         ->  CTE Scan on c  (cost=0.00..0.58 rows=29 width=8)
(8 rows)

Now, this gives us much better plan (almost exactly the actual number of
rows). The queries are usually more complex (more joins, aggregations
etc.) and this precise estimate significantly improves the performance.

I don't get it - how could it get so much more precise estimate? I'd
expect such behavior e.g. from temporary tables (because they may be
anaklyzed), but that's not how CTE work internally - the materialization
does not allow gathering stats prior to planning AFAIK.

Now let's see the opposite direction.

(B) significant deviation of estimates / performance

Let's work with the fact table only - we've seen queries where narrow
"noodles" of the fact table (basically a PK+column) were selected and
then joined using the PK. Yes, it's a dumb thing to do but that's not
the point. Let's see how this performs with CTEs

WITH a AS (SELECT * from fact_table), b AS (SELECT * from fact_table)
 SELECT * from a JOIN b on (a.id = b.id);

The explain plan is here (http://explain.depesz.com/s/zGy) - notice the
estimate which is ~5000x overestimating the actual number 1000000
(because we're joining over a PK).

                             QUERY PLAN
-----------------------------------------------------------------------
 Merge Join  (cost=278007.69..75283007.69 rows=5000000000 width=80)
   Merge Cond: (a.id = b.id)
   CTE a
     ->  Seq Scan on fact_table  (cost=0.00..19346.00 rows=1000000 ...)
   CTE b
     ->  Seq Scan on fact_table  (cost=0.00..19346.00 rows=1000000 ...)
   ->  Sort  (cost=119657.84..122157.84 rows=1000000 width=40)
         Sort Key: a.id
         ->  CTE Scan on a  (cost=0.00..20000.00 rows=1000000 width=40)
   ->  Sort  (cost=119657.84..122157.84 rows=1000000 width=40)
         Sort Key: b.id
         ->  CTE Scan on b  (cost=0.00..20000.00 rows=1000000 width=40)
(12 rows)

Now let's try without the CTEs:

 SELECT * from (SELECT * from fact_table) a
                  JOIN (SELECT * from fact_table) b on (a.id = b.id);

That gives us this plan (http://explain.depesz.com/s/Fmv)

                              QUERY PLAN
-----------------------------------------------------------------------
 Merge Join  (cost=0.00..85660.70 rows=1000000 width=82)
   Merge Cond: (public.fact_table.id = public.fact_table.id)
   ->  Index Scan using fact_table_pkey on fact_table
                            (cost=0.00..35330.35 rows=1000000 width=41)
   ->  Index Scan using fact_table_pkey on fact_table
                            (cost=0.00..35330.35 rows=1000000 width=41)
(4 rows)

Now we're getting much better estimates - exactly the right number.

Both queries might be improved (especially the second one, which may be
rewritten as a plain select without a join), but that's not the point
here. Why does the CTE improve the estimates so much in the first
example and hurts in the second one?

thanks
Tomas


Re: Why am I getting great/terrible estimates with these CTE queries?

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> I've been fighting with some CTE queries recently, and in the end I've
> ended up with two basic cases. In one case the CTEs work absolutely
> great, making the estimates much more precise, while in the other the
> results are pretty terrible. And I'm not sure why both of these behave
> the way they do.

You're assuming the case where the estimate is better is better for a
reason ... but it's only better as a result of blind dumb luck.  The
outer-level query planner doesn't know anything about the CTE's output
except the estimated number of rows --- in particular, it doesn't drill
down to find any statistics about the join column.  So what you're
getting there is a default selectivity estimate that just happens to
match reality in this case.  (If you work through the math in
eqjoinsel_inner for the case where the relation sizes are grossly
different and we don't have MCV stats, you'll find that it comes out to
be assuming that each row in the larger relation has one join partner in
the smaller one, which indeed is your situation here.)  In the other
example, you're likewise getting a default selectivity estimate, only it
doesn't match so well, because the default does not include assuming
that the join keys are unique on both sides.  Without the CTEs, the
optimizer can see the keys are unique so it makes the right selectivity
estimate.

In principle we could make the optimizer try to drill down for stats,
which would make these examples work the same with or without the CTE
layers.  I'm not sure it's worth the trouble though --- I'm dubious that
people would use a CTE for cases that are simple enough for the stats
estimates to be worth anything.

            regards, tom lane


Re: Why am I getting great/terrible estimates with these CTE queries?

From
Tomas Vondra
Date:
On 10.10.2012 01:09, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> I've been fighting with some CTE queries recently, and in the end I've
>> ended up with two basic cases. In one case the CTEs work absolutely
>> great, making the estimates much more precise, while in the other the
>> results are pretty terrible. And I'm not sure why both of these behave
>> the way they do.
>
> You're assuming the case where the estimate is better is better for a
> reason ... but it's only better as a result of blind dumb luck.  The
> outer-level query planner doesn't know anything about the CTE's output
> except the estimated number of rows --- in particular, it doesn't drill
> down to find any statistics about the join column.  So what you're
> getting there is a default selectivity estimate that just happens to
> match reality in this case.  (If you work through the math in
> eqjoinsel_inner for the case where the relation sizes are grossly
> different and we don't have MCV stats, you'll find that it comes out to
> be assuming that each row in the larger relation has one join partner in
> the smaller one, which indeed is your situation here.)  In the other
> example, you're likewise getting a default selectivity estimate, only it
> doesn't match so well, because the default does not include assuming
> that the join keys are unique on both sides.  Without the CTEs, the
> optimizer can see the keys are unique so it makes the right selectivity
> estimate.

Thanks for explaining, now it finally makes some sense.


> In principle we could make the optimizer try to drill down for stats,
> which would make these examples work the same with or without the CTE
> layers.  I'm not sure it's worth the trouble though --- I'm dubious that
> people would use a CTE for cases that are simple enough for the stats
> estimates to be worth anything.

I don't think we need this to be improved with CTEs, we've used them
mostly as an attempt to make the queries faster (and it worked by luck,
as it turned out). If we could get better estimates with plain CTE-free
queries, that'd definitely be the preferred solution.

Actually we need to improve only the first query, as the second one
(joining over PK) is rather crazy.

I'll check the eqjoinsel_inner and the other join estimates, but I'd bet
this all boils down to estimating selectivity of two correlated columns
(because we're querying one and joining over another).

thanks
Tomas