BUG #5539: Generally incorrect planner estimations on join with UNION ALL - Mailing list pgsql-bugs

From Maksym Boguk
Subject BUG #5539: Generally incorrect planner estimations on join with UNION ALL
Date
Msg-id 201007040047.o640lNpD039563@wwwmaster.postgresql.org
Whole thread Raw
Responses Re: BUG #5539: Generally incorrect planner estimations on join with UNION ALL
List pgsql-bugs
The following bug has been logged online:

Bug reference:      5539
Logged by:          Maksym Boguk
Email address:      Maxim.Boguk@gmail.com
PostgreSQL version: 8.4.3
Operating system:   FreeBSD 7.2
Description:        Generally incorrect planner estimations on join with
UNION ALL
Details:

If query contains an join with '(t1 union all t2) on something' planner
estimated rows in results widely incorrect, thus leads to very bad plans if
query contained more then that single join. However, looking on explain
analyze output, planner have all required information to produce correct
estimation.

The test case.

Step1, initial data preparation:
mboguk_billing=# CREATE TABLE t1 AS SELECT v as id FROM
generate_series(1,100000) as g(v);
SELECT
mboguk_billing=# CREATE TABLE t2 AS SELECT v as id FROM
generate_series(1,100000) as g(v);
SELECT
mboguk_billing=# CREATE TABLE t3 AS SELECT v as id,v as value FROM
generate_series(1,100000) as g(v);
SELECT
mboguk_billing=# CREATE INDEX t1_id_key on t1(id);
CREATE INDEX
mboguk_billing=# CREATE INDEX t2_id_key on t2(id);
CREATE INDEX
mboguk_billing=# CREATE INDEX t3_id_key on t3(id);
CREATE INDEX
mboguk_billing=# CREATE INDEX t3_value_key on t3(value);
CREATE INDEX
mboguk_billing=# ANALYZE t1;
ANALYZE
mboguk_billing=# ANALYZE t2;
ANALYZE
mboguk_billing=# ANALYZE t3;
ANALYZE

Ok now we have 3 table with 100000 rows each.

Step 2: test query:

EXPLAIN ANALYZE SELECT t3.*,t_all.* FROM t3
JOIN (
SELECT * FROM t1
UNION ALL
SELECT * FROM t2
) as t_all ON t_all.id=t3.id
WHERE t3.value=100;
                                                        QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------
 Nested Loop  (cost=0.00..1.49 rows=1000 width=12) (actual time=0.076..0.116
rows=2 loops=1)
   Join Filter: (t3.id = t1.id)
   ->  Index Scan using t3_value_key on t3  (cost=0.00..0.49 rows=1 width=8)
(actual time=0.038..0.040 rows=1 loops=1)
         Index Cond: (value = 100)
   ->  Append  (cost=0.00..0.98 rows=2 width=4) (actual time=0.018..0.047
rows=2 loops=1)
         ->  Index Scan using t1_id_key on t1  (cost=0.00..0.49 rows=1
width=4) (actual time=0.015..0.019 rows=1 loops=1)
               Index Cond: (t1.id = t3.id)
         ->  Index Scan using t2_id_key on t2  (cost=0.00..0.49 rows=1
width=4) (actual time=0.015..0.017 rows=1 loops=1)
               Index Cond: (t2.id = t3.id)
 Total runtime: 0.235 ms
(10 rows)

Here problems already easy visible:
Nested Loop  (cost=0.00..1.49 rows=1000 width=12)
vs actual:
(actual time=0.076..0.116 rows=2 loops=1)

1000 rows predicted (look like hard-coded heuristic value) vs 2 row in
reality, but if careful examine key parts of the plan:

 Nested Loop  (cost=0.00..1.49 rows=1000 width=12) (actual time=0.076..0.116
rows=2 loops=1)
   Join Filter: (t3.id = t1.id)
   ->  Index Scan using t3_value_key on t3  (cost=0.00..0.49 rows=1 width=8)
(actual time=0.038..0.040 rows=1 loops=1)
         Index Cond: (value = 100)
   ->  Append  (cost=0.00..0.98 rows=2 width=4) (actual time=0.018..0.047
rows=2 loops=1)

notable planner had correctly estimated amount of rows in one side of the
join:
(Append  (cost=0.00..0.98 rows=2 width=4))
and planner hard correctly estimated amount of rows in the another side of
the join:
Index Scan using t3_value_key on t3  (cost=0.00..0.49 rows=1 width=8)

And instead of use 2(from append)*1(from scan on t3)=2 as final row
estimation, planner chooses use heuristic value 1000.

That isn't problem in this simple case, but once you try to add another join
on t_all with some big table you screwed because planner choose slow hash or
merge join with that big table instead of fast inner loop over two values.

PS: another small question... are
   Join Filter: (t3.id = t1.id)
is really required in that case?
And if yes, probably in explain analyze it should produce something like
   Join Filter: (t3.id = t_all.id)
instead.

Thanks for your time and attention.

pgsql-bugs by date:

Previous
From: pritesh modi
Date:
Subject: postgresql 8.3 on ubuntu 8.10 uppgrade,remove,or reinstall problem
Next
From: Craig Ringer
Date:
Subject: Re: postgresql 8.3 on ubuntu 8.10 uppgrade,remove,or reinstall problem