Union+group by planner estimates way off? - Mailing list pgsql-performance

From Arthur Ward
Subject Union+group by planner estimates way off?
Date
Msg-id 23123.68.62.129.152.1068748095.squirrel@award.gotdns.org
Whole thread Raw
Responses Re: Union+group by planner estimates way off?
List pgsql-performance
On 7.4 RC2, I'm seeing a case where the query planner estimates are way
out of line after grouping the result of a union. I've tried adjusting the
statistics targets up to 200, and it made no difference in the planner's
estimates. The point of the full query this came from is that it also has
an aggregate function that produces a space-delimited list of commodity &
fak for each id. Does anyone have any suggestions on tweaks to apply or
ways to rewrite this? Is this one of those ugly corners where the query
planner doesn't have a clue how to estimate this (seeing the nice round
200 estimate makes me suspicious)?

EXPLAIN ANALYZE SELECT id FROM
(SELECT id, commodity FROM commodities WHERE commodity IS NOT NULL
   UNION
 SELECT id, fak FROM commodities WHERE fak IS NOT NULL
) all_commodities GROUP BY id;
                                                                           QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=15939.16..15939.16 rows=200 width=4) (actual
time=3537.281..3680.418 rows=83306 loops=1)
   ->  Subquery Scan all_commodities  (cost=14002.00..15697.02 rows=96858
width=4) (actual time=2268.052..3214.996 rows=95715 loops=1)
         ->  Unique  (cost=14002.00..14728.44 rows=96858 width=15) (actual
time=2268.043..2881.688 rows=95715 loops=1)
               ->  Sort  (cost=14002.00..14244.15 rows=96858 width=15)
(actual time=2268.037..2527.083 rows=100008 loops=1)
                     Sort Key: id, commodity
                     ->  Append  (cost=0.00..5034.42 rows=96858 width=15)
(actual time=7.402..1220.320 rows=100008 loops=1)
                           ->  Subquery Scan "*SELECT* 1"
(cost=0.00..2401.23 rows=36831 width=15)
(actual time=7.398..590.004 rows=39772 loops=1)
                                 ->  Seq Scan on commodities
(cost=0.00..2032.92 rows=36831 width=15)
(actual time=7.388..468.415 rows=39772
loops=1)
                                       Filter: (commodity IS NOT NULL)
                           ->  Subquery Scan "*SELECT* 2"
(cost=0.00..2633.19 rows=60027 width=14)
(actual time=0.016..408.160 rows=60236 loops=1)
                                 ->  Seq Scan on commodities
(cost=0.00..2032.92 rows=60027 width=14)
(actual time=0.010..221.635 rows=60236
loops=1)
                                       Filter: (fak IS NOT NULL)
 Total runtime: 3783.009 ms
(13 rows)


pgsql-performance by date:

Previous
From: Andrew Sullivan
Date:
Subject: strange estimate for number of rows
Next
From: Tom Lane
Date:
Subject: Re: Union+group by planner estimates way off?