Thread: Union+group by planner estimates way off?
On 7.4 RC2, I'm seeing a case where the query planner estimates are way out of line after grouping the result of a union. I've tried adjusting the statistics targets up to 200, and it made no difference in the planner's estimates. The point of the full query this came from is that it also has an aggregate function that produces a space-delimited list of commodity & fak for each id. Does anyone have any suggestions on tweaks to apply or ways to rewrite this? Is this one of those ugly corners where the query planner doesn't have a clue how to estimate this (seeing the nice round 200 estimate makes me suspicious)? EXPLAIN ANALYZE SELECT id FROM (SELECT id, commodity FROM commodities WHERE commodity IS NOT NULL UNION SELECT id, fak FROM commodities WHERE fak IS NOT NULL ) all_commodities GROUP BY id; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=15939.16..15939.16 rows=200 width=4) (actual time=3537.281..3680.418 rows=83306 loops=1) -> Subquery Scan all_commodities (cost=14002.00..15697.02 rows=96858 width=4) (actual time=2268.052..3214.996 rows=95715 loops=1) -> Unique (cost=14002.00..14728.44 rows=96858 width=15) (actual time=2268.043..2881.688 rows=95715 loops=1) -> Sort (cost=14002.00..14244.15 rows=96858 width=15) (actual time=2268.037..2527.083 rows=100008 loops=1) Sort Key: id, commodity -> Append (cost=0.00..5034.42 rows=96858 width=15) (actual time=7.402..1220.320 rows=100008 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..2401.23 rows=36831 width=15) (actual time=7.398..590.004 rows=39772 loops=1) -> Seq Scan on commodities (cost=0.00..2032.92 rows=36831 width=15) (actual time=7.388..468.415 rows=39772 loops=1) Filter: (commodity IS NOT NULL) -> Subquery Scan "*SELECT* 2" (cost=0.00..2633.19 rows=60027 width=14) (actual time=0.016..408.160 rows=60236 loops=1) -> Seq Scan on commodities (cost=0.00..2032.92 rows=60027 width=14) (actual time=0.010..221.635 rows=60236 loops=1) Filter: (fak IS NOT NULL) Total runtime: 3783.009 ms (13 rows)
"Arthur Ward" <award@dominionsciences.com> writes: > EXPLAIN ANALYZE SELECT id FROM > (SELECT id, commodity FROM commodities WHERE commodity IS NOT NULL > UNION > SELECT id, fak FROM commodities WHERE fak IS NOT NULL > ) all_commodities GROUP BY id; > QUERY > PLAN > ---------------------------------------------------------------------------------------------------------------------------------------------------------------- > HashAggregate (cost=15939.16..15939.16 rows=200 width=4) (actual > time=3537.281..3680.418 rows=83306 loops=1) > -> Subquery Scan all_commodities (cost=14002.00..15697.02 rows=96858 > width=4) (actual time=2268.052..3214.996 rows=95715 loops=1) It's falling back to a default estimate because it doesn't know how to find any statistics for the output of a sub-select. I have a TODO somewhere about burrowing down into sub-selects to see if the output maps directly to a column that we'd have stats for ... but it's not done yet. In this particular case the inaccurate estimate doesn't matter too much, I think, although it might be encouraging the system to select hash aggregation since it thinks the hashtable will be pretty small. If the estimate were getting used to plan higher-up plan steps then it could be a bigger problem. regards, tom lane
> In this particular case the inaccurate estimate doesn't matter too much, > I think, although it might be encouraging the system to select hash > aggregation since it thinks the hashtable will be pretty small. If the > estimate were getting used to plan higher-up plan steps then it could > be a bigger problem. That's my problem; this is a subselect feeding in to a larger query. That wrong estimate causes the planner to select a nested-loop at the next step up. At 83,000 rows, the word is "ouch!" At any rate, I discovered this while dissecting a giant & slow query. Hence, while disabling nested-loop joins avoids this particular pitfall, it's not good for the bigger picture. I think I'm going to end up splitting this larger query into smaller parts and reassemble the pieces in the application so I can push some smarts past other subselect boundaries. For my purposes, that should skirt the issue of union+group estimates not being calculated. As always, thanks for the fast answers!