Thread: Union+group by planner estimates way off?

Union+group by planner estimates way off?

From
"Arthur Ward"
Date:
On 7.4 RC2, I'm seeing a case where the query planner estimates are way
out of line after grouping the result of a union. I've tried adjusting the
statistics targets up to 200, and it made no difference in the planner's
estimates. The point of the full query this came from is that it also has
an aggregate function that produces a space-delimited list of commodity &
fak for each id. Does anyone have any suggestions on tweaks to apply or
ways to rewrite this? Is this one of those ugly corners where the query
planner doesn't have a clue how to estimate this (seeing the nice round
200 estimate makes me suspicious)?

EXPLAIN ANALYZE SELECT id FROM
(SELECT id, commodity FROM commodities WHERE commodity IS NOT NULL
   UNION
 SELECT id, fak FROM commodities WHERE fak IS NOT NULL
) all_commodities GROUP BY id;
                                                                           QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=15939.16..15939.16 rows=200 width=4) (actual
time=3537.281..3680.418 rows=83306 loops=1)
   ->  Subquery Scan all_commodities  (cost=14002.00..15697.02 rows=96858
width=4) (actual time=2268.052..3214.996 rows=95715 loops=1)
         ->  Unique  (cost=14002.00..14728.44 rows=96858 width=15) (actual
time=2268.043..2881.688 rows=95715 loops=1)
               ->  Sort  (cost=14002.00..14244.15 rows=96858 width=15)
(actual time=2268.037..2527.083 rows=100008 loops=1)
                     Sort Key: id, commodity
                     ->  Append  (cost=0.00..5034.42 rows=96858 width=15)
(actual time=7.402..1220.320 rows=100008 loops=1)
                           ->  Subquery Scan "*SELECT* 1"
(cost=0.00..2401.23 rows=36831 width=15)
(actual time=7.398..590.004 rows=39772 loops=1)
                                 ->  Seq Scan on commodities
(cost=0.00..2032.92 rows=36831 width=15)
(actual time=7.388..468.415 rows=39772
loops=1)
                                       Filter: (commodity IS NOT NULL)
                           ->  Subquery Scan "*SELECT* 2"
(cost=0.00..2633.19 rows=60027 width=14)
(actual time=0.016..408.160 rows=60236 loops=1)
                                 ->  Seq Scan on commodities
(cost=0.00..2032.92 rows=60027 width=14)
(actual time=0.010..221.635 rows=60236
loops=1)
                                       Filter: (fak IS NOT NULL)
 Total runtime: 3783.009 ms
(13 rows)


Re: Union+group by planner estimates way off?

From
Tom Lane
Date:
"Arthur Ward" <award@dominionsciences.com> writes:
> EXPLAIN ANALYZE SELECT id FROM
> (SELECT id, commodity FROM commodities WHERE commodity IS NOT NULL
>    UNION
>  SELECT id, fak FROM commodities WHERE fak IS NOT NULL
> ) all_commodities GROUP BY id;
>                                                                            QUERY
> PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=15939.16..15939.16 rows=200 width=4) (actual
> time=3537.281..3680.418 rows=83306 loops=1)
>    ->  Subquery Scan all_commodities  (cost=14002.00..15697.02 rows=96858
> width=4) (actual time=2268.052..3214.996 rows=95715 loops=1)

It's falling back to a default estimate because it doesn't know how to
find any statistics for the output of a sub-select.  I have a TODO
somewhere about burrowing down into sub-selects to see if the output maps
directly to a column that we'd have stats for ... but it's not done yet.

In this particular case the inaccurate estimate doesn't matter too much,
I think, although it might be encouraging the system to select hash
aggregation since it thinks the hashtable will be pretty small.  If the
estimate were getting used to plan higher-up plan steps then it could
be a bigger problem.

            regards, tom lane

Re: Union+group by planner estimates way off?

From
"Arthur Ward"
Date:
> In this particular case the inaccurate estimate doesn't matter too much,
> I think, although it might be encouraging the system to select hash
> aggregation since it thinks the hashtable will be pretty small.  If the
> estimate were getting used to plan higher-up plan steps then it could
> be a bigger problem.

That's my problem; this is a subselect feeding in to a larger query. That
wrong estimate causes the planner to select a nested-loop at the next step
up. At 83,000 rows, the word is "ouch!"

At any rate, I discovered this while dissecting a giant & slow query.
Hence, while disabling nested-loop joins avoids this particular pitfall,
it's not good for the bigger picture. I think I'm going to end up
splitting this larger query into smaller parts and reassemble the pieces
in the application so I can push some smarts past other subselect
boundaries. For my purposes, that should skirt the issue of union+group
estimates not being calculated.

As always, thanks for the fast answers!