Thread: Query planner question (7.4.5)

Query planner question (7.4.5)

From
Arthur Ward
Date:
I was looking at some stubborn queries in one of our applications, and 
not knowing the internals of the query planner, thought I might ask if 
this planner improvement is possible at all. We have queries with the 
general form of IN (SELECT FROM AA JOIN (SELECT foo UNION ALL SELECT 
bar)) clauses. (There's a view here, but that's not the planner's 
concern.) The corresponding part of the plan looks like this:
   ->  HashAggregate  (cost=135960.26..135960.26 rows=200 width=4)         ->  Nested Loop  (cost=1.14..130852.49
rows=2043108width=4)               ->  Index Scan using AA_pkey on AA  (cost=0.00..85.01 
 
rows=1 width=12)               ->  Subquery Scan BB (cost=1.14..110336.40 rows=2043108 
width=36)                     ->  Append  (cost=1.14..89905.32 rows=2043108 
width=16)                           ->  Subquery Scan "*SELECT* 1" 
(cost=1.14..75017.28 rows=1634954 width=16)                           ->  Subquery Scan "*SELECT* 2" 
(cost=0.00..14888.04 rows=408154 width=16)

The problem for us is that the default estimate at the HashAggregate is 
absurdly low, undercutting the other available join candidates' row 
estimates resulting in _bad_ plans. What I was wondering is whether the 
planner has enough information available when looking at the 
HashAggregate step (the IN clause that started this whole mess) to 
consider the statistics of the inputs to the Append. In our case, I know 
that the second Subquery scan is generating strictly unique values, so 
the HashAggregate cannot produce fewer than 408154 rows itself. 
Additionally, those subquery scans are simple queries without anything 
to cause the table statistics to be thrown away if they can be reused in 
later steps (or is the rowcount the only stat used after a node is 
created?) -- one subquery is a single-table filter, the other is a two 
table join, also nothing special.

Essentially, does the data exist for the planner to say at the 
HashAggregate, "a-ha! Subquery 2 estimates 400000 distinct rows!" and 
estimate something >= 400000?

If this is improved in 8.0, my apologies; we haven't had the opportunity 
to begin a migration, and the release notes only say there are a lot of 
small improvements. If this is a special case of a long-standing 
limitation, I'm not seeing the general case this fits in to, and I'll 
happily sit down and be quiet. :-)


Re: Query planner question (7.4.5)

From
Tom Lane
Date:
Arthur Ward <award@dominionsciences.com> writes:
> The problem for us is that the default estimate at the HashAggregate is 
> absurdly low, undercutting the other available join candidates' row 
> estimates resulting in _bad_ plans. What I was wondering is whether the 
> planner has enough information available when looking at the 
> HashAggregate step (the IN clause that started this whole mess) to 
> consider the statistics of the inputs to the Append.

Not at the moment ... UNION (and other set-operation nodes) are pretty
much terra incognita to the planner's estimation routines.  If you feel
motivated to work on it, step right up ...

> Additionally, those subquery scans are simple queries without anything 
> to cause the table statistics to be thrown away if they can be reused in 
> later steps (or is the rowcount the only stat used after a node is 
> created?) -- one subquery is a single-table filter, the other is a two 
> table join, also nothing special.

The rowcount does propagate up just fine, but guesstimates about numbers
of distinct values don't know anything about these structures.  Right
offhand I'm not sure what a suitable framework for handling this would
look like.
        regards, tom lane