Query planner question (7.4.5) - Mailing list pgsql-hackers

From Arthur Ward
Subject Query planner question (7.4.5)
Date
Msg-id 42080AAB.4080109@dominionsciences.com
Whole thread Raw
Responses Re: Query planner question (7.4.5)
List pgsql-hackers
I was looking at some stubborn queries in one of our applications, and 
not knowing the internals of the query planner, thought I might ask if 
this planner improvement is possible at all. We have queries with the 
general form of IN (SELECT FROM AA JOIN (SELECT foo UNION ALL SELECT 
bar)) clauses. (There's a view here, but that's not the planner's 
concern.) The corresponding part of the plan looks like this:
   ->  HashAggregate  (cost=135960.26..135960.26 rows=200 width=4)         ->  Nested Loop  (cost=1.14..130852.49
rows=2043108width=4)               ->  Index Scan using AA_pkey on AA  (cost=0.00..85.01 
 
rows=1 width=12)               ->  Subquery Scan BB (cost=1.14..110336.40 rows=2043108 
width=36)                     ->  Append  (cost=1.14..89905.32 rows=2043108 
width=16)                           ->  Subquery Scan "*SELECT* 1" 
(cost=1.14..75017.28 rows=1634954 width=16)                           ->  Subquery Scan "*SELECT* 2" 
(cost=0.00..14888.04 rows=408154 width=16)

The problem for us is that the default estimate at the HashAggregate is 
absurdly low, undercutting the other available join candidates' row 
estimates resulting in _bad_ plans. What I was wondering is whether the 
planner has enough information available when looking at the 
HashAggregate step (the IN clause that started this whole mess) to 
consider the statistics of the inputs to the Append. In our case, I know 
that the second Subquery scan is generating strictly unique values, so 
the HashAggregate cannot produce fewer than 408154 rows itself. 
Additionally, those subquery scans are simple queries without anything 
to cause the table statistics to be thrown away if they can be reused in 
later steps (or is the rowcount the only stat used after a node is 
created?) -- one subquery is a single-table filter, the other is a two 
table join, also nothing special.

Essentially, does the data exist for the planner to say at the 
HashAggregate, "a-ha! Subquery 2 estimates 400000 distinct rows!" and 
estimate something >= 400000?

If this is improved in 8.0, my apologies; we haven't had the opportunity 
to begin a migration, and the release notes only say there are a lot of 
small improvements. If this is a special case of a long-standing 
limitation, I'm not seeing the general case this fits in to, and I'll 
happily sit down and be quiet. :-)


pgsql-hackers by date:

Previous
From: "Simon Riggs"
Date:
Subject: Re: Thinking about breaking up the BufMgrLock
Next
From: Mark Kirkwood
Date:
Subject: Re: Query optimizer 8.0.1 (and 8.0)