Thread: Cost estimation problem on seq scan in a loop
While doing ad hoc queries I've seen several different problems that all seem to be variations on a theme.
The plan comes out looking like this:
Nested Loop (cost=826867.95..877038.04 rows=1 width=125)
Join Filter: (foo.bar = smallish_table.bar)
-> Something Complicated (cost=826867.95..876800.28 rows=1 width=81)
.....
-> Seq Scan on smallish_table (cost=0.00..142.89 rows=7389 width=44)
The estimate of rows=1 for Something Complicated is wrong and you really get 1000 or 100,000 rows. Meaning the seq scan on smallish_table gets iterated a lot, and the time really adds up.
It would be great if Something Complicated had the correct row estimate, but since I've seen this situation arise with a lot of different Something Complicated that don't have much to do with each other (although usually an antijoin of some kind is involved) , there is little reason to think we can squash every one of them.
Is there some principled way to go about teaching the planner that hashing smallish_table on the join filter key is a cheap insurance policy against underestimating the row count of the outer loop?
Cheers,
Jeff
On Mon, Dec 16, 2013 at 11:41 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > Is there some principled way to go about teaching the planner that hashing > smallish_table on the join filter key is a cheap insurance policy against > underestimating the row count of the outer loop? The problem is that cheap protection can end up being very expensive when it's the most expensive part of the query and it's repeated many times. In this query it's expecting thousands of rows. The executor goes to some effort to avoid having to do unnecessary copies of data and be able to use it straight out of the disk buffers so having to copy it an unnecessary time to a hash table would be annoying. What's more likely, I think is having plan nodes that make decisions at run-time. There's been some movement in this direction already and lots of discussion about it. Having a join type that retrieves the first few rows from the lhs and then decides whether to do a hash or nested loop on the rhs based on how many it finds might be more tractable than most other strategies. -- greg