Bradley Baetz <bbaetz@acm.org> writes:
> On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote:
>> This isn't really anything to do with the new IN code, but is a
>> long-standing problem: cost_mergejoin doesn't apply any penalty factor
> Hmm. I'm not sure that that is the entire story. For the non-DISTINCT
> version, the top of the tree has an estimated cost of 3485675.30, whilst
> for the DISTINCT case, the estimated cost is 3033.30.
But the DISTINCT case is a different query. The question that's really
at hand is why does the planner mistakenly prefer merge to hash join in
the non-DISTINCT case.
> If the planner
> tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo
> ...) then presumably the second plan would have been chosen.
You are confusing cost estimation with number-of-resulting-tuples
estimation. They are different things.
> What appears not to be
> happening is that the optimiser doesn't realise that the real cost is
> less because of:
Because the real cost is NOT less because of that. See above.
> -> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
> time=146.34..146.34 rows=0 loops=1)
> Why is the actual number of rows 0?
Because a Hash node doesn't return any tuples in the normal fashion;
it's only called to build a hash table, which is then probed by the
HashJoin node. (No, I don't know why the Berkeley boys designed it that
way, and not as one combined plan node. Maybe they had visions of using
Hash for other things.)
> While I'm going through the stats, the number of rows for a JOIN_IN
> merge is wrong - its currently |outer_rel->rows * selec;|, but should
> probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| -
> notice the estimated-vs-actual for the final join, of 5580 vs 50000.
Not sure about that. The selectivity would need to be calculated
differently if we based the calculation on that. It's probably bogus
as-is (note the comments) but I haven't had time to think about it.
The major reason why this equation is stated the way it is is that given
selectivity ranging from 0 to 1, multiplying by outer rows only yields
the correct range of possible results (0 to number of outer rows).
Multiplying by num_distinct_rows could yield an impossible result.
> IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same
> cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in
> set_joinrel_size_estimates ?
Maybe; from a mechanical point of view, the existing code does the right
thing given 0-to-1 selectivity estimates in each case.
regards, tom lane