Re: New hashed IN code ignores distinctiveness of subquery - Mailing list pgsql-bugs
From | Bradley Baetz |
---|---|
Subject | Re: New hashed IN code ignores distinctiveness of subquery |
Date | |
Msg-id | 20030126223739.GA1193@mango.home Whole thread Raw |
In response to | Re: New hashed IN code ignores distinctiveness of subquery (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: New hashed IN code ignores distinctiveness of subquery
|
List | pgsql-bugs |
On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote: > > This isn't really anything to do with the new IN code, but is a > long-standing problem: cost_mergejoin doesn't apply any penalty factor > for the case where there are lots of duplicates in both inner and outer > relation (causing rescans of big chunks of the inner relation). You can > see the rescanning happening in the EXPLAIN ANALYZE output: > > > -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual > > time=0.15..1429696.69 rows=50000 loops=1) > > Merge Cond: ("outer".product_id = "inner".product_id) > > -> Index Scan using bugs_product_id_idx on bugs > > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43 > > rows=50000 loops=1) > > -> Index Scan using bugs_product_id_idx on bugs > > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44 > > rows=277884160 loops=1) > ^^^^^^^^^^^^^^ > > 277884160 rows pulled from a 50000-row relation means a heck of a lot of > rescanning went on :-( Hmm. I'm not sure that that is the entire story. For the non-DISTINCT version, the top of the tree has an estimated cost of 3485675.30, whilst for the DISTINCT case, the estimated cost is 3033.30. If the planner tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo ...) then presumably the second plan would have been chosen. IOW, the cost estimate is roughly right (although its actually probably low, because the number of rows is wrong - see below), but the DISTINCTiveness of IN isn't being considered. > > The good news is that the system *is* aware of the small number of > distinct values in the table (note the dead-on estimate of the number of > distinct rows in your other query; which I think is from new-for-7.4 > code, though the required stats have been available since 7.2). > > I think it'd probably be possible to make some reasonable estimate of > the amount of rescanning required, and then inflate the mergejoin cost > estimate proportionally. I have not gotten around to looking at the > problem though. Care to take a shot at it? See above - the estimated merge_join cost appears to be being inflated already (to 3485661.38). That may be off by an order of magnitude, I guess, but its higher than the hash_join version. What appears not to be happening is that the optimiser doesn't realise that the real cost is less because of: /* * If we're doing an IN join, we want to return at most one row per * outer tuple; so we can stop scanning the inner scan if we matched on * the previous try. */ if (node->js.jointype == JOIN_IN && node->hj_MatchedOuter) node->hj_NeedNewOuter = true; Actually, from the last plan in my previous mail: -> Hash (cost=795.00..795.00 rows=50000 width=4) (actual time=146.34..146.34 rows=0 loops=1) Why is the actual number of rows 0? Would it be easier to have the hashing stage discard duplicates for an IN hash? I guess thats the same as doing a hash-based DISTINCT, though, although the DISTINCT plan has: -> Hash -> Subquery Scan "IN_subquery" -> Unique -> Index scan using bugs_product_id_idx so I guess the unique and the hash aren't combined (Not that the unique is needed - the indexscan should only be returning unique values anyway) While I'm going through the stats, the number of rows for a JOIN_IN merge is wrong - its currently |outer_rel->rows * selec;|, but should probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| - notice the estimated-vs-actual for the final join, of 5580 vs 50000. 5580 would be correct if the IN only returned one value, but thats not the case. IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in set_joinrel_size_estimates ? > > regards, tom lane Thanks, Bradley
pgsql-bugs by date: