Thread: New hashed IN code ignores distinctiveness of subquery
I've been trying out the new hased subselect code from CVS. It appears that the planner isn't taking the distinctiveness of the values from the subselect into account: bbaetz=# explain analyze select count(*) FROM bugs where product_id IN (SELECT product_id FROM bugs); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3485675.30..3485675.30 rows=1 width=8) (actual time=1430065.54..1430065.55 rows=1 loops=1) -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual time=0.15..1429696.69 rows=50000 loops=1) Merge Cond: ("outer".product_id = "inner".product_id) -> Index Scan using bugs_product_id_idx on bugs (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43 rows=50000 loops=1) -> Index Scan using bugs_product_id_idx on bugs (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44 rows=277884160 loops=1) Total runtime: 1430102.08 msec (6 rows) bbaetz=# explain analyze select count(*) FROM bugs where product_id IN (SELECT distinct product_id FROM bugs); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3033.30..3033.30 rows=1 width=8) (actual time=505.17..505.17 rows=1 loops=1) -> Hash Join (cost=1959.54..3032.67 rows=251 width=8) (actual time=282.19..456.66 rows=50000 loops=1) Hash Cond: ("outer".product_id = "inner".product_id) -> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4) (actual time=0.01..68.94 rows=50000 loops=1) -> Hash (cost=1959.52..1959.52 rows=9 width=4) (actual time=282.14..282.14 rows=0 loops=1) -> Subquery Scan "IN_subquery" (cost=0.00..1959.52 rows=9 width=4) (actual time=0.13..282.08 rows=9 loops=1) -> Unique (cost=0.00..1959.52 rows=9 width=4) (actual time=0.13..282.03 rows=9 loops=1) -> Index Scan using bugs_product_id_idx on bugs (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..245.46 rows=50000 loops=1) Total runtime: 505.30 msec (9 rows) bbaetz=# set enable_mergejoin=false; SET bbaetz=# explain analyze select count(*) FROM bugs where product_id IN (SELECT product_id FROM bugs); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=4281600.45..4281600.45 rows=1 width=8) (actual time=486.80..486.80 rows=1 loops=1) -> Hash Join (cost=1091.00..4281586.50 rows=5580 width=8) (actual time=146.58..425.92 rows=50000 loops=1) Hash Cond: ("outer".product_id = "inner".product_id) -> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4) (actual time=0.04..75.73 rows=50000 loops=1) -> Hash (cost=795.00..795.00 rows=50000 width=4) (actual time=146.34..146.34 rows=0 loops=1) -> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4) (actual time=0.06..64.98 rows=50000 loops=1) Total runtime: 486.91 msec (7 rows) bugs is a table with 50000 rows, and products has 10 rows. (Only 9 of the products are actually used in bugs.product_id, due to an off-by-one error in the script I used to generate the table) I still haven't tuned the various optimiser settings, which may explain part of the enable_mergejoin=false result, although the DISTINCT probably takes some time too (Side note - is it possible to notice that DISTINCT on a column with a unique index doesn't need a Unique pass?). However, 23 minutes vs 0.5 seconds isn't due to that. This is a fairly useless and silly query though - I was just playing arround. The tables have been analyzed, and there are separate unique indexes on products.id, bugs.bug_id and bugs.product_id. FWIW: bbaetz=# select n_distinct, correlation FROM pg_stats WHERE tablename='products' AND attname='product_id'; n_distinct | correlation ------------+------------- 9 | 0.0919474 (1 row) Thanks, Bradley
Bradley Baetz <bbaetz@acm.org> writes: > I've been trying out the new hased subselect code from CVS. It appears > that the planner isn't taking the distinctiveness of the values from the > subselect into account: This isn't really anything to do with the new IN code, but is a long-standing problem: cost_mergejoin doesn't apply any penalty factor for the case where there are lots of duplicates in both inner and outer relation (causing rescans of big chunks of the inner relation). You can see the rescanning happening in the EXPLAIN ANALYZE output: > -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual > time=0.15..1429696.69 rows=50000 loops=1) > Merge Cond: ("outer".product_id = "inner".product_id) > -> Index Scan using bugs_product_id_idx on bugs > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43 > rows=50000 loops=1) > -> Index Scan using bugs_product_id_idx on bugs > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44 > rows=277884160 loops=1) ^^^^^^^^^^^^^^ 277884160 rows pulled from a 50000-row relation means a heck of a lot of rescanning went on :-( The good news is that the system *is* aware of the small number of distinct values in the table (note the dead-on estimate of the number of distinct rows in your other query; which I think is from new-for-7.4 code, though the required stats have been available since 7.2). I think it'd probably be possible to make some reasonable estimate of the amount of rescanning required, and then inflate the mergejoin cost estimate proportionally. I have not gotten around to looking at the problem though. Care to take a shot at it? regards, tom lane
On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote: > > This isn't really anything to do with the new IN code, but is a > long-standing problem: cost_mergejoin doesn't apply any penalty factor > for the case where there are lots of duplicates in both inner and outer > relation (causing rescans of big chunks of the inner relation). You can > see the rescanning happening in the EXPLAIN ANALYZE output: > > > -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual > > time=0.15..1429696.69 rows=50000 loops=1) > > Merge Cond: ("outer".product_id = "inner".product_id) > > -> Index Scan using bugs_product_id_idx on bugs > > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43 > > rows=50000 loops=1) > > -> Index Scan using bugs_product_id_idx on bugs > > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44 > > rows=277884160 loops=1) > ^^^^^^^^^^^^^^ > > 277884160 rows pulled from a 50000-row relation means a heck of a lot of > rescanning went on :-( Hmm. I'm not sure that that is the entire story. For the non-DISTINCT version, the top of the tree has an estimated cost of 3485675.30, whilst for the DISTINCT case, the estimated cost is 3033.30. If the planner tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo ...) then presumably the second plan would have been chosen. IOW, the cost estimate is roughly right (although its actually probably low, because the number of rows is wrong - see below), but the DISTINCTiveness of IN isn't being considered. > > The good news is that the system *is* aware of the small number of > distinct values in the table (note the dead-on estimate of the number of > distinct rows in your other query; which I think is from new-for-7.4 > code, though the required stats have been available since 7.2). > > I think it'd probably be possible to make some reasonable estimate of > the amount of rescanning required, and then inflate the mergejoin cost > estimate proportionally. I have not gotten around to looking at the > problem though. Care to take a shot at it? See above - the estimated merge_join cost appears to be being inflated already (to 3485661.38). That may be off by an order of magnitude, I guess, but its higher than the hash_join version. What appears not to be happening is that the optimiser doesn't realise that the real cost is less because of: /* * If we're doing an IN join, we want to return at most one row per * outer tuple; so we can stop scanning the inner scan if we matched on * the previous try. */ if (node->js.jointype == JOIN_IN && node->hj_MatchedOuter) node->hj_NeedNewOuter = true; Actually, from the last plan in my previous mail: -> Hash (cost=795.00..795.00 rows=50000 width=4) (actual time=146.34..146.34 rows=0 loops=1) Why is the actual number of rows 0? Would it be easier to have the hashing stage discard duplicates for an IN hash? I guess thats the same as doing a hash-based DISTINCT, though, although the DISTINCT plan has: -> Hash -> Subquery Scan "IN_subquery" -> Unique -> Index scan using bugs_product_id_idx so I guess the unique and the hash aren't combined (Not that the unique is needed - the indexscan should only be returning unique values anyway) While I'm going through the stats, the number of rows for a JOIN_IN merge is wrong - its currently |outer_rel->rows * selec;|, but should probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| - notice the estimated-vs-actual for the final join, of 5580 vs 50000. 5580 would be correct if the IN only returned one value, but thats not the case. IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in set_joinrel_size_estimates ? > > regards, tom lane Thanks, Bradley
Bradley Baetz <bbaetz@acm.org> writes: > On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote: >> This isn't really anything to do with the new IN code, but is a >> long-standing problem: cost_mergejoin doesn't apply any penalty factor > Hmm. I'm not sure that that is the entire story. For the non-DISTINCT > version, the top of the tree has an estimated cost of 3485675.30, whilst > for the DISTINCT case, the estimated cost is 3033.30. But the DISTINCT case is a different query. The question that's really at hand is why does the planner mistakenly prefer merge to hash join in the non-DISTINCT case. > If the planner > tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo > ...) then presumably the second plan would have been chosen. You are confusing cost estimation with number-of-resulting-tuples estimation. They are different things. > What appears not to be > happening is that the optimiser doesn't realise that the real cost is > less because of: Because the real cost is NOT less because of that. See above. > -> Hash (cost=795.00..795.00 rows=50000 width=4) (actual > time=146.34..146.34 rows=0 loops=1) > Why is the actual number of rows 0? Because a Hash node doesn't return any tuples in the normal fashion; it's only called to build a hash table, which is then probed by the HashJoin node. (No, I don't know why the Berkeley boys designed it that way, and not as one combined plan node. Maybe they had visions of using Hash for other things.) > While I'm going through the stats, the number of rows for a JOIN_IN > merge is wrong - its currently |outer_rel->rows * selec;|, but should > probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| - > notice the estimated-vs-actual for the final join, of 5580 vs 50000. Not sure about that. The selectivity would need to be calculated differently if we based the calculation on that. It's probably bogus as-is (note the comments) but I haven't had time to think about it. The major reason why this equation is stated the way it is is that given selectivity ranging from 0 to 1, multiplying by outer rows only yields the correct range of possible results (0 to number of outer rows). Multiplying by num_distinct_rows could yield an impossible result. > IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same > cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in > set_joinrel_size_estimates ? Maybe; from a mechanical point of view, the existing code does the right thing given 0-to-1 selectivity estimates in each case. regards, tom lane
On Sun, Jan 26, 2003 at 06:51:18PM -0500, Tom Lane wrote: > Bradley Baetz <bbaetz@acm.org> writes: > > On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote: > >> This isn't really anything to do with the new IN code, but is a > >> long-standing problem: cost_mergejoin doesn't apply any penalty factor > > > Hmm. I'm not sure that that is the entire story. For the non-DISTINCT > > version, the top of the tree has an estimated cost of 3485675.30, whilst > > for the DISTINCT case, the estimated cost is 3033.30. > > But the DISTINCT case is a different query. The question that's really > at hand is why does the planner mistakenly prefer merge to hash join in > the non-DISTINCT case. Well, its a different input, but the output is the same, so it may as well be the same query. IN (1,1,1) is the same as IN (1). As I mentioned, I think the code prefers a merge join because the calculations assume that all of the 50000 rows returned from the subquery are unique. If that were the case, then a mergejoin probably would be correct. Given that most of the values are the same, however, its quicker to just do a hash probe for each entry. This probably ties into the 'long-standing problem' you mentioned earlier, but while fixing that would possibly increase the cost to the stage where the hashjoin is picked instead, it wouldn't fix the fact that the hashjoin cost is way too high to start with. > > > If the planner > > tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo > > ...) then presumably the second plan would have been chosen. > > You are confusing cost estimation with number-of-resulting-tuples > estimation. They are different things. I mentioned them both, because they are related, although not necessarily for the bug(s) here. IN (SELECT foo) will return more tuples, but its semantically identical to the IN (SELECT DISTINCT foo) form. For a subselect with a high selectivity, its better to filter later on, but where the result will have lots of repeated tuples, it may be better to try to filter first, which reduces teh number of tuples to join, and possibly use an index_scan to get the list instead of a seqscan. > Because a Hash node doesn't return any tuples in the normal fashion; > it's only called to build a hash table, which is then probed by the > HashJoin node. (No, I don't know why the Berkeley boys designed it that > way, and not as one combined plan node. Maybe they had visions of using > Hash for other things.) Well, in this case the hash node could prune duplicates as they went into the table :). Not sure if that would be a win, though. > > While I'm going through the stats, the number of rows for a JOIN_IN > > merge is wrong - its currently |outer_rel->rows * selec;|, but should > > probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| - > > notice the estimated-vs-actual for the final join, of 5580 vs 50000. > > Not sure about that. The selectivity would need to be calculated > differently if we based the calculation on that. It's probably bogus > as-is (note the comments) but I haven't had time to think about it. > The major reason why this equation is stated the way it is is that given > selectivity ranging from 0 to 1, multiplying by outer rows only yields > the correct range of possible results (0 to number of outer rows). > Multiplying by num_distinct_rows could yield an impossible result. I don't follow that. We have 50000 outer rows, with a selectivity of say 0.1 (to make the maths nicer). That gives the roughly 5000 rows printed above. However, thats not the whole story. The result of the join will be 5000 rows _for each (unique) tuple in the subselect result_. For the 10 distinct values I have, that gives 50000 total rows, which is correct. (I'm not familar with the postgres source to know if that approach will work with histograms; I presume theres a helper func somewhere to handle merging that stuff) Can you give an example where an impossible result would be given? > > > IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same > > cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in > > set_joinrel_size_estimates ? > > Maybe; from a mechanical point of view, the existing code does the right > thing given 0-to-1 selectivity estimates in each case. > See above. What is the purpose of the _UNIQUE forms, anyway? > regards, tom lane Thanks, Bradley
Bradley Baetz <bbaetz@acm.org> writes: > On Sun, Jan 26, 2003 at 06:51:18PM -0500, Tom Lane wrote: >> But the DISTINCT case is a different query. The question that's really >> at hand is why does the planner mistakenly prefer merge to hash join in >> the non-DISTINCT case. > Well, its a different input, but the output is the same, so it may as > well be the same query. IN (1,1,1) is the same as IN (1). That's a good observation with respect to the problem of estimating the number of output tuples; but it's irrelevant to the problem of estimating execution cost. The plan that we're trying to cost here is specifically a plan that runs a mergejoin *without* first uniq-ifying the righthand relation. As such, the cost of the mergejoin is nearly the same as the cost of an ordinary mergejoin of the two relations. We save the costs of outputting join tuples that aren't wanted by the IN, but we still have to run the mergejoin over those tuples. > Well, in this case the hash node could prune duplicates as they went > into the table :). Not sure if that would be a win, though. We're already checking that as a separate plan alternative. The implementation could be improved a little, though --- we could combine the uniq-ification into the Hash node. > I don't follow that. We have 50000 outer rows, with a selectivity of say > 0.1 (to make the maths nicer). That gives the roughly 5000 rows printed > above. However, thats not the whole story. The result of the join will > be 5000 rows _for each (unique) tuple in the subselect result_. For the > 10 distinct values I have, that gives 50000 total rows, which is > correct. AFAICS there are two or three different concepts of selectivity being tossed about here. If we want to continue defining "selectivity" as the ratio of the number of output tuples to the cross product size, then you need different selectivity numbers depending on whether you take the cross product size to be num_outer * num_inner or num_outer * num_distinct_inner or just num_outer. Only in the last case will the valid range of selectivity range from 0 to 1; if the selectivity estimator returns something approaching 1 then the first two equations will give very wrong answers. As an example, consider the case where all the entries in both tables are the same value (say 42). The present selectivity estimator *will* say 1.0, as it should for an ordinary join. If you multiply by anything but num_outer you get the wrong answer for the number of output tuples. It might be that we have to bite the bullet and set up a different selectivity estimation procedure for IN. I'd prefer to avoid that but haven't really figured out if it's necessary or not. In the meantime, I think you are right that it's bogus that set_joinrel_size_estimates uses different equations for JOIN_IN and JOIN_UNIQUE_INNER. Whatever the decision is about how to do the estimation, these should be done the same way. regards, tom lane
On Sun, Jan 26, 2003 at 09:43:18PM -0500, Tom Lane wrote: > We're already checking that as a separate plan alternative. The > implementation could be improved a little, though --- we could combine > the uniq-ification into the Hash node. Right, or skip it entirely when selecting stuff with unique constraints. > AFAICS there are two or three different concepts of selectivity being > tossed about here. <snip> Ah, OK. Yeah, I think I was confusing myself there. > > It might be that we have to bite the bullet and set up a different > selectivity estimation procedure for IN. I'd prefer to avoid that > but haven't really figured out if it's necessary or not. I don't think it is. The number of rows is correct if you do product_id IN (1) vs product_id IN (1,2) vs product_id IN (1,2,3) and so on. The only difference for a subselect is that the number of values generated is only known approximately, rather than exactly, but I don't think that thats relevent. > In the meantime, I think you are right that it's bogus that > set_joinrel_size_estimates uses different equations for JOIN_IN and > JOIN_UNIQUE_INNER. Whatever the decision is about how to do the > estimation, these should be done the same way. Hmm. So, I made that change, and now get an estimate of 50218 rows for the join result, which is about right (although capping the result to have a maximum of the number of input rows is valid for an inner join), but the cost for the hashjoin is still 4281586.50. cost_hashjoin probably needs to be taught about the short circuiting done for _IN, but I'm not sure where to do that from a quick glance through the code. What is the point of JOIN_UNIQUE_{INNER,OUTER}, though? What does it do that JOIN_IN doesn't? executor/* doesn't appear to use it. > regards, tom lane Bradley
Bradley Baetz <bbaetz@acm.org> writes: > On Sun, Jan 26, 2003 at 09:43:18PM -0500, Tom Lane wrote: >> We're already checking that as a separate plan alternative. The >> implementation could be improved a little, though --- we could combine >> the uniq-ification into the Hash node. > Right, or skip it entirely when selecting stuff with unique constraints. I'm hesitant to do that until we have some scheme in place for invalidating cached plans. Right now, if you drop an index that is used by some cached plan, you'll hear about it next time the plan is used. But if the plan doesn't directly use the index, yet depends on its existence to be correct, you'll get no error and subtly(?) wrong answers. I don't mind depending on such assumptions in estimating costs, but I do mind depending on them for correct answers. > I don't think it is. The number of rows is correct if you do product_id > IN (1) vs product_id IN (1,2) vs product_id IN (1,2,3) and so on. But that's a completely different code path; it doesn't even enter the routines we're concerned about here. > cost_hashjoin > probably needs to be taught about the short circuiting done for _IN, Yeah, I think so. Stepping through cost_hashjoin shows that the major component of the inflated cost is coming from the per-tuple CPU costs, which are inflated because we're not allowing for the IN shortcircuit. > What is the point of JOIN_UNIQUE_{INNER,OUTER}, though? What does it do > that JOIN_IN doesn't? Uniqify the inner/outer path and then do a normal inner join. See joinpath.c. > executor/* doesn't appear to use it. No; the executor never sees JOIN_REVERSE_IN, either. regards, tom lane
On Sun, Jan 26, 2003 at 11:18:31PM -0500, Tom Lane wrote: > Bradley Baetz <bbaetz@acm.org> writes: > > Right, or skip it entirely when selecting stuff with unique constraints. > > I'm hesitant to do that until we have some scheme in place for > invalidating cached plans. By cached, do you mean PREPARE stuff, or something else? > > > I don't think it is. The number of rows is correct if you do product_id > > IN (1) vs product_id IN (1,2) vs product_id IN (1,2,3) and so on. > > But that's a completely different code path; it doesn't even enter the > routines we're concerned about here. Yes, but its the same concept. Although we seem to be agreeing about that now :) > > What is the point of JOIN_UNIQUE_{INNER,OUTER}, though? What does it do > > that JOIN_IN doesn't? > > Uniqify the inner/outer path and then do a normal inner join. See > joinpath.c. Ah, OK. If I comment out line 547 of joinrels.c (which adds JOIN_IN to the set of join paths) so that the UNIQUE joins are all that are left to try, then I get: bbaetz=# explain analyze select count(*) FROM bugs where product_id IN (SELECT product_id FROM bugs); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=3494816.98..3494816.98 rows=1 width=8) (actual time=579.71..579.71 rows=1 loops=1) -> Merge Join (cost=5169.41..3494691.43 rows=50218 width=8) (actual time=111.41..530.16 rows=50000 loops=1) Merge Cond: ("outer".product_id = "inner".product_id) -> Index Scan using bugs_product_id_idx on bugs (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57 rows=50000 loops=1) -> Sort (cost=920.14..920.17 rows=9 width=4) (actual time=111.25..143.42 rows=44476 loops=1) Sort Key: public.bugs.product_id -> HashAggregate (cost=920.00..920.00 rows=9 width=4) (actual time=111.17..111.18 rows=9 loops=1) -> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4) (actual time=0.00..67.41 rows=50000 loops=1) Total runtime: 579.84 msec (9 rows) (This isn't picked without my hack, because the cost is slightly higher than the JOIN_IN version) However, its much faster (although not as fast as sticking the DISTINCT in there myself), but the actual rows coming from the sort is really odd - where is that number coming from? How can sorting 9 rows take 44476 anythings? The final mergejoin cost is still way off, too. > regards, tom lane Thanks, Bradley
Bradley Baetz <bbaetz@acm.org> writes: > By cached, do you mean PREPARE stuff, or something else? PREPARE, and cached plans in plpgsql, and cached plans in SPI, and cached plans for foreign keys (though probably those never use IN) and perhaps other places I'm not thinking of at the moment. > However, its much faster (although not as fast as sticking the DISTINCT > in there myself), but the actual rows coming from the sort is really odd > - where is that number coming from? How can sorting 9 rows take 44476 > anythings? We're back full circle to my original comment about rescans in mergejoin. The EXPLAIN ANALYZE instrumentation counts a re-fetch as another returned row. regards, tom lane