Thread: Thinking about IN/EXISTS optimization
I've been thinking about how to convert "x IN (subselect)" and EXISTS constructs into join-like processing, and I've run into a small problem in getting the planner to do it nicely. The issue is that I need to take the subselect and push it into the jointree --- essentially, make it look like a subselect-in-FROM --- so that the join planner can deal with it. Basically, I need to rearrange SELECT ... FROM ... WHERE ... AND x IN (SELECT y FROM ...) into SELECT ... FROM ..., (SELECT y FROM ...) ss WHERE ... AND x =* ss.y where =* represents some specially-marked RestrictInfo node. (NOT IN is the same except that the RestrictInfo node will be marked differently.) The difficulty is that there's no good place to do this in subquery_planner(). We should push the subselect into FROM before we run the pull_up_subqueries() and preprocess_jointree() operations; if we don't pull up the subselect into the main query then we won't have accomplished very much. But the WHERE clause isn't simplified into a form that makes it easy to spot top-level IN() expressions until after that. We can't simply switch the order of the subselect and WHERE-clause processing, because pulling up subqueries typically adds conditions to the WHERE clause. I haven't been able to think of a solution to this that doesn't involve wasting a lot of cycles by repeating some of these processing steps, or missing some optimization possibilities. (For example, if we pull up a subquery that came from a view, it might contain an IN where-clause, which ideally we'd want to be able to optimize. It almost seems like we need to be able to loop around the whole operation; but most of the time this will just waste cycles.) Anyone see a nice way to do this? regards, tom lane
This sounds like one of those classic optimizer problems we have had to deal with in the past. I suggest you go through the optimizer pass and set a boolean in Query whenever you do something that may require another loop through, then at the end, you check the boolean and loop if required. I think the rules system has to do something similar. I don't see any way around that, but because you are setting the boolean you only loop when you need to. --------------------------------------------------------------------------- Tom Lane wrote: > I've been thinking about how to convert "x IN (subselect)" and EXISTS > constructs into join-like processing, and I've run into a small problem > in getting the planner to do it nicely. The issue is that I need to > take the subselect and push it into the jointree --- essentially, make > it look like a subselect-in-FROM --- so that the join planner can deal > with it. Basically, I need to rearrange > > SELECT ... FROM ... WHERE ... AND x IN (SELECT y FROM ...) > > into > > SELECT ... FROM ..., (SELECT y FROM ...) ss > WHERE ... AND x =* ss.y > > where =* represents some specially-marked RestrictInfo node. (NOT IN is the > same except that the RestrictInfo node will be marked differently.) > > The difficulty is that there's no good place to do this in > subquery_planner(). We should push the subselect into FROM before we > run the pull_up_subqueries() and preprocess_jointree() operations; > if we don't pull up the subselect into the main query then we won't have > accomplished very much. But the WHERE clause isn't simplified into a > form that makes it easy to spot top-level IN() expressions until after > that. We can't simply switch the order of the subselect and > WHERE-clause processing, because pulling up subqueries typically adds > conditions to the WHERE clause. > > I haven't been able to think of a solution to this that doesn't involve > wasting a lot of cycles by repeating some of these processing steps, > or missing some optimization possibilities. (For example, if we pull up > a subquery that came from a view, it might contain an IN where-clause, > which ideally we'd want to be able to optimize. It almost seems like > we need to be able to loop around the whole operation; but most of the > time this will just waste cycles.) > > Anyone see a nice way to do this? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom, I'm just curious, will your proposed in/exists optimizations help for queries like: db=# explain delete from dns_expired_domains where domain_id in (select domain_id from dns_expired_domains group by domain_id having count(*)=14 ); NOTICE: QUERY PLAN: Seq Scan on dns_expired_domains (cost=0.00..55448724329.92 rows=324754 width=6) SubPlan -> Materialize (cost=85370.33..85370.33 rows=64951 width=4) -> Aggregate (cost=82122.79..85370.33rows=64951 width=4) -> Group (cost=82122.79..83746.56 rows=649508 width=4) -> Sort (cost=82122.79..82122.79 rows=649508 width=4) -> Seq Scan on dns_expired_domains (cost=0.00..10316.08 rows=649508 width=4) EXPLAIN I usually end up having to make a little script that runs the subquery, splits the domain_id's up in to chunks of 1000 or so, then executes several queries similar to: delete from dns_expired_domains where domain_id in (1,2,3,4,5,6,7,8,9,10...) This method seems to work fairly well and executes in a reasonable amount of time, unlike the original query with an estimated cost of 55,448,724,329.92. I attempted to use EXISTS in the same query but it seemed it wanted to delete all the rows in the table, I wasn't able to get it to delete only the ones that occured 14 times in the table. I may have overlooked something though. In any case, it would definately be nice if a query like this worked efficiently. Thanks, and congrats to all the people involved with the 7.3 release, all your hardwork is greatly appreciated. On Tue, 2002-10-22 at 16:18, Tom Lane wrote: > I've been thinking about how to convert "x IN (subselect)" and EXISTS > constructs into join-like processing, and I've run into a small problem > in getting the planner to do it nicely. The issue is that I need to > take the subselect and push it into the jointree -- essentially, make > it look like a subselect-in-FROM -- so that the join planner can deal > with it. Basically, I need to rearrange > > SELECT ... FROM ... WHERE ... AND x IN (SELECT y FROM ...) > > into > > SELECT ... FROM ..., (SELECT y FROM ...) ss > WHERE ... AND x =* ss.y > > where =* represents some specially-marked RestrictInfo node. (NOT IN is the > same except that the RestrictInfo node will be marked differently.) > > The difficulty is that there's no good place to do this in > subquery_planner(). We should push the subselect into FROM before we > run the pull_up_subqueries() and preprocess_jointree() operations; > if we don't pull up the subselect into the main query then we won't have > accomplished very much. But the WHERE clause isn't simplified into a > form that makes it easy to spot top-level IN() expressions until after > that. We can't simply switch the order of the subselect and > WHERE-clause processing, because pulling up subqueries typically adds > conditions to the WHERE clause. > > I haven't been able to think of a solution to this that doesn't involve > wasting a lot of cycles by repeating some of these processing steps, > or missing some optimization possibilities. (For example, if we pull up > a subquery that came from a view, it might contain an IN where-clause, > which ideally we'd want to be able to optimize. It almost seems like > we need to be able to loop around the whole operation; but most of the > time this will just waste cycles.) > > Anyone see a nice way to do this? > > regards, tom lane > > --(end of broadcast)-- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- Best Regards, Mike Benoit NetNation Communication Inc. Systems Engineer Tel: 604-684-6892 or 888-983-6600--Disclaimer: Opinions expressed here are my own and not necessarily those of my employer
Mike Benoit <mikeb@netnation.com> writes: > I'm just curious, will your proposed in/exists optimizations help for > queries like: > db=# explain delete from dns_expired_domains where domain_id in (select > domain_id from dns_expired_domains group by domain_id having count(*)=14 > ); Probably, but I'm more than a tad curious about why you're concerned about the efficiency of this particular example. Why would "count=14" be an interesting condition for deleting groups? > Seq Scan on dns_expired_domains (cost=0.00..55448724329.92 rows=324754 > width=6) > SubPlan > -> Materialize (cost=85370.33..85370.33 rows=64951 width=4) > -> Aggregate (cost=82122.79..85370.33 rows=64951 width=4) > -> Group (cost=82122.79..83746.56 rows=649508 width=4) > -> Sort (cost=82122.79..82122.79 rows=649508 > width=4) > -> Seq Scan on dns_expired_domains > (cost=0.00..10316.08 rows=649508 width=4) What are the *actual*, not estimated, row counts here --- ie, how many rows in the table, and how many distinct domain_ids are you typically deleting? regards, tom lane
On Fri, 2002-11-29 at 13:22, Tom Lane wrote: > Mike Benoit <mikeb@netnation.com> writes: > > I'm just curious, will your proposed in/exists optimizations help for > > queries like: > > > db=# explain delete from dns_expired_domains where domain_id in (select > > domain_id from dns_expired_domains group by domain_id having count(*)=14 > > ); > > Probably, but I'm more than a tad curious about why you're concerned > about the efficiency of this particular example. Why would "count=14" > be an interesting condition for deleting groups? The count=14 isn't really that significate, basically I'm just looking for faster execution of queries like: (delete|select) from table where id in (select id from large_table2) For cases where EXISTS won't work properly, and large_table2 has more then ~50,000 rows. > > > Seq Scan on dns_expired_domains (cost=0.00..55448724329.92 rows=324754 > > width=6) > > SubPlan > > -> Materialize (cost=85370.33..85370.33 rows=64951 width=4) > > -> Aggregate (cost=82122.79..85370.33 rows=64951 width=4) > > -> Group (cost=82122.79..83746.56 rows=649508 width=4) > > -> Sort (cost=82122.79..82122.79 rows=649508 > > width=4) > > -> Seq Scan on dns_expired_domains > > (cost=0.00..10316.08 rows=649508 width=4) > > What are the *actual*, not estimated, row counts here --- ie, how many > rows in the table, and how many distinct domain_ids are you typically > deleting? 650,000 actual rows in the table. 40,000 or so are returned by the subquery. About 500,000 rows should end up being deleted. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html -- Best Regards, Mike Benoit NetNation Communication Inc. Systems Engineer Tel: 604-684-6892 or 888-983-6600---------------------------------------Disclaimer: Opinions expressed here are my own andnot necessarily those of my employer