Thread: Thinking about IN/EXISTS optimization

Thinking about IN/EXISTS optimization

From
Tom Lane
Date:
I've been thinking about how to convert "x IN (subselect)" and EXISTS
constructs into join-like processing, and I've run into a small problem
in getting the planner to do it nicely.  The issue is that I need to
take the subselect and push it into the jointree --- essentially, make
it look like a subselect-in-FROM --- so that the join planner can deal
with it.  Basically, I need to rearrange
SELECT ... FROM ... WHERE ... AND x IN (SELECT y FROM ...)

into
SELECT ... FROM ..., (SELECT y FROM ...) ss       WHERE ... AND x =* ss.y

where =* represents some specially-marked RestrictInfo node.  (NOT IN is the
same except that the RestrictInfo node will be marked differently.)

The difficulty is that there's no good place to do this in
subquery_planner().  We should push the subselect into FROM before we
run the pull_up_subqueries() and preprocess_jointree() operations;
if we don't pull up the subselect into the main query then we won't have
accomplished very much.  But the WHERE clause isn't simplified into a
form that makes it easy to spot top-level IN() expressions until after
that.  We can't simply switch the order of the subselect and
WHERE-clause processing, because pulling up subqueries typically adds
conditions to the WHERE clause.

I haven't been able to think of a solution to this that doesn't involve
wasting a lot of cycles by repeating some of these processing steps,
or missing some optimization possibilities.  (For example, if we pull up
a subquery that came from a view, it might contain an IN where-clause,
which ideally we'd want to be able to optimize.  It almost seems like
we need to be able to loop around the whole operation; but most of the
time this will just waste cycles.)

Anyone see a nice way to do this?
        regards, tom lane


Re: Thinking about IN/EXISTS optimization

From
Bruce Momjian
Date:
This sounds like one of those classic optimizer problems we have had to
deal with in the past.  I suggest you go through the optimizer pass and
set a boolean in Query whenever you do something that may require
another loop through, then at the end, you check the boolean and loop if
required.

I think the rules system has to do something similar.  I don't see any
way around that, but because you are setting the boolean you only loop
when you need to.

---------------------------------------------------------------------------

Tom Lane wrote:
> I've been thinking about how to convert "x IN (subselect)" and EXISTS
> constructs into join-like processing, and I've run into a small problem
> in getting the planner to do it nicely.  The issue is that I need to
> take the subselect and push it into the jointree --- essentially, make
> it look like a subselect-in-FROM --- so that the join planner can deal
> with it.  Basically, I need to rearrange
> 
>     SELECT ... FROM ... WHERE ... AND x IN (SELECT y FROM ...)
> 
> into
> 
>     SELECT ... FROM ..., (SELECT y FROM ...) ss
>         WHERE ... AND x =* ss.y
> 
> where =* represents some specially-marked RestrictInfo node.  (NOT IN is the
> same except that the RestrictInfo node will be marked differently.)
> 
> The difficulty is that there's no good place to do this in
> subquery_planner().  We should push the subselect into FROM before we
> run the pull_up_subqueries() and preprocess_jointree() operations;
> if we don't pull up the subselect into the main query then we won't have
> accomplished very much.  But the WHERE clause isn't simplified into a
> form that makes it easy to spot top-level IN() expressions until after
> that.  We can't simply switch the order of the subselect and
> WHERE-clause processing, because pulling up subqueries typically adds
> conditions to the WHERE clause.
> 
> I haven't been able to think of a solution to this that doesn't involve
> wasting a lot of cycles by repeating some of these processing steps,
> or missing some optimization possibilities.  (For example, if we pull up
> a subquery that came from a view, it might contain an IN where-clause,
> which ideally we'd want to be able to optimize.  It almost seems like
> we need to be able to loop around the whole operation; but most of the
> time this will just waste cycles.)
> 
> Anyone see a nice way to do this?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Thinking about IN/EXISTS optimization

From
Mike Benoit
Date:
Tom,

I'm just curious, will your proposed in/exists optimizations help for
queries like:

db=# explain delete from dns_expired_domains where domain_id in (select
domain_id from dns_expired_domains group by domain_id having count(*)=14
);
NOTICE:  QUERY PLAN:

Seq Scan on dns_expired_domains  (cost=0.00..55448724329.92 rows=324754
width=6) SubPlan   ->  Materialize  (cost=85370.33..85370.33 rows=64951 width=4)         ->  Aggregate
(cost=82122.79..85370.33rows=64951 width=4)               ->  Group  (cost=82122.79..83746.56 rows=649508 width=4)
              ->  Sort  (cost=82122.79..82122.79 rows=649508
 
width=4)                           ->  Seq Scan on dns_expired_domains 
(cost=0.00..10316.08 rows=649508 width=4)

EXPLAIN

I usually end up having to make a little script that runs the subquery,
splits the domain_id's up in to chunks of 1000 or so, then executes
several queries similar to:

delete from dns_expired_domains where domain_id in
(1,2,3,4,5,6,7,8,9,10...)

This method seems to work fairly well and executes in a reasonable
amount of time, unlike the original query with an estimated cost of
55,448,724,329.92. I attempted to use EXISTS in the same query but it
seemed it wanted to delete all the rows in the table, I wasn't able to
get it to delete only the ones that occured 14 times in the table. I may
have overlooked something though.

In any case, it would definately be nice if a query like this worked
efficiently. 

Thanks, and congrats to all the people involved with the 7.3 release,
all your hardwork is greatly appreciated. 


On Tue, 2002-10-22 at 16:18, Tom Lane wrote: 
> I've been thinking about how to convert "x IN (subselect)" and EXISTS
> constructs into join-like processing, and I've run into a small problem
> in getting the planner to do it nicely.  The issue is that I need to
> take the subselect and push it into the jointree -- essentially, make
> it look like a subselect-in-FROM -- so that the join planner can deal
> with it.  Basically, I need to rearrange
> 
>     SELECT ... FROM ... WHERE ... AND x IN (SELECT y FROM ...)
> 
> into
> 
>     SELECT ... FROM ..., (SELECT y FROM ...) ss
>         WHERE ... AND x =* ss.y
> 
> where =* represents some specially-marked RestrictInfo node.  (NOT IN is the
> same except that the RestrictInfo node will be marked differently.)
> 
> The difficulty is that there's no good place to do this in
> subquery_planner().  We should push the subselect into FROM before we
> run the pull_up_subqueries() and preprocess_jointree() operations;
> if we don't pull up the subselect into the main query then we won't have
> accomplished very much.  But the WHERE clause isn't simplified into a
> form that makes it easy to spot top-level IN() expressions until after
> that.  We can't simply switch the order of the subselect and
> WHERE-clause processing, because pulling up subqueries typically adds
> conditions to the WHERE clause.
> 
> I haven't been able to think of a solution to this that doesn't involve
> wasting a lot of cycles by repeating some of these processing steps,
> or missing some optimization possibilities.  (For example, if we pull up
> a subquery that came from a view, it might contain an IN where-clause,
> which ideally we'd want to be able to optimize.  It almost seems like
> we need to be able to loop around the whole operation; but most of the
> time this will just waste cycles.)
> 
> Anyone see a nice way to do this?
> 
>             regards, tom lane
> 
> --(end of broadcast)--
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)



-- 
Best Regards,
Mike Benoit
NetNation Communication Inc.
Systems Engineer
Tel: 604-684-6892 or 888-983-6600--Disclaimer: Opinions expressed here are my own and not necessarily those of my
employer



Re: Thinking about IN/EXISTS optimization

From
Tom Lane
Date:
Mike Benoit <mikeb@netnation.com> writes:
> I'm just curious, will your proposed in/exists optimizations help for
> queries like:

> db=# explain delete from dns_expired_domains where domain_id in (select
> domain_id from dns_expired_domains group by domain_id having count(*)=14
> );

Probably, but I'm more than a tad curious about why you're concerned
about the efficiency of this particular example.  Why would "count=14"
be an interesting condition for deleting groups?

> Seq Scan on dns_expired_domains  (cost=0.00..55448724329.92 rows=324754
> width=6)
>   SubPlan
>     ->  Materialize  (cost=85370.33..85370.33 rows=64951 width=4)
>           ->  Aggregate  (cost=82122.79..85370.33 rows=64951 width=4)
>                 ->  Group  (cost=82122.79..83746.56 rows=649508 width=4)
>                       ->  Sort  (cost=82122.79..82122.79 rows=649508
> width=4)
>                             ->  Seq Scan on dns_expired_domains 
> (cost=0.00..10316.08 rows=649508 width=4)

What are the *actual*, not estimated, row counts here --- ie, how many
rows in the table, and how many distinct domain_ids are you typically
deleting?
        regards, tom lane


Re: Thinking about IN/EXISTS optimization

From
Mike Benoit
Date:
On Fri, 2002-11-29 at 13:22, Tom Lane wrote:
> Mike Benoit <mikeb@netnation.com> writes:
> > I'm just curious, will your proposed in/exists optimizations help for
> > queries like:
> 
> > db=# explain delete from dns_expired_domains where domain_id in (select
> > domain_id from dns_expired_domains group by domain_id having count(*)=14
> > );
> 
> Probably, but I'm more than a tad curious about why you're concerned
> about the efficiency of this particular example.  Why would "count=14"
> be an interesting condition for deleting groups?

The count=14 isn't really that significate, basically I'm just looking
for faster execution of queries like:

(delete|select) from table where id in (select id from large_table2)

For cases where EXISTS won't work properly, and large_table2 has more
then ~50,000 rows.

> 
> > Seq Scan on dns_expired_domains  (cost=0.00..55448724329.92 rows=324754
> > width=6)
> >   SubPlan
> >     ->  Materialize  (cost=85370.33..85370.33 rows=64951 width=4)
> >           ->  Aggregate  (cost=82122.79..85370.33 rows=64951 width=4)
> >                 ->  Group  (cost=82122.79..83746.56 rows=649508 width=4)
> >                       ->  Sort  (cost=82122.79..82122.79 rows=649508
> > width=4)
> >                             ->  Seq Scan on dns_expired_domains 
> > (cost=0.00..10316.08 rows=649508 width=4)
> 
> What are the *actual*, not estimated, row counts here --- ie, how many
> rows in the table, and how many distinct domain_ids are you typically
> deleting?

650,000 actual rows in the table. 40,000 or so are returned by the
subquery. About 500,000 rows should end up being deleted. 

> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html
-- 
Best Regards,
Mike Benoit
NetNation Communication Inc.
Systems Engineer
Tel: 604-684-6892 or 888-983-6600---------------------------------------Disclaimer: Opinions expressed here are my own
andnot necessarily those of my employer