Thread: Subselect performance

Subselect performance

From
Daniel Lopez
Date:
Hi,

I am having the following problem with a subselect query. Basically if I do
(the following is some kind of pseudocde)

a)
$list = select d from c
select b from a where b in ( $list )

is  5 seconds

If I try:
b)
select b from a where b in (select d from c) 
is 3 minutes!!  (although it shouldbe at least as fast as a)!)


How can I improve b) so it takes 5 seconds?

(I attach the queries I am making)

a)
( select d from c)

select distinct product_id from purchase where customer_id=17

Unique  (cost=4799.93 rows=114777 width=4) ->  Sort  (cost=4799.93 rows=114777 width=4)         ->  Seq Scan on
purchase (cost=4799.93 rows=114777 width=4)    
 
(the preivous gives me some values, which are put in a string, then the query
is constructed:
select distinct product_id, name, date,
application, description from product where product_id in
( 1, 3 , 8 , 9 ... 47)
  
Unique  (cost=43.05 rows=41 width=45) ->  Sort  (cost=43.05 rows=41 width=45)         ->  Index Scan using product_idx,
product_idx,product_idx,
 
product_idx, product_idx, product_idx, product_idx, product_idx,
product_idx, product_idx, product_idx, product_idx, product_idx,
product_idx, product_idx, product_idx, product_idx, product_idx,
product_idx, product_idx, product_idx on product  (cost=43.05 rows=41
width=45)


b) All in one (much slower)

select  distinct product_id, name, date,
application, description from product where product_id in
(select distinct product_id from purchase where customer_id
=17)

NOTICE:  QUERY PLAN:

Seq Scan on product  (cost=66.38 rows=648 width=45) SubPlan     ->  Unique  (cost=4799.93 rows=114777 width=4)
    ->  Sort  (cost=4799.93 rows=114777 width=4)                    ->  Seq Scan on purchase  (cost=4799.93
 
rows=114777 width=4)B



Re: [SQL] Subselect performance

From
Tom Lane
Date:
Daniel Lopez <ridruejo@atm9.com.dtu.dk> writes:
> $list = select d from c
> select b from a where b in ( $list )
> is  5 seconds

> select b from a where b in (select d from c) 
> is 3 minutes!!  (although it should be at least as fast as a)!

Not necessarily.  Your first example is depending on the fact that
the "list" (number of values selected from c) is short.  Try it with
10000 or 100000 values from c, if you want to see the backend crash ;-)

The second case works OK even if the sub-select result is large, because
it re-executes the sub-select for each row from a (essentially,
rescanning c to see if b is matched).  However, this means the runtime
is proportional to the product of the number of rows in a and c.  Ugh.

Try rewriting your query as a join:
select a.b from a, c where a.b = c.d

(you might want "select distinct" here, if b can match many rows from d).
If the system can't figure out anything better than a nested-loop join,
then it'll probably end up taking about the same amount of time, but
this form at least allows the possibility of a smarter join method.

I believe we have a TODO list item to perform this sort of
transformation automatically when the sub-select is of a form that
allows it.  We need to get the left/outer join stuff working first
in order to have an exact match of the behavior.
        regards, tom lane


Re: [SQL] Subselect performance

From
Stuart Rison
Date:
On Tue, 21 Sep 1999, Tom Lane wrote:

> Daniel Lopez <ridruejo@atm9.com.dtu.dk> writes:
> > $list = select d from c
> > select b from a where b in ( $list )
> > is  5 seconds
> 
> > select b from a where b in (select d from c) 
> > is 3 minutes!!  (although it should be at least as fast as a)!
> 
> Not necessarily.  Your first example is depending on the fact that
> the "list" (number of values selected from c) is short.  Try it with
> 10000 or 100000 values from c, if you want to see the backend crash ;-)

I've encoutered this sort of issue myself where I just wanted the
sub-select to be performed once.  Granted it would not work if you wanted
to select 10000 or 100000 but what if you have a very larged table a and a
very small table c (using the example above).  As you pointed out,
currently you're looking at 'a x c' runtime... Ugh indeed; whereas just
executing the subselect once and cut and pasting that you have an order of
'a' runtime...

So would it be possible to somehow have a switch, option, function,
something that might tell then backend to execute the sub-select only
once?  I know that for concurrent access databases that might mean a
dangerous loss of integrity (because the data in table c may change
between each execution of the subselect-- yes? no?) but with that caveat
in mind it would be a very useful switch!!  

cheers,

S.

Stuart C. G. Rison
Department of Biochemistry and Molecular Biology
6th floor, Darwin Building, University College London (UCL)
Gower Street, London, WC1E 6BT, United Kingdom
Tel. 0207 504 2303, Fax. 0207 380 7033
e-mail: rison@biochem.ucl.ac.uk





Re: [SQL] Subselect performance

From
Bruce Momjian
Date:
> On Tue, 21 Sep 1999, Tom Lane wrote:
> 
> > Daniel Lopez <ridruejo@atm9.com.dtu.dk> writes:
> > > $list = select d from c
> > > select b from a where b in ( $list )
> > > is  5 seconds
> > 
> > > select b from a where b in (select d from c) 
> > > is 3 minutes!!  (although it should be at least as fast as a)!
> > 
> > Not necessarily.  Your first example is depending on the fact that
> > the "list" (number of values selected from c) is short.  Try it with
> > 10000 or 100000 values from c, if you want to see the backend crash ;-)
> 
> I've encoutered this sort of issue myself where I just wanted the
> sub-select to be performed once.  Granted it would not work if you wanted
> to select 10000 or 100000 but what if you have a very larged table a and a
> very small table c (using the example above).  As you pointed out,
> currently you're looking at 'a x c' runtime... Ugh indeed; whereas just
> executing the subselect once and cut and pasting that you have an order of
> 'a' runtime...

OK, I am jumping in here, because it seems we have some strange
behavour.

The only subselect problem I know of is that:
select b from a where b in (select d from c) 

will execute the subquery just once, but will do a sequential scan for
of the subquery results for each row of 'a' looking for 'b' that is in
the set of result rows.

This is a major performance problem, one that is known, and one that
should be fixed, but I am sounding like a broken record.

The solution is to allow the subquery results to be mergejoined(sorted),
or hashjoined with the outer query.

Am I correct, or is something else going on here?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [SQL] Subselect performance

From
Stuart Rison
Date:
On Tue, 21 Sep 1999, Bruce Momjian wrote:
> OK, I am jumping in here, because it seems we have some strange
> behavour.
> 
> The only subselect problem I know of is that:
> 
>     select b from a where b in (select d from c) 
> 
> will execute the subquery just once, but will do a sequential scan for
> of the subquery results for each row of 'a' looking for 'b' that is in
> the set of result rows.

Oh, OK, that's very possible.  I was always under the impression (very
possibly missguided) that the reason it took a long time to do a "in
(select...)" was that the sub-select was actually executed for every row
in 'a' so that you ended up doing:

1x sequential scan of a
ax select on c

whereas if you did the sub-select ide[endently and cut-and-pasted the
obtained set into the "in (...)" you were in point of fact just doing:

1X sequential scan of a (each of them with loads of OR statements).

therefore saving "ax select" time.

Bruce, I appologise if I've completely missunderstood what's going on and
that your e-mail was all about correcting me.  I don't have a good grasp
of seq-scan vs. (nested-)joins vs. hash joins vs. mergejoins etc.
(although any pointers on where to get a crash course in these would be
greatly appreciated).  
> This is a major performance problem, one that is known, and one that
> should be fixed, but I am sounding like a broken record.

yeah, again appologise if this has been discussed to death in the past and
I missed it all (or it went over my head ;) )

> The solution is to allow the subquery results to be mergejoined(sorted),
> or hashjoined with the outer query.

erm...

> Am I correct, or is something else going on here?

most probably correct... :)

regards,

S.





Re: [SQL] Subselect performance

From
Bruce Momjian
Date:
> Oh, OK, that's very possible.  I was always under the impression (very
> possibly missguided) that the reason it took a long time to do a "in
> (select...)" was that the sub-select was actually executed for every row
> in 'a' so that you ended up doing:
> 
> 1x sequential scan of a
> ax select on c
> 
> whereas if you did the sub-select ide[endently and cut-and-pasted the
> obtained set into the "in (...)" you were in point of fact just doing:
> 
> 1X sequential scan of a (each of them with loads of OR statements).
> 
> therefore saving "ax select" time.
> 
> Bruce, I appologise if I've completely missunderstood what's going on and
> that your e-mail was all about correcting me.  I don't have a good grasp
> of seq-scan vs. (nested-)joins vs. hash joins vs. mergejoins etc.
> (although any pointers on where to get a crash course in these would be
> greatly appreciated).  

See src/backend/optimizer/README for more info on join types.

>  
> > This is a major performance problem, one that is known, and one that
> > should be fixed, but I am sounding like a broken record.
> 
> yeah, again appologise if this has been discussed to death in the past and
> I missed it all (or it went over my head ;) )
> 
> > The solution is to allow the subquery results to be mergejoined(sorted),
> > or hashjoined with the outer query.
> 

It is not your fault.  We frequently get reports of this type, and the
behavior of the subquery is very non-intuitive.  You would think that a
subquery and a join would have the same performance, but because of the
limitation of subqueries as being nested loop joined, this is not the
case, and subqueries are slower.  We tell people to rewrite their query
as EXISTS, but by the time we tell them that, they have already spent
much time trying to figure out why the query is so slow, and I am sure
many people don't even know about the EXISTS workaround.

I hate to have a feature that is not 100% optimized, but this is what we
have with subqueries at this time.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [SQL] Subselect performance

From
Daniel Lopez
Date:
> It is not your fault.  We frequently get reports of this type, and the
> behavior of the subquery is very non-intuitive.  You would think that a
> subquery and a join would have the same performance, but because of the
> limitation of subqueries as being nested loop joined, this is not the
> case, and subqueries are slower.  We tell people to rewrite their query
> as EXISTS, but by the time we tell them that, they have already spent
> much time trying to figure out why the query is so slow, and I am sure
> many people don't even know about the EXISTS workaround.

You are right: I spend some time scratching my head, then some time
searching the mailing lists and I finally made the query with a EXISTS,
which works great for  me :) Thanks
Can this be a candidate to include in the FAQ? 

On the same idea, is there any good document out there with all the SQL
"recipes" or common practice for things like : "Give me all the rows which
have this value in this column more than once, etc"
I do it with: 
select my_index, count(my_index) from my_table  group by my_index having
count(my_index) > 1;

But this is a common query and would be interested in knowing which is the
commonly accepted way of doing this

Regards

Daniel


Re: [SQL] Subselect performance

From
Bruce Momjian
Date:
> 
> > It is not your fault.  We frequently get reports of this type, and the
> > behavior of the subquery is very non-intuitive.  You would think that a
> > subquery and a join would have the same performance, but because of the
> > limitation of subqueries as being nested loop joined, this is not the
> > case, and subqueries are slower.  We tell people to rewrite their query
> > as EXISTS, but by the time we tell them that, they have already spent
> > much time trying to figure out why the query is so slow, and I am sure
> > many people don't even know about the EXISTS workaround.
> 
> You are right: I spend some time scratching my head, then some time
> searching the mailing lists and I finally made the query with a EXISTS,
> which works great for  me :) Thanks
> Can this be a candidate to include in the FAQ? 
> 
> On the same idea, is there any good document out there with all the SQL
> "recipes" or common practice for things like : "Give me all the rows which
> have this value in this column more than once, etc"
> I do it with: 
> select my_index, count(my_index) from my_table  group by my_index having
> count(my_index) > 1;
> 
> But this is a common query and would be interested in knowing which is the
> commonly accepted way of doing this

New FAQ:

4.23) Why are my subqueries using IN so slow?

Currently, we join subqueries to outer queries by sequential scanning
the result of the subquery for each row of the outer query. A workaround
is to replace IN with EXISTS. For example, change: 

       SELECT *       FROM tab       WHERE col1 IN (SELECT col2 FROM TAB2)

to: 

       SELECT *       FROM tab       WHERE EXISTS (SELECT col2 FROM TAB2 WHERE col1 = col2)

We hope to fix this limitation in a future release. 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


[SQL] How about a postgreSQL cookbook? (was [SQL] Subselect performance)

From
Stuart Rison
Date:
Dear All,

I think Daniel's idea is brilliant.  Why don't we set up some sort of
'postgreSQL cookbook' where everybody puts in their "tricks of the trade"
(for the moment I'm thinking mostly SQL related)?

A first pass of the FAQ should generate quite a few of these; like the
"SELECT my_index, count(my_index) from my_table  group by my_index having 
count(my_index) > 1" example below or Bruce's new FAQ item re: getting a
more efficient IN subquery equivalent.

Perhaps we could even have a (moderated?) newsgroup where we could cc:
cookbook like entries that appear in the other pgsql newsgroups.

Just ideas, for the moment I saddly have no time to make a start on this
(paper deadline looming ominously close) but I'm holding this thought!

Any comments?

Cheers,

Stuart. 

On Tue, 28 Sep 1999, Bruce Momjian wrote:

> > 
> > > as EXISTS, but by the time we tell them that, they have already spent
> > > much time trying to figure out why the query is so slow, and I am sure
> > > many people don't even know about the EXISTS workaround.
> > 
> > You are right: I spend some time scratching my head, then some time
> > searching the mailing lists and I finally made the query with a EXISTS,
> > which works great for  me :) Thanks
> > Can this be a candidate to include in the FAQ? 
> > 
> > On the same idea, is there any good document out there with all the SQL
> > "recipes" or common practice for things like : "Give me all the rows which
> > have this value in this column more than once, etc"
> > I do it with: 
> > select my_index, count(my_index) from my_table  group by my_index having
> > count(my_index) > 1;
> > 
> New FAQ:
> 
> 4.23) Why are my subqueries using IN so slow?
> 

<snip - an entry which could also be put into a cookbook>


Stuart C. G. Rison
Department of Biochemistry and Molecular Biology
6th floor, Darwin Building, University College London (UCL)
Gower Street, London, WC1E 6BT, United Kingdom
Tel. 0207 504 2303, Fax. 0207 380 7033
e-mail: rison@biochem.ucl.ac.uk




Re: [SQL] How about a postgreSQL cookbook? (was [SQL] Subselect performance)

From
Rich Shepard
Date:
On Wed, 29 Sep 1999, Stuart Rison wrote:

> I think Daniel's idea is brilliant.  Why don't we set up some sort of
> 'postgreSQL cookbook' where everybody puts in their "tricks of the trade"
> (for the moment I'm thinking mostly SQL related)?
 I like this idea. Every implementation of SQL has its individual
personality, and postgres is no exception. A solutions-oriented list (or
cookbook, if you want) will help everyone. It can be organized by
categories/types of queries (e.g., single table, multiple table; finding
missing values, finding duplicate values) and may be better hosted on the
web site as a table or dynamic (but moderated) list.
 Please add my vote.

Rich

Dr. Richard B. Shepard, President
                      Applied Ecosystem Services, Inc. (TM)             Making environmentally-responsible mining
happen.(SM)                               --------------------------------           2404 SW 22nd Street | Troutdale,
OR97060-1247 | U.S.A.+ 1 503-667-4517 (voice) | + 1 503-667-8863 (fax) | rshepard@appl-ecosys.com
 



Re: [SQL] How about a postgreSQL cookbook? (was [SQL] Subselect performance)

From
Clayton Cottingham
Date:
Stuart Rison wrote:

> Dear All,
>
> I think Daniel's idea is brilliant.  Why don't we set up some sort of
> 'postgreSQL cookbook' where everybody puts in their "tricks of the trade"
> (for the moment I'm thinking mostly SQL related)?
>
> A first pass of the FAQ should generate quite a few of these; like the
> "SELECT my_index, count(my_index) from my_table  group by my_index having
> count(my_index) > 1" example below or Bruce's new FAQ item re: getting a
> more efficient IN subquery equivalent.
>
> Perhaps we could even have a (moderated?) newsgroup where we could cc:
> cookbook like entries that appear in the other pgsql newsgroups.
>
> Just ideas, for the moment I saddly have no time to make a start on this
> (paper deadline looming ominously close) but I'm holding this thought!
>
> Any comments?
>
> Cheers,
>
> Stuart.
>
> On Tue, 28 Sep 1999, Bruce Momjian wrote:
>
> > >
> > > > as EXISTS, but by the time we tell them that, they have already spent
> > > > much time trying to figure out why the query is so slow, and I am sure
> > > > many people don't even know about the EXISTS workaround.
> > >
> > > You are right: I spend some time scratching my head, then some time
> > > searching the mailing lists and I finally made the query with a EXISTS,
> > > which works great for  me :) Thanks
> > > Can this be a candidate to include in the FAQ?
> > >
> > > On the same idea, is there any good document out there with all the SQL
> > > "recipes" or common practice for things like : "Give me all the rows which
> > > have this value in this column more than once, etc"
> > > I do it with:
> > > select my_index, count(my_index) from my_table  group by my_index having
> > > count(my_index) > 1;
> > >
> > New FAQ:
> >
> > 4.23) Why are my subqueries using IN so slow?
> >
>
> <snip - an entry which could also be put into a cookbook>
>
> Stuart C. G. Rison
> Department of Biochemistry and Molecular Biology
> 6th floor, Darwin Building, University College London (UCL)
> Gower Street, London, WC1E 6BT, United Kingdom
> Tel. 0207 504 2303, Fax. 0207 380 7033
> e-mail: rison@biochem.ucl.ac.uk
>
> ************

this is a great idea!

ive got the irc.stampede.org #dbms we could discuss this on

or right here!