Thread: Subselect performance
Hi, I am having the following problem with a subselect query. Basically if I do (the following is some kind of pseudocde) a) $list = select d from c select b from a where b in ( $list ) is 5 seconds If I try: b) select b from a where b in (select d from c) is 3 minutes!! (although it shouldbe at least as fast as a)!) How can I improve b) so it takes 5 seconds? (I attach the queries I am making) a) ( select d from c) select distinct product_id from purchase where customer_id=17 Unique (cost=4799.93 rows=114777 width=4) -> Sort (cost=4799.93 rows=114777 width=4) -> Seq Scan on purchase (cost=4799.93 rows=114777 width=4) (the preivous gives me some values, which are put in a string, then the query is constructed: select distinct product_id, name, date, application, description from product where product_id in ( 1, 3 , 8 , 9 ... 47) Unique (cost=43.05 rows=41 width=45) -> Sort (cost=43.05 rows=41 width=45) -> Index Scan using product_idx, product_idx,product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx, product_idx on product (cost=43.05 rows=41 width=45) b) All in one (much slower) select distinct product_id, name, date, application, description from product where product_id in (select distinct product_id from purchase where customer_id =17) NOTICE: QUERY PLAN: Seq Scan on product (cost=66.38 rows=648 width=45) SubPlan -> Unique (cost=4799.93 rows=114777 width=4) -> Sort (cost=4799.93 rows=114777 width=4) -> Seq Scan on purchase (cost=4799.93 rows=114777 width=4)B
Daniel Lopez <ridruejo@atm9.com.dtu.dk> writes: > $list = select d from c > select b from a where b in ( $list ) > is 5 seconds > select b from a where b in (select d from c) > is 3 minutes!! (although it should be at least as fast as a)! Not necessarily. Your first example is depending on the fact that the "list" (number of values selected from c) is short. Try it with 10000 or 100000 values from c, if you want to see the backend crash ;-) The second case works OK even if the sub-select result is large, because it re-executes the sub-select for each row from a (essentially, rescanning c to see if b is matched). However, this means the runtime is proportional to the product of the number of rows in a and c. Ugh. Try rewriting your query as a join: select a.b from a, c where a.b = c.d (you might want "select distinct" here, if b can match many rows from d). If the system can't figure out anything better than a nested-loop join, then it'll probably end up taking about the same amount of time, but this form at least allows the possibility of a smarter join method. I believe we have a TODO list item to perform this sort of transformation automatically when the sub-select is of a form that allows it. We need to get the left/outer join stuff working first in order to have an exact match of the behavior. regards, tom lane
On Tue, 21 Sep 1999, Tom Lane wrote: > Daniel Lopez <ridruejo@atm9.com.dtu.dk> writes: > > $list = select d from c > > select b from a where b in ( $list ) > > is 5 seconds > > > select b from a where b in (select d from c) > > is 3 minutes!! (although it should be at least as fast as a)! > > Not necessarily. Your first example is depending on the fact that > the "list" (number of values selected from c) is short. Try it with > 10000 or 100000 values from c, if you want to see the backend crash ;-) I've encoutered this sort of issue myself where I just wanted the sub-select to be performed once. Granted it would not work if you wanted to select 10000 or 100000 but what if you have a very larged table a and a very small table c (using the example above). As you pointed out, currently you're looking at 'a x c' runtime... Ugh indeed; whereas just executing the subselect once and cut and pasting that you have an order of 'a' runtime... So would it be possible to somehow have a switch, option, function, something that might tell then backend to execute the sub-select only once? I know that for concurrent access databases that might mean a dangerous loss of integrity (because the data in table c may change between each execution of the subselect-- yes? no?) but with that caveat in mind it would be a very useful switch!! cheers, S. Stuart C. G. Rison Department of Biochemistry and Molecular Biology 6th floor, Darwin Building, University College London (UCL) Gower Street, London, WC1E 6BT, United Kingdom Tel. 0207 504 2303, Fax. 0207 380 7033 e-mail: rison@biochem.ucl.ac.uk
> On Tue, 21 Sep 1999, Tom Lane wrote: > > > Daniel Lopez <ridruejo@atm9.com.dtu.dk> writes: > > > $list = select d from c > > > select b from a where b in ( $list ) > > > is 5 seconds > > > > > select b from a where b in (select d from c) > > > is 3 minutes!! (although it should be at least as fast as a)! > > > > Not necessarily. Your first example is depending on the fact that > > the "list" (number of values selected from c) is short. Try it with > > 10000 or 100000 values from c, if you want to see the backend crash ;-) > > I've encoutered this sort of issue myself where I just wanted the > sub-select to be performed once. Granted it would not work if you wanted > to select 10000 or 100000 but what if you have a very larged table a and a > very small table c (using the example above). As you pointed out, > currently you're looking at 'a x c' runtime... Ugh indeed; whereas just > executing the subselect once and cut and pasting that you have an order of > 'a' runtime... OK, I am jumping in here, because it seems we have some strange behavour. The only subselect problem I know of is that: select b from a where b in (select d from c) will execute the subquery just once, but will do a sequential scan for of the subquery results for each row of 'a' looking for 'b' that is in the set of result rows. This is a major performance problem, one that is known, and one that should be fixed, but I am sounding like a broken record. The solution is to allow the subquery results to be mergejoined(sorted), or hashjoined with the outer query. Am I correct, or is something else going on here? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Tue, 21 Sep 1999, Bruce Momjian wrote: > OK, I am jumping in here, because it seems we have some strange > behavour. > > The only subselect problem I know of is that: > > select b from a where b in (select d from c) > > will execute the subquery just once, but will do a sequential scan for > of the subquery results for each row of 'a' looking for 'b' that is in > the set of result rows. Oh, OK, that's very possible. I was always under the impression (very possibly missguided) that the reason it took a long time to do a "in (select...)" was that the sub-select was actually executed for every row in 'a' so that you ended up doing: 1x sequential scan of a ax select on c whereas if you did the sub-select ide[endently and cut-and-pasted the obtained set into the "in (...)" you were in point of fact just doing: 1X sequential scan of a (each of them with loads of OR statements). therefore saving "ax select" time. Bruce, I appologise if I've completely missunderstood what's going on and that your e-mail was all about correcting me. I don't have a good grasp of seq-scan vs. (nested-)joins vs. hash joins vs. mergejoins etc. (although any pointers on where to get a crash course in these would be greatly appreciated). > This is a major performance problem, one that is known, and one that > should be fixed, but I am sounding like a broken record. yeah, again appologise if this has been discussed to death in the past and I missed it all (or it went over my head ;) ) > The solution is to allow the subquery results to be mergejoined(sorted), > or hashjoined with the outer query. erm... > Am I correct, or is something else going on here? most probably correct... :) regards, S.
> Oh, OK, that's very possible. I was always under the impression (very > possibly missguided) that the reason it took a long time to do a "in > (select...)" was that the sub-select was actually executed for every row > in 'a' so that you ended up doing: > > 1x sequential scan of a > ax select on c > > whereas if you did the sub-select ide[endently and cut-and-pasted the > obtained set into the "in (...)" you were in point of fact just doing: > > 1X sequential scan of a (each of them with loads of OR statements). > > therefore saving "ax select" time. > > Bruce, I appologise if I've completely missunderstood what's going on and > that your e-mail was all about correcting me. I don't have a good grasp > of seq-scan vs. (nested-)joins vs. hash joins vs. mergejoins etc. > (although any pointers on where to get a crash course in these would be > greatly appreciated). See src/backend/optimizer/README for more info on join types. > > > This is a major performance problem, one that is known, and one that > > should be fixed, but I am sounding like a broken record. > > yeah, again appologise if this has been discussed to death in the past and > I missed it all (or it went over my head ;) ) > > > The solution is to allow the subquery results to be mergejoined(sorted), > > or hashjoined with the outer query. > It is not your fault. We frequently get reports of this type, and the behavior of the subquery is very non-intuitive. You would think that a subquery and a join would have the same performance, but because of the limitation of subqueries as being nested loop joined, this is not the case, and subqueries are slower. We tell people to rewrite their query as EXISTS, but by the time we tell them that, they have already spent much time trying to figure out why the query is so slow, and I am sure many people don't even know about the EXISTS workaround. I hate to have a feature that is not 100% optimized, but this is what we have with subqueries at this time. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> It is not your fault. We frequently get reports of this type, and the > behavior of the subquery is very non-intuitive. You would think that a > subquery and a join would have the same performance, but because of the > limitation of subqueries as being nested loop joined, this is not the > case, and subqueries are slower. We tell people to rewrite their query > as EXISTS, but by the time we tell them that, they have already spent > much time trying to figure out why the query is so slow, and I am sure > many people don't even know about the EXISTS workaround. You are right: I spend some time scratching my head, then some time searching the mailing lists and I finally made the query with a EXISTS, which works great for me :) Thanks Can this be a candidate to include in the FAQ? On the same idea, is there any good document out there with all the SQL "recipes" or common practice for things like : "Give me all the rows which have this value in this column more than once, etc" I do it with: select my_index, count(my_index) from my_table group by my_index having count(my_index) > 1; But this is a common query and would be interested in knowing which is the commonly accepted way of doing this Regards Daniel
> > > It is not your fault. We frequently get reports of this type, and the > > behavior of the subquery is very non-intuitive. You would think that a > > subquery and a join would have the same performance, but because of the > > limitation of subqueries as being nested loop joined, this is not the > > case, and subqueries are slower. We tell people to rewrite their query > > as EXISTS, but by the time we tell them that, they have already spent > > much time trying to figure out why the query is so slow, and I am sure > > many people don't even know about the EXISTS workaround. > > You are right: I spend some time scratching my head, then some time > searching the mailing lists and I finally made the query with a EXISTS, > which works great for me :) Thanks > Can this be a candidate to include in the FAQ? > > On the same idea, is there any good document out there with all the SQL > "recipes" or common practice for things like : "Give me all the rows which > have this value in this column more than once, etc" > I do it with: > select my_index, count(my_index) from my_table group by my_index having > count(my_index) > 1; > > But this is a common query and would be interested in knowing which is the > commonly accepted way of doing this New FAQ: 4.23) Why are my subqueries using IN so slow? Currently, we join subqueries to outer queries by sequential scanning the result of the subquery for each row of the outer query. A workaround is to replace IN with EXISTS. For example, change: SELECT * FROM tab WHERE col1 IN (SELECT col2 FROM TAB2) to: SELECT * FROM tab WHERE EXISTS (SELECT col2 FROM TAB2 WHERE col1 = col2) We hope to fix this limitation in a future release. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Dear All, I think Daniel's idea is brilliant. Why don't we set up some sort of 'postgreSQL cookbook' where everybody puts in their "tricks of the trade" (for the moment I'm thinking mostly SQL related)? A first pass of the FAQ should generate quite a few of these; like the "SELECT my_index, count(my_index) from my_table group by my_index having count(my_index) > 1" example below or Bruce's new FAQ item re: getting a more efficient IN subquery equivalent. Perhaps we could even have a (moderated?) newsgroup where we could cc: cookbook like entries that appear in the other pgsql newsgroups. Just ideas, for the moment I saddly have no time to make a start on this (paper deadline looming ominously close) but I'm holding this thought! Any comments? Cheers, Stuart. On Tue, 28 Sep 1999, Bruce Momjian wrote: > > > > > as EXISTS, but by the time we tell them that, they have already spent > > > much time trying to figure out why the query is so slow, and I am sure > > > many people don't even know about the EXISTS workaround. > > > > You are right: I spend some time scratching my head, then some time > > searching the mailing lists and I finally made the query with a EXISTS, > > which works great for me :) Thanks > > Can this be a candidate to include in the FAQ? > > > > On the same idea, is there any good document out there with all the SQL > > "recipes" or common practice for things like : "Give me all the rows which > > have this value in this column more than once, etc" > > I do it with: > > select my_index, count(my_index) from my_table group by my_index having > > count(my_index) > 1; > > > New FAQ: > > 4.23) Why are my subqueries using IN so slow? > <snip - an entry which could also be put into a cookbook> Stuart C. G. Rison Department of Biochemistry and Molecular Biology 6th floor, Darwin Building, University College London (UCL) Gower Street, London, WC1E 6BT, United Kingdom Tel. 0207 504 2303, Fax. 0207 380 7033 e-mail: rison@biochem.ucl.ac.uk
Re: [SQL] How about a postgreSQL cookbook? (was [SQL] Subselect performance)
From
Rich Shepard
Date:
On Wed, 29 Sep 1999, Stuart Rison wrote: > I think Daniel's idea is brilliant. Why don't we set up some sort of > 'postgreSQL cookbook' where everybody puts in their "tricks of the trade" > (for the moment I'm thinking mostly SQL related)? I like this idea. Every implementation of SQL has its individual personality, and postgres is no exception. A solutions-oriented list (or cookbook, if you want) will help everyone. It can be organized by categories/types of queries (e.g., single table, multiple table; finding missing values, finding duplicate values) and may be better hosted on the web site as a table or dynamic (but moderated) list. Please add my vote. Rich Dr. Richard B. Shepard, President Applied Ecosystem Services, Inc. (TM) Making environmentally-responsible mining happen.(SM) -------------------------------- 2404 SW 22nd Street | Troutdale, OR97060-1247 | U.S.A.+ 1 503-667-4517 (voice) | + 1 503-667-8863 (fax) | rshepard@appl-ecosys.com
Re: [SQL] How about a postgreSQL cookbook? (was [SQL] Subselect performance)
From
Clayton Cottingham
Date:
Stuart Rison wrote: > Dear All, > > I think Daniel's idea is brilliant. Why don't we set up some sort of > 'postgreSQL cookbook' where everybody puts in their "tricks of the trade" > (for the moment I'm thinking mostly SQL related)? > > A first pass of the FAQ should generate quite a few of these; like the > "SELECT my_index, count(my_index) from my_table group by my_index having > count(my_index) > 1" example below or Bruce's new FAQ item re: getting a > more efficient IN subquery equivalent. > > Perhaps we could even have a (moderated?) newsgroup where we could cc: > cookbook like entries that appear in the other pgsql newsgroups. > > Just ideas, for the moment I saddly have no time to make a start on this > (paper deadline looming ominously close) but I'm holding this thought! > > Any comments? > > Cheers, > > Stuart. > > On Tue, 28 Sep 1999, Bruce Momjian wrote: > > > > > > > > as EXISTS, but by the time we tell them that, they have already spent > > > > much time trying to figure out why the query is so slow, and I am sure > > > > many people don't even know about the EXISTS workaround. > > > > > > You are right: I spend some time scratching my head, then some time > > > searching the mailing lists and I finally made the query with a EXISTS, > > > which works great for me :) Thanks > > > Can this be a candidate to include in the FAQ? > > > > > > On the same idea, is there any good document out there with all the SQL > > > "recipes" or common practice for things like : "Give me all the rows which > > > have this value in this column more than once, etc" > > > I do it with: > > > select my_index, count(my_index) from my_table group by my_index having > > > count(my_index) > 1; > > > > > New FAQ: > > > > 4.23) Why are my subqueries using IN so slow? > > > > <snip - an entry which could also be put into a cookbook> > > Stuart C. G. Rison > Department of Biochemistry and Molecular Biology > 6th floor, Darwin Building, University College London (UCL) > Gower Street, London, WC1E 6BT, United Kingdom > Tel. 0207 504 2303, Fax. 0207 380 7033 > e-mail: rison@biochem.ucl.ac.uk > > ************ this is a great idea! ive got the irc.stampede.org #dbms we could discuss this on or right here!