Thread: Highly obscure and erratic
hi, I cannot get what causes the difference in the execution of these 2 commands 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') ) LIMIT 20; 2. SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ; where ID is the primary key. First let me tell that i have done all vacuum's ..analyze,full et all. The explain output should make it clear that for the former a "sequential" scan is done and for the later an "index" scan is done.Also the time take indicates that a sequential scan(atleast not index scan) is done on the former. ------------------------------------------------------------------------- explain SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') ) LIMIT 20; NOTICE: QUERY PLAN: Limit (cost=0.00..100544.53 rows=20 width=76) -> Seq Scan on dbmedia (cost=0.00..507161673.46 rows=100883 width=76) ^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ GAWD!! SubPlan -> Materialize (cost=5027.19..5027.19 rows=2575 width=4) -> Index Scan using wdkmedia on wdmedia (cost=0.00..5027.19 rows=2575 width=4) ---------------------------------------------------------------------------- explain SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ; NOTICE: QUERY PLAN: Index Scan using dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey,dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey on dbmedia (cost=0.00..59.40rows=1 width=76) -------------------------------------------------------------------------- observe the index scan.. OK , for those ppl who might be feeling that the o/p of the nested query in the former case might be a significant portion (and so the db does a sq scan) let me say that the result set of sub-query (SELECT id FROM wdmedia WHERE word = 'whatever') is of length 1,000 while thetable dbmedia is of length 100,000 and if a db selects to do a seq scan due to this then ... i can say no more. postgresql developers/gurus please help. -- ------ Varun Printer not ready. Do you have a pen? ----- End forwarded message ----- -- ------ Varun Printer not ready. Do you have a pen?
On Wed, 19 Jun 2002, Varun Kacholia wrote: > hi, > I cannot get what causes the difference in the execution of these 2 commands > 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') ) LIMIT 20; > 2. SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ; The former is not very well optimized (see the FAQ for suggestions on using EXISTS instead).
> > hi, > > I cannot get what causes the difference in the execution of these 2 commands > > 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') ) LIMIT 20; > > 2. SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ; > > The former is not very well optimized (see the FAQ for suggestions on > using EXISTS instead). hi , Had a look at that.. but still it is toooo slow :( when is it intended to be fixed up?? <excerpt below> Currently we join subqueries by sequentlialy scanning the result of the subquery for each row of the outer query.A workaround is to replace IN with EXISTS <snip> We hope to fix this in future release. <end excerpt> guys i am waiting for that release ;-). -- ------ Varun Once I finally figured out all of life's answers.....they changed the questions.
On Wed, 19 Jun 2002, Varun Kacholia wrote: > > > > > hi, > > > I cannot get what causes the difference in the execution of these 2 commands > > > 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') ) LIMIT 20; > > > 2. SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ; > > > > The former is not very well optimized (see the FAQ for suggestions on > > using EXISTS instead). > hi , > Had a look at that.. > but still it is toooo slow :( > when is it intended to be fixed up?? What does explain show for the exists version?
> What does explain show for the exists version? suryadb=# explain select * from dbmedia where EXISTS (select ID from wdmedia where word='whatever' AND dbmedia.id=id ) LIMIT200; NOTICE: QUERY PLAN: Limit (cost=0.00..1006732.42 rows=200 width=76) -> Seq Scan on dbmedia (cost=0.00..507810931.25 rows=100883 width=76) SubPlan -> Index Scan using wdkmedia on wdmedia (cost=0.00..5033.63 rows=1 width=4) EXPLAIN still seq scan :(( someone please fix this bug :( -- ------ Varun Once I finally figured out all of life's answers.....they changed the questions.
On Wed, 19 Jun 2002, Varun Kacholia wrote: > > > What does explain show for the exists version? > > suryadb=# explain select * from dbmedia where EXISTS (select ID from > wdmedia where word='whatever' AND dbmedia.id=id ) LIMIT 200; > NOTICE: QUERY PLAN: > > Limit (cost=0.00..1006732.42 rows=200 width=76) > -> Seq Scan on dbmedia (cost=0.00..507810931.25 rows=100883 width=76) > SubPlan > -> Index Scan using wdkmedia on wdmedia > (cost=0.00..5033.63 rows=1 width=4) > > EXPLAIN > > still seq scan :(( Yeah... Do either of these run better? select dbmedia.* from dbmedia, wdmedia where wdmedia.word='whatever' and dbmedia.id=wdmedia.id; select * from dbmedia, (select ID from wdmedia where word='whatever') AS w where w.id=dbmedia.id; > someone please fix this bug :( I'd guess it's not that simple.
On Wed, Jun 19, 2002 at 04:33:10AM +0530, Varun Kacholia wrote: > > > What does explain show for the exists version? > > suryadb=# explain select * from dbmedia where EXISTS (select ID from > wdmedia where word='whatever' AND dbmedia.id=id ) LIMIT 200; how does this differ from: select * from dbmedia where dbmedia.id=wdmedia.id and wdmedia.word='whatever' limit 200; > NOTICE: QUERY PLAN: > > Limit (cost=0.00..1006732.42 rows=200 width=76) > -> Seq Scan on dbmedia (cost=0.00..507810931.25 rows=100883 width=76) > SubPlan > -> Index Scan using wdkmedia on wdmedia (cost=0.00..5033.63 rows=1 width=4) > > EXPLAIN > > still seq scan :(( > someone please fix this bug :( I think it's called "pilot error". Your query asked to run the subquery for each row in the outer query, so ofcourse you get a sequential scan. If what you wanted was an index scan then you should rewrite it as a join (as above) and use that. If you can prove that your EXISTS statement is equivalent to the JOIN for all different types of subqueries, perhaps it can be made automatic. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
> how does this differ from: > > select * > from dbmedia > where dbmedia.id=wdmedia.id > and wdmedia.word='whatever' > limit 200; hey... thanks for that.. that was surely better... but now returning to the main point... how can i execute the following query in the form above? select * from dbmedia where id IN ((select ID from wdmedia where word='word1') INTERSECT (select ID from wdmedia where word='word2') ...so on) thanks again > I think it's called "pilot error". Your query asked to run the subquery for > each row in the outer query, so ofcourse you get a sequential scan. If what > you wanted was an index scan then you should rewrite it as a join (as above) > and use that. > > If you can prove that your EXISTS statement is equivalent to the JOIN for > all different types of subqueries, perhaps it can be made automatic. > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > > There are 10 kinds of people in the world, those that can do binary > > arithmetic and those that can't. > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- ------ Varun Best file compression around: "DEL *.*" = 100% compression
On Wed, Jun 19, 2002 at 02:44:08PM +0530, Varun Kacholia wrote: > but now returning to the main point... > how can i execute the following query in the form above? > > select * from dbmedia where id IN ((select ID from wdmedia where word='word1') > INTERSECT (select ID from wdmedia where word='word2') ...so on) Wow, you seem to have a knack for making queries that are hard to optimise. Do you need some kind of full-text indexing? There are premade modules that do this all for you. select * from dbmedia, wdmedia a, wdmedia b where id = a.id and a.word='word1' and id = b.id and b.word='word2' etc Seems odd but it may work. But as a rule, joins are faster than subqueries and avoid IN, INTERSECT and UNION at all costs. Do you have a book on SQL around? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
On Wed, 19 Jun 2002, Varun Kacholia wrote: > 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE > word = 'whatever') ) LIMIT 20; You probably didn't know this, but postgres is notorious for being *very* bad at optimizing IN queries. Try this instead: SELECT * FROM dbmedia d WHERE EXISTS (SELECT 1 FROM wdmedia w WHERE w.id = d.id AND word = 'whatever') LIMIT 20; -- +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+ | Shaun M. Thomas INN Database Administrator | | Phone: (309) 743-0812 Fax : (309) 743-0830 | | Email: sthomas@townnews.com AIM : trifthen | | Web : www.townnews.com | | | | "Most of our lives are about proving something, either to | | ourselves or to someone else." | | -- Anonymous | +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
On Tue, 18 Jun 2002, Stephan Szabo wrote: > select dbmedia.* from dbmedia, wdmedia where wdmedia.word='whatever' and > dbmedia.id=wdmedia.id; I think all of us suggesting EXISTS in the case of this query were complete tards, me included. Looks like a simple join will suffice nicely, and now that I look at it, I wonder why you're the first one to see it. Good job! But that'll definitely help. Varun, if you're listening, use this! -- +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+ | Shaun M. Thomas INN Database Administrator | | Phone: (309) 743-0812 Fax : (309) 743-0830 | | Email: sthomas@townnews.com AIM : trifthen | | Web : www.townnews.com | | | | "Most of our lives are about proving something, either to | | ourselves or to someone else." | | -- Anonymous | +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
On Wed, 19 Jun 2002, Martijn van Oosterhout wrote: > select * from dbmedia, wdmedia a, wdmedia b > where id = a.id and a.word='word1' > and id = b.id and b.word='word2' > etc Ewww, self join. I'd actually suggest he keep the intersect, but use the join format for each. So he'd get this: SELECT * FROM dbmedia d, wdmedia w WHERE w.word = 'word1' AND d.id=w.id INTERSECT SELECT * FROM dbmedia d, wdmedia w WHERE w.word = 'word2' AND d.id=w.id INTERSECT ... I think you get the picture from this. If he didn't need INTERSECT to emulate AND for his word matches, he could just use OR, and do this: SELECT * FROM dbmedia d, wdmedia w WHERE w.word IN ('word1', 'word2', 'etc') AND d.id=w.id; My guess is the OR version is faster, but he wants to restrict the data, not add to it. Unfortunate. ^_^ > and avoid IN, INTERSECT and UNION at all costs. Actually, if INTERSECT and UNION are done properly (which I'm guessing they are) at most, the query has to be run once for each distinct query between the INTERSECT/UNION clauses. On a fast/small query, an upper bound of doubling execution time isn't too bad. Mathematical matrix intersections and unions are pretty fast/easy, so that doesn't add much overhead at all. It's IN you have to avoid. The way postgres does it is just insane, and you'll suffer greatly upon using it. -- +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+ | Shaun M. Thomas INN Database Administrator | | Phone: (309) 743-0812 Fax : (309) 743-0830 | | Email: sthomas@townnews.com AIM : trifthen | | Web : www.townnews.com | | | | "Most of our lives are about proving something, either to | | ourselves or to someone else." | | -- Anonymous | +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
hi, > SELECT * > FROM dbmedia d, wdmedia w > WHERE w.word = 'word1' AND d.id=w.id > INTERSECT > SELECT * > FROM dbmedia d, wdmedia w > WHERE w.word = 'word2' AND d.id=w.id > INTERSECT > ... well this does not help... it results in a seq scan only. suryadb=# explain (select * from dbmedia d,wdmedia w where w.word='whatever' and d.id=w.id) INTERSECT (select * from dbmediad,wdmedia w where w.word='whatever' and d.id=w.id) ; NOTICE: QUERY PLAN: SetOp Intersect (cost=25519.49..25674.00 rows=515 width=92) -> Sort (cost=25519.49..25519.49 rows=5151 width=92) -> Append (cost=5033.63..25201.94 rows=5151 width=92) -> Subquery Scan *SELECT* 1 (cost=5033.63..12600.97 rows=2575 width=92) -> Hash Join (cost=5033.63..12600.97 rows=2575 width=92) -> Seq Scan on dbmedia d (cost=0.00..3499.83 rows=100883 width=76) ^^^^^^^^^^ -> Hash (cost=5027.19..5027.19 rows=2575 width=16) -> Index Scan using wdkmedia on wdmedia w (cost=0.00..5027.19 rows=2575 width=16) -> Subquery Scan *SELECT* 2 (cost=5033.63..12600.97 rows=2575 width=92) -> Hash Join (cost=5033.63..12600.97 rows=2575 width=92) -> Seq Scan on dbmedia d (cost=0.00..3499.83 rows=100883 width=76) ^^^^^^^^^^^^^^^^^^^^^^ -> Hash (cost=5027.19..5027.19 rows=2575 width=16) -> Index Scan using wdkmedia on wdmedia w (cost=0.00..5027.19 rows=2575 width=16) instead the self-join is better...atleast in the case of postgresql :p anyways.. thanks for the reply! > I think you get the picture from this. If he didn't need INTERSECT to > emulate AND for his word matches, he could just use OR, and do this: > > SELECT * > FROM dbmedia d, wdmedia w > WHERE w.word IN ('word1', 'word2', 'etc') AND d.id=w.id; > > My guess is the OR version is faster, but he wants to restrict the data, > not add to it. Unfortunate. ^_^ > > > and avoid IN, INTERSECT and UNION at all costs. > > Actually, if INTERSECT and UNION are done properly (which I'm guessing > they are) at most, the query has to be run once for each distinct query > between the INTERSECT/UNION clauses. On a fast/small query, an upper > bound of doubling execution time isn't too bad. Mathematical matrix > intersections and unions are pretty fast/easy, so that doesn't add much > overhead at all. > > It's IN you have to avoid. The way postgres does it is just insane, and > you'll suffer greatly upon using it. > > -- > +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+ > | Shaun M. Thomas INN Database Administrator | > | Phone: (309) 743-0812 Fax : (309) 743-0830 | > | Email: sthomas@townnews.com AIM : trifthen | > | Web : www.townnews.com | > | | > | "Most of our lives are about proving something, either to | > | ourselves or to someone else." | > | -- Anonymous | > +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+ > > -- ------ Varun Why do the people who know the least know it the loudest?
hi , > select * from dbmedia, wdmedia a, wdmedia b > where id = a.id and a.word='word1' > and id = b.id and b.word='word2' > etc well thanks for that.. but i really cannot understand y pgsql resorts to wierd behaviour.. like for this one : ========================================================================== suryadb=# explain SELECT * FROM dbmedia d, wdmedia a1,wdmedia a2 WHERE d.id = a1.id AND a1.word = 'word1' AND d.id=a2.id and a2.word='word2' and d.host like '144.16%' LIMIT 200; NOTICE: QUERY PLAN: Limit (cost=5033.63..10547.45 rows=1 width=108) -> Nested Loop (cost=5033.63..10547.45 rows=1 width=108) -> Hash Join (cost=5033.63..10196.02 rows=83 width=32) -> Index Scan using wdkmedia on wdmedia a1 (cost=0.00..5027.19 rows=2575 width=16) -> Hash (cost=5027.19..5027.19 rows=2575 width=16) -> Index Scan using wdkmedia on wdmedia a2 (cost=0.00..5027.19 rows=2575 width=16) -> Index Scan using indx2 on dbmedia d (cost=0.00..4.21 rows=1 width=76) ============================================================================ clearly an index scan is done but for this... ========================================================================== suryadb=# explain SELECT * FROM dbmedia d, wdmedia a1 WHERE d.id = a1.id AND a1.word = 'word1' and d.host like '144.16%' LIMIT 200; NOTICE: QUERY PLAN: Limit (cost=0.00..8811.42 rows=1 width=92) -> Nested Loop (cost=0.00..8811.42 rows=1 width=92) -> Seq Scan on dbmedia d (cost=0.00..3752.04 rows=1 width=76) ^^^^^^^^^^^^^^^ -> Index Scan using wdkmedia on wdmedia a1 (cost=0.00..5027.19 rows=2575 width=16) ============================================================================ y is a seq scan done here? hmm..perhaps cuz it might be thinking that '144.16%' has a high share? well but if the query is executed other way .. ie first Index scan on wdkmedia then Index scan on dbmedia is will be "inf" times fasters. I want it to first look in wdmedia and then find the results by index scan which are present in dbmedia (which initially i was doing using IN). Is there a way by which i can force it not to use seq scan? Will be grateful for any help > Seems odd but it may work. But as a rule, joins are faster than subqueries > and avoid IN, INTERSECT and UNION at all costs. Do you have a book on SQL > around? > -- -- ------ Varun If Bill Gates had a penny for every time Windows crashed......Oh wait, he does.