Thread: IN vs EXIIST
I find myself writing a lot of queries with this pattern: select distinct key1 from A where id not it (select distinct key1 from A where x='false'); The reason being that key1 is not a primary key (key1, key2 is the primary key). i.e. I have a table like this key1 key2 x ------------------ a 1 t a 2 t a 3 f b 1 t b 2 t b 3 t c 3 t c 4 f So basically I want key1 values for which all the X's are true. I've seen many posts saying that using IN is not optimal and replacing it with EXISTS is much better. I've read the only docs but I can't understand the difference between the two or how to convert. Can someone point me to some other docs or explain to me how to convert? Or is my table schema wrong? Thanks! Jc
On Thu, 19 Sep 2002, Jean-Christian Imbeault wrote: > I find myself writing a lot of queries with this pattern: > > select distinct key1 from A where id not it > (select distinct key1 from A where x='false'); > > The reason being that key1 is not a primary key (key1, key2 is the > primary key). i.e. I have a table like this > > key1 key2 x > ------------------ > a 1 t > a 2 t > a 3 f > b 1 t > b 2 t > b 3 t > c 3 t > c 4 f > > So basically I want key1 values for which all the X's are true. Why areyou using the sub select. If you just want all the key1 where x is true then the following will work SELECT DISTINCT(key1) FROM a WHERE x = TRUE; If you want all rows and don't care about duplicates then remove the distinct. Hope this helps > > I've seen many posts saying that using IN is not optimal and replacing > it with EXISTS is much better. I've read the only docs but I can't > understand the difference between the two or how to convert. > > Can someone point me to some other docs or explain to me how to convert? > Or is my table schema wrong? > > Thanks! > > Jc > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Darren Ferguson
On Wed, 2002-09-18 at 21:40, Jean-Christian Imbeault wrote: > select distinct key1 from A where id not it > (select distinct key1 from A where x='false'); > > I've seen many posts saying that using IN is not optimal and replacing > it with EXISTS is much better. I've read the only docs but I can't > understand the difference between the two or how to convert. I just did this today for a query and got a 200x speedup when converting from IN to EXISTS. It's dependent on your specific query exactly how to do the conversion, but generally it's just a question of moving a field from the "outside" to the "inside" of the subselect. For me, SELECT * FROM inventory WHERE itemid IN ( SELECT itemid FROM osinfo WHERE ostype = 'linux' ); became SELECT * FROM inventory WHERE EXISTS ( SELECT itemid FROM osinfo WHERE ostype = 'linux' and inventory.itemid = osinfo.itemid); Same results, same database; first query took 8 sec, second took 39 msec. YMMV b.g. > > Can someone point me to some other docs or explain to me how to convert? > Or is my table schema wrong? > > Thanks! > > Jc > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Bill Gribble wrote: >I just did this today for a query and got a 200x speedup when >converting from IN to EXISTS. It's dependent on your specific query >exactly how to do the conversion, but generally it's just a question of >moving a field from the "outside" to the "inside" of the subselect. If it's not asking too much could you recommend how I could fix this query? It's a bit more complicated than yours and I can't seem to get the syntax right. select distinct invoice_id from invoice_li where received='true' AND shipped='false' AND cancelled='false' AND (invoice_id not in ( select distinct invoice_id from invoice_li where received='false' AND cancelled='false' ) OR ship_now='true' ) Sorry for the formatting ... Jc
Henshall, Stuart - WCP wrote: > > select distinct invoice_id from invoice_li where received='true' > AND shipped='false' AND cancelled='false' > AND > (NOT EXISTS > ( > select * from invoice_li AS sq_inv_li where received='false' > AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id > ) > OR ship_now='true' > ) > > Should work (but is untested). I'll test it and let you know. Tanks! > As a side note there doesn't seem a point to having distinct on the > subquery in its original form either, so this could be removed to reduce > overhead. Really? I though that reducing the number of results in the sub-select would make the query more efficient since the outer query would have less items to check. I.e. it would be faster to check if something is in (1,2,3) than if it is in (1,2,2,2,2,2,2,2,2,2,3). No? Let me check that new optimized query! Jc
Henshall, Stuart - WCP wrote: [deleted] I tried your optimized query but it was muh slower. Here are the results: ##Query using EXISTS $ !1173 time psql TMP -c "select count(distinct invoice_id) from invoice_li where received='true' AND shipped='false' AND cancelled='false' AND (NOT EXISTS ( select * from invoice_li AS sq_inv_li where received='false' AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id ) OR ship_now='true' ) " count ------- 170 (1 row) real 0m8.322s user 0m0.010s sys 0m0.000s ##Query using IN $ !1175 time psql TMP -c "select count(distinct invoice_id) from invoice_li where received='true' AND shipped='false' AND cancelled='false' AND (invoice_id not in ( select distinct invoice_id from invoice_li where received='false' AND cancelled='false' ) OR ship_now='true' ) " count ------- 170 (1 row) real 0m0.234s user 0m0.000s sys 0m0.010s Maybe EXISTS is not always faster than IN ? After a "vacuum analyze" the numbers become: #using EXISTS real 0m3.229s user 0m0.000s sys 0m0.000s #using IN real 0m0.141s user 0m0.000s sys 0m0.000s Jc
Strangely enough doing an EXPLAIN on the two queries shows that using EXISTS would be faster than IN ... even though it isn't .. psql TMP -c "explain select count(distinct invoice_id) from invoice_li where received='true' AND shipped='false' AND cancelled='false' AND (invoice_id not in ( select distinct invoice_id from invoice_li where received='false' AND cancelled='false' ) OR ship_now='true' ) " NOTICE: QUERY PLAN: Aggregate (cost=17642485.70..17642485.70 rows=1 width=4) -> Seq Scan on invoice_li (cost=0.00..17642460.40 rows=10120 width=4) SubPlan -> Materialize (cost=871.61..871.61 rows=1 width=4) -> Unique (cost=871.61..871.61 rows=1 width=4) -> Sort (cost=871.61..871.61 rows=1 width=4) -> Seq Scan on invoice_li (cost=0.00..871.60 rows=1 width=4) EXPLAIN $ psql TMP -c "explain select count(distinct invoice_id) from invoice_li where received='true' AND shipped='false' AND cancelled='false' AND (NOT EXISTS ( select * from invoice_li AS sq_inv_li where received='false' AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id ) OR ship_now='true' ) " NOTICE: QUERY PLAN: Aggregate (cost=4955505.10..4955505.10 rows=1 width=4) -> Seq Scan on invoice_li (cost=0.00..4955479.80 rows=10120 width=4) SubPlan -> Index Scan using invoice_li_pkey on invoice_li sq_inv_li (cost=0.00..244.79 rows=1 width=80) EXPLAIN Jc
I think you've perhaps got a caching issue in those numbers. Note, I haven't been following this thread at all so I don't know if the IN version is the typical timing for the query. However, I would suggest that you try reversing the order you test those to see what effect the caching of data from the EXISTS version has on the IN timing. Or just repeat the each version and use the second run timings. On Thu, 19 Sep 2002, Jean-Christian Imbeault wrote: > Henshall, Stuart - WCP wrote: > > [deleted] > > I tried your optimized query but it was muh slower. Here are the results: > > ##Query using EXISTS > > $ !1173 > time psql TMP -c "select count(distinct invoice_id) from invoice_li > where received='true' > AND shipped='false' AND cancelled='false' > AND > (NOT EXISTS > ( > select * from invoice_li AS sq_inv_li where received='false' > AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id > ) > OR ship_now='true' > ) " > count > ------- > 170 > (1 row) > > > real 0m8.322s > user 0m0.010s > sys 0m0.000s > > ##Query using IN > > $ !1175 > time psql TMP -c "select count(distinct invoice_id) from invoice_li > where received='true' > AND shipped='false' AND cancelled='false' > AND > (invoice_id not in > ( > select distinct invoice_id from invoice_li where received='false' > AND cancelled='false' > ) > OR ship_now='true' > ) " > count > ------- > 170 > (1 row) > > > real 0m0.234s > user 0m0.000s > sys 0m0.010s > > Maybe EXISTS is not always faster than IN ? > > After a "vacuum analyze" the numbers become: > > #using EXISTS > > real 0m3.229s > user 0m0.000s > sys 0m0.000s > > #using IN > > real 0m0.141s > user 0m0.000s > sys 0m0.000s >
Nigel J. Andrews wrote: > > I think you've perhaps got a caching issue in those numbers. I was thinking the same thing too but reversing the order I execute them in doesn't change the numbers at all. I also tried executing the slow query five times in a row (hoping the results would be cahced) but the speed was always the same. maybe there is something in the query using IN that allows it to be cached while the query using the EXISTS cannot be cached? How can I clear the cache so that I can re-run the fast query to see what it's speed is withouth caching? Jc
Henshall, Stuart - WCP wrote: > > It might be interesting to try removing the distinct clause from the > first query and seeing what difference that makes, both real and > supposed. Also maybe try seeing what difference it would make to the > EXISTS sub query to have distinct invoice_id rather than *. Removing or addin a distinct did not change the results but ... Your query with EXISTS is this: psql TMP -c "select count(distinct invoice_id) from invoice_li where received='true' AND shipped='false' AND cancelled='false' AND (NOT EXISTS ( select * from invoice_li AS sq_inv_li where received='false' AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id ) OR ship_now='true' ) " I am no SQL expert so I took it as face value (especially since I don't quite underst EXISTS). But I removed this "AND invoice_li.invoice_id=sq_inv_li.invoice_id " from the subquery and I got the correct count *and* the same speed as the IN query ... Jc
Quoting Jean-Christian Imbeault <jc@mega-bucks.co.jp>: > > The reason being that key1 is not a primary key (key1, key2 is the > primary key). i.e. I have a table like this > > key1 key2 x > ------------------ > a 1 t > a 2 t > a 3 f > b 1 t > b 2 t > b 3 t > c 3 t > c 4 f > > So basically I want key1 values for which all the X's are true. SELECT key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue FROM table GROUP BY key1 HAVING isTrue = 1 > Or is my table schema wrong? I generally don't design tables with composite keys. I find it to complicated in many operations. But on occasion I have seen them being used by others very efficiently. Jochem
>Strangely enough doing an EXPLAIN on the two queries shows that using
>EXISTS would be faster than IN ... even though it isn't ..
A sad sidenote: I am stuck here with a similar IN/EXIST problem. One of our expensive queries contains NOT IN and IN as subqueries. As I was advised on this list, I tried to replace IN with EXISTS. When doing so for part of the query (omitting one of the IN subqueries) the IN and EXIST versions are both about the same speed in execution (about 30sec).
EXPLAIN tells me, that the EXIST version should be 15 times faster, which it is not. Caching is also not an issue here.
EXPLAIN also shows, that both queries want to perform a sequential scan on the outermost query part, instead of an index scan (where clause on the primary key). If I replace the innermost query by the results it gives (splitting the request in two requests), than the planner uses the index scan and is in fact much faster!
My next plan is to switch from 7.1.3 to 7.2, but that requires some planning, as the database is permamently used.
Regards
Jan
Henshall, Stuart - WCP wrote: > > AND invoice_li.invoice_id=sq_inv_li.invoice_id > is the bit that tells it to check the id of the sub query and parent are > the same. > If you don't have this bit I believe it makes the subquery kind of > piontless You are right. I had "ordered" my data set in between running the queries. When I added some new data running the EXISTS query without the AND a.id=b.id bit the results were fast *and* wrong ;) I'm still scratching my head as to why EXISTS isn't faster *and* why EXPLAIN says it should be ... Jc
Quoting Jan Weerts <j.weerts@i-views.de>: > > One > of our expensive queries contains NOT IN and IN as subqueries. As I > was advised on this list, I tried to replace IN with EXISTS. When > doing so for part of the query (omitting one of the IN subqueries) > the IN and EXIST versions are both about the same speed in execution > (about 30sec). I am wondering if the NOT might negate any positive effects from the switch to EXIST. > My next plan is to switch from 7.1.3 to 7.2, but that requires some > planning, as the database is permamently used. I would say that is a good idea anyway, but maybe you can post your query. There might be a way to totally rewrite it like the CASE & GROUP BY idea for Jean-Christian (not that that idea is faster by definition, but you never know). Jochem
Jan Weerts wrote: > > A sad sidenote: I am stuck here with a similar IN/EXIST problem. One of > our expensive queries contains NOT IN and IN as subqueries. As I was > advised on this list, I tried to replace IN with EXISTS. When doing so > for part of the query (omitting one of the IN subqueries) the IN and > EXIST versions are both about the same speed in execution (about 30sec). At least you get the same speed. In my case replacing IN with EXISTS makes it about 25X slower! > EXPLAIN tells me, that the EXIST version should be 15 times faster, > which it is not. Caching is also not an issue here. Yup, same as me. BTW how do I clear the cache? > EXPLAIN also shows, that both queries want to perform a sequential scan > on the outermost query part, instead of an index scan (where clause on > the primary key). To test this I added an index on outer query search term and lo-and-behold ... Just like you I get a seq scan on the outer part for both but the IN query does a seq scan on the inner query while the EXISTS uses an index scan. The EXISTS is still just as slow though .. > My next plan is to switch from 7.1.3 to 7.2, but that requires some > planning, as the database is permamently used. I'm on 7.2 so I don't know if that will help ;) Jc
Jochem van Dieten wrote: > > SELECT key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue > FROM table > GROUP BY key1 > HAVING isTrue = 1 I tried this but got an error: psql TMP -c "select invoice_id, min(case when received then 1 else 0 end) as ok from invoice_li group by invoice_id having ok = 1" ERROR: Attribute 'ok' not found Jc
Quoting Jean-Christian Imbeault <jc@mega-bucks.co.jp>: > > Jochem van Dieten wrote: > > > > SELECT key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue > > FROM table > > GROUP BY key1 > > HAVING isTrue = 1 > > I tried this but got an error: > psql TMP -c "select invoice_id, min(case when received then 1 else 0 > end) as ok from invoice_li group by invoice_id having ok = 1" > ERROR: Attribute 'ok' not found Sorry, not behind the console at the moment. Try either: SELECT invoice_id FROM ( SELECT invoice_id, Min(CASE WHEN received THEN 1 ELSE 0 END) AS ok FROM invoice_li GROUP BY invoice_id ) agg_invoices WHERE ok = 1 or: SELECT invoice_id, Min(CASE WHEN received THEN 1 ELSE 0 END) AS ok FROM invoice_li GROUP BY invoice_id HAVING Min(CASE WHEN received THEN 1 ELSE 0 END) = 1 Make sure you have an index on invoice_id. I am not sure if it is faster because of the case that needs to be done on each row, but I think it is worth a shot. Jochem
On Thu, Sep 19, 2002 at 06:36:56PM +0900, Jean-Christian Imbeault wrote: > > Maybe EXISTS is not always faster than IN ? I haven't been reading this thread very carefully, but that one is _certainly_ true. In testing, we have had a wide variety of results with IN versus EXISTS. It seems to have to do with the relative size of the two sides. I haven't come up with a perfect rule for deciding yet. NOT IN/NOT EXISTS is more often a clear-cut case: NOT EXISTS is _almost_ (but only almost) always faster. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Jochem van Dieten wrote: > > Sorry, not behind the console at the moment. Try either: Ok :) I tried both new versions and they work great. But ... they are not faster than the IN version. They run at exactly the same speed. Another thing is that the IN query and yours don't seem to be so dependent on the DB having been vacuum analyzed while the query with the EXISTS does (it goes from 15 secs to 5secs after vacuuming) I guess using IN in this case is not making the query slow. In case it's important the table contains 16800 rows and returns 300 matches. I'll be sticking with my IN query ... Jc
This is not what he wants. He wants all keys that have only true for x. > Why areyou using the sub select. If you just want all the key1 where x is > true then the following will work > > SELECT DISTINCT(key1) FROM a WHERE x = TRUE; > > If you want all rows and don't care about duplicates then remove the > distinct. > > Hope this helps > > > > > I've seen many posts saying that using IN is not optimal and replacing > > it with EXISTS is much better. I've read the only docs but I can't > > understand the difference between the two or how to convert. > > > > Can someone point me to some other docs or explain to me how to convert? > > Or is my table schema wrong? > > > > Thanks! > > > > Jc > >
I realized after he sent his notice this morning. On Thu, 19 Sep 2002, Jean-Luc Lachance wrote: > This is not what he wants. > > He wants all keys that have only true for x. > > > > Why areyou using the sub select. If you just want all the key1 where x is > > true then the following will work > > > > SELECT DISTINCT(key1) FROM a WHERE x = TRUE; > > > > If you want all rows and don't care about duplicates then remove the > > distinct. > > > > Hope this helps > > > > > > > > I've seen many posts saying that using IN is not optimal and replacing > > > it with EXISTS is much better. I've read the only docs but I can't > > > understand the difference between the two or how to convert. > > > > > > Can someone point me to some other docs or explain to me how to convert? > > > Or is my table schema wrong? > > > > > > Thanks! > > > > > > Jc > > > > -- Darren Ferguson
How about: select ( ( select count(distinct invoice_id) from invoice_li WHERE shipped='false' AND cancelled='false') - ( select count(distinct invoice_id) from invoice_li WHERE received='false' AND shipped='false' AND cancelled='false')); Jean-Christian Imbeault wrote: > > Henshall, Stuart - WCP wrote: > > [deleted] > > I tried your optimized query but it was muh slower. Here are the results: > > ##Query using EXISTS > > $ !1173 > time psql TMP -c "select count(distinct invoice_id) from invoice_li > where received='true' > AND shipped='false' AND cancelled='false' > AND > (NOT EXISTS > ( > select * from invoice_li AS sq_inv_li where received='false' > AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id > ) > OR ship_now='true' > ) " > count > ------- > 170 > (1 row) > > real 0m8.322s > user 0m0.010s > sys 0m0.000s > > ##Query using IN > > $ !1175 > time psql TMP -c "select count(distinct invoice_id) from invoice_li > where received='true' > AND shipped='false' AND cancelled='false' > AND > (invoice_id not in > ( > select distinct invoice_id from invoice_li where received='false' > AND cancelled='false' > ) > OR ship_now='true' > ) " > count > ------- > 170 > (1 row) > > real 0m0.234s > user 0m0.000s > sys 0m0.010s > > Maybe EXISTS is not always faster than IN ? > > After a "vacuum analyze" the numbers become: > > #using EXISTS > > real 0m3.229s > user 0m0.000s > sys 0m0.000s > > #using IN > > real 0m0.141s > user 0m0.000s > sys 0m0.000s > > Jc > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
You can also try: SELECT DISTINCT( key1) FROM table EXCEPT SELECT DISTINCT( key1) from table where not x; Jochem van Dieten wrote: > > Quoting Jean-Christian Imbeault <jc@mega-bucks.co.jp>: > > > > The reason being that key1 is not a primary key (key1, key2 is the > > primary key). i.e. I have a table like this > > > > key1 key2 x > > ------------------ > > a 1 t > > a 2 t > > a 3 f > > b 1 t > > b 2 t > > b 3 t > > c 3 t > > c 4 f > > > > So basically I want key1 values for which all the X's are true. > > SELECT key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue > FROM table > GROUP BY key1 > HAVING isTrue = 1 > > > Or is my table schema wrong? > > I generally don't design tables with composite keys. I find it to > complicated in many operations. But on occasion I have seen them being > used by others very efficiently. > > Jochem > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Caveat, I'm new to Postgres. However extrapolating from my experience with Oracle I think some of you are going about this optimization work the wrong way around. Don't focus on IN vs EXISTS, focus on the query plans they generate. Which of two identical queries is faster is purely a quirk of optimizer (and it's arguably always a bug if they aren't both the same). It's like debating which of two identical cars will be faster without actually checking who is driving. And don't focus on the estimated costs from explain plan. If they were always right then there wouldn't be any disputes like these. Well not quite. So in the cases below as Jean-Christian showed the NOT IN query leads the optimizer to do two sequential scans and a full sort and unique operation. Whereas the EXISTS syntax query leads the optimizer to use the index on one of the tables. Usually doing a full table scan and an extra sort and unique operation will be slower, unless you were going to be reading a lot of the table anyways. So which of these two plans will be fastest should hinge on how much of the truth table pertains to the attribute being checked. Exactly where the breakeven point is depends on the details of the database engine. For Oracle the traditional rule of thumb is a (surprisingly low) 10%. Someone implied that EXISTS might be more influenced by a VACUUM ANALYZE. That wouldn't be surprising if the optimizer is changing strategies if it has enough data to determine that an index scan will read more than the magic breakeven percentage of the full table and switches to a full table scan. In the query plans below I stripped out the distracting cost estimates: Jean-Christian Imbeault <jc@mega-bucks.co.jp> writes: > psql TMP -c "explain select count(distinct invoice_id) from invoice_li where > received='true' AND shipped='false' AND cancelled='false' > AND > (invoice_id not in ... > > NOTICE: QUERY PLAN: > > Aggregate > -> Seq Scan on invoice_li > SubPlan > -> Materialize > -> Unique > -> Sort > -> Seq Scan on invoice_li > $ psql TMP -c "explain select count(distinct invoice_id) from invoice_li where > received='true' > AND shipped='false' AND cancelled='false' > AND > (NOT EXISTS ... > > NOTICE: QUERY PLAN: > > Aggregate > -> Seq Scan on invoice_li > SubPlan > -> Index Scan using invoice_li_pkey on invoice_li sq_inv_li -- greg