Thread: EXISTS vs IN vs OUTER JOINS
Hi Few days ago I read, that EXISTS is better than IN, but only if there are many records (how many?). I was wondering which one is better and when. Did anyone try to compare these queries doing the same work: - select * from some_table t where t.id [not] in (select id from filter); - select * from some_table t where [not] exists (select * from filter where id=t.id); - select * from some_table t left join filter f using (id) where f.id is [not] null; Regards, Tomasz Myrta
Tomasz, > Few days ago I read, that EXISTS is better than IN, but only if there > are many records (how many?). I was wondering which one is better and > when. Did anyone try to compare these queries doing the same work: > > - select * from some_table t > where t.id [not] in (select id from filter); > -select * from some_table t > where [not] exists (select * from filter where id=t.id); The rule I use is: if I expect the sub-select to return more than 12 records 20% or more of the time, use EXISTS. The speed gain for IN on small lists is not as dramatic as the speed loss for EXISTS on large lists. More importantly, the difference between NOT IN and NOT EXISTS can be as much as 20:1 on large sub-selects, as opposed to IN and EXISTS, where I have rarely seen a difference of more than 3:1. As I understand it, this is because NOT EXISTS can use optimized join algorithms to locate matching rows, whereas NOT IN must compare each row against every possible matching value in the subselect. It also makes a difference whether or not the referenced field(s) in the subselect is indexed. EXISTS will often use an index to compare the values in the master query to the sub-query. As far as I know, IN can use an index to retrieve the subquery values, but not to sort or compare them after they have been retreived into memory. > -select * from some_table t > left join filter f using (id) > where f.id is [not] null; This will not get you the same result as the above. It will get you all rows from t+f where a record is present in f and has (or does not have) a NULL value for f.id. While this type of query works in MS Access, it will not work in SQL92/99-commpliant databases. Incidentally, the dramatic differences between IN and EXISTS are not only a "PostgreSQL Thing". The same rules apply to MS SQL Server and SQL Anywhere, for the same reasons. -Josh Berkus
On Thu, Dec 19, 2002 at 09:15:36AM -0800, Josh Berkus wrote: > > -select * from some_table t > > left join filter f using (id) > > where f.id is [not] null; > > This will not get you the same result as the above. It will get you > all rows from t+f where a record is present in f and has (or does not > have) a NULL value for f.id. While this type of query works in MS > Access, it will not work in SQL92/99-commpliant databases. filter_table does not contain null fields. It has only two states: it has row or it hasn't row coresponding to row in some_table. And now, which one is better? Tomasz Myrta
Josh Berkus wrote: > where I have rarely seen a difference of more than 3:1. As I > understand it, this is because NOT EXISTS can use optimized join > algorithms to locate matching rows, whereas NOT IN must compare each > row against every possible matching value in the subselect. > > It also makes a difference whether or not the referenced field(s) in > the subselect is indexed. EXISTS will often use an index to compare > the values in the master query to the sub-query. As far as I know, IN > can use an index to retrieve the subquery values, but not to sort or > compare them after they have been retreived into memory. I wonder if "[NOT] IN (subselect)" could be improved with a hash table in similar fashion to the hash aggregate solution Tom recently implemented? Joe
Tomasz, > > This will not get you the same result as the above. It will get you > > all rows from t+f where a record is present in f and has (or does not > > have) a NULL value for f.id. While this type of query works in MS > > Access, it will not work in SQL92/99-commpliant databases. > > filter_table does not contain null fields. It has only two states: it > has row > or it hasn't row coresponding to row in some_table. > > And now, which one is better? You're not listening. I said that LEFT JOIN won't work. At all. Please re-read the paragraph above, which explains why. -- -Josh Berkus Aglio Database Solutions San Francisco
On Thu, Dec 19, 2002 at 11:02:39AM -0800, Josh Berkus wrote: > > Tomasz, > You're not listening. I said that LEFT JOIN won't work. At all. > > Please re-read the paragraph above, which explains why. I read your mail once again, but I still don't understand what are you talking about. I'll write example - maybe it will help us to understand each other. I have three tables: users, things and access_list create table users( user_id integer primary key, username varchar ); insert into users(1,'Tomasz'); create table things( thing_id int4 primary key, thingname varchar ); insert into things(1,'thing1'); insert into things(2,'thing2'); insert into things(3,'thing3'); insert into things(4,'thing4'); insert into things(5,'thing5'); create table access_list( user_id int4 not null references users, thing_id int4 not null references things ); insert into access_list(1,1); insert into access_list(1,4); SELECT u.username,t.thingname,al.thing_id from users u cross join things t left join access_list al on (s.user_id=al.user_id and t.thing_id=al.thing_id) Result: username thingname thing_id Tomasz thing1 1 Tomasz thing2 Tomasz thing3 Tomasz thing4 4 Tomasz thing5 5 Now if we add "where al.user_id is null" we get: Tomasz thing2 Tomasz thing3 Or if we add "where al.user_id is not null" we get: (the same result we have when using inner join) Tomasz thing1 1 Tomasz thing4 4 Tomasz thing5 5 I know this method will fail if we have not unique pairs in table access_list, but in other case it looks ok. Tomasz Myrta
Joe Conway <mail@joeconway.com> writes: > I wonder if "[NOT] IN (subselect)" could be improved with a hash table in > similar fashion to the hash aggregate solution Tom recently implemented? It's being worked on ;-) http://archives.postgresql.org/pgsql-hackers/2002-11/msg01055.php Assuming I get this done, the conventional wisdom that "EXISTS outperforms IN" will be stood on its head --- unless we add planner code to try to reverse-engineer an IN from an EXISTS, which is something I'm not really eager to expend code and cycles on. regards, tom lane
Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: > > I wonder if "[NOT] IN (subselect)" could be improved with a hash table in > > similar fashion to the hash aggregate solution Tom recently implemented? > > It's being worked on ;-) > > http://archives.postgresql.org/pgsql-hackers/2002-11/msg01055.php > > Assuming I get this done, the conventional wisdom that "EXISTS > outperforms IN" will be stood on its head --- unless we add planner code > to try to reverse-engineer an IN from an EXISTS, which is something I'm > not really eager to expend code and cycles on. I am looking forward to removing _that_ FAQ item. :-) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tomasz, > I read your mail once again, but I still don't understand what are > you > talking about. > I'll write example - maybe it will help us to understand each other. Hmmm ... you're right. Sorry for being dense. It shouldn't work, but it does. Tom, Bruce: If I run the query: SELECT t1.* FROM table1 t1 LEFT JOIN table2 t2 ON t1.xid = t2.xid WHERE t2.label IS NULL I will get rows in t1 for which there is no row in t2. This does not seem SQL-spec to me; shouldn't I get only rows from t1 where a row exists in t2 and t2.label IS NULL? -Josh
"Josh Berkus" <josh@agliodbs.com> writes: > If I run the query: > SELECT t1.* > FROM table1 t1 > LEFT JOIN table2 t2 ON t1.xid = t2.xid > WHERE t2.label IS NULL > I will get rows in t1 for which there is no row in t2. This does not > seem SQL-spec to me; shouldn't I get only rows from t1 where a row > exists in t2 and t2.label IS NULL? No; that would be the behavior of an inner join, but you did a left join. The above will give you t1 rows for which there is no match in t2 (plus rows for which there is a match containing null in t2.label). regards, tom lane
On Thu, 19 Dec 2002 15:19:21 -0800, "Josh Berkus" <josh@agliodbs.com> wrote: >SELECT t1.* >FROM table1 t1 > LEFT JOIN table2 t2 ON t1.xid = t2.xid >WHERE t2.label IS NULL Josh, also note that Tomasz did something like SELECT t1.* FROM table1 t1 LEFT JOIN table2 t2 ON t1.xid = t2.xid WHERE t2.xid IS NULL ^^^ This special trick guarantees that you get exactly the rows from t1 not having a matching row in t2, because a NULL xid in t2 would not satisfy the condition t1.xid = t2.xid. Servus Manfred