Thread: EXISTS vs IN vs OUTER JOINS

EXISTS vs IN vs OUTER JOINS

From

Tomasz Myrta

Date:

19 December 2002, 05:09:33

Hi
Few days ago I read, that EXISTS is better than IN, but only if there
are many records (how many?). I was wondering which one is better and
when. Did anyone try to compare these queries doing the same work:

- select * from some_table t
     where t.id [not] in (select id from filter);
- select * from some_table t
     where [not] exists (select * from filter where id=t.id);
- select * from some_table t
    left join filter f using (id)
   where f.id is [not] null;

Regards,
Tomasz Myrta

Re: EXISTS vs IN vs OUTER JOINS

From

"Josh Berkus"

Date:

19 December 2002, 12:16:36

Tomasz,

> Few days ago I read, that EXISTS is better than IN, but only if there
> are many records (how many?). I was wondering which one is better and
> when. Did anyone try to compare these queries doing the same work:
>
> - select * from some_table t
>     where t.id [not] in (select id from filter);
> -select * from some_table t
>     where [not] exists (select * from filter where id=t.id);

The rule I use is: if I expect the sub-select to return more than 12
records 20% or more of the time, use EXISTS.   The speed gain for IN on
small lists is not as dramatic as the speed loss for EXISTS on large
lists.

More importantly, the difference between NOT IN and NOT EXISTS can be
as much as 20:1 on large sub-selects, as opposed to IN and EXISTS,
where I have rarely seen a difference of more than 3:1.  As I
understand it, this is because NOT EXISTS can use optimized join
algorithms to locate matching rows, whereas NOT IN must compare each
row against every possible matching value in the subselect.

It also makes a difference whether or not the referenced field(s) in
the subselect is indexed.   EXISTS will often use an index to compare
the values in the master query to the sub-query.  As far as I know, IN
can use an index to retrieve the subquery values, but not to sort or
compare them after they have been retreived into memory.

> -select * from some_table t
>    left join filter f using (id)
>   where f.id is [not] null;

This will not get you the same result as the above.  It will get you
all rows from t+f where a record is present in f and has (or does not
have) a NULL value for f.id.   While this type of query works in MS
Access, it will not work in SQL92/99-commpliant databases.

Incidentally, the dramatic differences between IN and EXISTS are not
only a "PostgreSQL Thing".   The same rules apply to  MS SQL Server and
SQL Anywhere, for the same reasons.

-Josh Berkus

Re: EXISTS vs IN vs OUTER JOINS

From

jasiek@klaster.net

Date:

19 December 2002, 12:47:46

On Thu, Dec 19, 2002 at 09:15:36AM -0800, Josh Berkus wrote:
> > -select * from some_table t
> >    left join filter f using (id)
> >   where f.id is [not] null;
>
> This will not get you the same result as the above.  It will get you
> all rows from t+f where a record is present in f and has (or does not
> have) a NULL value for f.id.   While this type of query works in MS
> Access, it will not work in SQL92/99-commpliant databases.

filter_table does not contain null fields. It has only two states: it
has row
or it hasn't row coresponding to row in some_table.

And now, which one is better?

Tomasz Myrta

Re: EXISTS vs IN vs OUTER JOINS

From

Joe Conway

Date:

19 December 2002, 12:48:28

Josh Berkus wrote:
> where I have rarely seen a difference of more than 3:1.  As I
> understand it, this is because NOT EXISTS can use optimized join
> algorithms to locate matching rows, whereas NOT IN must compare each
> row against every possible matching value in the subselect.
>
> It also makes a difference whether or not the referenced field(s) in
> the subselect is indexed.   EXISTS will often use an index to compare
> the values in the master query to the sub-query.  As far as I know, IN
> can use an index to retrieve the subquery values, but not to sort or
> compare them after they have been retreived into memory.

I wonder if "[NOT] IN (subselect)" could be improved with a hash table in
similar fashion to the hash aggregate solution Tom recently implemented?

Joe

Re: EXISTS vs IN vs OUTER JOINS

From

Josh Berkus

Date:

19 December 2002, 13:58:40

Tomasz,

> > This will not get you the same result as the above.  It will get you
> > all rows from t+f where a record is present in f and has (or does not
> > have) a NULL value for f.id.   While this type of query works in MS
> > Access, it will not work in SQL92/99-commpliant databases.
>
> filter_table does not contain null fields. It has only two states: it
> has row
> or it hasn't row coresponding to row in some_table.
>
> And now, which one is better?

You're not listening.  I said that LEFT JOIN won't work.  At all.

Please re-read the paragraph above, which explains why.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: EXISTS vs IN vs OUTER JOINS

From

jasiek@klaster.net

Date:

19 December 2002, 14:45:07

On Thu, Dec 19, 2002 at 11:02:39AM -0800, Josh Berkus wrote:
>
> Tomasz,
> You're not listening.  I said that LEFT JOIN won't work.  At all.
>
> Please re-read the paragraph above, which explains why.
I read your mail once again, but I still don't understand what are you
talking about.
I'll write example - maybe it will help us to understand each other.


I have three tables: users, things and access_list
create table users(
user_id integer primary key,
username varchar
);
insert into users(1,'Tomasz');

create table things(
thing_id int4 primary key,
thingname varchar
);
insert into things(1,'thing1');
insert into things(2,'thing2');
insert into things(3,'thing3');
insert into things(4,'thing4');
insert into things(5,'thing5');

create table access_list(
user_id int4 not null references users,
thing_id int4 not null references things
);
insert into access_list(1,1);
insert into access_list(1,4);

SELECT u.username,t.thingname,al.thing_id
from users u cross join things t
left join access_list al on (s.user_id=al.user_id and
t.thing_id=al.thing_id)

Result:
username       thingname     thing_id
Tomasz           thing1         1
Tomasz           thing2
Tomasz           thing3
Tomasz           thing4         4
Tomasz           thing5         5

Now if we add "where al.user_id is null" we get:
Tomasz           thing2
Tomasz           thing3

Or if we add "where al.user_id is not null" we get:
(the same result we have when using inner join)
Tomasz           thing1         1
Tomasz           thing4         4
Tomasz           thing5         5

I know this method will fail if we have not unique pairs in table
access_list, but in other case it looks ok.
Tomasz Myrta

Re: EXISTS vs IN vs OUTER JOINS

From

Tom Lane

Date:

19 December 2002, 17:47:09

Joe Conway <mail@joeconway.com> writes:
> I wonder if "[NOT] IN (subselect)" could be improved with a hash table in
> similar fashion to the hash aggregate solution Tom recently implemented?

It's being worked on ;-)

http://archives.postgresql.org/pgsql-hackers/2002-11/msg01055.php

Assuming I get this done, the conventional wisdom that "EXISTS
outperforms IN" will be stood on its head --- unless we add planner code
to try to reverse-engineer an IN from an EXISTS, which is something I'm
not really eager to expend code and cycles on.

            regards, tom lane

Re: EXISTS vs IN vs OUTER JOINS

From

Bruce Momjian

Date:

19 December 2002, 17:53:22

Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
> > I wonder if "[NOT] IN (subselect)" could be improved with a hash table in
> > similar fashion to the hash aggregate solution Tom recently implemented?
>
> It's being worked on ;-)
>
> http://archives.postgresql.org/pgsql-hackers/2002-11/msg01055.php
>
> Assuming I get this done, the conventional wisdom that "EXISTS
> outperforms IN" will be stood on its head --- unless we add planner code
> to try to reverse-engineer an IN from an EXISTS, which is something I'm
> not really eager to expend code and cycles on.

I am looking forward to removing _that_ FAQ item.  :-)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: EXISTS vs IN vs OUTER JOINS

From

"Josh Berkus"

Date:

19 December 2002, 18:20:38

Tomasz,

> I read your mail once again, but I still don't understand what are
> you
> talking about.
> I'll write example - maybe it will help us to understand each other.

Hmmm ... you're right.  Sorry for being dense.  It shouldn't work, but
it does.

Tom, Bruce:

If I run the query:

SELECT t1.*
FROM table1 t1
  LEFT JOIN table2 t2 ON t1.xid = t2.xid
WHERE t2.label IS NULL

I will get rows in t1 for which there is no row in t2.  This does not
seem SQL-spec to me; shouldn't I get only rows from t1 where a row
exists in t2 and t2.label IS NULL?

-Josh

Re: EXISTS vs IN vs OUTER JOINS

From

Tom Lane

Date:

19 December 2002, 19:03:31

"Josh Berkus" <josh@agliodbs.com> writes:
> If I run the query:

> SELECT t1.*
> FROM table1 t1
>   LEFT JOIN table2 t2 ON t1.xid = t2.xid
> WHERE t2.label IS NULL

> I will get rows in t1 for which there is no row in t2.  This does not
> seem SQL-spec to me; shouldn't I get only rows from t1 where a row
> exists in t2 and t2.label IS NULL?

No; that would be the behavior of an inner join, but you did a left
join.  The above will give you t1 rows for which there is no match in t2
(plus rows for which there is a match containing null in t2.label).

            regards, tom lane

Re: EXISTS vs IN vs OUTER JOINS

From

Manfred Koizar

Date:

20 December 2002, 03:54:16

On Thu, 19 Dec 2002 15:19:21 -0800, "Josh Berkus" <josh@agliodbs.com>
wrote:
>SELECT t1.*
>FROM table1 t1
>  LEFT JOIN table2 t2 ON t1.xid = t2.xid
>WHERE t2.label IS NULL

Josh, also note that Tomasz did something like

SELECT t1.*
FROM table1 t1
  LEFT JOIN table2 t2 ON t1.xid = t2.xid
WHERE t2.xid IS NULL
         ^^^
This special trick guarantees that you get exactly the rows from t1
not having a matching row in t2, because a NULL xid in t2 would not
satisfy the condition t1.xid = t2.xid.

Servus
 Manfred