Thread: IN vs EXIIST

IN vs EXIIST

From
Jean-Christian Imbeault
Date:
I find myself writing a lot of queries with this pattern:

select distinct key1 from A where id not it
   (select distinct key1 from A where x='false');

The reason being that key1 is not a primary key (key1, key2 is the
primary key). i.e. I have a table like this

key1    key2    x
------------------
a    1    t
a    2    t
a    3    f
b    1    t
b    2    t
b    3    t
c    3    t
c    4    f

So basically I want key1 values for which all the X's are true.

I've seen many posts saying that using IN is not optimal and replacing
it with EXISTS is much better. I've read the only docs but I can't
understand the difference between the two or how to convert.

Can someone point me to some other docs or explain to me how to convert?
Or is my table schema wrong?

Thanks!

Jc


Re: IN vs EXIIST

From
Darren Ferguson
Date:
On Thu, 19 Sep 2002, Jean-Christian Imbeault wrote:

> I find myself writing a lot of queries with this pattern:
>
> select distinct key1 from A where id not it
>    (select distinct key1 from A where x='false');
>
> The reason being that key1 is not a primary key (key1, key2 is the
> primary key). i.e. I have a table like this
>
> key1    key2    x
> ------------------
> a    1    t
> a    2    t
> a    3    f
> b    1    t
> b    2    t
> b    3    t
> c    3    t
> c    4    f
>
> So basically I want key1 values for which all the X's are true.

Why areyou using the sub select. If you just want all the key1 where x is
true then the following will work

SELECT DISTINCT(key1) FROM a WHERE x = TRUE;

If you want all rows and don't care about duplicates then remove the
distinct.


Hope this helps

>
> I've seen many posts saying that using IN is not optimal and replacing
> it with EXISTS is much better. I've read the only docs but I can't
> understand the difference between the two or how to convert.
>
> Can someone point me to some other docs or explain to me how to convert?
> Or is my table schema wrong?
>
> Thanks!
>
> Jc
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

--
Darren Ferguson


Re: IN vs EXIIST

From
Bill Gribble
Date:
On Wed, 2002-09-18 at 21:40, Jean-Christian Imbeault wrote:
> select distinct key1 from A where id not it
>    (select distinct key1 from A where x='false');
>
> I've seen many posts saying that using IN is not optimal and replacing
> it with EXISTS is much better. I've read the only docs but I can't
> understand the difference between the two or how to convert.

I just did this today for a query and got a 200x speedup when converting
from IN to EXISTS.  It's dependent on your specific query exactly how to
do the conversion, but generally it's just a question of moving a field
from the "outside" to the "inside" of the subselect.  For me,

  SELECT * FROM inventory WHERE itemid IN (
    SELECT itemid FROM osinfo WHERE ostype = 'linux' );

became

  SELECT * FROM inventory WHERE EXISTS (
     SELECT itemid FROM osinfo
     WHERE ostype = 'linux' and inventory.itemid = osinfo.itemid);

Same results, same database; first query took 8 sec, second took 39
msec.

YMMV
b.g.








>
> Can someone point me to some other docs or explain to me how to convert?
> Or is my table schema wrong?
>
> Thanks!
>
> Jc
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)



Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Bill Gribble wrote:

 >I just did this today for a query and got a 200x speedup when
 >converting from IN to EXISTS.  It's dependent on your specific query
 >exactly how to do the conversion, but generally it's just a question
of >moving a field from the "outside" to the "inside" of the subselect.

If it's not asking too much could you recommend how I could fix this
query? It's a bit more complicated than yours and I can't seem to get
the syntax right.

select distinct invoice_id from invoice_li where received='true'
AND shipped='false' AND cancelled='false'
AND
  (invoice_id not in
    (
     select distinct invoice_id from invoice_li where received='false'
AND cancelled='false'
     )
    OR ship_now='true'
    )

Sorry for the formatting ...

Jc


Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Henshall, Stuart - WCP wrote:
 >
 > select distinct invoice_id from invoice_li where received='true'
 > AND shipped='false' AND cancelled='false'
 > AND
 >   (NOT EXISTS
 >     (
 >      select * from invoice_li AS sq_inv_li where received='false'
 > AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
 >      )
 >     OR ship_now='true'
 >     )
 >
 > Should work (but is untested).

I'll test it and let you know. Tanks!


 > As a side note there doesn't seem a point to having distinct on the
 > subquery in its original form either, so this could be removed to reduce
 > overhead.

Really? I though that reducing the number of results in the sub-select
would make the query more efficient since the outer query would have
less items to check.

I.e. it would be faster to check if something is in (1,2,3) than if it
is in (1,2,2,2,2,2,2,2,2,2,3). No?

Let me check that new optimized query!

Jc


Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Henshall, Stuart - WCP wrote:

[deleted]

I tried your optimized query but it was muh slower. Here are the results:

##Query using EXISTS

$ !1173
time psql TMP -c "select count(distinct invoice_id) from invoice_li
where received='true'
AND shipped='false' AND cancelled='false'
AND
   (NOT EXISTS
     (
      select * from invoice_li AS sq_inv_li where received='false'
AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
      )
     OR ship_now='true'
     ) "
  count
-------
    170
(1 row)


real    0m8.322s
user    0m0.010s
sys     0m0.000s

##Query using IN

$ !1175
time psql TMP -c "select count(distinct invoice_id) from invoice_li
where received='true'
AND shipped='false' AND cancelled='false'
AND
  (invoice_id not in
    (
     select distinct invoice_id from invoice_li where received='false'
AND cancelled='false'
     )
    OR ship_now='true'
    ) "
  count
-------
    170
(1 row)


real    0m0.234s
user    0m0.000s
sys     0m0.010s

Maybe EXISTS is not always faster than IN ?

After a "vacuum analyze" the numbers become:

#using EXISTS

real    0m3.229s
user    0m0.000s
sys     0m0.000s

#using IN

real    0m0.141s
user    0m0.000s
sys     0m0.000s


Jc


Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Strangely enough doing an EXPLAIN on the two queries shows that using
EXISTS would be faster than IN ... even though it isn't ..

psql TMP -c "explain select count(distinct invoice_id) from invoice_li
where received='true' AND shipped='false' AND cancelled='false'
  AND
   (invoice_id not in
     (
      select distinct invoice_id from invoice_li where received='false'
AND cancelled='false'
      )
     OR ship_now='true'
     ) "
NOTICE:  QUERY PLAN:

Aggregate  (cost=17642485.70..17642485.70 rows=1 width=4)
   ->  Seq Scan on invoice_li  (cost=0.00..17642460.40 rows=10120 width=4)
         SubPlan
           ->  Materialize  (cost=871.61..871.61 rows=1 width=4)
                 ->  Unique  (cost=871.61..871.61 rows=1 width=4)
                       ->  Sort  (cost=871.61..871.61 rows=1 width=4)
                             ->  Seq Scan on invoice_li
(cost=0.00..871.60 rows=1 width=4)

EXPLAIN
$ psql TMP -c "explain select count(distinct invoice_id) from invoice_li
where received='true'
  AND shipped='false' AND cancelled='false'
  AND
    (NOT EXISTS
      (
       select * from invoice_li AS sq_inv_li where received='false'
  AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
       )
      OR ship_now='true'
      ) "
NOTICE:  QUERY PLAN:

Aggregate  (cost=4955505.10..4955505.10 rows=1 width=4)
   ->  Seq Scan on invoice_li  (cost=0.00..4955479.80 rows=10120 width=4)
         SubPlan
           ->  Index Scan using invoice_li_pkey on invoice_li sq_inv_li
  (cost=0.00..244.79 rows=1 width=80)

EXPLAIN

Jc


Re: IN vs EXIIST

From
"Nigel J. Andrews"
Date:
I think you've perhaps got a caching issue in those numbers.

Note, I haven't been following this thread at all so I don't know if the IN
version is the typical timing for the query. However, I would suggest that you
try reversing the order you test those to see what effect the caching of data
from the EXISTS version has on the IN timing. Or just repeat the each version
and use the second run timings.



On Thu, 19 Sep 2002, Jean-Christian Imbeault wrote:

> Henshall, Stuart - WCP wrote:
>
> [deleted]
>
> I tried your optimized query but it was muh slower. Here are the results:
>
> ##Query using EXISTS
>
> $ !1173
> time psql TMP -c "select count(distinct invoice_id) from invoice_li
> where received='true'
> AND shipped='false' AND cancelled='false'
> AND
>    (NOT EXISTS
>      (
>       select * from invoice_li AS sq_inv_li where received='false'
> AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
>       )
>      OR ship_now='true'
>      ) "
>   count
> -------
>     170
> (1 row)
>
>
> real    0m8.322s
> user    0m0.010s
> sys     0m0.000s
>
> ##Query using IN
>
> $ !1175
> time psql TMP -c "select count(distinct invoice_id) from invoice_li
> where received='true'
> AND shipped='false' AND cancelled='false'
> AND
>   (invoice_id not in
>     (
>      select distinct invoice_id from invoice_li where received='false'
> AND cancelled='false'
>      )
>     OR ship_now='true'
>     ) "
>   count
> -------
>     170
> (1 row)
>
>
> real    0m0.234s
> user    0m0.000s
> sys     0m0.010s
>
> Maybe EXISTS is not always faster than IN ?
>
> After a "vacuum analyze" the numbers become:
>
> #using EXISTS
>
> real    0m3.229s
> user    0m0.000s
> sys     0m0.000s
>
> #using IN
>
> real    0m0.141s
> user    0m0.000s
> sys     0m0.000s
>


Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Nigel J. Andrews wrote:
>
> I think you've perhaps got a caching issue in those numbers.

I was thinking the same thing too but reversing the order I execute them
in doesn't change the numbers at all.

I also tried executing the slow query five times in a row (hoping the
results would be cahced) but the speed was always the same.

maybe there is something in the query using IN that allows it to be
cached while the query using the EXISTS cannot be cached?

How can I clear the cache so that I can re-run the fast query to see
what it's speed is withouth caching?

Jc


Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Henshall, Stuart - WCP wrote:
>
> It might be interesting to try removing the distinct clause from the
> first query and seeing what difference that makes, both real and
> supposed. Also maybe try seeing what difference it would make to the
> EXISTS sub query to have distinct invoice_id rather than *.

Removing or addin a distinct did not change the results but ...

Your query with EXISTS is this:

psql TMP -c "select count(distinct invoice_id) from invoice_li where
received='true'
AND shipped='false' AND cancelled='false'
AND
   (NOT EXISTS
     (
      select * from invoice_li AS sq_inv_li where received='false'
AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
      )
     OR ship_now='true'
     ) "

I am no SQL expert so I took it as face value (especially since I don't
quite underst EXISTS).

But I removed this "AND invoice_li.invoice_id=sq_inv_li.invoice_id "
from the subquery and I got the correct count *and* the same speed as
the IN query ...

Jc



Re: IN vs EXIIST

From
Jochem van Dieten
Date:
Quoting Jean-Christian Imbeault <jc@mega-bucks.co.jp>:
>
> The reason being that key1 is not a primary key (key1, key2 is the
> primary key). i.e. I have a table like this
>
> key1    key2    x
> ------------------
> a    1    t
> a    2    t
> a    3    f
> b    1    t
> b    2    t
> b    3    t
> c    3    t
> c    4    f
>
> So basically I want key1 values for which all the X's are true.

SELECT     key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue
FROM       table
GROUP BY   key1
HAVING     isTrue = 1


> Or is my table schema wrong?

I generally don't design tables with composite keys. I find it to
complicated in many operations. But on occasion I have seen them being
used by others very efficiently.

Jochem

Re: IN vs EXIIST

From
"Jan Weerts"
Date:

>Strangely enough doing an EXPLAIN on the two queries shows that using
>EXISTS would be faster than IN ... even though it isn't ..

A sad sidenote: I am stuck here with a similar IN/EXIST problem. One of our expensive queries contains NOT IN and IN as subqueries. As I was advised on this list, I tried to replace IN with EXISTS. When doing so for part of the query (omitting one of the IN subqueries) the IN and EXIST versions are both about the same speed in execution (about 30sec).

EXPLAIN tells me, that the EXIST version should be 15 times faster, which it is not. Caching is also not an issue here.

EXPLAIN also shows, that both queries want to perform a sequential scan on the outermost query part, instead of an index scan (where clause on the primary key). If I replace the innermost query by the results it gives (splitting the request in two requests), than the planner uses the index scan and is in fact much faster!

My next plan is to switch from 7.1.3 to 7.2, but that requires some planning, as the database is permamently used.

Regards
  Jan

Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Henshall, Stuart - WCP wrote:
>
> AND invoice_li.invoice_id=sq_inv_li.invoice_id
> is the bit that tells it to check the id of the sub query and parent are
> the same.
> If you don't have this bit I believe it makes the subquery kind of
> piontless

You are right. I had "ordered" my data set in between running the
queries. When I added some new data running the EXISTS query without the
  AND a.id=b.id bit the results were fast *and* wrong ;)

I'm still scratching my head as to why EXISTS isn't faster *and* why
EXPLAIN says it should be ...

Jc


Re: IN vs EXIIST

From
Jochem van Dieten
Date:
Quoting Jan Weerts <j.weerts@i-views.de>:
>
> One
> of our expensive queries contains NOT IN and IN as subqueries. As I
> was advised on this list, I tried to replace IN with EXISTS. When
> doing so for part of the query (omitting one of the IN subqueries)
> the IN and EXIST versions are both about the same speed in execution
> (about 30sec).

I am wondering if the NOT might negate any positive effects from the
switch to EXIST.


> My next plan is to switch from 7.1.3 to 7.2, but that requires some
> planning, as the database is permamently used.

I would say that is a good idea anyway, but maybe you can post your
query. There might be a way to totally rewrite it like the CASE & GROUP
BY idea for Jean-Christian (not that that idea is faster by definition,
but you never know).

Jochem

Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Jan Weerts wrote:
>
> A sad sidenote: I am stuck here with a similar IN/EXIST problem. One of
> our expensive queries contains NOT IN and IN as subqueries. As I was
> advised on this list, I tried to replace IN with EXISTS. When doing so
> for part of the query (omitting one of the IN subqueries) the IN and
> EXIST versions are both about the same speed in execution (about 30sec).

At least you get the same speed. In my case replacing IN with EXISTS
makes it about 25X slower!

> EXPLAIN tells me, that the EXIST version should be 15 times faster,
> which it is not. Caching is also not an issue here.

Yup, same as me. BTW how do I clear the cache?

> EXPLAIN also shows, that both queries want to perform a sequential scan
> on the outermost query part, instead of an index scan (where clause on
> the primary key).

To test this I added an index on outer query search term and
lo-and-behold ...

Just like you I get a seq scan on the outer part for both but the IN
query does a seq scan on the inner query while the EXISTS uses an index
scan. The EXISTS is still just as slow though ..

> My next plan is to switch from 7.1.3 to 7.2, but that requires some
> planning, as the database is permamently used.

I'm on 7.2 so I don't know if that will help ;)

Jc


Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Jochem van Dieten wrote:
 >
 > SELECT     key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue
 > FROM       table
 > GROUP BY   key1
 > HAVING     isTrue = 1

I tried this but got an error:

psql TMP -c "select invoice_id, min(case when received then 1 else 0
end) as ok from invoice_li group by invoice_id having ok = 1"

ERROR:  Attribute 'ok' not found

Jc




Re: IN vs EXIIST

From
Jochem van Dieten
Date:
Quoting Jean-Christian Imbeault <jc@mega-bucks.co.jp>:
>
> Jochem van Dieten wrote:
>  >
>  > SELECT     key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue
>  > FROM       table
>  > GROUP BY   key1
>  > HAVING     isTrue = 1
>
> I tried this but got an error:
> psql TMP -c "select invoice_id, min(case when received then 1 else 0
> end) as ok from invoice_li group by invoice_id having ok = 1"
> ERROR:  Attribute 'ok' not found

Sorry, not behind the console at the moment. Try either:

SELECT     invoice_id
FROM       (
  SELECT    invoice_id, Min(CASE WHEN received THEN 1 ELSE 0 END) AS ok
  FROM      invoice_li
  GROUP BY  invoice_id
           ) agg_invoices
WHERE      ok = 1

or:

SELECT    invoice_id, Min(CASE WHEN received THEN 1 ELSE 0 END) AS ok
FROM      invoice_li
GROUP BY  invoice_id
HAVING    Min(CASE WHEN received THEN 1 ELSE 0 END) = 1

Make sure you have an index on invoice_id. I am not sure if it is
faster because of the case that needs to be done on each row, but I
think it is worth a shot.

Jochem

Re: IN vs EXIIST

From
Andrew Sullivan
Date:
On Thu, Sep 19, 2002 at 06:36:56PM +0900, Jean-Christian Imbeault wrote:
>
> Maybe EXISTS is not always faster than IN ?

I haven't been reading this thread very carefully, but that one is
_certainly_ true.  In testing, we have had a wide variety of results
with IN versus EXISTS.  It seems to have to do with the relative size
of the two sides.  I haven't come up with a perfect rule for deciding
yet.

NOT IN/NOT EXISTS is more often a clear-cut case: NOT EXISTS is
_almost_ (but only almost) always faster.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: IN vs EXIIST

From
Jean-Christian Imbeault
Date:
Jochem van Dieten wrote:
>
> Sorry, not behind the console at the moment. Try either:

Ok :) I tried both new versions and they work great.

But ... they are not faster than the IN version. They run at exactly the
same speed.

Another thing is that the IN query and yours don't seem to be so
dependent on the DB having been vacuum analyzed while the query with the
EXISTS does (it goes from 15 secs to 5secs after vacuuming)

I guess using IN in this case is not making the query slow.

In case it's important the table contains 16800 rows and returns 300
matches.

I'll be sticking with my IN query ...

Jc


Re: IN vs EXIIST

From
Jean-Luc Lachance
Date:
This is not what he wants.

He wants all keys that have only true for x.


> Why areyou using the sub select. If you just want all the key1 where x is
> true then the following will work
>
> SELECT DISTINCT(key1) FROM a WHERE x = TRUE;
>
> If you want all rows and don't care about duplicates then remove the
> distinct.
>
> Hope this helps
>
> >
> > I've seen many posts saying that using IN is not optimal and replacing
> > it with EXISTS is much better. I've read the only docs but I can't
> > understand the difference between the two or how to convert.
> >
> > Can someone point me to some other docs or explain to me how to convert?
> > Or is my table schema wrong?
> >
> > Thanks!
> >
> > Jc
> >

Re: IN vs EXIIST

From
Darren Ferguson
Date:
I realized after he sent his notice this morning.

On Thu, 19 Sep 2002, Jean-Luc Lachance wrote:

> This is not what he wants.
>
> He wants all keys that have only true for x.
>
>
> > Why areyou using the sub select. If you just want all the key1 where x is
> > true then the following will work
> >
> > SELECT DISTINCT(key1) FROM a WHERE x = TRUE;
> >
> > If you want all rows and don't care about duplicates then remove the
> > distinct.
> >
> > Hope this helps
> >
> > >
> > > I've seen many posts saying that using IN is not optimal and replacing
> > > it with EXISTS is much better. I've read the only docs but I can't
> > > understand the difference between the two or how to convert.
> > >
> > > Can someone point me to some other docs or explain to me how to convert?
> > > Or is my table schema wrong?
> > >
> > > Thanks!
> > >
> > > Jc
> > >
>

--
Darren Ferguson


Re: IN vs EXIIST

From
Jean-Luc Lachance
Date:
How about:
select (
  ( select count(distinct invoice_id)
    from invoice_li
    WHERE shipped='false' AND cancelled='false')
- ( select count(distinct invoice_id)
    from invoice_li
    WHERE received='false' AND shipped='false' AND cancelled='false'));

Jean-Christian Imbeault wrote:
>
> Henshall, Stuart - WCP wrote:
>
> [deleted]
>
> I tried your optimized query but it was muh slower. Here are the results:
>
> ##Query using EXISTS
>
> $ !1173
> time psql TMP -c "select count(distinct invoice_id) from invoice_li
> where received='true'
> AND shipped='false' AND cancelled='false'
> AND
>    (NOT EXISTS
>      (
>       select * from invoice_li AS sq_inv_li where received='false'
> AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
>       )
>      OR ship_now='true'
>      ) "
>   count
> -------
>     170
> (1 row)
>
> real    0m8.322s
> user    0m0.010s
> sys     0m0.000s
>
> ##Query using IN
>
> $ !1175
> time psql TMP -c "select count(distinct invoice_id) from invoice_li
> where received='true'
> AND shipped='false' AND cancelled='false'
> AND
>   (invoice_id not in
>     (
>      select distinct invoice_id from invoice_li where received='false'
> AND cancelled='false'
>      )
>     OR ship_now='true'
>     ) "
>   count
> -------
>     170
> (1 row)
>
> real    0m0.234s
> user    0m0.000s
> sys     0m0.010s
>
> Maybe EXISTS is not always faster than IN ?
>
> After a "vacuum analyze" the numbers become:
>
> #using EXISTS
>
> real    0m3.229s
> user    0m0.000s
> sys     0m0.000s
>
> #using IN
>
> real    0m0.141s
> user    0m0.000s
> sys     0m0.000s
>
> Jc
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: IN vs EXIIST

From
Jean-Luc Lachance
Date:
You can also try:

SELECT DISTINCT( key1) FROM table EXCEPT SELECT DISTINCT( key1) from
table where not x;



Jochem van Dieten wrote:
>
> Quoting Jean-Christian Imbeault <jc@mega-bucks.co.jp>:
> >
> > The reason being that key1 is not a primary key (key1, key2 is the
> > primary key). i.e. I have a table like this
> >
> > key1  key2    x
> > ------------------
> > a     1       t
> > a     2       t
> > a     3       f
> > b     1       t
> > b     2       t
> > b     3       t
> > c     3       t
> > c     4       f
> >
> > So basically I want key1 values for which all the X's are true.
>
> SELECT     key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue
> FROM       table
> GROUP BY   key1
> HAVING     isTrue = 1
>
> > Or is my table schema wrong?
>
> I generally don't design tables with composite keys. I find it to
> complicated in many operations. But on occasion I have seen them being
> used by others very efficiently.
>
> Jochem
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

Re: IN vs EXIIST

From
Greg Stark
Date:
Caveat, I'm new to Postgres. However extrapolating from my experience with
Oracle I think some of you are going about this optimization work the wrong
way around.

Don't focus on IN vs EXISTS, focus on the query plans they generate. Which of
two identical queries is faster is purely a quirk of optimizer (and it's
arguably always a bug if they aren't both the same). It's like debating which
of two identical cars will be faster without actually checking who is driving.

And don't focus on the estimated costs from explain plan. If they were always
right then there wouldn't be any disputes like these. Well not quite.

So in the cases below as Jean-Christian showed the NOT IN query leads the
optimizer to do two sequential scans and a full sort and unique operation.
Whereas the EXISTS syntax query leads the optimizer to use the index on one of
the tables.

Usually doing a full table scan and an extra sort and unique operation will be
slower, unless you were going to be reading a lot of the table anyways. So
which of these two plans will be fastest should hinge on how much of the truth
table pertains to the attribute being checked.

Exactly where the breakeven point is depends on the details of the database
engine. For Oracle the traditional rule of thumb is a (surprisingly low) 10%.

Someone implied that EXISTS might be more influenced by a VACUUM ANALYZE. That
wouldn't be surprising if the optimizer is changing strategies if it has
enough data to determine that an index scan will read more than the magic
breakeven percentage of the full table and switches to a full table scan.

In the query plans below I stripped out the distracting cost estimates:

Jean-Christian Imbeault <jc@mega-bucks.co.jp> writes:

> psql TMP -c "explain select count(distinct invoice_id) from invoice_li where
> received='true' AND shipped='false' AND cancelled='false'
>   AND
>    (invoice_id not in
...
>
> NOTICE:  QUERY PLAN:
>
> Aggregate
>    ->  Seq Scan on invoice_li
>          SubPlan
>            ->  Materialize
>                  ->  Unique
>                        ->  Sort
>                              ->  Seq Scan on invoice_li

> $ psql TMP -c "explain select count(distinct invoice_id) from invoice_li where
> received='true'
>   AND shipped='false' AND cancelled='false'
>   AND
>     (NOT EXISTS
...
>
> NOTICE:  QUERY PLAN:
>
> Aggregate
>    ->  Seq Scan on invoice_li
>          SubPlan
>            ->  Index Scan using invoice_li_pkey on invoice_li sq_inv_li


--
greg