Thread: in() VS exists()

in() VS exists()

From
Julien
Date:
Hello,

Does anyone has an idea why sometimes:
- select ... where ... in (select ...)
is faster than :
- select ... where ... exists(select ...)
and sometimes it's the opposite ?

I had such a situation, I've pasted the queries on:
http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html

It's running PostgreSQL 8.1 with an effective_cache_size of 30000.

specimens.id is the primary key and there are indexes on
sequences(specimen_id) and specimen_measurements(specimen_id)

Is there a general "rule" to know when to use the in() version and when
to use the exists() version ? Is it true to say that the exists()
version is more scalable (with many rows) than the in() version (from
the little tests I made it seems the case) ?

Thanks,
Julien

-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52



Re: in() VS exists()

From
Julien
Date:
If I understood well the query plan, the planner optimize the
IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :

->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
time=93.971..201.908 rows=1431 loops=1)        Hash Cond: ("outer".id = "inner".specimen_id)

so I guess that :

select count(sp.id) from specimens sp where sp.id in (select specimen_id
from sequences);

is almost the same as :

select count(sp.id) from specimens sp INNER JOIN (select specimen_id
from sequences GROUP BY specimen_id) as foo on foo.specimen_id = sp.id;

?

Thanks,
Julien

On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> The chapter on indexes in the manual should give you a pretty good
> idea on the why.
> IN and EXISTS are not the only possibilities, you can also use inner
> or outer joins.
> Which solution performs best depends on the data, the database
> version, the available indexes, ...
> 
> >>> Julien <jcigar@ulb.ac.be> 2008-03-13 15:47 >>>
> Hello,
> 
> Does anyone has an idea why sometimes:
> - select ... where ... in (select ...)
> is faster than :
> - select ... where ... exists(select ...)
> and sometimes it's the opposite ?
> 
> I had such a situation, I've pasted the queries on:
> http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html
> 
> It's running PostgreSQL 8.1 with an effective_cache_size of 30000.
> 
> specimens.id is the primary key and there are indexes on
> sequences(specimen_id) and specimen_measurements(specimen_id)
> 
> Is there a general "rule" to know when to use the in() version and
> when
> to use the exists() version ? Is it true to say that the exists()
> version is more scalable (with many rows) than the in() version (from
> the little tests I made it seems the case) ?
> 
> Thanks,
> Julien
> 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar@ulb.ac.be
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
> 
-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52



Re: in() VS exists()

From
"Bart Degryse"
Date:
I think that just
select count(sp.id) from specimens sp INNER JOIN sequences s on s.specimen_id = sp.id;
should be enough

>>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:10 >>>
If I understood well the query plan, the planner optimize the
IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :

->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
time=93.971..201.908 rows=1431 loops=1)
         Hash Cond: ("outer".id = "inner".specimen_id)

so I guess that :

select count(sp.id) from specimens sp where sp.id in (select specimen_id
from sequences);

is almost the same as :

select count(sp.id) from specimens sp INNER JOIN (select specimen_id
from sequences GROUP BY specimen_id) as foo on foo.specimen_id = sp.id;

?

Thanks,
Julien

On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> The chapter on indexes in the manual should give you a pretty good
> idea on the why.
> IN and EXISTS are not the only possibilities, you can also use inner
> or outer joins.
> Which solution performs best depends on the data, the database
> version, the available indexes, ...
>
> >>> Julien <jcigar@ulb.ac.be> 2008-03-13 15:47 >>>
> Hello,
>
> Does anyone has an idea why sometimes:
> - select ... where ... in (select ...)
> is faster than :
> - select ... where ... exists(select ...)
> and sometimes it's the opposite ?
>
> I had such a situation, I've pasted the queries on:
> http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html
>
> It's running PostgreSQL 8.1 with an effective_cache_size of 30000.
>
> specimens.id is the primary key and there are indexes on
> sequences(specimen_id) and specimen_measurements(specimen_id)
>
> Is there a general "rule" to know when to use the in() version and
> when
> to use the exists() version ? Is it true to say that the exists()
> version is more scalable (with many rows) than the in() version (from
> the little tests I made it seems the case) ?
>
> Thanks,
> Julien
>
> --
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar@ulb.ac.be
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
--
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: in() VS exists()

From
Julien
Date:
mmh no because it's a one to many relation (a specimen can have more
than one sequence) :

muridae=> select count(sp.id) from specimens sp INNER JOIN sequences s
on s.specimen_id = sp.id;count 
------- 1536
(1 row)

Time: 81.242 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);count 
------- 1431
(1 row)

Time: 81.736 ms
muridae=> 

(of course this is a bad example, because I could just do: select
count(specimen_id) from sequences group by specimen_id;, but in my
application I have more fields coming from specimens of course)

Julien

On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> I think that just
> select count(sp.id) from specimens sp INNER JOIN sequences s on
> s.specimen_id = sp.id;
> should be enough
> 
> >>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:10 >>>
> If I understood well the query plan, the planner optimize the
> IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :
> 
> ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> time=93.971..201.908 rows=1431 loops=1)
>          Hash Cond: ("outer".id = "inner".specimen_id)
> 
> so I guess that :
> 
> select count(sp.id) from specimens sp where sp.id in (select
> specimen_id
> from sequences);
> 
> is almost the same as :
> 
> select count(sp.id) from specimens sp INNER JOIN (select specimen_id
> from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> sp.id;
> 
> ?
> 
> Thanks,
> Julien
> 
> On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > The chapter on indexes in the manual should give you a pretty good
> > idea on the why.
> > IN and EXISTS are not the only possibilities, you can also use inner
> > or outer joins.
> > Which solution performs best depends on the data, the database
> > version, the available indexes, ...
> > 
> > >>> Julien <jcigar@ulb.ac.be> 2008-03-13 15:47 >>>
> > Hello,
> > 
> > Does anyone has an idea why sometimes:
> > - select ... where ... in (select ...)
> > is faster than :
> > - select ... where ... exists(select ...)
> > and sometimes it's the opposite ?
> > 
> > I had such a situation, I've pasted the queries on:
> > http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html
> > 
> > It's running PostgreSQL 8.1 with an effective_cache_size of 30000.
> > 
> > specimens.id is the primary key and there are indexes on
> > sequences(specimen_id) and specimen_measurements(specimen_id)
> > 
> > Is there a general "rule" to know when to use the in() version and
> > when
> > to use the exists() version ? Is it true to say that the exists()
> > version is more scalable (with many rows) than the in() version
> (from
> > the little tests I made it seems the case) ?
> > 
> > Thanks,
> > Julien
> > 
> > -- 
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: jcigar@ulb.ac.be
> > @biobel: http://biobel.biodiversity.be/person/show/471
> > Tel : 02 650 57 52
> > 
> > 
> > -- 
> > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql
> > 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar@ulb.ac.be
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
> 
-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52



Re: in() VS exists()

From
"Bart Degryse"
Date:
how about
select count(distinct s.specimen_id) from specimens sp INNER JOIN sequences s
on s.specimen_id = sp.id;


>>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:27 >>>
mmh no because it's a one to many relation (a specimen can have more
than one sequence) :

muridae=> select count(sp.id) from specimens sp INNER JOIN sequences s
on s.specimen_id = sp.id;
count
-------
  1536
(1 row)

Time: 81.242 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
count
-------
  1431
(1 row)

Time: 81.736 ms
muridae=>

(of course this is a bad example, because I could just do: select
count(specimen_id) from sequences group by specimen_id;, but in my
application I have more fields coming from specimens of course)

Julien

On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> I think that just
> select count(sp.id) from specimens sp INNER JOIN sequences s on
> s.specimen_id = sp.id;
> should be enough
>
> >>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:10 >>>
> If I understood well the query plan, the planner optimize the
> IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :
>
> ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> time=93.971..201.908 rows=1431 loops=1)
>          Hash Cond: ("outer".id = "inner".specimen_id)
>
> so I guess that :
>
> select count(sp.id) from specimens sp where sp.id in (select
> specimen_id
> from sequences);
>
> is almost the same as :
>
> select count(sp.id) from specimens sp INNER JOIN (select specimen_id
> from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> sp.id;
>
> ?
>
> Thanks,
> Julien
>
> On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > The chapter on indexes in the manual should give you a pretty good
> > idea on the why.
> > IN and EXISTS are not the only possibilities, you can also use inner
> > or outer joins.
> > Which solution performs best depends on the data, the database
> > version, the available indexes, ...
> >
> > >>> Julien <jcigar@ulb.ac.be> 2008-03-13 15:47 >>>
> > Hello,
> >
> > Does anyone has an idea why sometimes:
> > - select ... where ... in (select ...)
> > is faster than :
> > - select ... where ... exists(select ...)
> > and sometimes it's the opposite ?
> >
> > I had such a situation, I've pasted the queries on:
> > http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html
> >
> > It's running PostgreSQL 8.1 with an effective_cache_size of 30000.
> >
> > specimens.id is the primary key and there are indexes on
> > sequences(specimen_id) and specimen_measurements(specimen_id)
> >
> > Is there a general "rule" to know when to use the in() version and
> > when
> > to use the exists() version ? Is it true to say that the exists()
> > version is more scalable (with many rows) than the in() version
> (from
> > the little tests I made it seems the case) ?
> >
> > Thanks,
> > Julien
> >
> > --
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: jcigar@ulb.ac.be
> > @biobel: http://biobel.biodiversity.be/person/show/471
> > Tel : 02 650 57 52
> >
> >
> > --
> > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql
> >
> --
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar@ulb.ac.be
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
--
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: in() VS exists()

From
"Bart Degryse"
Date:
The chapter on indexes in the manual should give you a pretty good idea on the why.
IN and EXISTS are not the only possibilities, you can also use inner or outer joins.
Which solution performs best depends on the data, the database version, the available indexes, ...

>>> Julien <jcigar@ulb.ac.be> 2008-03-13 15:47 >>>
Hello,

Does anyone has an idea why sometimes:
- select ... where ... in (select ...)
is faster than :
- select ... where ... exists(select ...)
and sometimes it's the opposite ?

I had such a situation, I've pasted the queries on:
http://rafb.net/p/KXNZ6892.html and http://rafb.net/p/jvo5DO38.html

It's running PostgreSQL 8.1 with an effective_cache_size of 30000.

specimens.id is the primary key and there are indexes on
sequences(specimen_id) and specimen_measurements(specimen_id)

Is there a general "rule" to know when to use the in() version and when
to use the exists() version ? Is it true to say that the exists()
version is more scalable (with many rows) than the in() version (from
the little tests I made it seems the case) ?

Thanks,
Julien

--
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: in() VS exists()

From
Julien
Date:
>From my experience I tend to avoid SELECT DISTINCT queries because it's
usually very slow with many rows ...

For my specific case the result is the same:

muridae=> select count(distinct s.specimen_id) from specimens sp INNER
JOIN sequences s on s.specimen_id = sp.id;count 
------- 1431
(1 row)

Time: 65.351 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);count 
------- 1431
(1 row)

Time: 66.371 ms

But to give an example, I have a table with ~1 000 000 rows where the
DISTINCT solution is more than 10 times slower :

muridae=> select count(distinct sp.id) from specimens sp INNER JOIN
specimen_measurements m ON m.specimen_id = sp.id;count 
-------75241
(1 row)

Time: 15970.668 ms

muridae=> select count(sp.id) from specimens sp INNER JOIN (select
specimen_id from specimen_measurements GROUP BY specimen_id) as foo on
foo.specimen_id = sp.id;count 
-------75241
(1 row)

Time: 1165.487 ms

Regards,
Julien

On Thu, 2008-03-13 at 15:28 +0100, Bart Degryse wrote:
> how about
> select count(distinct s.specimen_id) from specimens sp INNER JOIN
> sequences s 
> on s.specimen_id = sp.id;
> 
> 
> >>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:27 >>>
> mmh no because it's a one to many relation (a specimen can have more
> than one sequence) :
> 
> muridae=> select count(sp.id) from specimens sp INNER JOIN sequences s
> on s.specimen_id = sp.id;
> count 
> -------
>   1536
> (1 row)
> 
> Time: 81.242 ms
> muridae=> select count(sp.id) from specimens sp where sp.id in (select
> specimen_id from sequences group by specimen_id);
> count 
> -------
>   1431
> (1 row)
> 
> Time: 81.736 ms
> muridae=> 
> 
> (of course this is a bad example, because I could just do: select
> count(specimen_id) from sequences group by specimen_id;, but in my
> application I have more fields coming from specimens of course)
> 
> Julien
> 
> On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> > I think that just
> > select count(sp.id) from specimens sp INNER JOIN sequences s on
> > s.specimen_id = sp.id;
> > should be enough
> > 
> > >>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:10 >>>
> > If I understood well the query plan, the planner optimize the
> > IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :
> > 
> > ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> > time=93.971..201.908 rows=1431 loops=1)
> >          Hash Cond: ("outer".id = "inner".specimen_id)
> > 
> > so I guess that :
> > 
> > select count(sp.id) from specimens sp where sp.id in (select
> > specimen_id
> > from sequences);
> > 
> > is almost the same as :
> > 
> > select count(sp.id) from specimens sp INNER JOIN (select specimen_id
> > from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> > sp.id;
> > 
> > ?
> > 
> > Thanks,
> > Julien
> > 
> > On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > > The chapter on indexes in the manual should give you a pretty good
> > > idea on the why.
> > > IN and EXISTS are not the only possibilities, you can also use
> inner
> > > or outer joins.
> > > Which solution performs best depends on the data, the database
> > > version, the available indexes, ...
> > > 
> > > >>> Julien <jcigar@ulb.ac.be> 2008-03-13 15:47 >>>
> > > Hello,
> > > 
> > > Does anyone has an idea why sometimes:
> > > - select ... where ... in (select ...)
> > > is faster than :
> > > - select ... where ... exists(select ...)
> > > and sometimes it's the opposite ?
> > > 
> > > I had such a situation, I've pasted the queries on:
> > > http://rafb.net/p/KXNZ6892.html and
> http://rafb.net/p/jvo5DO38.html
> > > 
> > > It's running PostgreSQL 8.1 with an effective_cache_size of 30000.
> > > 
> > > specimens.id is the primary key and there are indexes on
> > > sequences(specimen_id) and specimen_measurements(specimen_id)
> > > 
> > > Is there a general "rule" to know when to use the in() version and
> > > when
> > > to use the exists() version ? Is it true to say that the exists()
> > > version is more scalable (with many rows) than the in() version
> > (from
> > > the little tests I made it seems the case) ?
> > > 
> > > Thanks,
> > > Julien
> > > 
> > > -- 
> > > Julien Cigar
> > > Belgian Biodiversity Platform
> > > http://www.biodiversity.be
> > > Université Libre de Bruxelles (ULB)
> > > Campus de la Plaine CP 257
> > > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > > Boulevard du Triomphe, entrée ULB 2
> > > B-1050 Bruxelles
> > > Mail: jcigar@ulb.ac.be
> > > @biobel: http://biobel.biodiversity.be/person/show/471
> > > Tel : 02 650 57 52
> > > 
> > > 
> > > -- 
> > > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-sql
> > > 
> > -- 
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: jcigar@ulb.ac.be
> > @biobel: http://biobel.biodiversity.be/person/show/471
> > Tel : 02 650 57 52
> > 
> > 
> > -- 
> > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql
> > 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar@ulb.ac.be
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
> 
-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52



Re: in() VS exists()

From
"Bart Degryse"
Date:
So it all comes back to what I wrote in the beginning:
"Which solution performs best depends on the data, the database version, the available indexes, ..."
Tips:
  - be aware that statements can be written in different ways
  - test them on realistic data
  - use explain to tune your statements


>>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:50 >>>
>From my experience I tend to avoid SELECT DISTINCT queries because it's
usually very slow with many rows ...

For my specific case the result is the same:

muridae=> select count(distinct s.specimen_id) from specimens sp INNER
JOIN sequences s on s.specimen_id = sp.id;
count
-------
  1431
(1 row)

Time: 65.351 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
count
-------
  1431
(1 row)

Time: 66.371 ms

But to give an example, I have a table with ~1 000 000 rows where the
DISTINCT solution is more than 10 times slower :

muridae=> select count(distinct sp.id) from specimens sp INNER JOIN
specimen_measurements m ON m.specimen_id = sp.id;
count
-------
75241
(1 row)

Time: 15970.668 ms

muridae=> select count(sp.id) from specimens sp INNER JOIN (select
specimen_id from specimen_measurements GROUP BY specimen_id) as foo on
foo.specimen_id = sp.id;
count
-------
75241
(1 row)

Time: 1165.487 ms

Regards,
Julien

On Thu, 2008-03-13 at 15:28 +0100, Bart Degryse wrote:
> how about
> select count(distinct s.specimen_id) from specimens sp INNER JOIN
> sequences s
> on s.specimen_id = sp.id;
>
>
> >>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:27 >>>
> mmh no because it's a one to many relation (a specimen can have more
> than one sequence) :
>
> muridae=> select count(sp.id) from specimens sp INNER JOIN sequences s
> on s.specimen_id = sp.id;
> count
> -------
>   1536
> (1 row)
>
> Time: 81.242 ms
> muridae=> select count(sp.id) from specimens sp where sp.id in (select
> specimen_id from sequences group by specimen_id);
> count
> -------
>   1431
> (1 row)
>
> Time: 81.736 ms
> muridae=>
>
> (of course this is a bad example, because I could just do: select
> count(specimen_id) from sequences group by specimen_id;, but in my
> application I have more fields coming from specimens of course)
>
> Julien
>
> On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> > I think that just
> > select count(sp.id) from specimens sp INNER JOIN sequences s on
> > s.specimen_id = sp.id;
> > should be enough
> >
> > >>> Julien <jcigar@ulb.ac.be> 2008-03-13 17:10 >>>
> > If I understood well the query plan, the planner optimize the
> > IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :
> >
> > ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> > time=93.971..201.908 rows=1431 loops=1)
> >          Hash Cond: ("outer".id = "inner".specimen_id)
> >
> > so I guess that :
> >
> > select count(sp.id) from specimens sp where sp.id in (select
> > specimen_id
> > from sequences);
> >
> > is almost the same as :
> >
> > select count(sp.id) from specimens sp INNER JOIN (select specimen_id
> > from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> > sp.id;
> >
> > ?
> >
> > Thanks,
> > Julien
> >
> > On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > > The chapter on indexes in the manual should give you a pretty good
> > > idea on the why.
> > > IN and EXISTS are not the only possibilities, you can also use
> inner
> > > or outer joins.
> > > Which solution performs best depends on the data, the database
> > > version, the available indexes, ...
> > >
> > > >>> Julien <jcigar@ulb.ac.be> 2008-03-13 15:47 >>>
> > > Hello,
> > >
> > > Does anyone has an idea why sometimes:
> > > - select ... where ... in (select ...)
> > > is faster than :
> > > - select ... where ... exists(select ...)
> > > and sometimes it's the opposite ?
> > >
> > > I had such a situation, I've pasted the queries on:
> > > http://rafb.net/p/KXNZ6892.html and
> http://rafb.net/p/jvo5DO38.html
> > >
> > > It's running PostgreSQL 8.1 with an effective_cache_size of 30000.
> > >
> > > specimens.id is the primary key and there are indexes on
> > > sequences(specimen_id) and specimen_measurements(specimen_id)
> > >
> > > Is there a general "rule" to know when to use the in() version and
> > > when
> > > to use the exists() version ? Is it true to say that the exists()
> > > version is more scalable (with many rows) than the in() version
> > (from
> > > the little tests I made it seems the case) ?
> > >
> > > Thanks,
> > > Julien
> > >
> > > --
> > > Julien Cigar
> > > Belgian Biodiversity Platform
> > > http://www.biodiversity.be
> > > Université Libre de Bruxelles (ULB)
> > > Campus de la Plaine CP 257
> > > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > > Boulevard du Triomphe, entrée ULB 2
> > > B-1050 Bruxelles
> > > Mail: jcigar@ulb.ac.be
> > > @biobel: http://biobel.biodiversity.be/person/show/471
> > > Tel : 02 650 57 52
> > >
> > >
> > > --
> > > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-sql
> > >
> > --
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: jcigar@ulb.ac.be
> > @biobel: http://biobel.biodiversity.be/person/show/471
> > Tel : 02 650 57 52
> >
> >
> > --
> > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql
> >
> --
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar@ulb.ac.be
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
--
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52