Thread: Highly obscure and erratic

Highly obscure and erratic

From

Varun Kacholia

Date:

18 June 2002, 17:26:24

hi,
 I cannot get what causes the difference in the execution of these 2 commands
1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') )  LIMIT 20;
2. SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ;

  where ID is the primary key.
  First let me tell that i have done all vacuum's ..analyze,full et all.
The explain output should make it clear that for the former a "sequential"
scan is done and for the later an "index" scan is done.Also the time
take indicates that a sequential scan(atleast not index scan) is done
on the former.
-------------------------------------------------------------------------
explain SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') )  LIMIT 20;
NOTICE:  QUERY PLAN:

Limit  (cost=0.00..100544.53 rows=20 width=76)
  ->  Seq Scan on dbmedia  (cost=0.00..507161673.46 rows=100883 width=76)
      ^^^^^^^                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  GAWD!!
      SubPlan
                    ->  Materialize  (cost=5027.19..5027.19 rows=2575 width=4)
                                    ->  Index Scan using wdkmedia on wdmedia  (cost=0.00..5027.19 rows=2575 width=4)


----------------------------------------------------------------------------

 explain SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ;
 NOTICE:  QUERY PLAN:

 Index Scan using dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey,
dbmedia_pkey,dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey, dbmedia_pkey on dbmedia
(cost=0.00..59.40rows=1 width=76) 
 --------------------------------------------------------------------------

 observe the index scan..

OK , for those ppl who might be feeling that the o/p of the nested query in the
former case might be a significant portion (and so the db does a sq scan)
let me say that the result set of sub-query  (SELECT id FROM wdmedia WHERE word = 'whatever') is of length 1,000 while
thetable dbmedia is of length 
100,000 and if a db selects to do a seq scan due to this then ... i can
say no more.
 postgresql developers/gurus please help.
--
------
Varun
Printer not ready. Do you have a pen?

----- End forwarded message -----

--
------
Varun
Printer not ready. Do you have a pen?

Re: Highly obscure and erratic

From

Stephan Szabo

Date:

18 June 2002, 18:34:03

On Wed, 19 Jun 2002, Varun Kacholia wrote:

> hi,
>  I cannot get what causes the difference in the execution of these 2 commands
> 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') )  LIMIT 20;
> 2. SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ;

The former is not very well optimized (see the FAQ for suggestions on
using EXISTS instead).

Re: Highly obscure and erratic

From

Varun Kacholia

Date:

18 June 2002, 19:17:37


> > hi,
> >  I cannot get what causes the difference in the execution of these 2 commands
> > 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') )  LIMIT 20;
> > 2. SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ;
>
> The former is not very well optimized (see the FAQ for suggestions on
> using EXISTS instead).
 hi ,
  Had a look at that..
  but still it is toooo slow :(
  when is it intended to be fixed up??
  <excerpt below>
   Currently we join subqueries by sequentlialy scanning the result of the
   subquery for each row of the outer query.A workaround is to replace IN
   with EXISTS
   <snip>
   We hope to fix this in future release.
   <end excerpt>

   guys i am waiting for that release ;-).


--
------
Varun
Once I finally figured out all of life's answers.....they changed the questions.

Re: Highly obscure and erratic

From

Stephan Szabo

Date:

18 June 2002, 19:21:20

On Wed, 19 Jun 2002, Varun Kacholia wrote:

>
>
> > > hi,
> > >  I cannot get what causes the difference in the execution of these 2 commands
> > > 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE word = 'whatever') )  LIMIT 20;
> > > 2. SELECT * FROM dbmedia WHERE ID IN (1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ;
> >
> > The former is not very well optimized (see the FAQ for suggestions on
> > using EXISTS instead).
>  hi ,
>   Had a look at that..
>   but still it is toooo slow :(
>   when is it intended to be fixed up??

What does explain show for the exists version?

Re: Highly obscure and erratic

From

Varun Kacholia

Date:

18 June 2002, 19:22:29

> What does explain show for the exists version?

suryadb=# explain select * from dbmedia where EXISTS (select ID from wdmedia where word='whatever' AND dbmedia.id=id )
LIMIT200; 
NOTICE:  QUERY PLAN:

Limit  (cost=0.00..1006732.42 rows=200 width=76)
  ->  Seq Scan on dbmedia  (cost=0.00..507810931.25 rows=100883 width=76)
          SubPlan
                    ->  Index Scan using wdkmedia on wdmedia  (cost=0.00..5033.63 rows=1 width=4)

                    EXPLAIN

still seq scan :((
someone please fix this bug :(

--
------
Varun
Once I finally figured out all of life's answers.....they changed the questions.

Re: Highly obscure and erratic

From

Stephan Szabo

Date:

18 June 2002, 19:26:11

On Wed, 19 Jun 2002, Varun Kacholia wrote:

>
> > What does explain show for the exists version?
>
> suryadb=# explain select * from dbmedia where EXISTS (select ID from
> wdmedia where word='whatever' AND dbmedia.id=id )  LIMIT 200;
> NOTICE:  QUERY PLAN:
>
> Limit  (cost=0.00..1006732.42 rows=200 width=76)
>   ->  Seq Scan on dbmedia  (cost=0.00..507810931.25 rows=100883 width=76)
>           SubPlan
>                     -> Index Scan using wdkmedia on wdmedia
> (cost=0.00..5033.63 rows=1 width=4)
>
>                     EXPLAIN
>
> still seq scan :((

Yeah...

Do either of these run better?

select dbmedia.* from dbmedia, wdmedia where wdmedia.word='whatever' and
  dbmedia.id=wdmedia.id;

select * from dbmedia, (select ID from wdmedia where word='whatever') AS w
 where w.id=dbmedia.id;


> someone please fix this bug :(
I'd guess it's not that simple.

Re: Highly obscure and erratic

From

Martijn van Oosterhout

Date:

18 June 2002, 21:09:21

On Wed, Jun 19, 2002 at 04:33:10AM +0530, Varun Kacholia wrote:
>
> > What does explain show for the exists version?
>
> suryadb=# explain select * from dbmedia where EXISTS (select ID from
> wdmedia where word='whatever' AND dbmedia.id=id ) LIMIT 200;

how does this differ from:

select *
from dbmedia
where dbmedia.id=wdmedia.id
and wdmedia.word='whatever'
limit 200;

> NOTICE:  QUERY PLAN:
>
> Limit  (cost=0.00..1006732.42 rows=200 width=76)
>   ->  Seq Scan on dbmedia  (cost=0.00..507810931.25 rows=100883 width=76)
>           SubPlan
>                     ->  Index Scan using wdkmedia on wdmedia  (cost=0.00..5033.63 rows=1 width=4)
>
>                     EXPLAIN
>
> still seq scan :((
> someone please fix this bug :(

I think it's called "pilot error". Your query asked to run the subquery for
each row in the outer query, so ofcourse you get a sequential scan. If what
you wanted was an index scan then you should rewrite it as a join (as above)
and use that.

If you can prove that your EXISTS statement is equivalent to the JOIN for
all different types of subqueries, perhaps it can be made automatic.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Highly obscure and erratic

From

Varun Kacholia

Date:

19 June 2002, 05:18:16

> how does this differ from:
>
> select *
> from dbmedia
> where dbmedia.id=wdmedia.id
> and wdmedia.word='whatever'
> limit 200;

hey...
thanks for that..
that was surely better...
but now returning to the main point...
how can i execute the following query in the form above?

select * from dbmedia where id IN ((select ID from wdmedia where word='word1')
INTERSECT (select ID from wdmedia where word='word2') ...so on)


thanks again


> I think it's called "pilot error". Your query asked to run the subquery for
> each row in the outer query, so ofcourse you get a sequential scan. If what
> you wanted was an index scan then you should rewrite it as a join (as above)
> and use that.
>
> If you can prove that your EXISTS statement is equivalent to the JOIN for
> all different types of subqueries, perhaps it can be made automatic.
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > There are 10 kinds of people in the world, those that can do binary
> > arithmetic and those that can't.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

--
------
Varun
  Best file compression around: "DEL *.*" = 100% compression

Re: Highly obscure and erratic

From

Martijn van Oosterhout

Date:

19 June 2002, 05:29:51

On Wed, Jun 19, 2002 at 02:44:08PM +0530, Varun Kacholia wrote:
> but now returning to the main point...
> how can i execute the following query in the form above?
>
> select * from dbmedia where id IN ((select ID from wdmedia where word='word1')
> INTERSECT (select ID from wdmedia where word='word2') ...so on)

Wow, you seem to have a knack for making queries that are hard to optimise.
Do you need some kind of full-text indexing? There are premade modules that
do this all for you.

select * from dbmedia, wdmedia a, wdmedia b
where id = a.id and a.word='word1'
and   id = b.id and b.word='word2'
etc

Seems odd but it may work. But as a rule, joins are faster than subqueries
and avoid IN, INTERSECT and UNION at all costs. Do you have a book on SQL
around?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Highly obscure and erratic

From

Shaun Thomas

Date:

19 June 2002, 11:01:45

On Wed, 19 Jun 2002, Varun Kacholia wrote:

> 1. SELECT * FROM dbmedia WHERE ID IN ((SELECT id FROM wdmedia WHERE
> word = 'whatever') )  LIMIT 20;

You probably didn't know this, but postgres is notorious for being
*very* bad at optimizing IN queries.  Try this instead:

SELECT * FROM dbmedia d WHERE EXISTS
  (SELECT 1 FROM wdmedia w WHERE w.id = d.id AND word = 'whatever')
 LIMIT 20;

--
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
| Shaun M. Thomas                INN Database Administrator           |
| Phone: (309) 743-0812          Fax  : (309) 743-0830                |
| Email: sthomas@townnews.com    AIM  : trifthen                      |
| Web  : www.townnews.com                                             |
|                                                                     |
|     "Most of our lives are about proving something, either to       |
|      ourselves or to someone else."                                 |
|                                           -- Anonymous              |
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+

Re: Highly obscure and erratic

From

Shaun Thomas

Date:

19 June 2002, 11:02:18

On Tue, 18 Jun 2002, Stephan Szabo wrote:

> select dbmedia.* from dbmedia, wdmedia where wdmedia.word='whatever' and
>   dbmedia.id=wdmedia.id;

I think all of us suggesting EXISTS in the case of this query were
complete tards, me included.  Looks like a simple join will suffice
nicely, and now that I look at it, I wonder why you're the first one to
see it.  Good job!

But that'll definitely help.  Varun, if you're listening, use this!

--
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
| Shaun M. Thomas                INN Database Administrator           |
| Phone: (309) 743-0812          Fax  : (309) 743-0830                |
| Email: sthomas@townnews.com    AIM  : trifthen                      |
| Web  : www.townnews.com                                             |
|                                                                     |
|     "Most of our lives are about proving something, either to       |
|      ourselves or to someone else."                                 |
|                                           -- Anonymous              |
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+

Re: Highly obscure and erratic

From

Shaun Thomas

Date:

19 June 2002, 11:25:40

On Wed, 19 Jun 2002, Martijn van Oosterhout wrote:

> select * from dbmedia, wdmedia a, wdmedia b
> where id = a.id and a.word='word1'
> and   id = b.id and b.word='word2'
> etc

Ewww, self join.  I'd actually suggest he keep the intersect, but use
the join format for each.  So he'd get this:

SELECT *
  FROM dbmedia d, wdmedia w
 WHERE w.word = 'word1' AND d.id=w.id
INTERSECT
SELECT *
  FROM dbmedia d, wdmedia w
 WHERE w.word = 'word2' AND d.id=w.id
INTERSECT
...

I think you get the picture from this.  If he didn't need INTERSECT to
emulate AND for his word matches, he could just use OR, and do this:

SELECT *
  FROM dbmedia d, wdmedia w
 WHERE w.word IN ('word1', 'word2', 'etc') AND d.id=w.id;

My guess is the OR version is faster, but he wants to restrict the data,
not add to it.  Unfortunate.  ^_^

> and avoid IN, INTERSECT and UNION at all costs.

Actually, if INTERSECT and UNION are done properly (which I'm guessing
they are) at most, the query has to be run once for each distinct query
between the INTERSECT/UNION clauses.  On a fast/small query, an upper
bound of doubling execution time isn't too bad.  Mathematical matrix
intersections and unions are pretty fast/easy, so that doesn't add much
overhead at all.

It's IN you have to avoid.  The way postgres does it is just insane, and
you'll suffer greatly upon using it.

--
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
| Shaun M. Thomas                INN Database Administrator           |
| Phone: (309) 743-0812          Fax  : (309) 743-0830                |
| Email: sthomas@townnews.com    AIM  : trifthen                      |
| Web  : www.townnews.com                                             |
|                                                                     |
|     "Most of our lives are about proving something, either to       |
|      ourselves or to someone else."                                 |
|                                           -- Anonymous              |
+-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+

Re: Highly obscure and erratic

From

Varun Kacholia

Date:

20 June 2002, 19:02:15

hi,

> SELECT *
>   FROM dbmedia d, wdmedia w
>  WHERE w.word = 'word1' AND d.id=w.id
> INTERSECT
> SELECT *
>   FROM dbmedia d, wdmedia w
>  WHERE w.word = 'word2' AND d.id=w.id
> INTERSECT
> ...

 well this does not help...
 it results in a seq scan only.

 suryadb=# explain (select * from dbmedia d,wdmedia w where w.word='whatever' and d.id=w.id) INTERSECT (select * from
dbmediad,wdmedia w where w.word='whatever' and d.id=w.id) ; 
 NOTICE:  QUERY PLAN:

 SetOp Intersect  (cost=25519.49..25674.00 rows=515 width=92)
     ->  Sort  (cost=25519.49..25519.49 rows=5151 width=92)
     ->  Append  (cost=5033.63..25201.94 rows=5151 width=92)
     ->  Subquery Scan *SELECT* 1  (cost=5033.63..12600.97 rows=2575 width=92)
     ->  Hash Join  (cost=5033.63..12600.97 rows=2575 width=92)
     ->  Seq Scan on dbmedia d  (cost=0.00..3499.83 rows=100883 width=76)
         ^^^^^^^^^^
     ->  Hash  (cost=5027.19..5027.19 rows=2575 width=16)
     ->  Index Scan using wdkmedia on wdmedia w  (cost=0.00..5027.19 rows=2575 width=16)
     ->  Subquery Scan *SELECT* 2  (cost=5033.63..12600.97 rows=2575 width=92)
     ->  Hash Join  (cost=5033.63..12600.97 rows=2575 width=92)
     ->  Seq Scan on dbmedia d  (cost=0.00..3499.83 rows=100883 width=76)
         ^^^^^^^^^^^^^^^^^^^^^^
     ->  Hash  (cost=5027.19..5027.19 rows=2575 width=16)
->  Index Scan using wdkmedia on wdmedia w  (cost=0.00..5027.19 rows=2575 width=16)

instead the self-join is better...atleast in the case of postgresql :p
anyways..
thanks for the reply!

> I think you get the picture from this.  If he didn't need INTERSECT to
> emulate AND for his word matches, he could just use OR, and do this:
>
> SELECT *
>   FROM dbmedia d, wdmedia w
>  WHERE w.word IN ('word1', 'word2', 'etc') AND d.id=w.id;
>
> My guess is the OR version is faster, but he wants to restrict the data,
> not add to it.  Unfortunate.  ^_^
>
> > and avoid IN, INTERSECT and UNION at all costs.
>
> Actually, if INTERSECT and UNION are done properly (which I'm guessing
> they are) at most, the query has to be run once for each distinct query
> between the INTERSECT/UNION clauses.  On a fast/small query, an upper
> bound of doubling execution time isn't too bad.  Mathematical matrix
> intersections and unions are pretty fast/easy, so that doesn't add much
> overhead at all.
>
> It's IN you have to avoid.  The way postgres does it is just insane, and
> you'll suffer greatly upon using it.
>
> --
> +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
> | Shaun M. Thomas                INN Database Administrator           |
> | Phone: (309) 743-0812          Fax  : (309) 743-0830                |
> | Email: sthomas@townnews.com    AIM  : trifthen                      |
> | Web  : www.townnews.com                                             |
> |                                                                     |
> |     "Most of our lives are about proving something, either to       |
> |      ourselves or to someone else."                                 |
> |                                           -- Anonymous              |
> +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
>
>

--
------
Varun
Why do the people who know the least know it the loudest?

Highly obscure and erratic(II)

From

Varun Kacholia

Date:

20 June 2002, 19:26:10

hi ,
> select * from dbmedia, wdmedia a, wdmedia b
> where id = a.id and a.word='word1'
> and   id = b.id and b.word='word2'
> etc

well thanks for that..
but i really cannot understand y pgsql resorts to wierd behaviour..
like for this one :

==========================================================================
suryadb=# explain SELECT * FROM dbmedia d, wdmedia a1,wdmedia a2
          WHERE d.id = a1.id AND a1.word = 'word1' AND
          d.id=a2.id and a2.word='word2'
          and d.host like '144.16%' LIMIT 200;
NOTICE:  QUERY PLAN:
     Limit  (cost=5033.63..10547.45 rows=1 width=108)
     ->  Nested Loop  (cost=5033.63..10547.45 rows=1 width=108)
     ->  Hash Join  (cost=5033.63..10196.02 rows=83 width=32)
     ->  Index Scan using wdkmedia on wdmedia a1  (cost=0.00..5027.19 rows=2575 width=16)
     ->  Hash  (cost=5027.19..5027.19 rows=2575 width=16)
     ->  Index Scan using wdkmedia on wdmedia a2  (cost=0.00..5027.19 rows=2575 width=16)
->  Index Scan using indx2 on dbmedia d  (cost=0.00..4.21 rows=1 width=76)
============================================================================
                      
 clearly an index scan is done but for this...
==========================================================================
 suryadb=# explain SELECT * FROM dbmedia d, wdmedia a1
        WHERE d.id = a1.id AND a1.word = 'word1'
        and d.host like '144.16%' LIMIT 200;
 NOTICE:  QUERY PLAN:

 Limit  (cost=0.00..8811.42 rows=1 width=92)
   ->  Nested Loop  (cost=0.00..8811.42 rows=1 width=92)
           ->  Seq Scan on dbmedia d  (cost=0.00..3752.04 rows=1 width=76)
              ^^^^^^^^^^^^^^^
                   ->  Index Scan using wdkmedia on wdmedia a1  (cost=0.00..5027.19 rows=2575 width=16)
============================================================================

  y is a seq scan done here?
  hmm..perhaps cuz it might be thinking that '144.16%' has a high share?
  well but if the query is executed other way .. ie first Index scan on
  wdkmedia then Index scan on dbmedia is will be "inf" times fasters.
  I want it to first look in wdmedia and  then find the results by index
  scan which are present in dbmedia
  (which initially i was doing using  IN).
  Is there a way by which i can force it not to use seq scan?
  Will be grateful for any help

> Seems odd but it may work. But as a rule, joins are faster than subqueries
> and avoid IN, INTERSECT and UNION at all costs. Do you have a book on SQL
> around?
> --
--
------
Varun
If Bill Gates had a penny for every time Windows crashed......Oh wait, he does.