sub-query optimization - Mailing list pgsql-sql

Hello,

I am hoping someone can help explain why modifying the following query
can effect such a huge change in speed.  The query is:
     select * from articles     where exists      ( select 1 from article_categories        where
article_categories.article_id= articles.id and        article_categories.category_id is null      )
 

The original query was much more complex, but I have trimmed it down to
highlight the problem.  The query above also manifests the problem.  OK,
the above query (with 100,000 records in the articles table) takes 1292
msec (see output below).  If I modify the query slightly:

--------
select 1 from article_categories -->
select 1 from articles, article_categories
---------

the query takes 98 msec.  Now, normally, I would just leave the query at
that and enjoy my newfound speed.  But, adding that extra table to the
inner query only helps when the inner query matches few or no records. 
In my sample dataset, there are no records where category_id is null. 
However, if I modify the inner query such that it matches many records,
adding that extra table to the inner query has the opposite effect.  It
kills the speed.

I am wondering if there is a way to write a query that performs
effeciently, regardless of the number of records the inner query
matches.  Can anyone offer any help?

Thanks!
-Brad

============================================
TABLE STRUCTURE:
---------------
create table articles (       id serial primary key
);
create index articles_publish_time_key on articles (publish_time);

create table categories (id serial primary key
);

create table article_categories ( article_id int not null references articles, category_id int not null references
categories,primary key(article_id, category_id)
 
);
create index article_categories_article_id_key on article_categories
(article_id);
----------------------------------------

EXPLAIN ANALYZE OUTPUT
--------------------------

(FAST QUERY)Result  (cost=0.00..4217.00 rows=100000 width=187) (actual
time=98.00..98.00 rows=0 loops=1)  One-Time Filter: $0  InitPlan    ->  Nested Loop  (cost=0.00..2984.03 rows=1
width=8)(actual
 
time=97.98..97.98 rows=0 loops=1)          ->  Seq Scan on article_categories  (cost=0.00..2981.00
rows=1 width=4) (actual time=97.98..97.98 rows=0 loops=1)                Filter: (category_id IS NULL)          ->
IndexScan using articles_pkey on articles 
 
(cost=0.00..3.01 rows=1 width=4) (never executed)                Index Cond: ("outer".article_id = articles.id)  ->
SeqScan on articles  (cost=0.00..4217.00 rows=100000 width=187)
 
(never executed)Total runtime: 98.24 msec
(10 rows)

------------------------------------------------
SLOW QUERY:
-------------Seq Scan on articles  (cost=0.00..306827.16 rows=50000 width=187)
(actual time=1292.48..1292.48 rows=0 loops=1)  Filter: (subplan)  SubPlan    ->  Index Scan using
article_categories_article_id_keyon
 
article_categories  (cost=0.00..3.03 rows=1 width=0) (actual
time=0.01..0.01 rows=0 loops=100000)          Index Cond: (article_id = $0)          Filter: (category_id IS NULL)Total
runtime:1292.68 msec
 
(7 rows)




pgsql-sql by date:

Previous
From: Mintoo Lall
Date:
Subject: Re: Drop temporary table only if it exists
Next
From: Brad Hilton
Date:
Subject: Re: sub-query optimization