Thread: sub-query optimization

sub-query optimization

From

Brad Hilton

Date:

14 February 2003, 13:21:03

Hello,

I am hoping someone can help explain why modifying the following query
can effect such a huge change in speed.  The query is:
     select * from articles     where exists      ( select 1 from article_categories        where
article_categories.article_id= articles.id and        article_categories.category_id is null      )
 

The original query was much more complex, but I have trimmed it down to
highlight the problem.  The query above also manifests the problem.  OK,
the above query (with 100,000 records in the articles table) takes 1292
msec (see output below).  If I modify the query slightly:

--------
select 1 from article_categories -->
select 1 from articles, article_categories
---------

the query takes 98 msec.  Now, normally, I would just leave the query at
that and enjoy my newfound speed.  But, adding that extra table to the
inner query only helps when the inner query matches few or no records. 
In my sample dataset, there are no records where category_id is null. 
However, if I modify the inner query such that it matches many records,
adding that extra table to the inner query has the opposite effect.  It
kills the speed.

I am wondering if there is a way to write a query that performs
effeciently, regardless of the number of records the inner query
matches.  Can anyone offer any help?

Thanks!
-Brad

============================================
TABLE STRUCTURE:
---------------
create table articles (       id serial primary key
);
create index articles_publish_time_key on articles (publish_time);

create table categories (id serial primary key
);

create table article_categories ( article_id int not null references articles, category_id int not null references
categories,primary key(article_id, category_id)
 
);
create index article_categories_article_id_key on article_categories
(article_id);
----------------------------------------

EXPLAIN ANALYZE OUTPUT
--------------------------

(FAST QUERY)Result  (cost=0.00..4217.00 rows=100000 width=187) (actual
time=98.00..98.00 rows=0 loops=1)  One-Time Filter: $0  InitPlan    ->  Nested Loop  (cost=0.00..2984.03 rows=1
width=8)(actual
 
time=97.98..97.98 rows=0 loops=1)          ->  Seq Scan on article_categories  (cost=0.00..2981.00
rows=1 width=4) (actual time=97.98..97.98 rows=0 loops=1)                Filter: (category_id IS NULL)          ->
IndexScan using articles_pkey on articles 
 
(cost=0.00..3.01 rows=1 width=4) (never executed)                Index Cond: ("outer".article_id = articles.id)  ->
SeqScan on articles  (cost=0.00..4217.00 rows=100000 width=187)
 
(never executed)Total runtime: 98.24 msec
(10 rows)

------------------------------------------------
SLOW QUERY:
-------------Seq Scan on articles  (cost=0.00..306827.16 rows=50000 width=187)
(actual time=1292.48..1292.48 rows=0 loops=1)  Filter: (subplan)  SubPlan    ->  Index Scan using
article_categories_article_id_keyon
 
article_categories  (cost=0.00..3.03 rows=1 width=0) (actual
time=0.01..0.01 rows=0 loops=100000)          Index Cond: (article_id = $0)          Filter: (category_id IS NULL)Total
runtime:1292.68 msec
 
(7 rows)

Re: sub-query optimization

From

Brad Hilton

Date:

14 February 2003, 13:37:34

On Fri, 2003-02-14 at 10:22, Brad Hilton wrote:
> Hello,
> 
> I am hoping someone can help explain why modifying the following query
> can effect such a huge change in speed.  The query is:
> 
>       select * from articles
>       where exists
>        ( select 1 from article_categories
>          where
>          article_categories.article_id = articles.id and
>          article_categories.category_id is null
>        )

To add one more detail:  in simplifying my query for the list, I should
have said:
article_categories.category_id = 0

instead of
article_categories.category_id is NULL

Then, with an index on article_categories (category_id) you get the
following results for the two queries:

Without adding the "articles" table to the inner query: 1329 msec
With the "articles" table in the inner query: 0.28 msec!

That highlights the difference a bit more dramatically.

Any ideas?

Thanks,
-Brad

Re: sub-query optimization

From

"Tomasz Myrta"

Date:

14 February 2003, 14:33:08

> Hello,
> 
> I am hoping someone can help explain why modifying the following 
> query can effect such a huge change in speed.  The query is:
> 
>       select * from articles
>       where exists
>        ( select 1 from article_categories
>          where
>          article_categories.article_id = articles.id and
>          article_categories.category_id is null
>        )
> 

Can you test these two queries?select * from (select article_id from article_categories where category_id is null
groupby article_id) X   join articles using (article_id);
 
 select <fields> from  article_categories  join articles using (article_id) where category_id is null  group by
<fields>
Above queries will need index on article_idI'm not sure if it helps, but they are the only solutions in mymind ;-)
Can you say anything about data statistics in your tables? Howmany rows are with category_id=null?
I looked into query definition once again. Your query doesn't makesense - article_categories have not null
category_id...What do you reallywant to do?
 

Regards,
Tomasz Myrta

Re: sub-query optimization

From

Brad Hilton

Date:

14 February 2003, 14:41:10

On Fri, 2003-02-14 at 11:21, jasiek@serwer.skawsoft.com.pl wrote:

> Can you test these two queries?

Thanks, I'll test them shortly.  I wanted to answer your other
questions, first:

>  Can you say anything about data statistics in your tables? How
>  many rows are with category_id=null?
> 
>  I looked into query definition once again. Your query doesn't make
>  sense - article_categories have not null category_id... What do you really
>  want to do?

Sorry to cause confusion.  My original query and db format were fairly
complex so I didn't want to distract from my problem.  My actual query
looks like:

select * from articles where exists (select 1 from article_categories, categories, category_map  where
article_categories.article_id= articles.id and  categories.restrict_views = FALSE and  article_categories.category_id =
categories.idand  category_map.parent_id = 1 and  category_map.child_id = categories.id and  category_map.child_id =
article_categories.category_idand  articles.post_status = 'publish' )and

post_status = 'publish'

------------------
The problem is that sometimes there are no categories with
"restrict_views = FALSE" and the query takes a *long* time: 23 seconds. 
However, if I simply add the 'articles' table to the inner query it
takes 0.23 msec.

*But*, sometimes there are many categories where "restrict_views =
FALSE", and in such a case adding the 'articles' table to the inner
query actually hurts performance quite a bit.

Does that help at all?

Thanks,
-Brad

Re: sub-query optimization

From

Stephan Szabo

Date:

14 February 2003, 15:04:49

On 14 Feb 2003, Brad Hilton wrote:

> I am hoping someone can help explain why modifying the following query
> can effect such a huge change in speed.  The query is:
>
>       select * from articles
>       where exists
>        ( select 1 from article_categories
>          where
>          article_categories.article_id = articles.id and
>          article_categories.category_id is null
>        )
>
> The original query was much more complex, but I have trimmed it down to
> highlight the problem.  The query above also manifests the problem.  OK,
> the above query (with 100,000 records in the articles table) takes 1292
> msec (see output below).  If I modify the query slightly:
>
> --------
> select 1 from article_categories
>   -->
> select 1 from articles, article_categories
> ---------

After putting the latter in the subselect do you actually have the same
query?  In one case articles is an outer reference for the particular
row.  In the other it's a reference to the copy of articles in the
subselect.  Wouldn't that give the wrong results when you have any matches
(since there'd exist a row from the subselect even if it wasn't the one
matching the outer query)?

Re: sub-query optimization

From

"Tomasz Myrta"

Date:

14 February 2003, 15:42:24

Brad Hilton wrote:
<cut>
> select * from articles where exists
>   (select 1 from article_categories, categories, category_map
>    where
>    article_categories.article_id = articles.id and
>    categories.restrict_views = FALSE and
>    article_categories.category_id = categories.id and
>    category_map.parent_id = 1 and
>    category_map.child_id = categories.id and
>    category_map.child_id = article_categories.category_id and
>    articles.post_status = 'publish'
>   )
>  and
> post_status = 'publish'

According to your table definition I can say, that you don't need subselect 
and exists, because 1 row from article and 1 row from categories have only 1 
hit row in articles_categories (primary key), so you can rewrite your query 
as simple joins:
(Query is only a hint, it probably won't work)

select a.*
from categories_c cross join category_map m   join articles a on (child_id=category_id) join articles_categories ac
using(article_id,category_id)
 
where m.parent_id=1 and not c.restrict_views; and a.post_status='publish' 

You can change join order depending on your table stats.

Regards,
Tomasz Myrta

Re: sub-query optimization

From

Tom Lane

Date:

14 February 2003, 17:08:27

Brad Hilton <bhilton@vpop.net> writes:
> I am hoping someone can help explain why modifying the following query
> can effect such a huge change in speed.  The query is:

>       select * from articles
>       where exists
>        ( select 1 from article_categories
>          where
>          article_categories.article_id = articles.id and
>          article_categories.category_id is null
>        )

> ... If I modify the query slightly:

> --------
> select 1 from article_categories
>   -->
> select 1 from articles, article_categories
> ---------

> the query takes 98 msec.

Yeah, because then the sub-query is a constant (it doesn't depend on the
current outer row at all) and so it is only evaluated once, not once per
outer row.  Unfortunately, that approach probably gives the wrong
answers...
        regards, tom lane

Re: sub-query optimization

From

Brad Hilton

Date:

14 February 2003, 17:25:50

On Fri, 2003-02-14 at 14:08, Tom Lane wrote:
> Brad Hilton <bhilton@vpop.net> writes:
> > ... If I modify the query slightly:
> 
> > --------
> > select 1 from article_categories
> >   -->
> > select 1 from articles, article_categories
> > ---------
> 
> > the query takes 98 msec.
> 
> Yeah, because then the sub-query is a constant (it doesn't depend on the
> current outer row at all) and so it is only evaluated once, not once per
> outer row.  Unfortunately, that approach probably gives the wrong
> answers...

Ah, that makes sense.  But does it surprise you that when I manipulate
the dataset such that the inner query matches 0 records, the total query
takes so much longer?

Unfortunately, after following the suggestions of several kind posters,
the resulting queries are pretty slow compared to my example which used
'exists.'

The fact that the query takes so long in certain dataset conditions is
surprising me.  Watch the following results:


psql> update categories set restrict_views = FALSE;

explain analyze select * from articles where exists (select 1 from article_categories, categories where
article_categories.article_id= articles.id and categories.restrict_views = FALSE and article_categories.category_id =
categories.id) and post_status = 'publish' order by publish_time desc limit 10;Total runtime: 0.69 msec
 

psql> update categories set restrict_views = TRUE;
explain analyze select * from articles where exists (select 1 from article_categories, categories where
article_categories.article_id= articles.id and categories.restrict_views = FALSE and article_categories.category_id =
categories.id) and post_status = 'publish' order by publish_time desc limit 10;
 
Total runtime: 27490.84 msec

Is that a surprising result?  I would think that the second time things
would be faster because there are no matches to the inner query.  In
fact, if I execute the inner query by itself, minus the reference to the
articles table, it executes lightning fast. (0.07 msec)

-Brad

Re: sub-query optimization

From

Brad Hilton

Date:

14 February 2003, 17:46:29

On Fri, 2003-02-14 at 04:59, Tomasz Myrta wrote:
> Brad Hilton wrote:
> <cut>
> > select * from articles where exists
> >   (select 1 from article_categories, categories, category_map
> >    where
> >    article_categories.article_id = articles.id and
> >    categories.restrict_views = FALSE and
> >    article_categories.category_id = categories.id and
> >    category_map.parent_id = 1 and
> >    category_map.child_id = categories.id and
> >    category_map.child_id = article_categories.category_id
> >   )
> >  and
> > post_status = 'publish'
> 
> According to your table definition I can say, that you don't need subselect 
> and exists, because 1 row from article and 1 row from categories have only 1 
> hit row in articles_categories (primary key),

I don't think the article_categories primary key can be used in my query
since I'm also joining against category_map.  Articles can live in
multiple categories.  What my query is attempting is (in english terms):

select all articles that live in non-restricted categories at or below a
top-level category (id=1 in this case).

If I just utilize article_categories primary key, I could end up with
duplicate articles since articles can live in multiple categories.

>  so you can rewrite your query 
> as simple joins:
> (Query is only a hint, it probably won't work)
> 
> select a.*
> from
>   categories_c cross join category_map m  
>   join articles a on (child_id=category_id)
>   join articles_categories ac using (article_id,category_id)
> where
>   m.parent_id=1 and not c.restrict_views;
>   and a.post_status='publish'
>   

In case I'm not understanding your suggestiong perfectly, I tried to
flesh it out a bit more.  Does the following query match your
suggestion?

select a.*
fromcategories c cross join category_map mjoin article_categories ac on (c.id = ac.category_id and m.child_id =
ac.category_id)join articles a on (a.id = ac.article_id)
wherem.parent_id=1 andnot c.restrict_views andm.child_id = c.id anda.post_status='publish'

Unfortunately, this query returns duplicate articles (see explanation
above), and is fairly slow.  Maybe I didn't follow your initial query
properly.

-Brad

Re: sub-query optimization

From

"Tomasz Myrta"

Date:

15 February 2003, 04:00:55

On Fri, Feb 14, 2003 at 02:39:31PM -0800, Brad Hilton wrote:
> If I just utilize article_categories primary key, I could end up with
> duplicate articles since articles can live in multiple categories.
You can use group by to eliminate duplicates.

> In case I'm not understanding your suggestiong perfectly, I tried to
> flesh it out a bit more.  Does the following query match your
> suggestion?
It looks ok now. Probably it needs some cosmetics changes.
>
> select a.*
> from
>  categories c cross join category_map m
>  join article_categories ac on (c.id = ac.category_id and m.child_id =
> ac.category_id)
>  join articles a on (a.id = ac.article_id)
> where
>  m.parent_id=1 and
>  not c.restrict_views and
>  m.child_id = c.id and
>  a.post_status='publish'
>
> Unfortunately, this query returns duplicate articles (see explanation
> above), and is fairly slow.  Maybe I didn't follow your initial query
> properly.
Can you send explain analyze this query? Maybe table
joins should be reordered or they need other indexes they have?

Tomasz

Re: sub-query optimization

From

Brad Hilton

Date:

17 February 2003, 12:18:29

On Fri, 2003-02-14 at 17:52, Tomasz Myrta wrote:
> It looks ok now. Probably it needs some cosmetics changes.
> >
> > select a.*
> > from
> >  categories c cross join category_map m
> >  join article_categories ac on (c.id = ac.category_id and m.child_id =
> > ac.category_id)
> >  join articles a on (a.id = ac.article_id)
> > where
> >  m.parent_id=1 and
> >  not c.restrict_views and
> >  m.child_id = c.id and
> >  a.post_status='publish'
> >
> Can you send explain analyze this query? Maybe table
> joins should be reordered or they need other indexes they have?
> 

Sure, I appreciate your interest and help.  I modified the above query
to do a "select distinct a.*" since I need to fetch all fields and need
them grouped by id.  I'm also including a query which uses derived
tables and avoids the penalty of "distinct a.*."  The second query is
much faster because of this, but it's still slower than I was hoping.  
~3 seconds for a query isn't going to fly. :(  If you can see any
warning signs from the "explain" output, I'd love to hear from you.

Thanks,
-Brad

----------------------------------
explain analyze select distinct a.*
fromcategories c cross join category_map mjoin article_categories ac on (c.id = ac.category_id and m.child_id =
ac.category_id)join articles a on (a.id = ac.article_id)
wherem.parent_id=1 andnot c.restrict_views andm.child_id = c.id anda.post_status='publish'

---------------------------------
Unique  (cost=61058.89..68058.89 rows=20000 width=203) (actual
time=7649.35..8465.96 rows=100000 loops=1)  ->  Sort  (cost=61058.89..61558.89 rows=200000 width=203) (actual
time=7649.35..7898.56 rows=200000 loops=1)        Sort Key: a.id, a.user_id, a.blog_id, a.remote_ip,
a.create_time, a.publish_time, a.update_time, a.title, a.body,
a.long_body, a.excerpt, a.post_status, a.publish_date        ->  Hash Join  (cost=5133.25..17614.25 rows=200000
width=203)
(actual time=590.36..6029.65 rows=200000 loops=1)              Hash Cond: ("outer".article_id = "inner".id)
->  Hash Join  (cost=18.33..6499.33 rows=200000 width=16)
 
(actual time=2.08..801.69 rows=200000 loops=1)                    Hash Cond: ("outer".category_id = "inner".id)
          ->  Seq Scan on article_categories ac 
 
(cost=0.00..2981.00 rows=200000 width=8) (actual time=0.01..293.28
rows=200000 loops=1)                    ->  Hash  (cost=18.01..18.01 rows=131 width=8)
(actual time=1.72..1.72 rows=0 loops=1)                          ->  Hash Join  (cost=6.64..18.01 rows=131
width=8) (actual time=0.84..1.58 rows=131 loops=1)                                Hash Cond: ("outer".child_id =
"inner".id)                                ->  Seq Scan on category_map m 
(cost=0.00..9.07 rows=131 width=4) (actual time=0.02..0.43 rows=131
loops=1)                                      Filter: (parent_id = 1)                                ->  Hash
(cost=6.31..6.31rows=131
 
width=4) (actual time=0.40..0.40 rows=0 loops=1)                                      ->  Seq Scan on categories c 
(cost=0.00..6.31 rows=131 width=4) (actual time=0.02..0.24 rows=131
loops=1)                                            Filter: (NOT
restrict_views)              ->  Hash  (cost=2885.00..2885.00 rows=100000 width=187)
(actual time=588.13..588.13 rows=0 loops=1)                    ->  Seq Scan on articles a  (cost=0.00..2885.00
rows=100000 width=187) (actual time=0.03..429.67 rows=100000 loops=1)                          Filter: (post_status =
'publish'::character
varying)Total runtime: 18544.75 msec

--------------------------------
explain analyze select articles.* from(select article_id from article_categories ac, categories c, category_map cm
whereac.category_id = c.id and ac.category_id = cm.child_id and c.id = cm.child_id and c.restrict_views = FALSE and
cm.parent_id= 1group by article_id) Xjoin articles on (articles.id = X.article_id) where articles.post_status =
'publish'

--------------------------------Merge Join  (cost=26538.07..30754.75 rows=20000 width=191) (actual
time=1736.36..3079.64 rows=100000 loops=1)  Merge Cond: ("outer".id = "inner".article_id)  ->  Index Scan using
articles_pkeyon articles  (cost=0.00..3616.68
 
rows=100000 width=187) (actual time=0.06..674.13 rows=100000 loops=1)        Filter: (post_status =
'publish'::charactervarying)  ->  Sort  (cost=26538.07..26588.07 rows=20000 width=16) (actual
 
time=1736.27..1800.42 rows=100000 loops=1)        Sort Key: x.article_id        ->  Subquery Scan x
(cost=24109.30..25109.30rows=20000
 
width=16) (actual time=1073.54..1594.71 rows=100000 loops=1)              ->  Group  (cost=24109.30..25109.30
rows=20000width=16)
 
(actual time=1073.54..1447.94 rows=100000 loops=1)                    ->  Sort  (cost=24109.30..24609.30 rows=200000
width=16) (actual time=1073.52..1191.72 rows=200000 loops=1)                          Sort Key: ac.article_id
              ->  Hash Join  (cost=18.66..6499.66
 
rows=200000 width=16) (actual time=2.14..777.38 rows=200000 loops=1)                                Hash Cond:
("outer".category_id=
 
"inner".id)                                ->  Seq Scan on article_categories ac 
(cost=0.00..2981.00 rows=200000 width=8) (actual time=0.01..304.06
rows=200000 loops=1)                                ->  Hash  (cost=18.33..18.33 rows=131
width=8) (actual time=1.80..1.80 rows=0 loops=1)                                      ->  Hash Join  (cost=6.97..18.33
rows=131 width=8) (actual time=0.78..1.65 rows=131 loops=1)                                            Hash Cond:
("outer".child_id = "inner".id)                                            ->  Seq Scan on
category_map cm  (cost=0.00..9.07 rows=131 width=4) (actual
time=0.01..0.54 rows=131 loops=1)                                                  Filter: (parent_id =
1)                                            ->  Hash  (cost=6.64..6.64
rows=131 width=4) (actual time=0.39..0.39 rows=0 loops=1)                                                  ->  Seq Scan
on
categories c  (cost=0.00..6.64 rows=131 width=4) (actual time=0.04..0.25
rows=131 loops=1)                                                        Filter:
(restrict_views = false)Total runtime: 3221.05 msec

Re: sub-query optimization

From

Tomasz Myrta

Date:

17 February 2003, 13:12:22

Brad Hilton wrote:

> --------------------------------
>  Merge Join  (cost=26538.07..30754.75 rows=20000 width=191) (actual
> time=1736.36..3079.64 rows=100000 loops=1)

It looks like you didn't vacuum analyze database before explain analyze. 
Can you do this and send explain analyze once again?
Look at the "actual time" showing which operation takes the most time.

Regards,
Tomasz Myrta

Re: sub-query optimization

From

jasiek@serwer.skawsoft.com.pl

Date:

17 February 2003, 13:34:53

On Fri, Feb 14, 2003 at 10:22:19AM -0800, Brad Hilton wrote:
> Hello,
>
> I am hoping someone can help explain why modifying the following query
> can effect such a huge change in speed.  The query is:
>
>       select * from articles
>       where exists
>        ( select 1 from article_categories
>          where
>          article_categories.article_id = articles.id and
>          article_categories.category_id is null
>        )

Can you test these two queries?select * from (select article_id from article_categories where category_id is null
groupby article_id) X   join articles using (article_id);

 select <fields> from  article_categories  join articles using (article_id) where category_id is null  group by
<fields>
Above queries will need index on article_idI'm not sure if it helps, but they are the only solutions in mymind ;-)
Can you say anything about data statistics in your tables? Howmany rows are with category_id=null?
I looked into query definition once again. Your query doesn't makesense - article_categories have not null
category_id...What do you reallywant to do?

Regards,
Tomasz Myrta

Re: sub-query optimization

From

jasiek@serwer.skawsoft.com.pl

Date:

17 February 2003, 13:35:16

On Fri, Feb 14, 2003 at 02:39:31PM -0800, Brad Hilton wrote:
> If I just utilize article_categories primary key, I could end up with
> duplicate articles since articles can live in multiple categories.

> In case I'm not understanding your suggestiong perfectly, I tried to
> flesh it out a bit more.  Does the following query match your
> suggestion?
It looks ok now. Probably it needs some cosmetics changes. 
> 
> select a.*
> from
>  categories c cross join category_map m
>  join article_categories ac on (c.id = ac.category_id and m.child_id =
> ac.category_id)
>  join articles a on (a.id = ac.article_id)
> where
>  m.parent_id=1 and
>  not c.restrict_views and
>  m.child_id = c.id and
>  a.post_status='publish'
> 
> Unfortunately, this query returns duplicate articles (see explanation
> above), and is fairly slow.  Maybe I didn't follow your initial query
> properly.
What about adding "group by a.field1,a.field2..."? It will eliminate
duplicates.
Can you send explain analyze this query? Maybe table
joins should be reordered or they need other indexes they have?


Tomasz