Thread: sub-query optimization
Hello, I am hoping someone can help explain why modifying the following query can effect such a huge change in speed. The query is: select * from articles where exists ( select 1 from article_categories where article_categories.article_id= articles.id and article_categories.category_id is null ) The original query was much more complex, but I have trimmed it down to highlight the problem. The query above also manifests the problem. OK, the above query (with 100,000 records in the articles table) takes 1292 msec (see output below). If I modify the query slightly: -------- select 1 from article_categories --> select 1 from articles, article_categories --------- the query takes 98 msec. Now, normally, I would just leave the query at that and enjoy my newfound speed. But, adding that extra table to the inner query only helps when the inner query matches few or no records. In my sample dataset, there are no records where category_id is null. However, if I modify the inner query such that it matches many records, adding that extra table to the inner query has the opposite effect. It kills the speed. I am wondering if there is a way to write a query that performs effeciently, regardless of the number of records the inner query matches. Can anyone offer any help? Thanks! -Brad ============================================ TABLE STRUCTURE: --------------- create table articles ( id serial primary key ); create index articles_publish_time_key on articles (publish_time); create table categories (id serial primary key ); create table article_categories ( article_id int not null references articles, category_id int not null references categories,primary key(article_id, category_id) ); create index article_categories_article_id_key on article_categories (article_id); ---------------------------------------- EXPLAIN ANALYZE OUTPUT -------------------------- (FAST QUERY)Result (cost=0.00..4217.00 rows=100000 width=187) (actual time=98.00..98.00 rows=0 loops=1) One-Time Filter: $0 InitPlan -> Nested Loop (cost=0.00..2984.03 rows=1 width=8)(actual time=97.98..97.98 rows=0 loops=1) -> Seq Scan on article_categories (cost=0.00..2981.00 rows=1 width=4) (actual time=97.98..97.98 rows=0 loops=1) Filter: (category_id IS NULL) -> IndexScan using articles_pkey on articles (cost=0.00..3.01 rows=1 width=4) (never executed) Index Cond: ("outer".article_id = articles.id) -> SeqScan on articles (cost=0.00..4217.00 rows=100000 width=187) (never executed)Total runtime: 98.24 msec (10 rows) ------------------------------------------------ SLOW QUERY: -------------Seq Scan on articles (cost=0.00..306827.16 rows=50000 width=187) (actual time=1292.48..1292.48 rows=0 loops=1) Filter: (subplan) SubPlan -> Index Scan using article_categories_article_id_keyon article_categories (cost=0.00..3.03 rows=1 width=0) (actual time=0.01..0.01 rows=0 loops=100000) Index Cond: (article_id = $0) Filter: (category_id IS NULL)Total runtime:1292.68 msec (7 rows)
On Fri, 2003-02-14 at 10:22, Brad Hilton wrote: > Hello, > > I am hoping someone can help explain why modifying the following query > can effect such a huge change in speed. The query is: > > select * from articles > where exists > ( select 1 from article_categories > where > article_categories.article_id = articles.id and > article_categories.category_id is null > ) To add one more detail: in simplifying my query for the list, I should have said: article_categories.category_id = 0 instead of article_categories.category_id is NULL Then, with an index on article_categories (category_id) you get the following results for the two queries: Without adding the "articles" table to the inner query: 1329 msec With the "articles" table in the inner query: 0.28 msec! That highlights the difference a bit more dramatically. Any ideas? Thanks, -Brad
> Hello, > > I am hoping someone can help explain why modifying the following > query can effect such a huge change in speed. The query is: > > select * from articles > where exists > ( select 1 from article_categories > where > article_categories.article_id = articles.id and > article_categories.category_id is null > ) > Can you test these two queries?select * from (select article_id from article_categories where category_id is null groupby article_id) X join articles using (article_id); select <fields> from article_categories join articles using (article_id) where category_id is null group by <fields> Above queries will need index on article_idI'm not sure if it helps, but they are the only solutions in mymind ;-) Can you say anything about data statistics in your tables? Howmany rows are with category_id=null? I looked into query definition once again. Your query doesn't makesense - article_categories have not null category_id...What do you reallywant to do? Regards, Tomasz Myrta
On Fri, 2003-02-14 at 11:21, jasiek@serwer.skawsoft.com.pl wrote: > Can you test these two queries? Thanks, I'll test them shortly. I wanted to answer your other questions, first: > Can you say anything about data statistics in your tables? How > many rows are with category_id=null? > > I looked into query definition once again. Your query doesn't make > sense - article_categories have not null category_id... What do you really > want to do? Sorry to cause confusion. My original query and db format were fairly complex so I didn't want to distract from my problem. My actual query looks like: select * from articles where exists (select 1 from article_categories, categories, category_map where article_categories.article_id= articles.id and categories.restrict_views = FALSE and article_categories.category_id = categories.idand category_map.parent_id = 1 and category_map.child_id = categories.id and category_map.child_id = article_categories.category_idand articles.post_status = 'publish' )and post_status = 'publish' ------------------ The problem is that sometimes there are no categories with "restrict_views = FALSE" and the query takes a *long* time: 23 seconds. However, if I simply add the 'articles' table to the inner query it takes 0.23 msec. *But*, sometimes there are many categories where "restrict_views = FALSE", and in such a case adding the 'articles' table to the inner query actually hurts performance quite a bit. Does that help at all? Thanks, -Brad
On 14 Feb 2003, Brad Hilton wrote: > I am hoping someone can help explain why modifying the following query > can effect such a huge change in speed. The query is: > > select * from articles > where exists > ( select 1 from article_categories > where > article_categories.article_id = articles.id and > article_categories.category_id is null > ) > > The original query was much more complex, but I have trimmed it down to > highlight the problem. The query above also manifests the problem. OK, > the above query (with 100,000 records in the articles table) takes 1292 > msec (see output below). If I modify the query slightly: > > -------- > select 1 from article_categories > --> > select 1 from articles, article_categories > --------- After putting the latter in the subselect do you actually have the same query? In one case articles is an outer reference for the particular row. In the other it's a reference to the copy of articles in the subselect. Wouldn't that give the wrong results when you have any matches (since there'd exist a row from the subselect even if it wasn't the one matching the outer query)?
Brad Hilton wrote: <cut> > select * from articles where exists > (select 1 from article_categories, categories, category_map > where > article_categories.article_id = articles.id and > categories.restrict_views = FALSE and > article_categories.category_id = categories.id and > category_map.parent_id = 1 and > category_map.child_id = categories.id and > category_map.child_id = article_categories.category_id and > articles.post_status = 'publish' > ) > and > post_status = 'publish' According to your table definition I can say, that you don't need subselect and exists, because 1 row from article and 1 row from categories have only 1 hit row in articles_categories (primary key), so you can rewrite your query as simple joins: (Query is only a hint, it probably won't work) select a.* from categories_c cross join category_map m join articles a on (child_id=category_id) join articles_categories ac using(article_id,category_id) where m.parent_id=1 and not c.restrict_views; and a.post_status='publish' You can change join order depending on your table stats. Regards, Tomasz Myrta
Brad Hilton <bhilton@vpop.net> writes: > I am hoping someone can help explain why modifying the following query > can effect such a huge change in speed. The query is: > select * from articles > where exists > ( select 1 from article_categories > where > article_categories.article_id = articles.id and > article_categories.category_id is null > ) > ... If I modify the query slightly: > -------- > select 1 from article_categories > --> > select 1 from articles, article_categories > --------- > the query takes 98 msec. Yeah, because then the sub-query is a constant (it doesn't depend on the current outer row at all) and so it is only evaluated once, not once per outer row. Unfortunately, that approach probably gives the wrong answers... regards, tom lane
On Fri, 2003-02-14 at 14:08, Tom Lane wrote: > Brad Hilton <bhilton@vpop.net> writes: > > ... If I modify the query slightly: > > > -------- > > select 1 from article_categories > > --> > > select 1 from articles, article_categories > > --------- > > > the query takes 98 msec. > > Yeah, because then the sub-query is a constant (it doesn't depend on the > current outer row at all) and so it is only evaluated once, not once per > outer row. Unfortunately, that approach probably gives the wrong > answers... Ah, that makes sense. But does it surprise you that when I manipulate the dataset such that the inner query matches 0 records, the total query takes so much longer? Unfortunately, after following the suggestions of several kind posters, the resulting queries are pretty slow compared to my example which used 'exists.' The fact that the query takes so long in certain dataset conditions is surprising me. Watch the following results: psql> update categories set restrict_views = FALSE; explain analyze select * from articles where exists (select 1 from article_categories, categories where article_categories.article_id= articles.id and categories.restrict_views = FALSE and article_categories.category_id = categories.id) and post_status = 'publish' order by publish_time desc limit 10;Total runtime: 0.69 msec psql> update categories set restrict_views = TRUE; explain analyze select * from articles where exists (select 1 from article_categories, categories where article_categories.article_id= articles.id and categories.restrict_views = FALSE and article_categories.category_id = categories.id) and post_status = 'publish' order by publish_time desc limit 10; Total runtime: 27490.84 msec Is that a surprising result? I would think that the second time things would be faster because there are no matches to the inner query. In fact, if I execute the inner query by itself, minus the reference to the articles table, it executes lightning fast. (0.07 msec) -Brad
On Fri, 2003-02-14 at 04:59, Tomasz Myrta wrote: > Brad Hilton wrote: > <cut> > > select * from articles where exists > > (select 1 from article_categories, categories, category_map > > where > > article_categories.article_id = articles.id and > > categories.restrict_views = FALSE and > > article_categories.category_id = categories.id and > > category_map.parent_id = 1 and > > category_map.child_id = categories.id and > > category_map.child_id = article_categories.category_id > > ) > > and > > post_status = 'publish' > > According to your table definition I can say, that you don't need subselect > and exists, because 1 row from article and 1 row from categories have only 1 > hit row in articles_categories (primary key), I don't think the article_categories primary key can be used in my query since I'm also joining against category_map. Articles can live in multiple categories. What my query is attempting is (in english terms): select all articles that live in non-restricted categories at or below a top-level category (id=1 in this case). If I just utilize article_categories primary key, I could end up with duplicate articles since articles can live in multiple categories. > so you can rewrite your query > as simple joins: > (Query is only a hint, it probably won't work) > > select a.* > from > categories_c cross join category_map m > join articles a on (child_id=category_id) > join articles_categories ac using (article_id,category_id) > where > m.parent_id=1 and not c.restrict_views; > and a.post_status='publish' > In case I'm not understanding your suggestiong perfectly, I tried to flesh it out a bit more. Does the following query match your suggestion? select a.* fromcategories c cross join category_map mjoin article_categories ac on (c.id = ac.category_id and m.child_id = ac.category_id)join articles a on (a.id = ac.article_id) wherem.parent_id=1 andnot c.restrict_views andm.child_id = c.id anda.post_status='publish' Unfortunately, this query returns duplicate articles (see explanation above), and is fairly slow. Maybe I didn't follow your initial query properly. -Brad
On Fri, Feb 14, 2003 at 02:39:31PM -0800, Brad Hilton wrote: > If I just utilize article_categories primary key, I could end up with > duplicate articles since articles can live in multiple categories. You can use group by to eliminate duplicates. > In case I'm not understanding your suggestiong perfectly, I tried to > flesh it out a bit more. Does the following query match your > suggestion? It looks ok now. Probably it needs some cosmetics changes. > > select a.* > from > categories c cross join category_map m > join article_categories ac on (c.id = ac.category_id and m.child_id = > ac.category_id) > join articles a on (a.id = ac.article_id) > where > m.parent_id=1 and > not c.restrict_views and > m.child_id = c.id and > a.post_status='publish' > > Unfortunately, this query returns duplicate articles (see explanation > above), and is fairly slow. Maybe I didn't follow your initial query > properly. Can you send explain analyze this query? Maybe table joins should be reordered or they need other indexes they have? Tomasz
On Fri, 2003-02-14 at 17:52, Tomasz Myrta wrote: > It looks ok now. Probably it needs some cosmetics changes. > > > > select a.* > > from > > categories c cross join category_map m > > join article_categories ac on (c.id = ac.category_id and m.child_id = > > ac.category_id) > > join articles a on (a.id = ac.article_id) > > where > > m.parent_id=1 and > > not c.restrict_views and > > m.child_id = c.id and > > a.post_status='publish' > > > Can you send explain analyze this query? Maybe table > joins should be reordered or they need other indexes they have? > Sure, I appreciate your interest and help. I modified the above query to do a "select distinct a.*" since I need to fetch all fields and need them grouped by id. I'm also including a query which uses derived tables and avoids the penalty of "distinct a.*." The second query is much faster because of this, but it's still slower than I was hoping. ~3 seconds for a query isn't going to fly. :( If you can see any warning signs from the "explain" output, I'd love to hear from you. Thanks, -Brad ---------------------------------- explain analyze select distinct a.* fromcategories c cross join category_map mjoin article_categories ac on (c.id = ac.category_id and m.child_id = ac.category_id)join articles a on (a.id = ac.article_id) wherem.parent_id=1 andnot c.restrict_views andm.child_id = c.id anda.post_status='publish' --------------------------------- Unique (cost=61058.89..68058.89 rows=20000 width=203) (actual time=7649.35..8465.96 rows=100000 loops=1) -> Sort (cost=61058.89..61558.89 rows=200000 width=203) (actual time=7649.35..7898.56 rows=200000 loops=1) Sort Key: a.id, a.user_id, a.blog_id, a.remote_ip, a.create_time, a.publish_time, a.update_time, a.title, a.body, a.long_body, a.excerpt, a.post_status, a.publish_date -> Hash Join (cost=5133.25..17614.25 rows=200000 width=203) (actual time=590.36..6029.65 rows=200000 loops=1) Hash Cond: ("outer".article_id = "inner".id) -> Hash Join (cost=18.33..6499.33 rows=200000 width=16) (actual time=2.08..801.69 rows=200000 loops=1) Hash Cond: ("outer".category_id = "inner".id) -> Seq Scan on article_categories ac (cost=0.00..2981.00 rows=200000 width=8) (actual time=0.01..293.28 rows=200000 loops=1) -> Hash (cost=18.01..18.01 rows=131 width=8) (actual time=1.72..1.72 rows=0 loops=1) -> Hash Join (cost=6.64..18.01 rows=131 width=8) (actual time=0.84..1.58 rows=131 loops=1) Hash Cond: ("outer".child_id = "inner".id) -> Seq Scan on category_map m (cost=0.00..9.07 rows=131 width=4) (actual time=0.02..0.43 rows=131 loops=1) Filter: (parent_id = 1) -> Hash (cost=6.31..6.31rows=131 width=4) (actual time=0.40..0.40 rows=0 loops=1) -> Seq Scan on categories c (cost=0.00..6.31 rows=131 width=4) (actual time=0.02..0.24 rows=131 loops=1) Filter: (NOT restrict_views) -> Hash (cost=2885.00..2885.00 rows=100000 width=187) (actual time=588.13..588.13 rows=0 loops=1) -> Seq Scan on articles a (cost=0.00..2885.00 rows=100000 width=187) (actual time=0.03..429.67 rows=100000 loops=1) Filter: (post_status = 'publish'::character varying)Total runtime: 18544.75 msec -------------------------------- explain analyze select articles.* from(select article_id from article_categories ac, categories c, category_map cm whereac.category_id = c.id and ac.category_id = cm.child_id and c.id = cm.child_id and c.restrict_views = FALSE and cm.parent_id= 1group by article_id) Xjoin articles on (articles.id = X.article_id) where articles.post_status = 'publish' --------------------------------Merge Join (cost=26538.07..30754.75 rows=20000 width=191) (actual time=1736.36..3079.64 rows=100000 loops=1) Merge Cond: ("outer".id = "inner".article_id) -> Index Scan using articles_pkeyon articles (cost=0.00..3616.68 rows=100000 width=187) (actual time=0.06..674.13 rows=100000 loops=1) Filter: (post_status = 'publish'::charactervarying) -> Sort (cost=26538.07..26588.07 rows=20000 width=16) (actual time=1736.27..1800.42 rows=100000 loops=1) Sort Key: x.article_id -> Subquery Scan x (cost=24109.30..25109.30rows=20000 width=16) (actual time=1073.54..1594.71 rows=100000 loops=1) -> Group (cost=24109.30..25109.30 rows=20000width=16) (actual time=1073.54..1447.94 rows=100000 loops=1) -> Sort (cost=24109.30..24609.30 rows=200000 width=16) (actual time=1073.52..1191.72 rows=200000 loops=1) Sort Key: ac.article_id -> Hash Join (cost=18.66..6499.66 rows=200000 width=16) (actual time=2.14..777.38 rows=200000 loops=1) Hash Cond: ("outer".category_id= "inner".id) -> Seq Scan on article_categories ac (cost=0.00..2981.00 rows=200000 width=8) (actual time=0.01..304.06 rows=200000 loops=1) -> Hash (cost=18.33..18.33 rows=131 width=8) (actual time=1.80..1.80 rows=0 loops=1) -> Hash Join (cost=6.97..18.33 rows=131 width=8) (actual time=0.78..1.65 rows=131 loops=1) Hash Cond: ("outer".child_id = "inner".id) -> Seq Scan on category_map cm (cost=0.00..9.07 rows=131 width=4) (actual time=0.01..0.54 rows=131 loops=1) Filter: (parent_id = 1) -> Hash (cost=6.64..6.64 rows=131 width=4) (actual time=0.39..0.39 rows=0 loops=1) -> Seq Scan on categories c (cost=0.00..6.64 rows=131 width=4) (actual time=0.04..0.25 rows=131 loops=1) Filter: (restrict_views = false)Total runtime: 3221.05 msec
Brad Hilton wrote: > -------------------------------- > Merge Join (cost=26538.07..30754.75 rows=20000 width=191) (actual > time=1736.36..3079.64 rows=100000 loops=1) It looks like you didn't vacuum analyze database before explain analyze. Can you do this and send explain analyze once again? Look at the "actual time" showing which operation takes the most time. Regards, Tomasz Myrta
On Fri, Feb 14, 2003 at 10:22:19AM -0800, Brad Hilton wrote: > Hello, > > I am hoping someone can help explain why modifying the following query > can effect such a huge change in speed. The query is: > > select * from articles > where exists > ( select 1 from article_categories > where > article_categories.article_id = articles.id and > article_categories.category_id is null > ) Can you test these two queries?select * from (select article_id from article_categories where category_id is null groupby article_id) X join articles using (article_id); select <fields> from article_categories join articles using (article_id) where category_id is null group by <fields> Above queries will need index on article_idI'm not sure if it helps, but they are the only solutions in mymind ;-) Can you say anything about data statistics in your tables? Howmany rows are with category_id=null? I looked into query definition once again. Your query doesn't makesense - article_categories have not null category_id...What do you reallywant to do? Regards, Tomasz Myrta
On Fri, Feb 14, 2003 at 02:39:31PM -0800, Brad Hilton wrote: > If I just utilize article_categories primary key, I could end up with > duplicate articles since articles can live in multiple categories. > In case I'm not understanding your suggestiong perfectly, I tried to > flesh it out a bit more. Does the following query match your > suggestion? It looks ok now. Probably it needs some cosmetics changes. > > select a.* > from > categories c cross join category_map m > join article_categories ac on (c.id = ac.category_id and m.child_id = > ac.category_id) > join articles a on (a.id = ac.article_id) > where > m.parent_id=1 and > not c.restrict_views and > m.child_id = c.id and > a.post_status='publish' > > Unfortunately, this query returns duplicate articles (see explanation > above), and is fairly slow. Maybe I didn't follow your initial query > properly. What about adding "group by a.field1,a.field2..."? It will eliminate duplicates. Can you send explain analyze this query? Maybe table joins should be reordered or they need other indexes they have? Tomasz