Thread: Re-Reason of Slowness of Query
Thanks Chetan, After my Lunch Break, I tried the below steps :
My original query was :
explain analyze select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id != c.source_id
which hangs because it is wrong query to fetch the desired output .
Next Updated Query be Chetan Suttraway :
explain analyze select distinct(p.crawled_page_id) from page_content p
where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7192.843..7195.923 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.040..7192.426 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.009..395.599 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.014..0.014 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7199.748 ms
(6 rows)
I think it is very much faster but I don't understand the query :
explain select distinct(b) from t1,t2 where t1.b >t2.d union all select distinct(b) from t1,t2 where t1.b <t2.d;
As i transform it into my format as:
explain select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id > c.source_id union all select distinct(p.crawled_page_id) from page_content p,clause2 c where p.crawled_page_id < c.source_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Append (cost=0.00..296085951076.34 rows=765318 width=8)
-> Unique (cost=0.00..148042971711.58 rows=382659 width=8)
-> Nested Loop (cost=0.00..136655213119.84 rows=4555103436696 width=8)
-> Index Scan using idx_page_id on page_content p (cost=0.00..174214.02 rows=428817 width=8)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..185898.05 rows=10622488 width=4)
Index Cond: (p.crawled_page_id > c.source_id)
-> Unique (cost=0.00..148042971711.58 rows=382659 width=8)
-> Nested Loop (cost=0.00..136655213119.84 rows=4555103436696 width=8)
-> Index Scan using idx_page_id on page_content p (cost=0.00..174214.02 rows=428817 width=8)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..185898.05 rows=10622488 width=4)
Index Cond: (p.crawled_page_id < c.source_id)
(11 rows)
I don't think this is correct because it produce 11 rows output.
Any further suggestions, Please guide.
Thanks & best Regards,
Adarsh Sharma
My original query was :
explain analyze select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id != c.source_id
which hangs because it is wrong query to fetch the desired output .
Next Updated Query be Chetan Suttraway :
explain analyze select distinct(p.crawled_page_id) from page_content p
where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7192.843..7195.923 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.040..7192.426 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.009..395.599 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.014..0.014 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7199.748 ms
(6 rows)
I think it is very much faster but I don't understand the query :
explain select distinct(b) from t1,t2 where t1.b >t2.d union all select distinct(b) from t1,t2 where t1.b <t2.d;
As i transform it into my format as:
explain select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id > c.source_id union all select distinct(p.crawled_page_id) from page_content p,clause2 c where p.crawled_page_id < c.source_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Append (cost=0.00..296085951076.34 rows=765318 width=8)
-> Unique (cost=0.00..148042971711.58 rows=382659 width=8)
-> Nested Loop (cost=0.00..136655213119.84 rows=4555103436696 width=8)
-> Index Scan using idx_page_id on page_content p (cost=0.00..174214.02 rows=428817 width=8)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..185898.05 rows=10622488 width=4)
Index Cond: (p.crawled_page_id > c.source_id)
-> Unique (cost=0.00..148042971711.58 rows=382659 width=8)
-> Nested Loop (cost=0.00..136655213119.84 rows=4555103436696 width=8)
-> Index Scan using idx_page_id on page_content p (cost=0.00..174214.02 rows=428817 width=8)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..185898.05 rows=10622488 width=4)
Index Cond: (p.crawled_page_id < c.source_id)
(11 rows)
I don't think this is correct because it produce 11 rows output.
Any further suggestions, Please guide.
Thanks & best Regards,
Adarsh Sharma
23.03.11 11:17, Adarsh Sharma написав(ла):
select distinct(b) from t1 where b > (select min(d) from t2) or b < (select max(d) from t2)
?
Can you explain in words, not SQL, what do you expect do retrieve?
Best regards, Vitalii Tymchyshyn
I don't understand it too. What are you trying to get? Is it
I think it is very much faster but I don't understand the query :
explain select distinct(b) from t1,t2 where t1.b >t2.d union all select distinct(b) from t1,t2 where t1.b <t2.d;
select distinct(b) from t1 where b > (select min(d) from t2) or b < (select max(d) from t2)
?
Can you explain in words, not SQL, what do you expect do retrieve?
Best regards, Vitalii Tymchyshyn
I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.
Thanks , Adarsh
Vitalii Tymchyshyn wrote:
Thanks , Adarsh
Vitalii Tymchyshyn wrote:
23.03.11 11:17, Adarsh Sharma написав(ла):I don't understand it too. What are you trying to get? Is it
I think it is very much faster but I don't understand the query :
explain select distinct(b) from t1,t2 where t1.b >t2.d union all select distinct(b) from t1,t2 where t1.b <t2.d;
select distinct(b) from t1 where b > (select min(d) from t2) or b < (select max(d) from t2)
?
Can you explain in words, not SQL, what do you expect do retrieve?
Best regards, Vitalii Tymchyshyn
23.03.11 12:10, Adarsh Sharma написав(ла):
select distinct(p.crawled_page_id) from page_content p
where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
is correct query.
Best regards, Vitalii Tymchyshyn.
I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.Then
select distinct(p.crawled_page_id) from page_content p
where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
is correct query.
Best regards, Vitalii Tymchyshyn.
Vitalii Tymchyshyn wrote:
I can't understand how select 1 from clause2 c where c.source_id = p.crawled_page_id works too, i get my output .
What is the significance of 1 here.
Thanks , Adarsh
23.03.11 12:10, Adarsh Sharma написав(ла):I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.Then
select distinct(p.crawled_page_id) from page_content p
where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
is correct query.
I can't understand how select 1 from clause2 c where c.source_id = p.crawled_page_id works too, i get my output .
What is the significance of 1 here.
Thanks , Adarsh
Best regards, Vitalii Tymchyshyn.
23.03.11 12:19, Adarsh Sharma написав(ла):
Best regards, Vitalii Tymchyshyn
Vitalii Tymchyshyn wrote:No significance. You can put anything there. E.g. "*". Simply arbitrary constant. Exists checks if there were any rows, it does not matter which columns are there or what is in this columns.23.03.11 12:10, Adarsh Sharma написав(ла):I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.Then
select distinct(p.crawled_page_id) from page_content p
where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
is correct query.
I can't understand how select 1 from clause2 c where c.source_id = p.crawled_page_id works too, i get my output .
What is the significance of 1 here.
Best regards, Vitalii Tymchyshyn
> I just want to retrieve that id 's from page_content which do not have > any entry in clause2 table. In that case the query probably does not work (at least the query you've sent in the first post) as it will return even those IDs that have at least one other row in 'clause2' (not matching the != condition). At least that's how I understand it. So instead of this select distinct(p.crawled_page_id) from page_content p, clause2 c where p.crawled_page_id != c.source_id ; you should probably do this select distinct(p.crawled_page_id) from page_content p left join clause2 c on (p.crawled_page_id = c.source_id) where (c.source_id is null); I guess this will be much more efficient too. regards Tomas
On Wed, Mar 23, 2011 at 3:49 PM, Adarsh Sharma <adarsh.sharma@orkash.com> wrote:
I can't understand how select 1 from clause2 c where c.source_id = p.crawled_page_id works too, i get my output .Vitalii Tymchyshyn wrote:23.03.11 12:10, Adarsh Sharma написав(ла):I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.Then
select distinct(p.crawled_page_id) from page_content p
where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
is correct query.
What is the significance of 1 here.
Thanks , AdarshBest regards, Vitalii Tymchyshyn.
Its the inverted logic for finding crawled_page_id not matching with source_id.
Actually, the idea was to force index scan on clause2 though.
--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.
On Wed, Mar 23, 2011 at 4:08 PM, <tv@fuzzy.cz> wrote:
This looks like to give expected results. Also note that the where clause "is null" is really required and is not an
optional predicate.
> I just want to retrieve that id 's from page_content which do not haveIn that case the query probably does not work (at least the query you've
> any entry in clause2 table.
sent in the first post) as it will return even those IDs that have at
least one other row in 'clause2' (not matching the != condition). At least
that's how I understand it.
true.
So instead of thisfrom page_content p, clause2 c where p.crawled_page_id != c.source_id ;
select distinct(p.crawled_page_id)
you should probably do thisfrom page_content p left join clause2 c on (p.crawled_page_id =
select distinct(p.crawled_page_id)
c.source_id) where (c.source_id is null);
I guess this will be much more efficient too.
This looks like to give expected results. Also note that the where clause "is null" is really required and is not an
optional predicate.
regards
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.
Thank U all, for U'r Nice Support.
Let me Conclude the results, below results are obtained after finding the needed queries :
First Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 87933.882 ms :-(
(6 rows)
Second Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-# where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7054.074 ms :-)
(6 rows)
Thanks & best Regards,
Adarsh Sharma
Chetan Suttraway wrote:
Let me Conclude the results, below results are obtained after finding the needed queries :
First Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 87933.882 ms :-(
(6 rows)
Second Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-# where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7054.074 ms :-)
(6 rows)
Thanks & best Regards,
Adarsh Sharma
Chetan Suttraway wrote:
On Wed, Mar 23, 2011 at 4:08 PM, <tv@fuzzy.cz> wrote:> I just want to retrieve that id 's from page_content which do not haveIn that case the query probably does not work (at least the query you've
> any entry in clause2 table.
sent in the first post) as it will return even those IDs that have at
least one other row in 'clause2' (not matching the != condition). At least
that's how I understand it.true.
So instead of thisfrom page_content p, clause2 c where p.crawled_page_id != c.source_id ;
select distinct(p.crawled_page_id)
you should probably do thisfrom page_content p left join clause2 c on (p.crawled_page_id =
select distinct(p.crawled_page_id)
c.source_id) where (c.source_id is null);
I guess this will be much more efficient too.
This looks like to give expected results. Also note that the where clause "is null" is really required and is not an
optional predicate.
regards
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.
23.03.11 13:21, Adarsh Sharma написав(ла):
Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.
Best regards, Vitalii Tymchyshyn
Thank U all, for U'r Nice Support.
Let me Conclude the results, below results are obtained after finding the needed queries :
First Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 87933.882 ms :-(
(6 rows)
Second Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-# where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7054.074 ms :-)
(6 rows)
Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.
Best regards, Vitalii Tymchyshyn
On Wed, Mar 23, 2011 at 4:51 PM, Vitalii Tymchyshyn <tivv00@gmail.com> wrote:
23.03.11 13:21, Adarsh Sharma написав(ла):Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.Thank U all, for U'r Nice Support.
Let me Conclude the results, below results are obtained after finding the needed queries :
First Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 87933.882 ms :-(
(6 rows)
Second Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-# where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7054.074 ms :-)
(6 rows)
yeah. maybe the easiest way, is to start a fresh session and fire the queries.
Best regards, Vitalii Tymchyshyn
--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.
Vitalii Tymchyshyn wrote:
Yes U 'r absolutely right, if I run it again, it display the output as :
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p left join clause2 c on (p.crawled_page_id = c.source_id) where (c.source_id is null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7618.452..7621.427 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.131..7618.043 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.020..472.811 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.015..0.015 rows=1 loops=428467)
Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 7637.132 ms
(6 rows)
I let U know after a fresh start (session ).
Then the true result comes and if further tuning required can be performed.
Best Regards, Adarsh
23.03.11 13:21, Adarsh Sharma написав(ла):Thank U all, for U'r Nice Support.
Let me Conclude the results, below results are obtained after finding the needed queries :
First Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p left join clause2 c on (p.crawled_page_id = c.source_id) where (c.source_id is null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 87933.882 ms :-(
(6 rows)
Second Option :
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-# where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 7054.074 ms :-)
(6 rows)
Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.
Yes U 'r absolutely right, if I run it again, it display the output as :
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p left join clause2 c on (p.crawled_page_id = c.source_id) where (c.source_id is null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7618.452..7621.427 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.131..7618.043 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.020..472.811 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.015..0.015 rows=1 loops=428467)
Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 7637.132 ms
(6 rows)
I let U know after a fresh start (session ).
Then the true result comes and if further tuning required can be performed.
Best Regards, Adarsh
Best regards, Vitalii Tymchyshyn
Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.yeah. maybe the easiest way, is to start a fresh session and fire the queries.
After the fresh start , the results obtained are :
pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7725.132..7728.341 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.115..7724.713 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.021..472.199 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.015..0.015 rows=1 loops=428467)
Index Cond: (p.crawled_page_id = c.source_id)
Total runtime: 7731.840 ms
(6 rows)
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-# where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=6192.249..6195.368 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.036..6191.838 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..372.489 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.012..0.012 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 6198.567 ms
(6 rows)
This seems a slight upper hand of the second query .
Would it be possible to tune it further.
My postgresql.conf parameters are as follows : ( Total RAM = 16 GB )
shared_buffers = 4GB
max_connections=700
effective_cache_size = 6GB
work_mem=16MB
maintenance_mem=64MB
I think to change
work_mem=64MB
maintenance_mem=256MB
Does it has some effects now.
Thanks & best Regards,
Adarsh Sharma
Best regards, Vitalii Tymchyshyn
--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.
> >> Actually the plans are equal, so I suppose it depends on what were >> run first :). Slow query operates with data mostly on disk, while >> fast one with data in memory. >> >> yeah. maybe the easiest way, is to start a fresh session and fire the >> queries. > > > After the fresh start , the results obtained are : As Chetan Suttraway already pointed out, the execution plans are exactly the same. And by "excactly" I mean there's no difference in evaluating those two queries. The difference is due to cached data - not just in shared buffers (which will be lost of postgres restart) but also in filesystem cache (which is managed by kernel, not postgres). So the first execution had to load (some of) the data into shared buffers, while the second execution already had a lot of data in shared buffers. That's why the first query run in 7.7sec while the second 6.2sec. >> This seems a slight upper hand of the second query . Again, there's no difference between those two queries, they're exactly the same. It's just a matter of which of them is executed first. > Would it be possible to tune it further. I don't think so. The only possibility I see is to add a flag into page_content table, update it using a trigger (when something is inserted/deleted from clause2). Then you don't need to do the join. > My postgresql.conf parameters are as follows : ( Total RAM = 16 GB ) > > shared_buffers = 4GB > max_connections=700 > effective_cache_size = 6GB > work_mem=16MB > maintenance_mem=64MB > > I think to change > > work_mem=64MB > maintenance_mem=256MB > > Does it has some effects now. Generally a good idea, but we don't know if there are other processes running on the same machine and what kind of system is this (how many users are there, what kind of queries do they run). If there's a lot of users, keep work_mem low. If there's just a few users decrease max_connections and bump up work_mem and consider increasing shared_buffers. Maintenance_work_mem is used for vacuum/create index etc. so it really does not affect regular queries. Some of those values (e.g. work_mem/maintenance_work_mem) are dynamic, so you can set them for the current connection and see how it affects the queries. Just do something like db=# SET work_mem='32MB' db=# EXPLAIN ANALYZE SELECT ... But I don't think this will improve the query we've been talking about. regards Tomas
On 03/23/2011 04:17 AM, Adarsh Sharma wrote: > explain analyze select distinct(p.crawled_page_id) from page_content > p where NOT EXISTS (select 1 from clause2 c where c.source_id = > p.crawled_page_id); You know... I'm surprised nobody has mentioned this, but DISTINCT is very slow unless you have a fairly recent version of Postgres that replaces it with something faster. Try this: EXPLAIN ANALYZE SELECT p.crawled_page_id FROM page_content p WHERE NOT EXISTS ( SELECT 1 FROM clause2 c WHERE c.source_id = p.crawled_page_id ) GROUP BY p.crawled_page_id; Or if you like the cleaner query without a sub-select: EXPLAIN ANALYZE SELECT p.crawled_page_id FROM page_content p LEFT JOIN clause2 c ON (c.source_id = p.crawled_page_id) WHERE c.source_id IS NULL GROUP BY p.crawled_page_id; -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
> On 03/23/2011 04:17 AM, Adarsh Sharma wrote: > >> explain analyze select distinct(p.crawled_page_id) from page_content >> p where NOT EXISTS (select 1 from clause2 c where c.source_id = >> p.crawled_page_id); > > You know... I'm surprised nobody has mentioned this, but DISTINCT is > very slow unless you have a fairly recent version of Postgres that > replaces it with something faster. Try this: Nobody mentioned that because the explain plan already uses hash aggregate (instead of the old sort) HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1) which means this is at least 8.4. Plus the 'distinct' step uses less than 1% of total time, so even if you improve it the impact will be minimal. regards Tomas
On 03/23/2011 09:16 AM, tv@fuzzy.cz wrote: > which means this is at least 8.4. Plus the 'distinct' step uses less than > 1% of total time, so even if you improve it the impact will be minimal. Haha. Noted. I guess I'm still on my original crusade against DISTINCT. I was pulling it out of so much old code it's been fused to my DNA. Actually, we're still on 8.2 so... :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
tv@fuzzy.cz wrote:
Yes, U"r absolutely right I am using Version 8.4SS and i am satisfied with the below query results:
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-# where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=5149.308..5152.251 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.119..5148.954 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.021..444.487 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.009..0.009 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 5155.874 ms
(6 rows)
I don't think that the above results are optimized further.
Thanks & best Regards,
Adarsh Sharma
On 03/23/2011 04:17 AM, Adarsh Sharma wrote:explain analyze select distinct(p.crawled_page_id) from page_content p where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);You know... I'm surprised nobody has mentioned this, but DISTINCT is very slow unless you have a fairly recent version of Postgres that replaces it with something faster. Try this:Nobody mentioned that because the explain plan already uses hash aggregate (instead of the old sort) HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1) which means this is at least 8.4. Plus the 'distinct' step uses less than 1% of total time, so even if you improve it the impact will be minimal.
Yes, U"r absolutely right I am using Version 8.4SS and i am satisfied with the below query results:
pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-# where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=100278.16..104104.75 rows=382659 width=8) (actual time=5149.308..5152.251 rows=72 loops=1)
-> Nested Loop Anti Join (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.119..5148.954 rows=74 loops=1)
-> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.021..444.487 rows=428467 loops=1)
-> Index Scan using idx_clause2_source_id on clause2 c (cost=0.00..18.18 rows=781 width=4) (actual time=0.009..0.009 rows=1 loops=428467)
Index Cond: (c.source_id = p.crawled_page_id)
Total runtime: 5155.874 ms
(6 rows)
I don't think that the above results are optimized further.
Thanks & best Regards,
Adarsh Sharma
regards Tomas