Thread: Re-Reason of Slowness of Query

Re-Reason of Slowness of Query

From
Adarsh Sharma
Date:
Thanks Chetan, After my Lunch Break, I tried the below steps :

My original query was :
explain analyze select distinct(p.crawled_page_id) from page_content p , clause2  c where p.crawled_page_id != c.source_id

which hangs because it is wrong query to fetch the desired output .

Next Updated Query be Chetan Suttraway :

explain analyze select distinct(p.crawled_page_id) from page_content p
 where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);

                                                                    QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7192.843..7195.923 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.040..7192.426 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.009..395.599 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.014..0.014 rows=1 loops=428467)
               Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 7199.748 ms
(6 rows)

I think it is very much faster but I don't understand the query :

explain select distinct(b) from t1,t2 where t1.b >t2.d union all  select distinct(b) from t1,t2 where  t1.b <t2.d;

As i transform it into my format as:

explain select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id > c.source_id union all  select distinct(p.crawled_page_id) from page_content p,clause2 c where p.crawled_page_id < c.source_id;

                                                     QUERY PLAN                                                     
---------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..296085951076.34 rows=765318 width=8)
   ->  Unique  (cost=0.00..148042971711.58 rows=382659 width=8)
         ->  Nested Loop  (cost=0.00..136655213119.84 rows=4555103436696 width=8)
               ->  Index Scan using idx_page_id on page_content p  (cost=0.00..174214.02 rows=428817 width=8)
               ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..185898.05 rows=10622488 width=4)
                     Index Cond: (p.crawled_page_id > c.source_id)
   ->  Unique  (cost=0.00..148042971711.58 rows=382659 width=8)
         ->  Nested Loop  (cost=0.00..136655213119.84 rows=4555103436696 width=8)
               ->  Index Scan using idx_page_id on page_content p  (cost=0.00..174214.02 rows=428817 width=8)
               ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..185898.05 rows=10622488 width=4)
                     Index Cond: (p.crawled_page_id < c.source_id)
(11 rows)

I don't think this is correct because it produce 11 rows output.

Any further suggestions, Please guide.

Thanks & best Regards,
Adarsh Sharma

Re: Re-Reason of Slowness of Query

From
Vitalii Tymchyshyn
Date:
23.03.11 11:17, Adarsh Sharma написав(ла):

I think it is very much faster but I don't understand the query :

explain select distinct(b) from t1,t2 where t1.b >t2.d union all  select distinct(b) from t1,t2 where  t1.b <t2.d;

I don't understand it too. What are you trying to get? Is it
select distinct(b) from t1 where  b > (select min(d) from t2) or b  < (select max(d) from t2)
?

Can you explain in words, not SQL, what do you expect do retrieve?

Best regards, Vitalii Tymchyshyn

Re: Re-Reason of Slowness of Query

From
Adarsh Sharma
Date:
I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.

Thanks , Adarsh

Vitalii Tymchyshyn wrote:
23.03.11 11:17, Adarsh Sharma написав(ла):

I think it is very much faster but I don't understand the query :

explain select distinct(b) from t1,t2 where t1.b >t2.d union all  select distinct(b) from t1,t2 where  t1.b <t2.d;

I don't understand it too. What are you trying to get? Is it
select distinct(b) from t1 where  b > (select min(d) from t2) or b  < (select max(d) from t2)
?

Can you explain in words, not SQL, what do you expect do retrieve?

Best regards, Vitalii Tymchyshyn

Re: Re-Reason of Slowness of Query

From
Vitalii Tymchyshyn
Date:
23.03.11 12:10, Adarsh Sharma написав(ла):
I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.

Then
select distinct(p.crawled_page_id) from page_content p
 where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
is correct query.

Best regards, Vitalii Tymchyshyn.

Re: Re-Reason of Slowness of Query

From
Adarsh Sharma
Date:
Vitalii Tymchyshyn wrote:
23.03.11 12:10, Adarsh Sharma написав(ла):
I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.

Then
select distinct(p.crawled_page_id) from page_content p
 where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
is correct query.


I can't understand how select 1 from  clause2 c where c.source_id = p.crawled_page_id works too, i get my output .

What is the significance of 1 here.

Thanks , Adarsh
Best regards, Vitalii Tymchyshyn.

Re: Re-Reason of Slowness of Query

From
Vitalii Tymchyshyn
Date:
23.03.11 12:19, Adarsh Sharma написав(ла):
Vitalii Tymchyshyn wrote:
23.03.11 12:10, Adarsh Sharma написав(ла):
I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.

Then
select distinct(p.crawled_page_id) from page_content p
 where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
is correct query.


I can't understand how select 1 from  clause2 c where c.source_id = p.crawled_page_id works too, i get my output .

What is the significance of 1 here.
No significance. You can put anything there. E.g. "*". Simply arbitrary constant. Exists checks if there were any rows, it does not matter which columns are there or what is in this columns.

Best regards, Vitalii Tymchyshyn

Re: Re-Reason of Slowness of Query

From
tv@fuzzy.cz
Date:
> I just want to retrieve that id 's from page_content which do not have
> any entry in clause2 table.

In that case the query probably does not work (at least the query you've
sent in the first post) as it will return even those IDs that have at
least one other row in 'clause2' (not matching the != condition). At least
that's how I understand it.

So instead of this

select distinct(p.crawled_page_id)
from page_content p, clause2 c where p.crawled_page_id != c.source_id ;

you should probably do this

select distinct(p.crawled_page_id)
from page_content p left join clause2 c on (p.crawled_page_id =
c.source_id) where (c.source_id is null);

I guess this will be much more efficient too.

regards
Tomas


Re: Re-Reason of Slowness of Query

From
Chetan Suttraway
Date:


On Wed, Mar 23, 2011 at 3:49 PM, Adarsh Sharma <adarsh.sharma@orkash.com> wrote:
Vitalii Tymchyshyn wrote:
23.03.11 12:10, Adarsh Sharma написав(ла):
I just want to retrieve that id 's from page_content which do not have any entry in clause2 table.

Then
select distinct(p.crawled_page_id) from page_content p
 where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
is correct query.


I can't understand how select 1 from  clause2 c where c.source_id = p.crawled_page_id works too, i get my output .

What is the significance of 1 here.

Thanks , Adarsh
Best regards, Vitalii Tymchyshyn.


Its the inverted logic for finding crawled_page_id not matching with source_id.
Actually, the idea was to force index scan on clause2 though.


--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.



Re: Re-Reason of Slowness of Query

From
Chetan Suttraway
Date:


On Wed, Mar 23, 2011 at 4:08 PM, <tv@fuzzy.cz> wrote:
> I just want to retrieve that id 's from page_content which do not have
> any entry in clause2 table.

In that case the query probably does not work (at least the query you've
sent in the first post) as it will return even those IDs that have at
least one other row in 'clause2' (not matching the != condition). At least
that's how I understand it.

true.
 
So instead of this

select distinct(p.crawled_page_id)
from page_content p, clause2 c where p.crawled_page_id != c.source_id ;

you should probably do this

select distinct(p.crawled_page_id)
from page_content p left join clause2 c on (p.crawled_page_id =
c.source_id) where (c.source_id is null);

I guess this will be much more efficient too.


This looks like to give expected results. Also note that the where clause "is null" is really required and is not an
optional predicate.

 
regards
Tomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.



Re: Re-Reason of Slowness of Query

From
Adarsh Sharma
Date:
Thank U all, for U'r Nice Support.

Let me Conclude the results, below results are obtained after finding the needed queries :

First Option :

pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
               Index Cond: (p.crawled_page_id = c.source_id)
 Total runtime: 87933.882 ms :-(
(6 rows)

Second Option :

pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
               Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 7054.074 ms :-)
(6 rows)


Thanks & best Regards,
Adarsh Sharma




Chetan Suttraway wrote:


On Wed, Mar 23, 2011 at 4:08 PM, <tv@fuzzy.cz> wrote:
> I just want to retrieve that id 's from page_content which do not have
> any entry in clause2 table.

In that case the query probably does not work (at least the query you've
sent in the first post) as it will return even those IDs that have at
least one other row in 'clause2' (not matching the != condition). At least
that's how I understand it.

true.
 
So instead of this

select distinct(p.crawled_page_id)
from page_content p, clause2 c where p.crawled_page_id != c.source_id ;

you should probably do this

select distinct(p.crawled_page_id)
from page_content p left join clause2 c on (p.crawled_page_id =
c.source_id) where (c.source_id is null);

I guess this will be much more efficient too.


This looks like to give expected results. Also note that the where clause "is null" is really required and is not an
optional predicate.

 
regards
Tomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.




Re: Re-Reason of Slowness of Query

From
Vitalii Tymchyshyn
Date:
23.03.11 13:21, Adarsh Sharma написав(ла):
Thank U all, for U'r Nice Support.

Let me Conclude the results, below results are obtained after finding the needed queries :

First Option :

pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
               Index Cond: (p.crawled_page_id = c.source_id)
 Total runtime: 87933.882 ms :-(
(6 rows)

Second Option :

pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
               Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 7054.074 ms :-)
(6 rows)


Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.

Best regards, Vitalii Tymchyshyn

Re: Re-Reason of Slowness of Query

From
Chetan Suttraway
Date:


On Wed, Mar 23, 2011 at 4:51 PM, Vitalii Tymchyshyn <tivv00@gmail.com> wrote:
23.03.11 13:21, Adarsh Sharma написав(ла):
Thank U all, for U'r Nice Support.

Let me Conclude the results, below results are obtained after finding the needed queries :

First Option :

pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-# from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(# c.source_id) where (c.source_id is null);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
               Index Cond: (p.crawled_page_id = c.source_id)
 Total runtime: 87933.882 ms :-(
(6 rows)

Second Option :

pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
               Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 7054.074 ms :-)
(6 rows)


Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.

yeah. maybe the easiest way, is to start a fresh session and fire the queries.
 
Best regards, Vitalii Tymchyshyn



--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.



Re: Re-Reason of Slowness of Query

From
Adarsh Sharma
Date:
Vitalii Tymchyshyn wrote:
23.03.11 13:21, Adarsh Sharma написав(ла):
Thank U all, for U'r Nice Support.

Let me Conclude the results, below results are obtained after finding the needed queries :

First Option :

pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p left join clause2 c on (p.crawled_page_id = c.source_id) where (c.source_id is null);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=87927.000..87930.084 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.191..87926.546 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.027..528.978 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.202..0.202 rows=1 loops=428467)
               Index Cond: (p.crawled_page_id = c.source_id)
 Total runtime: 87933.882 ms :-(
(6 rows)

Second Option :

pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7047.259..7050.261 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.039..7046.826 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..388.976 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.013..0.013 rows=1 loops=428467)
               Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 7054.074 ms :-)
(6 rows)


Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.

Yes U 'r absolutely right, if I run it again, it display the output as :

pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p left join clause2 c on (p.crawled_page_id = c.source_id) where (c.source_id is null);

                                                                    QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7618.452..7621.427 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.131..7618.043 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.020..472.811 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.015..0.015 rows=1 loops=428467)
               Index Cond: (p.crawled_page_id = c.source_id)
 Total runtime: 7637.132 ms
(6 rows)

I let U know after a fresh start (session ).
Then the true result comes and if further tuning required can be performed.

Best Regards, Adarsh

Best regards, Vitalii Tymchyshyn

Re: Re-Reason of Slowness of Query

From
Adarsh Sharma
Date:

Actually the plans are equal, so I suppose it depends on what were run first :). Slow query operates with data mostly on disk, while fast one with data in memory.

yeah. maybe the easiest way, is to start a fresh session and fire the queries.


After the fresh start , the results obtained are :

pdc_uima=# explain analyze select distinct(p.crawled_page_id)
pdc_uima-#  from page_content p left join clause2 c on (p.crawled_page_id =
pdc_uima(#  c.source_id) where (c.source_id is null);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=7725.132..7728.341 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.115..7724.713 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.021..472.199 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.015..0.015 rows=1 loops=428467)
               Index Cond: (p.crawled_page_id = c.source_id)
 Total runtime: 7731.840 ms
(6 rows)

pdc_uima=#          explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-#   where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=6192.249..6195.368 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.036..6191.838 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.008..372.489 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.012..0.012 rows=1 loops=428467)
               Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 6198.567 ms
(6 rows)

This seems a slight upper hand of the second query .

Would it be possible to tune it further.
My postgresql.conf parameters are as follows : ( Total RAM = 16 GB )

shared_buffers = 4GB
max_connections=700
effective_cache_size = 6GB
work_mem=16MB
maintenance_mem=64MB

I think to change

work_mem=64MB
maintenance_mem=256MB

Does it has some effects now.


Thanks & best Regards,
Adarsh Sharma


Best regards, Vitalii Tymchyshyn



--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.




Re: Re-Reason of Slowness of Query

From
tv@fuzzy.cz
Date:
>
>>     Actually the plans are equal, so I suppose it depends on what were
>>     run first :). Slow query operates with data mostly on disk, while
>>     fast one with data in memory.
>>
>> yeah. maybe the easiest way, is to start a fresh session and fire the
>> queries.
>
>
> After the fresh start , the results obtained are :

As Chetan Suttraway already pointed out, the execution plans are exactly
the same. And by "excactly" I mean there's no difference in evaluating
those two queries.

The difference is due to cached data - not just in shared buffers (which
will be lost of postgres restart) but also in filesystem cache (which is
managed by kernel, not postgres).

So the first execution had to load (some of) the data into shared buffers,
while the second execution already had a lot of data in shared buffers.
That's why the first query run in 7.7sec while the second 6.2sec.

>> This seems a slight upper hand of the second query .

Again, there's no difference between those two queries, they're exactly
the same. It's just a matter of which of them is executed first.

> Would it be possible to tune it further.

I don't think so. The only possibility I see is to add a flag into
page_content table, update it using a trigger (when something is
inserted/deleted from clause2). Then you don't need to do the join.

> My postgresql.conf parameters are as follows : ( Total RAM = 16 GB )
>
> shared_buffers = 4GB
> max_connections=700
> effective_cache_size = 6GB
> work_mem=16MB
> maintenance_mem=64MB
>
> I think to change
>
> work_mem=64MB
> maintenance_mem=256MB
>
> Does it has some effects now.

Generally a good idea, but we don't know if there are other processes
running on the same machine and what kind of system is this (how many
users are there, what kind of queries do they run). If there's a lot of
users, keep work_mem low. If there's just a few users decrease
max_connections and bump up work_mem and consider increasing
shared_buffers.

Maintenance_work_mem is used for vacuum/create index etc. so it really
does not affect regular queries.

Some of those values (e.g. work_mem/maintenance_work_mem) are dynamic, so
you can set them for the current connection and see how it affects the
queries.

Just do something like

db=# SET work_mem='32MB'
db=# EXPLAIN ANALYZE SELECT ...

But I don't think this will improve the query we've been talking about.

regards
Tomas


Re: Re-Reason of Slowness of Query

From
Shaun Thomas
Date:
On 03/23/2011 04:17 AM, Adarsh Sharma wrote:

> explain analyze select distinct(p.crawled_page_id) from page_content
> p where NOT EXISTS (select 1 from clause2 c where c.source_id =
> p.crawled_page_id);

You know... I'm surprised nobody has mentioned this, but DISTINCT is
very slow unless you have a fairly recent version of Postgres that
replaces it with something faster. Try this:

EXPLAIN ANALYZE
SELECT p.crawled_page_id
   FROM page_content p
  WHERE NOT EXISTS (
          SELECT 1
            FROM clause2 c
           WHERE c.source_id = p.crawled_page_id
        )
  GROUP BY p.crawled_page_id;

Or if you like the cleaner query without a sub-select:

EXPLAIN ANALYZE
SELECT p.crawled_page_id
   FROM page_content p
   LEFT JOIN clause2 c ON (c.source_id = p.crawled_page_id)
  WHERE c.source_id IS NULL
  GROUP BY p.crawled_page_id;

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Re-Reason of Slowness of Query

From
tv@fuzzy.cz
Date:
> On 03/23/2011 04:17 AM, Adarsh Sharma wrote:
>
>> explain analyze select distinct(p.crawled_page_id) from page_content
>> p where NOT EXISTS (select 1 from clause2 c where c.source_id =
>> p.crawled_page_id);
>
> You know... I'm surprised nobody has mentioned this, but DISTINCT is
> very slow unless you have a fairly recent version of Postgres that
> replaces it with something faster. Try this:

Nobody mentioned that because the explain plan already uses hash aggregate
(instead of the old sort)

 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual
time=7047.259..7050.261 rows=72 loops=1)

which means this is at least 8.4. Plus the 'distinct' step uses less than
1% of total time, so even if you improve it the impact will be minimal.

regards
Tomas


Re: Re-Reason of Slowness of Query

From
Shaun Thomas
Date:
On 03/23/2011 09:16 AM, tv@fuzzy.cz wrote:

> which means this is at least 8.4. Plus the 'distinct' step uses less than
> 1% of total time, so even if you improve it the impact will be minimal.

Haha. Noted. I guess I'm still on my original crusade against DISTINCT.
I was pulling it out of so much old code it's been fused to my DNA.
Actually, we're still on 8.2 so... :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Re-Reason of Slowness of Query

From
Adarsh Sharma
Date:
tv@fuzzy.cz wrote:
On 03/23/2011 04:17 AM, Adarsh Sharma wrote:
   
explain analyze select distinct(p.crawled_page_id) from page_content
p where NOT EXISTS (select 1 from clause2 c where c.source_id =
p.crawled_page_id);     
You know... I'm surprised nobody has mentioned this, but DISTINCT is
very slow unless you have a fairly recent version of Postgres that
replaces it with something faster. Try this:   
Nobody mentioned that because the explain plan already uses hash aggregate
(instead of the old sort)
HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual
time=7047.259..7050.261 rows=72 loops=1)

which means this is at least 8.4. Plus the 'distinct' step uses less than
1% of total time, so even if you improve it the impact will be minimal.
 

Yes, U"r absolutely right I am using Version 8.4SS and i am satisfied with the below query results:

pdc_uima=# explain analyze select distinct(p.crawled_page_id) from page_content p
pdc_uima-#  where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);
                                                                     QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=100278.16..104104.75 rows=382659 width=8) (actual time=5149.308..5152.251 rows=72 loops=1)
   ->  Nested Loop Anti Join  (cost=0.00..99320.46 rows=383079 width=8) (actual time=0.119..5148.954 rows=74 loops=1)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.021..444.487 rows=428467 loops=1)
         ->  Index Scan using idx_clause2_source_id on clause2 c  (cost=0.00..18.18 rows=781 width=4) (actual time=0.009..0.009 rows=1 loops=428467)
               Index Cond: (c.source_id = p.crawled_page_id)
 Total runtime: 5155.874 ms
(6 rows)

I don't think that the above results are optimized further.


Thanks & best Regards,
Adarsh Sharma
regards
Tomas