Re: Reason of Slowness of query - Mailing list pgsql-performance

From Adarsh Sharma
Subject Re: Reason of Slowness of query
Date
Msg-id 4D89A192.7000407@orkash.com
Whole thread Raw
In response to Re: Reason of Slowness of query  (Chetan Suttraway <chetan.suttraway@enterprisedb.com>)
Responses Re: Reason of Slowness of query  (Vitalii Tymchyshyn <tivv00@gmail.com>)
List pgsql-performance
Thanks Chetan, here is the output of your updated query :


explain  select distinct(p.crawled_page_id) from page_content p where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);


                                      QUERY PLAN                                      
---------------------------------------------------------------------------------------
 HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8)
   ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071 width=8)
         Hash Cond: (p.crawled_page_id = c.source_id)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8)
         ->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4)
               ->  Seq Scan on clause2 c  (cost=0.00..771182.96 rows=31876196 width=4)
(6 rows)

And my explain analyze output is :

                                                      QUERY PLAN                                                                
--------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8) (actual time=56666.181..56669.270 rows=72 loops=1)
   ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071 width=8) (actual time=45740.789..56665.816 rows=74 loops=1)
         Hash Cond: (p.crawled_page_id = c.source_id)
         ->  Seq Scan on page_content p  (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.012..715.915 rows=428467 loops=1)
         ->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4) (actual time=45310.524..45310.524 rows=31853083 loops=1)
               ->  Seq Scan on clause2 c  (cost=0.00..771182.96 rows=31876196 width=4) (actual time=0.055..23408.884 rows=31853083 loops=1)
 Total runtime: 56687.660 ms
(7 rows)

But Is there is any option to tune it further and one more thing output rows varies from 6 to 7.


Thanks & best Regards,
Adarsh Sharma






Chetan Suttraway wrote:


On Wed, Mar 23, 2011 at 11:58 AM, Adarsh Sharma <adarsh.sharma@orkash.com> wrote:
Dear all,

I have 2 tables in my database name clause2( 4900 MB) & page_content(1582 MB).

My table definations are as :

page_content :-

CREATE TABLE page_content
(
  content_id integer,
  wkb_geometry geometry,
  link_level integer,
  isprocessable integer,
  isvalid integer,
  isanalyzed integer,
  islocked integer,
  content_language character(10),
  url_id integer,
  publishing_date character(40),
  heading character(150),
  category character(150),
  crawled_page_url character(500),
  keywords character(500),
  dt_stamp timestamp with time zone,
  "content" character varying,
  crawled_page_id bigint,
  id integer
)
WITH (
  OIDS=FALSE
);

Indexes on it :-
CREATE INDEX idx_page_id  ON page_content  USING btree  (crawled_page_id);
CREATE INDEX idx_page_id_content   ON page_content  USING btree  (crawled_page_id, content_language, publishing_date, isprocessable);
CREATE INDEX pgweb_idx  ON page_content   USING gin   (to_tsvector('english'::regconfig, content::text));

clause 2:-
CREATE TABLE clause2
(
  id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
  source_id integer,
  sentence_id integer,
  clause_id integer,
  tense character varying(30),
  clause text,
  CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)
)WITH ( OIDS=FALSE);

Indexes on it :

CREATE INDEX idx_clause2_march10
  ON clause2
  USING btree
  (id, source_id);

I perform a join query on it as :

 explain analyze select distinct(p.crawled_page_id) from page_content p , clause2  c where p.crawled_page_id != c.source_id ;

What it takes more than 1 hour to complete. As I issue the explain analyze command and cannot able to wait for output but I send my explain output as :
                                             QUERY PLAN                                              
--------------------------------------------------------------------------------------------------------
 Unique  (cost=927576.16..395122387390.13 rows=382659 width=8)
   ->  Nested Loop  (cost=927576.16..360949839832.15 rows=13669019023195 width=8)
         Join Filter: (p.crawled_page_id <> c.source_id)
         ->  Index Scan using idx_page_id on page_content p  (cost=0.00..174214.02 rows=428817 width=8)
         ->  Materialize  (cost=927576.16..1370855.12 rows=31876196 width=4)
               ->  Seq Scan on clause2 c  (cost=0.00..771182.96 rows=31876196 width=4)
(6 rows)


Please guide me how to make the above query run faster as I am not able to do that.


Thanks, Adarsh




Could you try just explaining the below query:
explain  select distinct(p.crawled_page_id) from page_content p where NOT EXISTS (select 1 from  clause2 c where c.source_id = p.crawled_page_id);

The idea here is to avoid directly using NOT operator.



Regards,
Chetan

--
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.




pgsql-performance by date:

Previous
From: Vitalii Tymchyshyn
Date:
Subject: Re: Reason of Slowness of query
Next
From: Chetan Suttraway
Date:
Subject: Re: Reason of Slowness of query