Thread: Reason of Slowness of query

Reason of Slowness of query

From

Adarsh Sharma

Date:

23 March 2011, 03:23:27

Dear all,

I have 2 tables in my database name clause2( 4900 MB) & page_content(1582 MB).

My table definations are as :

page_content :-

CREATE TABLE page_content
(
content_id integer,
wkb_geometry geometry,
link_level integer,
isprocessable integer,
isvalid integer,
isanalyzed integer,
islocked integer,
content_language character(10),
url_id integer,
publishing_date character(40),
heading character(150),
category character(150),
crawled_page_url character(500),
keywords character(500),
dt_stamp timestamp with time zone,
"content" character varying,
crawled_page_id bigint,
id integer
)
WITH (
OIDS=FALSE
);

Indexes on it :-
CREATE INDEX idx_page_id ON page_content USING btree (crawled_page_id);
CREATE INDEX idx_page_id_content ON page_content USING btree (crawled_page_id, content_language, publishing_date, isprocessable);
CREATE INDEX pgweb_idx ON page_content USING gin (to_tsvector('english'::regconfig, content::text));

clause 2:-
CREATE TABLE clause2
(
id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
source_id integer,
sentence_id integer,
clause_id integer,
tense character varying(30),
clause text,
CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)
)WITH ( OIDS=FALSE);

Indexes on it :

CREATE INDEX idx_clause2_march10
ON clause2
USING btree
(id, source_id);

I perform a join query on it as :

explain analyze select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id != c.source_id ;

What it takes more than 1 hour to complete. As I issue the explain analyze command and cannot able to wait for output but I send my explain output as :
                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------
Unique (cost=927576.16..395122387390.13 rows=382659 width=8)
   -> Nested Loop (cost=927576.16..360949839832.15 rows=13669019023195 width=8)
         Join Filter: (p.crawled_page_id <> c.source_id)
         -> Index Scan using idx_page_id on page_content p (cost=0.00..174214.02 rows=428817 width=8)
         -> Materialize (cost=927576.16..1370855.12 rows=31876196 width=4)
               -> Seq Scan on clause2 c (cost=0.00..771182.96 rows=31876196 width=4)
(6 rows)

Please guide me how to make the above query run faster as I am not able to do that.

Thanks, Adarsh

Re: Reason of Slowness of query

From

Chetan Suttraway

Date:

23 March 2011, 04:17:36

On Wed, Mar 23, 2011 at 11:58 AM, Adarsh Sharma <adarsh.sharma@orkash.com> wrote:

Dear all,

I have 2 tables in my database name clause2( 4900 MB) & page_content(1582 MB).

My table definations are as :

page_content :-

CREATE TABLE page_content
(
content_id integer,
wkb_geometry geometry,
link_level integer,
isprocessable integer,
isvalid integer,
isanalyzed integer,
islocked integer,
content_language character(10),
url_id integer,
publishing_date character(40),
heading character(150),
category character(150),
crawled_page_url character(500),
keywords character(500),
dt_stamp timestamp with time zone,
"content" character varying,
crawled_page_id bigint,
id integer
)
WITH (
OIDS=FALSE
);

Indexes on it :-
CREATE INDEX idx_page_id ON page_content USING btree (crawled_page_id);
CREATE INDEX idx_page_id_content ON page_content USING btree (crawled_page_id, content_language, publishing_date, isprocessable);
CREATE INDEX pgweb_idx ON page_content USING gin (to_tsvector('english'::regconfig, content::text));

clause 2:-
CREATE TABLE clause2
(
id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
source_id integer,
sentence_id integer,
clause_id integer,
tense character varying(30),
clause text,
CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)
)WITH ( OIDS=FALSE);

Indexes on it :

CREATE INDEX idx_clause2_march10
ON clause2
USING btree
(id, source_id);

I perform a join query on it as :

explain analyze select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id != c.source_id ;

What it takes more than 1 hour to complete. As I issue the explain analyze command and cannot able to wait for output but I send my explain output as :
                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------
Unique (cost=927576.16..395122387390.13 rows=382659 width=8)
   -> Nested Loop (cost=927576.16..360949839832.15 rows=13669019023195 width=8)
         Join Filter: (p.crawled_page_id <> c.source_id)
         -> Index Scan using idx_page_id on page_content p (cost=0.00..174214.02 rows=428817 width=8)
         -> Materialize (cost=927576.16..1370855.12 rows=31876196 width=4)
               -> Seq Scan on clause2 c (cost=0.00..771182.96 rows=31876196 width=4)
(6 rows)

Please guide me how to make the above query run faster as I am not able to do that.

Thanks, Adarsh

Could you try just explaining the below query:
explain select distinct(p.crawled_page_id) from page_content p where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);

The idea here is to avoid directly using NOT operator.

Regards,
Chetan

--
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.

Re: Reason of Slowness of query

From

Samuel Gendler

Date:

23 March 2011, 04:21:20

On Tue, Mar 22, 2011 at 11:28 PM, Adarsh Sharma <adarsh.sharma@orkash.com> wrote:

I perform a join query on it as :

explain analyze select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id != c.source_id ;

What it takes more than 1 hour to complete. As I issue the explain analyze command and cannot able to wait for output but I send my explain output as :

please describe what your query is trying to select, as it is possible that query isn't doing what you think it is. joining 2 tables where id1 != id2 will create a cross multiple of the two tables such that every row from the first table is matched with every single row from the second table that doesn't have a matching id. Then you are looking for distinct values on that potentially enormous set of rows.

db_v2=# select * from table1;

id | value

----+-------

1 | 1

2 | 2

3 | 3

(3 rows)

db_v2=# select * from table2;

id | value

----+-------

1 | 4

2 | 5

3 | 6

(3 rows)

db_v2=# select t1.id, t1.value, t2.id, t2.value from table1 t1, table2 t2 where t1.id != t2.id;

id | value | id | value

----+-------+----+-------

1 | 1 | 2 | 5

1 | 1 | 3 | 6

2 | 2 | 1 | 4

2 | 2 | 3 | 6

3 | 3 | 1 | 4

3 | 3 | 2 | 5

So if you have a couple of million rows in each table, you are selecting distinct over a potentially huge set of data. If you are actually trying to find all ids from one table which have no match at all in the other table, then you need an entirely different query:

db_v2=# insert into table2 (value) values (7);

INSERT 0 1

db_v2=# select * from table2;

id | value

----+-------

1 | 4

2 | 5

3 | 6

4 | 7

db_v2=# select t2.id, t2.value from table2 t2 where not exists (select 1 from table1 t1 where t1.id = t2.id);

id | value

----+-------

4 | 7

Re: Reason of Slowness of query

From

Vitalii Tymchyshyn

Date:

23 March 2011, 04:25:29

23.03.11 08:28, Adarsh Sharma написав(ла):

I perform a join query on it as :

explain analyze select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id != c.source_id ;

Your query is wrong. This query will return every crawled_page_id if clause2 has more then 1 source_id. This is because DB will be able to find clause with source_id different from crawled_page_id. You need to use "not exists" or "not in".

Best regards, Vitalii Tymchyshyn.

Re: Reason of Slowness of query

From

Adarsh Sharma

Date:

23 March 2011, 04:25:42

Thanks Chetan, here is the output of your updated query :

explain select distinct(p.crawled_page_id) from page_content p where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);

                                      QUERY PLAN
---------------------------------------------------------------------------------------
HashAggregate (cost=1516749.47..1520576.06 rows=382659 width=8)
   -> Hash Anti Join (cost=1294152.41..1515791.80 rows=383071 width=8)
         Hash Cond: (p.crawled_page_id = c.source_id)
         -> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8)
         -> Hash (cost=771182.96..771182.96 rows=31876196 width=4)
               -> Seq Scan on clause2 c (cost=0.00..771182.96 rows=31876196 width=4)
(6 rows)

And my explain analyze output is :

                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1516749.47..1520576.06 rows=382659 width=8) (actual time=56666.181..56669.270 rows=72 loops=1)
   -> Hash Anti Join (cost=1294152.41..1515791.80 rows=383071 width=8) (actual time=45740.789..56665.816 rows=74 loops=1)
         Hash Cond: (p.crawled_page_id = c.source_id)
         -> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.012..715.915 rows=428467 loops=1)
         -> Hash (cost=771182.96..771182.96 rows=31876196 width=4) (actual time=45310.524..45310.524 rows=31853083 loops=1)
               -> Seq Scan on clause2 c (cost=0.00..771182.96 rows=31876196 width=4) (actual time=0.055..23408.884 rows=31853083 loops=1)
Total runtime: 56687.660 ms
(7 rows)

But Is there is any option to tune it further and one more thing output rows varies from 6 to 7.

Thanks & best Regards,
Adarsh Sharma

Chetan Suttraway wrote:

On Wed, Mar 23, 2011 at 11:58 AM, Adarsh Sharma <adarsh.sharma@orkash.com> wrote:
Dear all,

I have 2 tables in my database name clause2( 4900 MB) & page_content(1582 MB).

My table definations are as :

page_content :-

CREATE TABLE page_content
(
content_id integer,
wkb_geometry geometry,
link_level integer,
isprocessable integer,
isvalid integer,
isanalyzed integer,
islocked integer,
content_language character(10),
url_id integer,
publishing_date character(40),
heading character(150),
category character(150),
crawled_page_url character(500),
keywords character(500),
dt_stamp timestamp with time zone,
"content" character varying,
crawled_page_id bigint,
id integer
)
WITH (
OIDS=FALSE
);

Indexes on it :-
CREATE INDEX idx_page_id ON page_content USING btree (crawled_page_id);
CREATE INDEX idx_page_id_content ON page_content USING btree (crawled_page_id, content_language, publishing_date, isprocessable);
CREATE INDEX pgweb_idx ON page_content USING gin (to_tsvector('english'::regconfig, content::text));

clause 2:-
CREATE TABLE clause2
(
id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
source_id integer,
sentence_id integer,
clause_id integer,
tense character varying(30),
clause text,
CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)
)WITH ( OIDS=FALSE);

Indexes on it :

CREATE INDEX idx_clause2_march10
ON clause2
USING btree
(id, source_id);

I perform a join query on it as :

explain analyze select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id != c.source_id ;

What it takes more than 1 hour to complete. As I issue the explain analyze command and cannot able to wait for output but I send my explain output as :
                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------
Unique (cost=927576.16..395122387390.13 rows=382659 width=8)
   -> Nested Loop (cost=927576.16..360949839832.15 rows=13669019023195 width=8)
         Join Filter: (p.crawled_page_id <> c.source_id)
         -> Index Scan using idx_page_id on page_content p (cost=0.00..174214.02 rows=428817 width=8)
         -> Materialize (cost=927576.16..1370855.12 rows=31876196 width=4)
               -> Seq Scan on clause2 c (cost=0.00..771182.96 rows=31876196 width=4)
(6 rows)

Please guide me how to make the above query run faster as I am not able to do that.

Thanks, Adarsh

Could you try just explaining the below query:
explain select distinct(p.crawled_page_id) from page_content p where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);

The idea here is to avoid directly using NOT operator.

Regards,
Chetan

--
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.

Re: Reason of Slowness of query

From

Chetan Suttraway

Date:

23 March 2011, 04:26:06

On Wed, Mar 23, 2011 at 12:50 PM, Samuel Gendler <sgendler@ideasculptor.com> wrote:

On Tue, Mar 22, 2011 at 11:28 PM, Adarsh Sharma <adarsh.sharma@orkash.com> wrote:

I perform a join query on it as :

explain analyze select distinct(p.crawled_page_id) from page_content p , clause2 c where p.crawled_page_id != c.source_id ;

What it takes more than 1 hour to complete. As I issue the explain analyze command and cannot able to wait for output but I send my explain output as :

please describe what your query is trying to select, as it is possible that query isn't doing what you think it is. joining 2 tables where id1 != id2 will create a cross multiple of the two tables such that every row from the first table is matched with every single row from the second table that doesn't have a matching id. Then you are looking for distinct values on that potentially enormous set of rows.

db_v2=# select * from table1;
id | value
----+-------
  1 | 1
  2 | 2
  3 | 3
(3 rows)

db_v2=# select * from table2;
id | value
----+-------
  1 | 4
  2 | 5
  3 | 6
(3 rows)

db_v2=# select t1.id, t1.value, t2.id, t2.value from table1 t1, table2 t2 where t1.id != t2.id;
id | value | id | value
----+-------+----+-------
  1 | 1 | 2 | 5
  1 | 1 | 3 | 6
  2 | 2 | 1 | 4
  2 | 2 | 3 | 6
  3 | 3 | 1 | 4
  3 | 3 | 2 | 5

So if you have a couple of million rows in each table, you are selecting distinct over a potentially huge set of data. If you are actually trying to find all ids from one table which have no match at all in the other table, then you need an entirely different query:

db_v2=# insert into table2 (value) values (7);
INSERT 0 1

db_v2=# select * from table2;
id | value
----+-------
  1 | 4
  2 | 5
  3 | 6
  4 | 7

db_v2=# select t2.id, t2.value from table2 t2 where not exists (select 1 from table1 t1 where t1.id = t2.id);
id | value
----+-------
  4 | 7

Check this setup:
pg=# create table t1(a int, b int);
CREATE TABLE
pg=# create index t1_b on t1(b);
CREATE INDEX
pg=# create table t2(c int, d int);
CREATE TABLE
pg=# create index t2_cd on t2(c,d);
CREATE INDEX
pg=# explain select distinct(b) from t1,t2 where t1.b !=t2.d;
                                  QUERY PLAN
-------------------------------------------------------------------------------
Unique (cost=0.00..80198.86 rows=200 width=4)
   -> Nested Loop (cost=0.00..68807.10 rows=4556702 width=4)
         Join Filter: (t1.b <> t2.d)
         -> Index Scan using t1_b on t1 (cost=0.00..76.35 rows=2140 width=4)
         -> Materialize (cost=0.00..42.10 rows=2140 width=4)
               -> Seq Scan on t2 (cost=0.00..31.40 rows=2140 width=4)
(6 rows)

pg=# explain select distinct(b) from t1 where NOT EXISTS (select 1 from t2 where t2.d=t1.b);
                               QUERY PLAN
------------------------------------------------------------------------
HashAggregate (cost=193.88..193.89 rows=1 width=4)
   -> Hash Anti Join (cost=58.15..193.88 rows=1 width=4)
         Hash Cond: (t1.b = t2.d)
         -> Seq Scan on t1 (cost=0.00..31.40 rows=2140 width=4)
         -> Hash (cost=31.40..31.40 rows=2140 width=4)
               -> Seq Scan on t2 (cost=0.00..31.40 rows=2140 width=4)
(6 rows)

The cost seems to be on higher side, but maybe on your system with index scan on t2 and t1, the cost might be on lower side.

Another query which forced index scan was :
pg=# explain select distinct(b) from t1,t2 where t1.b >t2.d union all select distinct(b) from t1,t2 where t1.b <t2.d;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
Append (cost=0.00..100496.74 rows=400 width=4)
   -> Unique (cost=0.00..50246.37 rows=200 width=4)
         -> Nested Loop (cost=0.00..46430.04 rows=1526533 width=4)
               -> Index Scan using t1_b on t1 (cost=0.00..76.35 rows=2140 width=4)
               -> Index Scan using t2_d on t2 (cost=0.00..12.75 rows=713 width=4)
                     Index Cond: (public.t1.b > public.t2.d)
   -> Unique (cost=0.00..50246.37 rows=200 width=4)
         -> Nested Loop (cost=0.00..46430.04 rows=1526533 width=4)
               -> Index Scan using t1_b on t1 (cost=0.00..76.35 rows=2140 width=4)
               -> Index Scan using t2_d on t2 (cost=0.00..12.75 rows=713 width=4)
                     Index Cond: (public.t1.b < public.t2.d)
(11 rows)

This looks like to a acceptable.
Please try this above query with your setup and post the explain output.

--
Regards,
Chetan Suttraway
EnterpriseDB, The Enterprise PostgreSQL company.

Re: Reason of Slowness of query

From

Vitalii Tymchyshyn

Date:

23 March 2011, 05:51:30

23.03.11 09:30, Adarsh Sharma написав(ла):

Thanks Chetan, here is the output of your updated query :

explain select distinct(p.crawled_page_id) from page_content p where NOT EXISTS (select 1 from clause2 c where c.source_id = p.crawled_page_id);

                                      QUERY PLAN
---------------------------------------------------------------------------------------
HashAggregate (cost=1516749.47..1520576.06 rows=382659 width=8)
   -> Hash Anti Join (cost=1294152.41..1515791.80 rows=383071 width=8)
         Hash Cond: (p.crawled_page_id = c.source_id)
         -> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8)
         -> Hash (cost=771182.96..771182.96 rows=31876196 width=4)
               -> Seq Scan on clause2 c (cost=0.00..771182.96 rows=31876196 width=4)
(6 rows)

And my explain analyze output is :

                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1516749.47..1520576.06 rows=382659 width=8) (actual time=56666.181..56669.270 rows=72 loops=1)
   -> Hash Anti Join (cost=1294152.41..1515791.80 rows=383071 width=8) (actual time=45740.789..56665.816 rows=74 loops=1)
         Hash Cond: (p.crawled_page_id = c.source_id)
         -> Seq Scan on page_content p (cost=0.00..87132.17 rows=428817 width=8) (actual time=0.012..715.915 rows=428467 loops=1)
         -> Hash (cost=771182.96..771182.96 rows=31876196 width=4) (actual time=45310.524..45310.524 rows=31853083 loops=1)
               -> Seq Scan on clause2 c (cost=0.00..771182.96 rows=31876196 width=4) (actual time=0.055..23408.884 rows=31853083 loops=1)
Total runtime: 56687.660 ms
(7 rows)

But Is there is any option to tune it further and one more thing output rows varies from 6 to 7.

You need an index on source_id to prevent seq scan, like the next:
CREATE INDEX idx_clause2_source_id
ON clause2
(source_id);

Best regards, Vitalii Tymchyshyn