Why are queries with subselects so slow? - Mailing list pgsql-general

From Robert Wille
Subject Why are queries with subselects so slow?
Date
Msg-id 008101c2e46b$bda4e850$0864a8c0@zucchini
Whole thread Raw
Responses Re: Why are queries with subselects so slow?  (Martijn van Oosterhout <kleptog@svana.org>)
Re: Why are queries with subselects so slow?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
I am trying to execute an update that has a subselect, and it takes an extraordinarily long time.
By performing the subselect in my application, and issuing the update multiple times, it appears that I can do the entire update about 400 times faster.
 
Following is a detailed description of the problem.
 
I am running Postgres 7.2.3 on Linux, and I have a freshly vacuum analyze'd database. Here are the tables:
 
create table container(
 containerid integer not null,
 parentid integer not null,
 name varchar(40) not null,
 createtime timestamp not null default current_timestamp,
 type smallint not null,
 primary key (containerid),
 foreign key (parentid) references container (containerid) on delete cascade,
 unique (parentid, name)
);
 
create table image(
 imageid integer not null,
 containerid integer not null,
 name varchar(120) not null,
 createtime timestamp not null default current_timestamp,
 state bigint not null default 0,
 articlescope integer not null default 0,
 primary key (imageid),
 foreign key (containerid) references container (containerid) on delete cascade,
 unique (containerid, name)
);
 
create table ancestry(
 containerid integer not null,
 ancestorid integer not null,\
 foreign key (containerid) references container (containerid) on delete cascade,
 foreign key (ancestorid) references container (containerid) on delete cascade,
 unique (containerid, ancestorid),
 unique (ancestorid, containerid)
);
 
The image table has 2043686 rows
The container table has 4016 rows
The ancestry table has 24728 rows
 
I am trying to execute the following update statement:
 
update image set state = ((state & -35184340369664::int8) | 0::int8) where containerid in (select containerid from ancestry where ancestorid = 1228067)
 
This statement will eventually update 766459 of the 2043686 rows.
 
When I explain this statement, I get the following:
 
Seq Scan on image  (cost=0.00..906827616.06 rows=1023222 width=47)
  SubPlan
    ->  Materialize  (cost=443.10..443.10 rows=898 width=4)
          ->  Seq Scan on ancestry  (cost=0.00..443.10 rows=898 width=4)
 
It is predicting that it will take upwards of a billion page fetches to update about one million rows, or nearly a thousand page fetches per row. Obviously, this can be done more efficiently, as my image table is only about 50 megs in size. So in order to understand its execution plan, I asked it to explain smaller parts.
 
If I explain the subselect, I get:
 
explain select containerid from ancestry where ancestorid = 1228067
 
Seq Scan on ancestry  (cost=0.00..443.10 rows=898 width=4)
 
This seems fairly reasonable (the predicted number of rows is very close to the actual number of rows), except that I wonder why it is doing a sequential scan of a table with 25K rows to retrieve less than 4% of them.
 
If I explain the cost of doing an update of the image table for one row in the subselect, I get the following:
 
explain update image set state = ((state & -35184340369664::int8) | 0::int8) where containerid=1228533
 
Index Scan using image_containerid_key on image  (cost=0.00..2809.74 rows=862 width=47)
 
Again, this seems reasonable. Therefore, if I were to do the update for each row in the subselect, it should take about 2.5 million (898 * 2809) page fetches, which is about 400 times fewer page fetches than the original statement. Therefore, I should be able to do this update before I am an old man by executing the subselect, and then calling update 865 times.

Can anybody explain why the database performs so poorly in this case? Is there anything I can do besides executing the subselect and then calling update multiple times?

pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Why PostgreSQL?
Next
From: Bruce Momjian
Date:
Subject: Re: Two questions about stored procedures