Thread: Why are queries with subselects so slow?
I am trying to execute an update that has a subselect, and it takes an extraordinarily long time.
By performing the subselect in my application, and issuing the update multiple times, it appears that I can do the entire update about 400 times faster.
By performing the subselect in my application, and issuing the update multiple times, it appears that I can do the entire update about 400 times faster.
Following is a detailed description of the problem.
I am running Postgres 7.2.3 on Linux, and I have a freshly vacuum analyze'd database. Here are the tables:
create table container(
containerid integer not null,
parentid integer not null,
name varchar(40) not null,
createtime timestamp not null default current_timestamp,
type smallint not null,
primary key (containerid),
foreign key (parentid) references container (containerid) on delete cascade,
unique (parentid, name)
);
containerid integer not null,
parentid integer not null,
name varchar(40) not null,
createtime timestamp not null default current_timestamp,
type smallint not null,
primary key (containerid),
foreign key (parentid) references container (containerid) on delete cascade,
unique (parentid, name)
);
create table image(
imageid integer not null,
containerid integer not null,
name varchar(120) not null,
createtime timestamp not null default current_timestamp,
state bigint not null default 0,
articlescope integer not null default 0,
primary key (imageid),
foreign key (containerid) references container (containerid) on delete cascade,
unique (containerid, name)
);
imageid integer not null,
containerid integer not null,
name varchar(120) not null,
createtime timestamp not null default current_timestamp,
state bigint not null default 0,
articlescope integer not null default 0,
primary key (imageid),
foreign key (containerid) references container (containerid) on delete cascade,
unique (containerid, name)
);
create table ancestry(
containerid integer not null,
ancestorid integer not null,\
foreign key (containerid) references container (containerid) on delete cascade,
foreign key (ancestorid) references container (containerid) on delete cascade,
unique (containerid, ancestorid),
unique (ancestorid, containerid)
);
containerid integer not null,
ancestorid integer not null,\
foreign key (containerid) references container (containerid) on delete cascade,
foreign key (ancestorid) references container (containerid) on delete cascade,
unique (containerid, ancestorid),
unique (ancestorid, containerid)
);
The image table has 2043686 rows
The container table has 4016 rows
The ancestry table has 24728 rows
The container table has 4016 rows
The ancestry table has 24728 rows
I am trying to execute the following update statement:
update image set state = ((state & -35184340369664::int8) | 0::int8) where containerid in (select containerid from ancestry where ancestorid = 1228067)
This statement will eventually update 766459 of the 2043686 rows.
When I explain this statement, I get the following:
Seq Scan on image (cost=0.00..906827616.06 rows=1023222 width=47)
SubPlan
-> Materialize (cost=443.10..443.10 rows=898 width=4)
-> Seq Scan on ancestry (cost=0.00..443.10 rows=898 width=4)
SubPlan
-> Materialize (cost=443.10..443.10 rows=898 width=4)
-> Seq Scan on ancestry (cost=0.00..443.10 rows=898 width=4)
It is predicting that it will take upwards of a billion page fetches to update about one million rows, or nearly a thousand page fetches per row. Obviously, this can be done more efficiently, as my image table is only about 50 megs in size. So in order to understand its execution plan, I asked it to explain smaller parts.
If I explain the subselect, I get:
explain select containerid from ancestry where ancestorid = 1228067
Seq Scan on ancestry (cost=0.00..443.10 rows=898 width=4)
This seems fairly reasonable (the predicted number of rows is very close to the actual number of rows), except that I wonder why it is doing a sequential scan of a table with 25K rows to retrieve less than 4% of them.
If I explain the cost of doing an update of the image table for one row in the subselect, I get the following:
explain update image set state = ((state & -35184340369664::int8) | 0::int8) where containerid=1228533
Index Scan using image_containerid_key on image (cost=0.00..2809.74 rows=862 width=47)
Again, this seems reasonable. Therefore, if I were to do the update for each row in the subselect, it should take about 2.5 million (898 * 2809) page fetches, which is about 400 times fewer page fetches than the original statement. Therefore, I should be able to do this update before I am an old man by executing the subselect, and then calling update 865 times.
Can anybody explain why the database performs so poorly in this case? Is there anything I can do besides executing the subselect and then calling update multiple times?
On Thu, Mar 06, 2003 at 10:38:09PM -0700, Robert Wille wrote: > I am trying to execute an update that has a subselect, and it takes an extraordinarily long time. > By performing the subselect in my application, and issuing the update multiple times, it appears that I can do the entireupdate about 400 times faster. Try using EXISTS rather than IN. It tends to work better. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Support bacteria! They're the only culture some people have.
Attachment
Its Sure,
SubSelect which takes lot of temporary tables to store the required data which makes the process slow.
Subselects and Joins does not show much difference, but if data grows Subselect becoming very slow.
So you can use joins which makes your query very fast.
Regards,
Sreedhar
----- Original Message -----From: Robert WilleSent: Friday, March 07, 2003 11:08 AMSubject: [GENERAL] Why are queries with subselects so slow?I am trying to execute an update that has a subselect, and it takes an extraordinarily long time.
By performing the subselect in my application, and issuing the update multiple times, it appears that I can do the entire update about 400 times faster.Following is a detailed description of the problem.I am running Postgres 7.2.3 on Linux, and I have a freshly vacuum analyze'd database. Here are the tables:create table container(
containerid integer not null,
parentid integer not null,
name varchar(40) not null,
createtime timestamp not null default current_timestamp,
type smallint not null,
primary key (containerid),
foreign key (parentid) references container (containerid) on delete cascade,
unique (parentid, name)
);create table image(
imageid integer not null,
containerid integer not null,
name varchar(120) not null,
createtime timestamp not null default current_timestamp,
state bigint not null default 0,
articlescope integer not null default 0,
primary key (imageid),
foreign key (containerid) references container (containerid) on delete cascade,
unique (containerid, name)
);create table ancestry(
containerid integer not null,
ancestorid integer not null,\
foreign key (containerid) references container (containerid) on delete cascade,
foreign key (ancestorid) references container (containerid) on delete cascade,
unique (containerid, ancestorid),
unique (ancestorid, containerid)
);The image table has 2043686 rows
The container table has 4016 rows
The ancestry table has 24728 rowsI am trying to execute the following update statement:update image set state = ((state & -35184340369664::int8) | 0::int8) where containerid in (select containerid from ancestry where ancestorid = 1228067)This statement will eventually update 766459 of the 2043686 rows.When I explain this statement, I get the following:Seq Scan on image (cost=0.00..906827616.06 rows=1023222 width=47)
SubPlan
-> Materialize (cost=443.10..443.10 rows=898 width=4)
-> Seq Scan on ancestry (cost=0.00..443.10 rows=898 width=4)It is predicting that it will take upwards of a billion page fetches to update about one million rows, or nearly a thousand page fetches per row. Obviously, this can be done more efficiently, as my image table is only about 50 megs in size. So in order to understand its execution plan, I asked it to explain smaller parts.If I explain the subselect, I get:explain select containerid from ancestry where ancestorid = 1228067Seq Scan on ancestry (cost=0.00..443.10 rows=898 width=4)This seems fairly reasonable (the predicted number of rows is very close to the actual number of rows), except that I wonder why it is doing a sequential scan of a table with 25K rows to retrieve less than 4% of them.If I explain the cost of doing an update of the image table for one row in the subselect, I get the following:explain update image set state = ((state & -35184340369664::int8) | 0::int8) where containerid=1228533Index Scan using image_containerid_key on image (cost=0.00..2809.74 rows=862 width=47)Again, this seems reasonable. Therefore, if I were to do the update for each row in the subselect, it should take about 2.5 million (898 * 2809) page fetches, which is about 400 times fewer page fetches than the original statement. Therefore, I should be able to do this update before I am an old man by executing the subselect, and then calling update 865 times.
Can anybody explain why the database performs so poorly in this case? Is there anything I can do besides executing the subselect and then calling update multiple times?
"Robert Wille" <robertw@willeweb.com> writes: > [ this is awfully slow: ] > update image set state =3D ((state & -35184340369664::int8) | 0::int8) wher= > e containerid in (select containerid from ancestry where ancestorid =3D 122= > 8067) IN is notoriously slow. It'll be better in 7.4, but in the meantime it's best avoided. It appears from your schema that the subselect cannot generate duplicate containerids, so you should be able to do it like this: update image set state = ((state & -35184340369664::int8) | 0::int8) from ancestry where image.containerid = ancestry.containerid and ancestry.ancestorid = 1228067; regards, tom lane