Thread: Slow update with simple query
Hi list ! I am running a query to update the boolean field of a table based on another table's fields. The query is (changed names for readability): UPDATE t1 SET booleanfield = (t2.field1 IN ('some', 'other') AND t2.field2 = 'Y') FROM t2 WHERE t1.uid = t2.uid t2.uid is the PRIMARY KEY. t2 only has ~1000 rows, so I think it fits fully in memory. t1 as ~2.000.000 rows. There is an index on t1.uid also. The explain (sorry, not explain analyze available yet) is : Hash Join (cost=112.75..307410.10 rows=2019448 width=357) Hash Cond: ("outer".uid= "inner".uid) -> Seq Scan on t1 (cost=0.00..261792.01 rows=2033001 width=340) -> Hash (cost=110.20..110.20 rows=1020 width=53) -> Seq Scan on t2 (cost=0.00..110.20 rows=1020 width=53) My query has been running for more than 1.5 hour now, and it is still running. Nothing else is running on the server. There are two multicolumn-indexes on this column (both are 3-columns indexes). One of them has a functional column (date_trunc('month', datefield)). Do you think the problem is with the indexes ? The hardware is not great, but the database is on a RAID1 array, so its not bad either. I am surprised that it takes more than 3 seconds per row to be updated. Thanks for your opinion on this ! -- Arnaud
On mið, 2006-12-13 at 11:51 +0100, Arnaud Lesauvage wrote: > Hi list ! > > I am running a query to update the boolean field of a table based on > another table's fields. > > The query is (changed names for readability): > UPDATE t1 > SET booleanfield = (t2.field1 IN ('some', 'other') AND t2.field2 = 'Y') > FROM t2 > WHERE t1.uid = t2.uid > > t2.uid is the PRIMARY KEY. > t2 only has ~1000 rows, so I think it fits fully in memory. > t1 as ~2.000.000 rows. > There is an index on t1.uid also. > > The explain (sorry, not explain analyze available yet) is : > > Hash Join (cost=112.75..307410.10 rows=2019448 width=357) > Hash Cond: ("outer".uid= "inner".uid) > -> Seq Scan on t1 (cost=0.00..261792.01 rows=2033001 width=340) > -> Hash (cost=110.20..110.20 rows=1020 width=53) > -> Seq Scan on t2 (cost=0.00..110.20 rows=1020 width=53) > > My query has been running for more than 1.5 hour now, and it is still running. > Nothing else is running on the server. > There are two multicolumn-indexes on this column (both are 3-columns indexes). One of them has a > functional column (date_trunc('month', datefield)). > > Do you think the problem is with the indexes ? I guess so. are you sure about the index on t1.uid? what are the column definitions for t1.uid and t2.uid ? are they the same ? you should ba able to get a plan similar to: Merge Join (cost=0.00..43.56 rows=1000 width=11) Merge Cond: ("outer".uid = "inner".uid) -> Index Scan using t1i on t1 (cost=0.00..38298.39 rows=2000035 width=10) -> Index Scan using t2i on t2 (cost=0.00..26.73 rows=1000 width=5) what postgres version are you using ? gnari > > The hardware is not great, but the database is on a RAID1 array, so its not bad either. > I am surprised that it takes more than 3 seconds per row to be updated. > > Thanks for your opinion on this ! > > -- > Arnaud > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
Ragnar a écrit : >> Do you think the problem is with the indexes ? > > I guess so. are you sure about the index on t1.uid? > what are the column definitions for t1.uid and t2.uid ? > are they the same ? Man, no !!! I just checked and indeed, no index on this column. I probably dropped it lately. Thanks Ragnar. (t1.uid and t2.uid were the same, character(32) columns) > you should ba able to get a plan similar to: > Merge Join (cost=0.00..43.56 rows=1000 width=11) > Merge Cond: ("outer".uid = "inner".uid) > -> Index Scan using t1i on t1 (cost=0.00..38298.39 rows=2000035 > width=10) > -> Index Scan using t2i on t2 (cost=0.00..26.73 rows=1000 width=5) > > what postgres version are you using ? Ooops, forgot that too : 8.1.4 I am creating the index right now, I'll tell you if this fixes the problem. Thanks for your help ! -- Arnaud
Hi, the problem is a combination of bad formed SQL and maybe missing indexes. try this: UPDATE t1 SET booleanfield = foo.bar FROM (SELECT uid,(field IN ('some','other') AND field2 = 'Y') AS bar FROM t2) AS foo WHERE t1.uid=foo.uid; and index t1.uid, t2.uid, t2.field, t2.field2 regards, Jens Schipkowski On Wed, 13 Dec 2006 11:51:10 +0100, Arnaud Lesauvage <thewild@freesurf.fr> wrote: > Hi list ! > > I am running a query to update the boolean field of a table based on > another table's fields. > > The query is (changed names for readability): > UPDATE t1 > SET booleanfield = (t2.field1 IN ('some', 'other') AND t2.field2 = 'Y') > FROM t2 > WHERE t1.uid = t2.uid > > t2.uid is the PRIMARY KEY. > t2 only has ~1000 rows, so I think it fits fully in memory. > t1 as ~2.000.000 rows. > There is an index on t1.uid also. > > The explain (sorry, not explain analyze available yet) is : > > Hash Join (cost=112.75..307410.10 rows=2019448 width=357) > Hash Cond: ("outer".uid= "inner".uid) > -> Seq Scan on t1 (cost=0.00..261792.01 rows=2033001 width=340) > -> Hash (cost=110.20..110.20 rows=1020 width=53) > -> Seq Scan on t2 (cost=0.00..110.20 rows=1020 width=53) > > My query has been running for more than 1.5 hour now, and it is still > running. > Nothing else is running on the server. > There are two multicolumn-indexes on this column (both are 3-columns > indexes). One of them has a functional column (date_trunc('month', > datefield)). > > Do you think the problem is with the indexes ? > > The hardware is not great, but the database is on a RAID1 array, so its > not bad either. > I am surprised that it takes more than 3 seconds per row to be updated. > > Thanks for your opinion on this ! > > -- > Arnaud > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- ** APUS Software GmbH
Jens Schipkowski a écrit : > the problem is a combination of bad formed SQL and maybe missing indexes. > try this: > UPDATE t1 > SET booleanfield = foo.bar > FROM (SELECT uid,(field IN ('some','other') AND field2 = 'Y') AS bar FROM > t2) AS foo > WHERE t1.uid=foo.uid; Hi Jens, Why is this query better than the other one ? Because it runs the "(field IN ('some','other') AND field2 = 'Y')" once and then executes the join with the resulting set ? > and index t1.uid, t2.uid, t2.field, t2.field2 t1.field can only take 3 or 4 values (don't remember exactly), and field2 only 2 ('Y' or 'N'). So this fields have a very low cardinality. Won't the planner chose to do a table scan in such a case ? Thanks for your advices ! -- Arnaud
On Wed, 13 Dec 2006 13:23:41 +0100, Arnaud Lesauvage <thewild@freesurf.fr> wrote: > Jens Schipkowski a écrit : >> the problem is a combination of bad formed SQL and maybe missing >> indexes. >> try this: >> UPDATE t1 >> SET booleanfield = foo.bar >> FROM (SELECT uid,(field IN ('some','other') AND field2 = 'Y') AS bar >> FROM t2) AS foo >> WHERE t1.uid=foo.uid; > > > Hi Jens, > Why is this query better than the other one ? Because it runs the > "(field IN ('some','other') AND field2 = 'Y')" once and then executes > the join with the resulting set ? True. The Subselect in FROM clause will be executed once and will be joined using the condition at where clause. So your condition at t2 is not executed for each row in t1(2mio records) but for each row in t2(1k records). And the boolean value is already set during update. regards, Jens > >> and index t1.uid, t2.uid, t2.field, t2.field2 > > t1.field can only take 3 or 4 values (don't remember exactly), and > field2 only 2 ('Y' or 'N'). So this fields have a very low cardinality. > Won't the planner chose to do a table scan in such a case ? > > Thanks for your advices ! > > -- > Arnaud -- ** APUS Software GmbH
Jens Schipkowski a écrit : > On Wed, 13 Dec 2006 13:23:41 +0100, Arnaud Lesauvage <thewild@freesurf.fr> >> Why is this query better than the other one ? Because it runs the >> "(field IN ('some','other') AND field2 = 'Y')" once and then executes >> the join with the resulting set ? > True. The Subselect in FROM clause will be executed once and will be > joined using the condition at where clause. So your condition at t2 is not > executed for each row in t1(2mio records) but for each row in t2(1k > records). And the boolean value is already set during update. OK Jens, thanks for clarifying this. I thought the planner could guess what to do in such cases. Regards -- Arnaud
On mið, 2006-12-13 at 14:38 +0100, Arnaud Lesauvage wrote: > Jens Schipkowski a écrit : > > On Wed, 13 Dec 2006 13:23:41 +0100, Arnaud Lesauvage <thewild@freesurf.fr> > >> Why is this query better than the other one ? Because it runs the > >> "(field IN ('some','other') AND field2 = 'Y')" once and then executes > >> the join with the resulting set ? > > True. The Subselect in FROM clause will be executed once and will be > > joined using the condition at where clause. So your condition at t2 is not > > executed for each row in t1(2mio records) but for each row in t2(1k > > records). And the boolean value is already set during update. > > OK Jens, thanks for clarifying this. > I thought the planner could guess what to do in such cases. don't worry, it will. this is not your problem gnari
Ragnar a écrit : > On mið, 2006-12-13 at 14:38 +0100, Arnaud Lesauvage wrote: >> Jens Schipkowski a écrit : >> > On Wed, 13 Dec 2006 13:23:41 +0100, Arnaud Lesauvage <thewild@freesurf.fr> >> >> Why is this query better than the other one ? Because it runs the >> >> "(field IN ('some','other') AND field2 = 'Y')" once and then executes >> >> the join with the resulting set ? >> > True. The Subselect in FROM clause will be executed once and will be >> > joined using the condition at where clause. So your condition at t2 is not >> > executed for each row in t1(2mio records) but for each row in t2(1k >> > records). And the boolean value is already set during update. >> >> OK Jens, thanks for clarifying this. >> I thought the planner could guess what to do in such cases. > > don't worry, it will. > this is not your problem Indeed, the new query does not perform that well : "Hash Join (cost=112.75..307504.97 rows=2024869 width=355) (actual time=53.995..246443.811 rows=2020061 loops=1)" " Hash Cond: ("outer".uid = "inner".uid)" " -> Seq Scan on t1 (cost=0.00..261792.01 rows=2033001 width=338) (actual time=19.342..234304.499 rows=2033001 loops=1)" " -> Hash (cost=110.20..110.20 rows=1020 width=53) (actual time=4.853..4.853 rows=1020 loops=1)" " -> Seq Scan on t2 (cost=0.00..110.20 rows=1020 width=53) (actual time=0.017..2.586 rows=1020 loops=1)" "Total runtime: 2777844.892 ms" I removed all unnecessary indexes on t1 before running the query (I left the index on uid and the multicolumn index containindthe updated field). I believe the multicolumn-functional-index computation is taking some time here, isn't it ? Regards -- Arnaud
Arnaud,
Have you run "ANALYZE" on the table after creating index?
Also make sure that "#effective_cache_size" is set properly. A higher value makes it more likely to use index scans.
Thanks
asif ali
Arnaud Lesauvage <thewild@freesurf.fr> wrote:
Any questions? Get answers on any topic at Yahoo! Answers. Try it now.
Have you run "ANALYZE" on the table after creating index?
Also make sure that "#effective_cache_size" is set properly. A higher value makes it more likely to use index scans.
Thanks
asif ali
Arnaud Lesauvage <thewild@freesurf.fr> wrote:
Ragnar a écrit :
> On mið, 2006-12-13 at 14:38 +0100, Arnaud Lesauvage wrote:
>> Jens Schipkowski a écrit :
>> > On Wed, 13 Dec 2006 13:23:41 +0100, Arnaud Lesauvage
>> >> Why is this query better than the other one ? Because it runs the
>> >> "(field IN ('some','other') AND field2 = 'Y')" once and then executes
>> >> the join with the resulting set ?
>> > True. The Subselect in FROM clause will be executed once and will be
>> > joined using the condition at where clause. So your condition at t2 is not
>> > executed for each row in t1(2mio records) but for each row in t2(1k
>> > records). And the boolean value is already set during update.
>>
>> OK Jens, thanks for clarifying this.
>> I thought the planner could guess what to do in such cases.
>
> don't worry, it will.
> this is not your problem
Indeed, the new query does not perform that well :
"Hash Join (cost=112.75..307504.97 rows=2024869 width=355) (actual time=53.995..246443.811 rows=2020061 loops=1)"
" Hash Cond: ("outer".uid = "inner".uid)"
" -> Seq Scan on t1 (cost=0.00..261792.01 rows=2033001 width=338) (actual time=19.342..234304.499 rows=2033001 loops=1)"
" -> Hash (cost=110.20..110.20 rows=1020 width=53) (actual time=4.853..4.853 rows=1020 loops=1)"
" -> Seq Scan on t2 (cost=0.00..110.20 rows=1020 width=53) (actual time=0.017..2.586 rows=1020 loops=1)"
"Total runtime: 2777844.892 ms"
I removed all unnecessary indexes on t1 before running the query (I left the index on uid and the multicolumn index containind the updated field).
I believe the multicolumn-functional-index computation is taking some time here, isn't it ?
Regards
--
Arnaud
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
http://www.postgresql.org/about/donate
Any questions? Get answers on any topic at Yahoo! Answers. Try it now.
Arnaud Lesauvage <thewild@freesurf.fr> writes: > Indeed, the new query does not perform that well : > "Hash Join (cost=112.75..307504.97 rows=2024869 width=355) (actual time=53.995..246443.811 rows=2020061 loops=1)" > ... > "Total runtime: 2777844.892 ms" > I removed all unnecessary indexes on t1 before running the query (I left the index on uid and the multicolumn index containindthe updated field). > I believe the multicolumn-functional-index computation is taking some time here, isn't it ? Given that the plan itself only takes 246 sec, there's *something* associated with row insertion that's eating the other 2500+ seconds. Either index entry computation or constraint checking ... regards, tom lane
asif ali a écrit : > Arnaud, > Have you run "ANALYZE" on the table after creating index? Yes, I have ! > Also make sure that "#effective_cache_size" is set properly. A higher value makes it more likely to use indexscans. It is set to 50.000. I thought this would be enough, and maybe too much ! Thanks for your advice ! -- Arnaud
But he's using 8.1.4-- in that version, an explain analyze would list the time taken to go through triggers, so the fact that we don't see any of those lines means that it can't be constraint checking, so wouldn't it have to be the index update overhead? -- Mark On Wed, 2006-12-13 at 11:46 -0500, Tom Lane wrote: > Arnaud Lesauvage <thewild@freesurf.fr> writes: > > Indeed, the new query does not perform that well : > > > "Hash Join (cost=112.75..307504.97 rows=2024869 width=355) (actual time=53.995..246443.811 rows=2020061 loops=1)" > > ... > > "Total runtime: 2777844.892 ms" > > > I removed all unnecessary indexes on t1 before running the query (I left the index on uid and the multicolumn index containindthe updated field). > > I believe the multicolumn-functional-index computation is taking some time here, isn't it ? > > Given that the plan itself only takes 246 sec, there's *something* > associated with row insertion that's eating the other 2500+ seconds. > Either index entry computation or constraint checking ... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
Mark Lewis <mark.lewis@mir3.com> writes: > But he's using 8.1.4-- in that version, an explain analyze would list > the time taken to go through triggers, so the fact that we don't see any > of those lines means that it can't be constraint checking, so wouldn't > it have to be the index update overhead? Well, it can't be foreign key checking. Could have been an expensive function in a CHECK constraint, maybe... regards, tom lane
Tom Lane a écrit : > Arnaud Lesauvage <thewild@freesurf.fr> writes: >> Indeed, the new query does not perform that well : > >> "Hash Join (cost=112.75..307504.97 rows=2024869 width=355) (actual time=53.995..246443.811 rows=2020061 loops=1)" >> ... >> "Total runtime: 2777844.892 ms" > >> I removed all unnecessary indexes on t1 before running the query (I left the index on uid and the multicolumn index containindthe updated field). >> I believe the multicolumn-functional-index computation is taking some time here, isn't it ? > > Given that the plan itself only takes 246 sec, there's *something* > associated with row insertion that's eating the other 2500+ seconds. > Either index entry computation or constraint checking ... There is an insert trigger (but here I am only updating the data), and a multicolumn functional index. That's all I can think of. I must be missing something, so here is the full table description. The field I am updating is incluredansstats. The field I am join on is userinternalid. CREATE TABLE statistiques.log ( gid serial NOT NULL, userinternalid character(32), ip character varying(255), browser character varying(255), fichier character varying(255), querystring text, page character varying(255), useridentity character varying(100), incluredansstats boolean NOT NULL DEFAULT true, date character varying, heure character varying, dateformatee timestamp without time zone, sessionid character(32), sortindex integer, CONSTRAINT log_pkey PRIMARY KEY (gid) ) WITHOUT OIDS; ALTER TABLE statistiques.log OWNER TO postgres;ncluredansstats; CREATE INDEX idx_page_datemonth_incluredansstats ON statistiques.log USING btree (page, date_trunc('month'::text, dateformatee), incluredansstats); CREATE INDEX idx_userinternalid ON statistiques.log USING btree (userinternalid); CREATE INDEX idx_userinternalid_page_datemonth ON statistiques.log USING btree (userinternalid, page, date_trunc('month'::text, dateformatee)); ALTER TABLE statistiques.log ADD CONSTRAINT log_pkey PRIMARY KEY(gid); CREATE TRIGGER parse_log_trigger BEFORE INSERT ON statistiques.log FOR EACH ROW EXECUTE PROCEDURE statistiques.parse_log_trigger(); This was a "one-shot" query, so I don't really mind it being slow, but if you want I can still troubleshoot it !
Arnaud Lesauvage <thewild@freesurf.fr> writes: > I must be missing something, so here is the full table description. It looks pretty harmless, except for > CREATE TRIGGER parse_log_trigger > BEFORE INSERT > ON statistiques.log > FOR EACH ROW > EXECUTE PROCEDURE statistiques.parse_log_trigger(); It seems the time must be going into this trigger function. What does it do? regards, tom lane
Tom Lane a écrit : > Arnaud Lesauvage <thewild@freesurf.fr> writes: >> I must be missing something, so here is the full table description. > > It looks pretty harmless, except for > >> CREATE TRIGGER parse_log_trigger >> BEFORE INSERT >> ON statistiques.log >> FOR EACH ROW >> EXECUTE PROCEDURE statistiques.parse_log_trigger(); > > It seems the time must be going into this trigger function. What > does it do? A lot of things ! Indeed, if it runs it will very badly hurt performances (table lookups, string manipulation, etc...) ! But it should only be tringered on INSERTs, and I am doing an UPDATE ! I can post the function's body if you want. Regards -- Arnaud
Arnaud Lesauvage <thewild@freesurf.fr> writes: > Tom Lane a �crit : >> It seems the time must be going into this trigger function. What >> does it do? > A lot of things ! Indeed, if it runs it will very badly hurt performances (table > lookups, string manipulation, etc...) ! > But it should only be tringered on INSERTs, and I am doing an UPDATE ! Doh, right, I obviously still need to ingest more caffeine this morning. I think the conclusion must be that there was just too much I/O to be done to update all the rows. Have you done any tuning of shared_buffers and so forth? I recall having seen cases where update performance went bad as soon as the upper levels of a large index no longer fit into shared_buffers ... regards, tom lane
Tom Lane a écrit : > I think the conclusion must be that there was just too much I/O to be > done to update all the rows. Have you done any tuning of shared_buffers > and so forth? I recall having seen cases where update performance went > bad as soon as the upper levels of a large index no longer fit into > shared_buffers ... Yes, that's probably it. I think my raid1 array's performances are very bad. I am switching to a brand new hardware next week, I am quite confident that this will solve many problems. Thanks for helping ! Regards -- Arnaud
Out of curiosity, how hard would it be to modify the output of EXPLAIN ANALYZE when doing an insert/update to include the index update times and/or non-FK constraint checking times and/or the table row update times? Or any other numbers that might be useful in circumstances like this. I'm wondering if it's possible to shed some light on the remaining dark shadows of PG performance troubleshooting. -- Mark Lewis On Thu, 2006-12-14 at 11:19 -0500, Tom Lane wrote: > Arnaud Lesauvage <thewild@freesurf.fr> writes: > > Tom Lane a crit : > >> It seems the time must be going into this trigger function. What > >> does it do? > > > A lot of things ! Indeed, if it runs it will very badly hurt performances (table > > lookups, string manipulation, etc...) ! > > But it should only be tringered on INSERTs, and I am doing an UPDATE ! > > Doh, right, I obviously still need to ingest more caffeine this morning. > > I think the conclusion must be that there was just too much I/O to be > done to update all the rows. Have you done any tuning of shared_buffers > and so forth? I recall having seen cases where update performance went > bad as soon as the upper levels of a large index no longer fit into > shared_buffers ... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend
Mark Lewis <mark.lewis@mir3.com> writes: > Out of curiosity, how hard would it be to modify the output of EXPLAIN > ANALYZE when doing an insert/update to include the index update times > and/or non-FK constraint checking times and/or the table row update > times? I don't think it'd help much --- in an example like this, other tools like vmstat would be more likely to be useful in investigating the problem. There's also the problem that EXPLAIN ANALYZE overhead is already too high --- another dozen gettimeofday() calls per row processed doesn't sound appetizing. regards, tom lane