Thread: UPDATE with subquery too slow
I can't get the following statement to complete with reasonable time. I've had it running for over ten hours without getting anywhere. I suspect (hope) there may be a better way to accomplish what I'm trying to do (set fields containing unique values to null): UPDATE requests SET session = NULL WHERE session IN ( SELECT session FROM requests GROUP BY session HAVING COUNT(*) = 1 ); Output of EXPLAIN: Nested Loop (cost=170350.16..305352.37 rows=33533 width=98) -> HashAggregate (cost=170350.16..170350.16 rows=200 width=8) -> Subquery Scan "IN_subquery" (cost=169728.12..170261.30 rows=35545 width=8) -> HashAggregate (cost=169728.12..169905.85 rows=35545 width=8) Filter: (count(*) = 1) -> Seq Scan on requests (cost=0.00..139207.75 rows=6104075 width=8) -> Index Scan using requests_session_idx on requests (cost=0.00..672.92 rows=168 width=106) Index Cond: (requests."session" = "outer"."session") If I drop the index on requests(session): Hash Join (cost=170350.66..340414.12 rows=33533 width=98) Hash Cond: ("outer"."session" = "inner"."session") -> Seq Scan on requests (cost=0.00..139207.75 rows=6104075 width=106) -> Hash (cost=170350.16..170350.16 rows=200 width=8) -> HashAggregate (cost=170350.16..170350.16 rows=200 width=8) -> Subquery Scan "IN_subquery" (cost=169728.12..170261.30 rows=35545 width=8) -> HashAggregate (cost=169728.12..169905.85 rows=35545 width=8) Filter: (count(*) = 1) -> Seq Scan on requests (cost=0.00..139207.75 rows=6104075 width=8) The subquery itself requires 5-10 min to run on its own, and may return several million rows. Using EXISTS rather than IN (I'm using 7.4-RC2, not sure if IN queries were already improved in this release): UPDATE requests SET session = NULL WHERE NOT EXISTS ( SELECT r.session FROM requests r WHERE r.session = session AND NOT r.id = id ); With and without index: Result (cost=227855.74..415334.22 rows=8075449 width=101) One-Time Filter: (NOT $0) InitPlan -> Seq Scan on requests r (cost=0.00..227855.74 rows=201 width=8) Filter: (("session" = "session") AND (id <> id)) -> Seq Scan on requests (cost=0.00..187478.49 rows=8075449 width=101) I've been running this for more than an hour so far, and no end in sight, either... Any ideas?
Hi, This is not going to answer your question of course but did you already try to do this in 2 steps? You said that the subquery itself doesn't take very long, so perhaps you can create a temporary table based on the subquery,then in the update do a join with the temporary table? This might not be desirable in the end, but it might be useful just to check the performance of it. And - isn't it an option to upgrade to 7.4.1 instead? regards, --Tim THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intendedrecipient. If you received this in error, please contact the sender and delete the e-mail and its attachments fromall computers. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Eric Jain Sent: dinsdag 17 februari 2004 13:38 To: pgsql-performance Subject: [PERFORM] UPDATE with subquery too slow I can't get the following statement to complete with reasonable time. I've had it running for over ten hours without getting anywhere. I suspect (hope) there may be a better way to accomplish what I'm trying to do (set fields containing unique values to null): UPDATE requests SET session = NULL WHERE session IN ( SELECT session FROM requests GROUP BY session HAVING COUNT(*) = 1 ); [...]
Eric Jain wrote: > I can't get the following statement to complete with reasonable time. > I've had it running for over ten hours without getting anywhere. I > suspect (hope) there may be a better way to accomplish what I'm trying > to do (set fields containing unique values to null): [...] > Using EXISTS rather than IN (I'm using 7.4-RC2, not sure if IN queries > were already improved in this release): > > UPDATE requests > SET session = NULL > WHERE NOT EXISTS > ( > SELECT r.session > FROM requests r > WHERE > r.session = session > AND NOT r.id = id > ); I suppose you could try: UPDATE requests SET session = NULL WHERE EXISTS ( SELECT r.session FROM requests r WHERE r.session = session GROUP BY r.session HAVING count(*) = 1 ); but I don't know that you'll get much different results than your version. -- Kevin Brown kevin@sysexperts.com
> I can't get the following statement to complete with reasonable time. Upgraded to 7.4.1, and realized that NOT IN is far more efficient than IN, EXISTS or NOT EXISTS, at least for the amount and distribution of data that I have. Here are some numbers from before and after performing the problematic clean up operation: | Before | After ------------------------+-----------+----------- COUNT(*) | 6'104'075 | 6'104'075 COUNT(session) | 5'945'272 | 3'640'659 COUNT(DISTINCT session) | 2'865'570 | 560'957 The following query completes within less than three hours on a machine with a high load, versa many many hours for any of the alternatives: UPDATE requests SET session = NULL WHERE session NOT IN ( SELECT r.session FROM requests r WHERE r.session IS NOT NULL GROUP BY r.session HAVING COUNT(*) > 1 ); Note that in order to correctly reverse an IN subquery, IS NOT NULL needs to be added. Interestingly, the query planner believes that using EXISTS would be more efficient than NOT IN, and IN only slightly less efficient; I assume the query planner is not able to accurately estimate the number of rows returned by the subquery. EXISTS 351'511 NOT IN 376'577 IN 386'780 LEFT JOIN 18'263'826 NOT EXISTS 7'241'815'330