Thread: UPDATE with subquery too slow

UPDATE with subquery too slow

From
"Eric Jain"
Date:
I can't get the following statement to complete with reasonable time.
I've had it running for over ten hours without getting anywhere. I
suspect (hope) there may be a better way to accomplish what I'm trying
to do (set fields containing unique values to null):

  UPDATE requests
  SET session = NULL
  WHERE session IN
  (
    SELECT session
    FROM requests
    GROUP BY session
    HAVING COUNT(*) = 1
  );

Output of EXPLAIN:

  Nested Loop
  (cost=170350.16..305352.37 rows=33533 width=98)
  ->  HashAggregate
      (cost=170350.16..170350.16 rows=200 width=8)
      ->  Subquery Scan "IN_subquery"
          (cost=169728.12..170261.30 rows=35545 width=8)
          ->  HashAggregate
              (cost=169728.12..169905.85 rows=35545 width=8)
                Filter: (count(*) = 1)
              ->  Seq Scan on requests
                  (cost=0.00..139207.75 rows=6104075 width=8)
   ->  Index Scan using requests_session_idx on requests
       (cost=0.00..672.92 rows=168 width=106)
         Index Cond: (requests."session" = "outer"."session")

If I drop the index on requests(session):

  Hash Join
  (cost=170350.66..340414.12 rows=33533 width=98)
    Hash Cond: ("outer"."session" = "inner"."session")
    ->  Seq Scan on requests
        (cost=0.00..139207.75 rows=6104075 width=106)
    ->  Hash
        (cost=170350.16..170350.16 rows=200 width=8)
         ->  HashAggregate
             (cost=170350.16..170350.16 rows=200 width=8)
               ->  Subquery Scan "IN_subquery"
                   (cost=169728.12..170261.30 rows=35545 width=8)
                     ->  HashAggregate
                         (cost=169728.12..169905.85 rows=35545 width=8)
                           Filter: (count(*) = 1)
                           ->  Seq Scan on requests
                               (cost=0.00..139207.75 rows=6104075
width=8)

The subquery itself requires 5-10 min to run on its own, and may return
several million rows.

Using EXISTS rather than IN (I'm using 7.4-RC2, not sure if IN queries
were already improved in this release):

  UPDATE requests
  SET session = NULL
  WHERE NOT EXISTS
  (
    SELECT r.session
    FROM requests r
    WHERE
      r.session = session
      AND NOT r.id = id
  );

With and without index:

  Result
  (cost=227855.74..415334.22 rows=8075449 width=101)
    One-Time Filter: (NOT $0)
    InitPlan
      ->  Seq Scan on requests r
          (cost=0.00..227855.74 rows=201 width=8)
            Filter: (("session" = "session") AND (id <> id))
    ->  Seq Scan on requests
        (cost=0.00..187478.49 rows=8075449 width=101)

I've been running this for more than an hour so far, and no end in
sight, either... Any ideas?


Re: UPDATE with subquery too slow

From
"Leeuw van der, Tim"
Date:
Hi,

This is not going to answer your question of course but did you already try to do this in 2 steps?

You said that the subquery itself doesn't take very long, so perhaps you can create a temporary table based on the
subquery,then in the update do a join with the temporary table? 

This might not be desirable in the end, but it might be useful just to check the performance of it.

And - isn't it an option to upgrade to 7.4.1 instead?


regards,

--Tim

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the
intendedrecipient. If you received this in error, please contact the sender and delete the e-mail and its attachments
fromall computers.  



-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Eric Jain
Sent: dinsdag 17 februari 2004 13:38
To: pgsql-performance
Subject: [PERFORM] UPDATE with subquery too slow


I can't get the following statement to complete with reasonable time.
I've had it running for over ten hours without getting anywhere. I
suspect (hope) there may be a better way to accomplish what I'm trying
to do (set fields containing unique values to null):

  UPDATE requests
  SET session = NULL
  WHERE session IN
  (
    SELECT session
    FROM requests
    GROUP BY session
    HAVING COUNT(*) = 1
  );

[...]


Re: UPDATE with subquery too slow

From
Kevin Brown
Date:
Eric Jain wrote:
> I can't get the following statement to complete with reasonable time.
> I've had it running for over ten hours without getting anywhere. I
> suspect (hope) there may be a better way to accomplish what I'm trying
> to do (set fields containing unique values to null):

[...]

> Using EXISTS rather than IN (I'm using 7.4-RC2, not sure if IN queries
> were already improved in this release):
>
>   UPDATE requests
>   SET session = NULL
>   WHERE NOT EXISTS
>   (
>     SELECT r.session
>     FROM requests r
>     WHERE
>       r.session = session
>       AND NOT r.id = id
>   );

I suppose you could try:

  UPDATE requests
  SET session = NULL
  WHERE EXISTS
  (
    SELECT r.session
    FROM requests r
    WHERE
      r.session = session
      GROUP BY r.session
      HAVING count(*) = 1
  );

but I don't know that you'll get much different results than your
version.


--
Kevin Brown                          kevin@sysexperts.com

Re: UPDATE with subquery too slow

From
"Eric Jain"
Date:
> I can't get the following statement to complete with reasonable time.

Upgraded to 7.4.1, and realized that NOT IN is far more efficient than
IN, EXISTS or NOT EXISTS, at least for the amount and distribution of
data that I have. Here are some numbers from before and after performing
the problematic clean up operation:

                        | Before    | After
------------------------+-----------+-----------
COUNT(*)                | 6'104'075 | 6'104'075
COUNT(session)          | 5'945'272 | 3'640'659
COUNT(DISTINCT session) | 2'865'570 |   560'957

The following query completes within less than three hours on a machine
with a high load, versa many many hours for any of the alternatives:

UPDATE requests
SET session = NULL
WHERE session NOT IN
(
  SELECT r.session
  FROM requests r
  WHERE r.session IS NOT NULL
  GROUP BY r.session
  HAVING COUNT(*) > 1
);

Note that in order to correctly reverse an IN subquery, IS NOT NULL
needs to be added.

Interestingly, the query planner believes that using EXISTS would be
more efficient than NOT IN, and IN only slightly less efficient; I
assume the query planner is not able to accurately estimate the number
of rows returned by the subquery.

EXISTS            351'511
NOT IN            376'577
IN                386'780
LEFT JOIN      18'263'826
NOT EXISTS  7'241'815'330