UPDATE with subquery too slow - Mailing list pgsql-performance

From Eric Jain
Subject UPDATE with subquery too slow
Date
Msg-id 00ed01c3f552$ec054e80$c300000a@caliente
Whole thread Raw
Responses Re: UPDATE with subquery too slow
List pgsql-performance
I can't get the following statement to complete with reasonable time.
I've had it running for over ten hours without getting anywhere. I
suspect (hope) there may be a better way to accomplish what I'm trying
to do (set fields containing unique values to null):

  UPDATE requests
  SET session = NULL
  WHERE session IN
  (
    SELECT session
    FROM requests
    GROUP BY session
    HAVING COUNT(*) = 1
  );

Output of EXPLAIN:

  Nested Loop
  (cost=170350.16..305352.37 rows=33533 width=98)
  ->  HashAggregate
      (cost=170350.16..170350.16 rows=200 width=8)
      ->  Subquery Scan "IN_subquery"
          (cost=169728.12..170261.30 rows=35545 width=8)
          ->  HashAggregate
              (cost=169728.12..169905.85 rows=35545 width=8)
                Filter: (count(*) = 1)
              ->  Seq Scan on requests
                  (cost=0.00..139207.75 rows=6104075 width=8)
   ->  Index Scan using requests_session_idx on requests
       (cost=0.00..672.92 rows=168 width=106)
         Index Cond: (requests."session" = "outer"."session")

If I drop the index on requests(session):

  Hash Join
  (cost=170350.66..340414.12 rows=33533 width=98)
    Hash Cond: ("outer"."session" = "inner"."session")
    ->  Seq Scan on requests
        (cost=0.00..139207.75 rows=6104075 width=106)
    ->  Hash
        (cost=170350.16..170350.16 rows=200 width=8)
         ->  HashAggregate
             (cost=170350.16..170350.16 rows=200 width=8)
               ->  Subquery Scan "IN_subquery"
                   (cost=169728.12..170261.30 rows=35545 width=8)
                     ->  HashAggregate
                         (cost=169728.12..169905.85 rows=35545 width=8)
                           Filter: (count(*) = 1)
                           ->  Seq Scan on requests
                               (cost=0.00..139207.75 rows=6104075
width=8)

The subquery itself requires 5-10 min to run on its own, and may return
several million rows.

Using EXISTS rather than IN (I'm using 7.4-RC2, not sure if IN queries
were already improved in this release):

  UPDATE requests
  SET session = NULL
  WHERE NOT EXISTS
  (
    SELECT r.session
    FROM requests r
    WHERE
      r.session = session
      AND NOT r.id = id
  );

With and without index:

  Result
  (cost=227855.74..415334.22 rows=8075449 width=101)
    One-Time Filter: (NOT $0)
    InitPlan
      ->  Seq Scan on requests r
          (cost=0.00..227855.74 rows=201 width=8)
            Filter: (("session" = "session") AND (id <> id))
    ->  Seq Scan on requests
        (cost=0.00..187478.49 rows=8075449 width=101)

I've been running this for more than an hour so far, and no end in
sight, either... Any ideas?


pgsql-performance by date:

Previous
From: Shridhar Daithankar
Date:
Subject: Re: Slow response of PostgreSQL
Next
From: "Leeuw van der, Tim"
Date:
Subject: Re: UPDATE with subquery too slow