Re: UPDATE with subquery too slow - Mailing list pgsql-performance

From Eric Jain
Subject Re: UPDATE with subquery too slow
Date
Msg-id 013401c3f653$14f95970$c300000a@caliente
Whole thread Raw
In response to UPDATE with subquery too slow  ("Eric Jain" <Eric.Jain@isb-sib.ch>)
List pgsql-performance
> I can't get the following statement to complete with reasonable time.

Upgraded to 7.4.1, and realized that NOT IN is far more efficient than
IN, EXISTS or NOT EXISTS, at least for the amount and distribution of
data that I have. Here are some numbers from before and after performing
the problematic clean up operation:

                        | Before    | After
------------------------+-----------+-----------
COUNT(*)                | 6'104'075 | 6'104'075
COUNT(session)          | 5'945'272 | 3'640'659
COUNT(DISTINCT session) | 2'865'570 |   560'957

The following query completes within less than three hours on a machine
with a high load, versa many many hours for any of the alternatives:

UPDATE requests
SET session = NULL
WHERE session NOT IN
(
  SELECT r.session
  FROM requests r
  WHERE r.session IS NOT NULL
  GROUP BY r.session
  HAVING COUNT(*) > 1
);

Note that in order to correctly reverse an IN subquery, IS NOT NULL
needs to be added.

Interestingly, the query planner believes that using EXISTS would be
more efficient than NOT IN, and IN only slightly less efficient; I
assume the query planner is not able to accurately estimate the number
of rows returned by the subquery.

EXISTS            351'511
NOT IN            376'577
IN                386'780
LEFT JOIN      18'263'826
NOT EXISTS  7'241'815'330


pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Optimizer difference using function index between 7.3 and 7.4
Next
From: Josh Berkus
Date:
Subject: Forcing filter/join order?