> I can't get the following statement to complete with reasonable time.
Upgraded to 7.4.1, and realized that NOT IN is far more efficient than
IN, EXISTS or NOT EXISTS, at least for the amount and distribution of
data that I have. Here are some numbers from before and after performing
the problematic clean up operation:
| Before | After
------------------------+-----------+-----------
COUNT(*) | 6'104'075 | 6'104'075
COUNT(session) | 5'945'272 | 3'640'659
COUNT(DISTINCT session) | 2'865'570 | 560'957
The following query completes within less than three hours on a machine
with a high load, versa many many hours for any of the alternatives:
UPDATE requests
SET session = NULL
WHERE session NOT IN
(
SELECT r.session
FROM requests r
WHERE r.session IS NOT NULL
GROUP BY r.session
HAVING COUNT(*) > 1
);
Note that in order to correctly reverse an IN subquery, IS NOT NULL
needs to be added.
Interestingly, the query planner believes that using EXISTS would be
more efficient than NOT IN, and IN only slightly less efficient; I
assume the query planner is not able to accurately estimate the number
of rows returned by the subquery.
EXISTS 351'511
NOT IN 376'577
IN 386'780
LEFT JOIN 18'263'826
NOT EXISTS 7'241'815'330