not in vs not exists - vastly diferent performance - Mailing list pgsql-sql

From Iain
Subject not in vs not exists - vastly diferent performance
Date
Msg-id 013d01c3ca89$7307d7d0$7201a8c0@mst1x5r347kymb
Whole thread Raw
Responses Re: not in vs not exists - vastly diferent performance
List pgsql-sql
Hi All,
 
I found this interesting and thought I'd offer it up for comment.
 
I have the following delete:
 
delete from tableB where id not in (select id from tableA);
 
There are about 100,000 rows in table A, and about 500,000 in table B. id is indexed on both tables. This is just a development DB, and I wanted to clean it up so I could put in RI constraints. Somewhere along the line, records have been deleted from tableA leaving orphans in tableB.
 
I launched the query yesterday afternoon, and it hadn't returned as of this morning, 15 hours later. Running top showed that CPU utilization was running close to 100%, and the disk was not busy at all. Anyway, I killed it and did some testing:
 
Analyse revealed this plan, and varying random_page_cost between 1 and 4 didn't affect it:
 
 Seq Scan on tableB  (cost=0.00..1003619849.56 rows=251513 width=6)
   Filter: (NOT (subplan))
   SubPlan
     ->  Seq Scan on tableA (cost=0.00..3738.64 rows=100664 width=4)
recoding the delete to use not exists as follows:
 
delete from tableB where not exists (select id from tableA where tableA.id = tableB.id);
 
Gave this plan:
 
Seq Scan on tableB  (cost=0.00..719522.41 rows=236131 width=6)
   Filter: (NOT (subplan))
   SubPlan
     ->  Index Scan using tableB_pk on tableA(cost=0.00..3.01 rows=2 width=4)
           Index Cond: ((id)::integer = ($0)::integer)
This deleted 1200 rows in about 2 seconds. Much better.
 
Anyway, I was a little surprised that "not in" chose to use a seq scan on the tableA in this case. I had imagined, given that statistics were up to date and the size of the table, that the plan would have been similar to that generated by not exists, or perhaps would have used a hash table based on the tableA ids.
 
Something to think about anyway.
Regards
Iain

pgsql-sql by date:

Previous
From: alvaro@audifarma.com.co
Date:
Subject: MD5 encrypt
Next
From: "Chris Travers"
Date:
Subject: Re: Distributed keys