Re: How to speed up delete where not in - Mailing list pgsql-general

From David Rowley
Subject Re: How to speed up delete where not in
Date
Msg-id CAKJS1f_C5dWqpJCefb4T3RTftRGdxtrCKVJaob18wyh+H=KbJQ@mail.gmail.com
Whole thread Raw
In response to How to speed up delete where not in  ("Andrus" <kobruleht2@hot.ee>)
Responses Re: How to speed up delete where not in
How to drop user if objects depend on it
List pgsql-general
On 26 September 2015 at 19:53, Andrus <kobruleht2@hot.ee> wrote:
Hi!


I'm looking for a way to delete records which do not have child rows on big tables where lot of rows needs to be deleted. Both tables have lot of other foreign key references.


Document headers are in omdok table:

   create table omdok ( dokumnr serial primary key, ... );

Document rows are in omrid table

   CREATE TABLE omrid
   (
     id serial NOT NULL,
     reanr serial NOT NULL,
     dokumnr integer NOT NULL,
     CONSTRAINT omrid_pkey PRIMARY KEY (id),
     CONSTRAINT omrid_dokumnr_fkey FOREIGN KEY (dokumnr)
         REFERENCES omdok (dokumnr) MATCH SIMPLE
         ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY IMMEDIATE,
     ....
   );


I tried

   delete from omdok where dokumnr not in      (select dokumnr from omrid)

Query it is running currently 15 hours and is still running.
postgres.exe is using 50% CPU  all the time (this is 2 core CPU).

   explain delete from omdok where dokumnr not in      (select dokumnr from omrid)

returns:

   "Delete  (cost=0.00..21971079433.34 rows=220815 width=6)"
   "  ->  Seq Scan on omdok  (cost=0.00..21971079433.34 rows=220815 width=6)"
   "        Filter: (NOT (SubPlan 1))"
   "        SubPlan 1"
   "          ->  Materialize  (cost=0.00..94756.92 rows=1897261 width=4)"
   "                ->  Seq Scan on omrid  (cost=0.00..77858.61 rows=1897261 width=4)"

- How to delete parents without child rows fast?

if you write the query as with a NOT EXISTS, rather than a NOT IN() it should complete much more quickly.

It's important to know that the semantics of NOT IN are likely not at all what you think:

For example, in the following query, would you expect it to return the row with a.a = 3 ?

select * from (values(1),(2),(3)) a(a) WHERE a NOT IN(SELECT b FROM (VALUES(NULL),(1),(2)) b(b));

The presence of NULL causes this to not behave the way you might think, yet it works exactly the way the SQL standard dictates.

You could think of this as equivalent to writing:

select * from (values(1),(2),(3)) a(a) WHERE a.a <> NULL AND a.a <> 1 AND a.a <> 2;

And since a.a <> NULL is 'unknown', this causes the entire WHERE clause to be false, therefore nothing matches.

For this reason PostgreSQL does not optimise NOT IN() the same way as it optimises NOT EXISTS().

If you write the query as:

delete from omdok where not exists(select 1 from omrid where omdok.dokumnr = omrid.dokumnr);

then you might see it execute in a few seconds. Perhaps you should consider cancelling the current query, perhaps perform a VACUUM on omdoc after cancelling, and then run the NOT EXISTS version instead.

I mocked up your tables locally, and populated them with the same number of records as your row estimates in the EXPLAIN you pasted and I got:

test=# delete from omdok where not exists (select 1 from omrid where omdok.dokumn = omrid.dokumnr);
DELETE 0
Time: 1698.233 ms

Whereas with the NOT IN() I cancelled it after 10 minutes.

Regards

David Rowley 

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-general by date:

Previous
From: "Andrus"
Date:
Subject: How to speed up delete where not in
Next
From: Melvin Davidson
Date:
Subject: Re: How to speed up delete where not in