Thread: DELETE FROM takes forever

DELETE FROM takes forever

From

Josh

Date:

10 February 2011, 14:00:44

Hi

I'm trying to do a DELETE FROM on my large table (about 800 million
rows) based on the contents of another, moderately large table (about
110 million rows). The command I'm using is:

DELETE FROM records WHERE id NOT IN (SELECT id FROM unique_records);

This process ran for about two weeks before I decided to stop it -- it
was dragging down the DB server. I can understand long-running
processes, but two weeks seems a bit much even for a big table.

Is this the best way to approach the problem? Is there a better way?

Some background: The server is version 8.3, running nothing but Pg.
The 'records' table has 'id' as its primary key, and one other index
on another column. The table is referenced by just about every other
table in my DB (about 15 other tables) via foreign key constraints,
which I don't want to break (which is why I'm not just recreating the
table rather than deleting rows). Most of the dependent tables have ON
DELETE CASCADE. The 'unique_records' table is a temp table I got via
something like: SELECT DISTINCT (other_column) id INTO unique_records
FROM records


Thanks very much!

Josh Leder

Re: DELETE FROM takes forever

From

Samuel Gendler

Date:

10 February 2011, 14:37:35

On Thu, Feb 10, 2011 at 9:57 AM, Josh <slushie@gmail.com> wrote:

Hi

I'm trying to do a DELETE FROM on my large table (about 800 million
rows) based on the contents of another, moderately large table (about
110 million rows). The command I'm using is:

DELETE FROM records WHERE id NOT IN (SELECT id FROM unique_records);

This process ran for about two weeks before I decided to stop it -- it
was dragging down the DB server. I can understand long-running
processes, but two weeks seems a bit much even for a big table.

Is this the best way to approach the problem? Is there a better way?

You need

delete from records r where not exists (select 1 from unique_records ur where ur.id = r.id);

Re: DELETE FROM takes forever

From

Tom Lane

Date:

10 February 2011, 14:44:27

Josh <slushie@gmail.com> writes:
> I'm trying to do a DELETE FROM on my large table (about 800 million
> rows) based on the contents of another, moderately large table (about
> 110 million rows). The command I'm using is:

> DELETE FROM records WHERE id NOT IN (SELECT id FROM unique_records);

> This process ran for about two weeks before I decided to stop it -- it
> was dragging down the DB server. I can understand long-running
> processes, but two weeks seems a bit much even for a big table.

> Is this the best way to approach the problem? Is there a better way?

> Some background: The server is version 8.3, running nothing but Pg.
> The 'records' table has 'id' as its primary key, and one other index
> on another column. The table is referenced by just about every other
> table in my DB (about 15 other tables) via foreign key constraints,

Hmm ... do all of those referencing tables have indexes on the
referencing columns?  It seems plausible that the time is going into
seqscan searches for referencing rows.

You might try doing EXPLAIN ANALYZE of this same delete for a limited
number of rows (maybe 1000 or so) so that you could see what plan you're
getting and where the time really goes.  I think 8.3 had the ability to
break out time spent in triggers, so if the problem is the FK
propagation, EXPLAIN ANALYZE would show it.

Also, the NOT IN is probably going to suck performance-wise no matter
what, for such large numbers of rows.  Converting to NOT EXISTS might
help some, though I don't remember right now how smart 8.3 is about
either.
        regards, tom lane

Re: DELETE FROM takes forever

From

Piotr Czekalski

Date:

10 February 2011, 14:52:54

Are your IDs (in both tables) a subject of index?
If so, analyze tables and indexes. If not, create an index for each ID - 
that may help.
Post an explain plan of the query as well. I guess there is full scan 
instead of index scan, thus running over and over 800 rows vs 110mln 
rows may take a lot of time.

Hope that helps,

Piotr

W dniu 2011-02-10 18:57, Josh pisze:
> Hi
>
> I'm trying to do a DELETE FROM on my large table (about 800 million
> rows) based on the contents of another, moderately large table (about
> 110 million rows). The command I'm using is:
>
> DELETE FROM records WHERE id NOT IN (SELECT id FROM unique_records);
>
> This process ran for about two weeks before I decided to stop it -- it
> was dragging down the DB server. I can understand long-running
> processes, but two weeks seems a bit much even for a big table.
>
> Is this the best way to approach the problem? Is there a better way?
>
> Some background: The server is version 8.3, running nothing but Pg.
> The 'records' table has 'id' as its primary key, and one other index
> on another column. The table is referenced by just about every other
> table in my DB (about 15 other tables) via foreign key constraints,
> which I don't want to break (which is why I'm not just recreating the
> table rather than deleting rows). Most of the dependent tables have ON
> DELETE CASCADE. The 'unique_records' table is a temp table I got via
> something like: SELECT DISTINCT (other_column) id INTO unique_records
> FROM records
>
>
> Thanks very much!
>
> Josh Leder
>


-- 

--------------------------------------------------------------
"TECHBAZA.PL" Sp. z o.o.
Technologie WEB, eDB&  eCommerce
Oddział Gliwice
ul. Chorzowska 50
44-100 Gliwice
tel. (+4832) 7186081
fax. (+4832) 7003289

Re: DELETE FROM takes forever

From

Josh

Date:

10 February 2011, 15:16:52

Many of the tables do not have indexes on the FK, though a couple of
the biggest ones do. It does seem worth the time to put an index on
each of these tables, considering the few hundred hours I'm already
spending on the DELETE.

I've started the EXPLAIN ANALYZE but it will take a while, no doubt.
In the meantime I'm going to play with the NOT EXISTS angle, its
something I hadn't considered.

On Thu, Feb 10, 2011 at 12:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm ... do all of those referencing tables have indexes on the
> referencing columns?  It seems plausible that the time is going into
> seqscan searches for referencing rows.
>
> You might try doing EXPLAIN ANALYZE of this same delete for a limited
> number of rows (maybe 1000 or so) so that you could see what plan you're
> getting and where the time really goes.  I think 8.3 had the ability to
> break out time spent in triggers, so if the problem is the FK
> propagation, EXPLAIN ANALYZE would show it.
>
> Also, the NOT IN is probably going to suck performance-wise no matter
> what, for such large numbers of rows.  Converting to NOT EXISTS might
> help some, though I don't remember right now how smart 8.3 is about
> either.
>
>                        regards, tom lane
>

Re: DELETE FROM takes forever

From

Chris Browne

Date:

10 February 2011, 15:21:10

slushie@gmail.com (Josh) writes:
> I'm trying to do a DELETE FROM on my large table (about 800 million
> rows) based on the contents of another, moderately large table (about
> 110 million rows). The command I'm using is:
>
> DELETE FROM records WHERE id NOT IN (SELECT id FROM unique_records);
>
> This process ran for about two weeks before I decided to stop it -- it
> was dragging down the DB server. I can understand long-running
> processes, but two weeks seems a bit much even for a big table.
>
> Is this the best way to approach the problem? Is there a better way?
>
> Some background: The server is version 8.3, running nothing but Pg.
> The 'records' table has 'id' as its primary key, and one other index
> on another column. The table is referenced by just about every other
> table in my DB (about 15 other tables) via foreign key constraints,
> which I don't want to break (which is why I'm not just recreating the
> table rather than deleting rows). Most of the dependent tables have ON
> DELETE CASCADE. The 'unique_records' table is a temp table I got via
> something like: SELECT DISTINCT (other_column) id INTO unique_records
> FROM records

I'd be really inclined to do this incrementally, to trim out a few
thousand at a time, if at all possible.

You wind up firing a whole lot of foreign key constraint triggers to
verify that everything's OK, and this'll wind up being just huge.

What I might do in such a case is to construct a table that contains all
the key values that ought to get trimmed, likely:
  select id into records_to_delete from records where id not in (select     id from unique_records);  create index
rtd_idon records_to_delete (id);
 

Then loop on the following set of queries:
  drop table if exists records_presently_being_deleted;  select id into temp table records_presently_being_deleted
fromrecords_to_delete limit 5000;  delete from records where id in (select id from
records_presently_being_deleted)and   id not in (select id from unique_records);  delete from records_to_delete where
idin (select id from records_presently_being_deleted);
 

That'll drop out 5000 records at a time, you'll have no
ultra-long-running transactions, and you'll get regular feedback that
it's doing work for you.
-- 
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','linuxdatabases.info').
http://linuxfinances.info/info/linuxdistributions.html
"A  ROUGH  WHIMPER  OF   INSANITY"  is  an  anagram  for  "INFORMATION
SUPERHIGHWAY".

Re: DELETE FROM takes forever

From

"Hiltibidal, Rob"

Date:

11 February 2011, 10:20:30

Even DB2 and Oracle will take hellishly long times to perform large
scale deletes....

What I do for a database just under 300 gb in size is do deletes in
groups of 10,000

So your where clause might look some like

WHERE id NOT IN  (SELECT id FROM unique_records fetch first 10000 rows
only)

DB2 has a clause of "with ur" to specify its ok to use dirty reads. I am
not sure if postgres has this, been awhile. The goal is to make sure
postgres allows "dirty reads". It prevents row locking...

In DB2 the query would like like:DELETE FROM records WHERE id NOT IN  (SELECT id FROM
unique_records fetch first 10000 rows only) with ur

Other tips that might enhance the performance is make sure the
unique_records table is indexed... even if it has a primary key. In some
cases the optimizer may choose an index to satisfy the select clause or
it may do a table scan. Table scans are more costly than index scans.

What's going to save you the real time is to break up your delete into
chunks. All the rdbms log the transactions and each delete is a
transaction. See where this is going?

Some rdbms allow you to turn off "transactional logging" some don't. DB2
doesn't (( at least not without more effort than reasonably necessary ))
so I write my delete queries to use chunks at a time. The most I would
recommend is 100,000 records deleted at once. Play with timing and see
what works for you

Hope this helps

-Rob



-----Original Message-----
From: pgsql-sql-owner@postgresql.org
[mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Josh
Sent: Thursday, February 10, 2011 11:57 AM
To: pgsql-sql@postgresql.org
Subject: [SQL] DELETE FROM takes forever

Hi

I'm trying to do a DELETE FROM on my large table (about 800 million
rows) based on the contents of another, moderately large table (about
110 million rows). The command I'm using is:

DELETE FROM records WHERE id NOT IN (SELECT id FROM unique_records);

This process ran for about two weeks before I decided to stop it -- it
was dragging down the DB server. I can understand long-running
processes, but two weeks seems a bit much even for a big table.

Is this the best way to approach the problem? Is there a better way?

Some background: The server is version 8.3, running nothing but Pg.
The 'records' table has 'id' as its primary key, and one other index
on another column. The table is referenced by just about every other
table in my DB (about 15 other tables) via foreign key constraints,
which I don't want to break (which is why I'm not just recreating the
table rather than deleting rows). Most of the dependent tables have ON
DELETE CASCADE. The 'unique_records' table is a temp table I got via
something like: SELECT DISTINCT (other_column) id INTO unique_records
FROM records


Thanks very much!

Josh Leder

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


PRIVILEGED AND CONFIDENTIAL
This email transmission contains privileged and confidential information intended only for the use of the individual or
entitynamed above.  If the reader of the email is not the intended recipient or the employee or agent responsible for
deliveringit to the intended recipient, you are hereby notified that any use, dissemination or copying of this email
transmissionis strictly prohibited by the sender.  If you have received this transmission in error, please delete the
emailand immediately notify the sender via the email return address or mailto:postmaster@argushealth.com.  Thank you.

Re: DELETE FROM takes forever

From

Jasen Betts

Date:

12 February 2011, 04:09:45

On 2011-02-10, Josh <slushie@gmail.com> wrote:
> Hi
>
> I'm trying to do a DELETE FROM on my large table (about 800 million
> rows) based on the contents of another, moderately large table (about
> 110 million rows). The command I'm using is:
>
> DELETE FROM records WHERE id NOT IN (SELECT id FROM unique_records);
>
> This process ran for about two weeks before I decided to stop it -- it
> was dragging down the DB server. I can understand long-running
> processes, but two weeks seems a bit much even for a big table.

I find that scripting deletes of smaller numbers of records can help
here, long-running queries do bad things to the efficiency of postgres.

on strategy that could work for your task would be to create a temp
table first:

create temp table planned_deletions as select id from records exceptSELECT id FROM unique_records;
create index badids on planned_deletions(id);

the repeatedly
delete from records where id in ( select id from planned_deletions limit 10000 order by id);delete from
planned_deletionswhere id in ( select id from planned_deletions limit 10000 order by id);
 

until there are none left.

possibly pausing a few seconds between each slug if there is a heavy
load on the server (that you were able to run the query for 2 weeks
suggests that there may not be).

> Is this the best way to approach the problem? Is there a better way?
>
> Some background: The server is version 8.3, running nothing but Pg.
> The 'records' table has 'id' as its primary key, and one other index
> on another column. The table is referenced by just about every other
> table in my DB (about 15 other tables) via foreign key constraints,
> which I don't want to break (which is why I'm not just recreating the
> table rather than deleting rows). Most of the dependent tables have ON
> DELETE CASCADE. The 'unique_records' table is a temp table I got via
> something like: SELECT DISTINCT (other_column) id INTO unique_records
> FROM records

if you can accept the down-time I would drop the constraints 

(if you don't have them on file do a pg_dump --schema-only , and grep it 
for the ADD CONSTRIANT commands, use sed or similar to create matching 
DROP CONSTRAINT commands, run them)

then rebuild the table

then reintroduce the constraints, keep a copy of the grep output above

-- 
⚂⚃ 100% natural