best way to do bulk delete? - Mailing list pgsql-novice

From pg noob
Subject best way to do bulk delete?
Date
Msg-id CAPNY-2W9a1qn+hAb=x8+NHn6kRaGyUjMg1tY+4zBNTnYaVVepA@mail.gmail.com
Whole thread Raw
Responses Re: best way to do bulk delete?  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-novice

Greetings,

I have an occasional recurring use case where I need to delete a large number of rows from multiple tables as quickly as possible,
something like a bulk-delete.

A typical example would be deleting a few hundred thousand rows at once from a set of tables each containing 1 to 2 million rows,
but in a worst case scenario it could be as large as 1 million rows from a set of tables each containing 4 to 5 million rows.

These tables of course have indexes on them as well as foreign key constraints and cascade deletes to other tables.

I can't simply truncate the tables because the rows being deleted are subsets of the total amount of data in the tables.

These tables have heavy insert/update/delete activity going on at the same time but mostly not on the same set of rows that
is being bulk-deleted (though there may be some update activity going on which accesses those rows and hasn't yet quiesced).

What is the best way of going about this?

I've considered a few options.

One option is to open a single transaction, issue delete statements that delete huge numbers of rows from each
table in question (probably with a DELETE USING SQL query) and then commit the transaction.
My concern with this approach is that it will hold a huge number of row locks while in progress and may take a long
time to complete and could introduce deadlocks if it competes with other updates that have acquired row locks out of order.

Another option would be to delete one row per transaction or a smaller set of rows as part of a transaction and then
commit the transaction, repeating in a loop.
This has the advantage that if the transaction hits an error and has to be rolled back it doesn't have to redo the entire
delete operation again, and it doesn't hold as many row locks for as long a time.  The drawback is that I believe this approach
would be a lot slower.

And finally, I've considered the idea of using COPY to copy the data that needs to be kept to temporary tables,
truncating the original tables and then copying the data back.  I believe this would be the most efficient way to
do the delete but the implementation is quite a bit more complicated than the first two options I described,
and has implications for how to deal with error scenarios or database/system crashes while the operation is in progress.

Is there a better way that I haven't thought of?  How would you handle this?

Thank you in advance for any ideas you might have.

pgsql-novice by date:

Previous
From: Nuno Ferreira
Date:
Subject: Re: Copy database to another machine
Next
From: Merlin Moncure
Date:
Subject: Re: best way to do bulk delete?