Re: best way to do bulk delete? - Mailing list pgsql-novice

From Merlin Moncure
Subject Re: best way to do bulk delete?
Date
Msg-id CAHyXU0wism0HkYj9A9=9AKD=07igQ8MZ3vuvfcZ6s3L8B_fing@mail.gmail.com
Whole thread Raw
In response to best way to do bulk delete?  (pg noob <pgnube@gmail.com>)
Responses Re: best way to do bulk delete?  (pg noob <pgnube@gmail.com>)
List pgsql-novice
On Fri, Oct 28, 2011 at 9:30 AM, pg noob <pgnube@gmail.com> wrote:
>
> Greetings,
>
> I have an occasional recurring use case where I need to delete a large
> number of rows from multiple tables as quickly as possible,
> something like a bulk-delete.
>
> A typical example would be deleting a few hundred thousand rows at once from
> a set of tables each containing 1 to 2 million rows,
> but in a worst case scenario it could be as large as 1 million rows from a
> set of tables each containing 4 to 5 million rows.
>
> These tables of course have indexes on them as well as foreign key
> constraints and cascade deletes to other tables.
>
> I can't simply truncate the tables because the rows being deleted are
> subsets of the total amount of data in the tables.
>
> These tables have heavy insert/update/delete activity going on at the same
> time but mostly not on the same set of rows that
> is being bulk-deleted (though there may be some update activity going on
> which accesses those rows and hasn't yet quiesced).
>
> What is the best way of going about this?
>
> I've considered a few options.
>
> One option is to open a single transaction, issue delete statements that
> delete huge numbers of rows from each
> table in question (probably with a DELETE USING SQL query) and then commit
> the transaction.
> My concern with this approach is that it will hold a huge number of row
> locks while in progress and may take a long
> time to complete and could introduce deadlocks if it competes with other
> updates that have acquired row locks out of order.
>
> Another option would be to delete one row per transaction or a smaller set
> of rows as part of a transaction and then
> commit the transaction, repeating in a loop.
> This has the advantage that if the transaction hits an error and has to be
> rolled back it doesn't have to redo the entire
> delete operation again, and it doesn't hold as many row locks for as long a
> time.  The drawback is that I believe this approach
> would be a lot slower.
>
> And finally, I've considered the idea of using COPY to copy the data that
> needs to be kept to temporary tables,
> truncating the original tables and then copying the data back.  I believe
> this would be the most efficient way to
> do the delete but the implementation is quite a bit more complicated than
> the first two options I described,
> and has implications for how to deal with error scenarios or database/system
> crashes while the operation is in progress.

you don't need to COPY -- just insert/select, like this:

CREATE TABLE keeprows(LIKE old_table INCLUDING INDEXES INCLUDING CONSTRAINTS);
INSERT INTO keeprows SELECT * FROM old_table WHERE ...
DROP TABLE old_table;
ALTER TABLE keeprows RENAME to old_table CASCADE;
<reset RI rules>

This procedure is not side effect free and has some caveats:  for
example, if you have code that is dependent on the table's composite
type that will also drop and has to be reconstructed.

Perhaps a better way to go if you can structure your code around it is
to simply partition your table around when it gets loaded and dropped
-- then your deletion becomes 'DROP TABLE' with no extra processing.

merlin

pgsql-novice by date:

Previous
From: pg noob
Date:
Subject: best way to do bulk delete?
Next
From: Detox
Date:
Subject: PgAdmin III shows removed PostgreSQL