Re: Deleting Rows From Large Tables - Mailing list pgsql-performance

From Greg Spiegelberg
Subject Re: Deleting Rows From Large Tables
Date
Msg-id CAEtnbpUyEs10PuBng+h3jEyhHo+ff4a70mP7mSf-4prhSmAHDg@mail.gmail.com
Whole thread Raw
In response to Deleting Rows From Large Tables  (Rob Emery <re-pgsql@codeweavers.net>)
List pgsql-performance
Rob,

I'm going to make half of the list cringe at this suggestion though I have used it successfully.

If you can guarantee the table will not be vacuumed during this cleanup or rows you want deleted updated, I would suggest using the ctid column to facilitate the delete.  Using the simple transaction below, I have witnessed a DELETE move much more quickly than one using a PK or any other column with an index.

BEGIN;
SELECT ctid INTO TEMP TABLE ctids_to_be deleted FROM my_big_table WHERE delete criteria;
DELETE FROM my_big_table bt USING ctids_to_be_deleted dels WHERE bt.ctid = dels.ctid;
COMMIT;

HTH.
-Greg


On Fri, May 17, 2013 at 5:26 AM, Rob Emery <re-pgsql@codeweavers.net> wrote:
Hi All,

We've got 3 quite large tables that due to an unexpected surge in
usage (!) have grown to about 10GB each, with 72, 32 and 31 million
rows in. I've been tasked with cleaning out about half of them, the
problem I've got is that even deleting the first 1,000,000 rows seems
to take an unreasonable amount of time. Unfortunately this is on quite
an old server (Dell 2950 with a RAID-10 over 6 disks) running Postgres
8.4; which serves other things like our logging systems.

If I run a sustained (more than about 5 minutes) delete it'll have a
detrimental effect on the other services. I'm trying to batch up the
deletes into small chunks of approximately 1 month of data ; even this
seems to take too long, I originally reduced this down to a single
day's data and had the same problem. I can keep decreasing the size of
the window I'm deleting but I feel I must be doing something either
fundamentally wrong or over-complicating this enormously. I've
switched over to retrieving a list of IDs to delete, storing them in
temporary tables and deleting based on the primary keys on each of the
tables with something similar to this:

BEGIN TRANSACTION;

CREATE TEMPORARY TABLE table_a_ids_to_delete (id INT);
CREATE TEMPORARY TABLE table_b_ids_to_delete (id INT);

INSERT INTO table_a_ids_to_delete
    SELECT id FROM table_a WHERE purchased ='-infinity' AND created_at
< '2007-01-01T00:00:00';

INSERT INTO table_b_ids_to_delete
    SELECT table_b_id FROM table_a_table_b_xref
    INNER JOIN table_a_ids_to_delete ON (table_a_ids_to_delete.id =
table_a_table_b.quote_id);

DELETE FROM table_a_table_b_xref USING table_a_ids_to_delete
    WHERE table_a_table_b_xref.table_a_id = table_a_ids_to_delete.id;

DELETE FROM table_b USING table_b_ids_to_delete
    WHERE table_b.id = table_b_ids_to_delete.id;

DELETE FROM table_a USING table_a_ids_to_delete
    WHERE table_a.id =  table_a_ids_to_delete.id;

COMMIT;

There're indices on table_a on the queried columns, table_b's primary
key is it's id, and table_a_table_b_xref has an index on (table_a_id,
table_b_id). There're FK defined on the xref table, hence why I'm
deleting from it first.

Does anyone have any ideas as to what I can do to make the deletes any
faster? I'm running out of ideas!

Thanks in advance,

--
Rob Emery


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Deleting Rows From Large Tables
Next
From: Tomas Vondra
Date:
Subject: Re: Reliability with RAID 10 SSD and Streaming Replication