Thread: Deleting Rows From Large Tables

Deleting Rows From Large Tables

From
Rob Emery
Date:
Hi All,

We've got 3 quite large tables that due to an unexpected surge in
usage (!) have grown to about 10GB each, with 72, 32 and 31 million
rows in. I've been tasked with cleaning out about half of them, the
problem I've got is that even deleting the first 1,000,000 rows seems
to take an unreasonable amount of time. Unfortunately this is on quite
an old server (Dell 2950 with a RAID-10 over 6 disks) running Postgres
8.4; which serves other things like our logging systems.

If I run a sustained (more than about 5 minutes) delete it'll have a
detrimental effect on the other services. I'm trying to batch up the
deletes into small chunks of approximately 1 month of data ; even this
seems to take too long, I originally reduced this down to a single
day's data and had the same problem. I can keep decreasing the size of
the window I'm deleting but I feel I must be doing something either
fundamentally wrong or over-complicating this enormously. I've
switched over to retrieving a list of IDs to delete, storing them in
temporary tables and deleting based on the primary keys on each of the
tables with something similar to this:

BEGIN TRANSACTION;

CREATE TEMPORARY TABLE table_a_ids_to_delete (id INT);
CREATE TEMPORARY TABLE table_b_ids_to_delete (id INT);

INSERT INTO table_a_ids_to_delete
    SELECT id FROM table_a WHERE purchased ='-infinity' AND created_at
< '2007-01-01T00:00:00';

INSERT INTO table_b_ids_to_delete
    SELECT table_b_id FROM table_a_table_b_xref
    INNER JOIN table_a_ids_to_delete ON (table_a_ids_to_delete.id =
table_a_table_b.quote_id);

DELETE FROM table_a_table_b_xref USING table_a_ids_to_delete
    WHERE table_a_table_b_xref.table_a_id = table_a_ids_to_delete.id;

DELETE FROM table_b USING table_b_ids_to_delete
    WHERE table_b.id = table_b_ids_to_delete.id;

DELETE FROM table_a USING table_a_ids_to_delete
    WHERE table_a.id =  table_a_ids_to_delete.id;

COMMIT;

There're indices on table_a on the queried columns, table_b's primary
key is it's id, and table_a_table_b_xref has an index on (table_a_id,
table_b_id). There're FK defined on the xref table, hence why I'm
deleting from it first.

Does anyone have any ideas as to what I can do to make the deletes any
faster? I'm running out of ideas!

Thanks in advance,

--
Rob Emery


Re: Deleting Rows From Large Tables

From
Sékine Coulibaly
Date:
Oh, sorry, overlooked that part.
Maybe refreshing stats with VACUUM FULL ?


2013/5/17 Robert Emery <robertemery@codeweavers.net>
Hi Sékine,

Unfortunately I'm not trying to empty the table completely, just
delete about 10-15% of the data in it.

Thanks,

On 17 May 2013 14:11, Sékine Coulibaly <scoulibaly@gmail.com> wrote:
> Rob,
>
> Did you tried TRUNCATE ?
> http://www.postgresql.org/docs/8.4/static/sql-truncate.html
>
> This is is supposed to be quicker since it does scan the table.
>
> Regards
>
>
> 2013/5/17 Rob Emery <re-pgsql@codeweavers.net>
>>
>> Hi All,
>>
>> We've got 3 quite large tables that due to an unexpected surge in
>> usage (!) have grown to about 10GB each, with 72, 32 and 31 million
>> rows in. I've been tasked with cleaning out about half of them, the
>> problem I've got is that even deleting the first 1,000,000 rows seems
>> to take an unreasonable amount of time. Unfortunately this is on quite
>> an old server (Dell 2950 with a RAID-10 over 6 disks) running Postgres
>> 8.4; which serves other things like our logging systems.
>>
>> If I run a sustained (more than about 5 minutes) delete it'll have a
>> detrimental effect on the other services. I'm trying to batch up the
>> deletes into small chunks of approximately 1 month of data ; even this
>> seems to take too long, I originally reduced this down to a single
>> day's data and had the same problem. I can keep decreasing the size of
>> the window I'm deleting but I feel I must be doing something either
>> fundamentally wrong or over-complicating this enormously. I've
>> switched over to retrieving a list of IDs to delete, storing them in
>> temporary tables and deleting based on the primary keys on each of the
>> tables with something similar to this:
>>
>> BEGIN TRANSACTION;
>>
>> CREATE TEMPORARY TABLE table_a_ids_to_delete (id INT);
>> CREATE TEMPORARY TABLE table_b_ids_to_delete (id INT);
>>
>> INSERT INTO table_a_ids_to_delete
>>     SELECT id FROM table_a WHERE purchased ='-infinity' AND created_at
>> < '2007-01-01T00:00:00';
>>
>> INSERT INTO table_b_ids_to_delete
>>     SELECT table_b_id FROM table_a_table_b_xref
>>     INNER JOIN table_a_ids_to_delete ON (table_a_ids_to_delete.id =
>> table_a_table_b.quote_id);
>>
>> DELETE FROM table_a_table_b_xref USING table_a_ids_to_delete
>>     WHERE table_a_table_b_xref.table_a_id = table_a_ids_to_delete.id;
>>
>> DELETE FROM table_b USING table_b_ids_to_delete
>>     WHERE table_b.id = table_b_ids_to_delete.id;
>>
>> DELETE FROM table_a USING table_a_ids_to_delete
>>     WHERE table_a.id =  table_a_ids_to_delete.id;
>>
>> COMMIT;
>>
>> There're indices on table_a on the queried columns, table_b's primary
>> key is it's id, and table_a_table_b_xref has an index on (table_a_id,
>> table_b_id). There're FK defined on the xref table, hence why I'm
>> deleting from it first.
>>
>> Does anyone have any ideas as to what I can do to make the deletes any
>> faster? I'm running out of ideas!
>>
>> Thanks in advance,
>>
>> --
>> Rob Emery
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>



--
Robert Emery
Database Administrator

| T: 0800 021 0888 | www.codeweavers.net |
| Codeweavers Limited | Barn 4 | Dunston Business Village | Dunston | ST18 9AB |
| Registered in England and Wales No. 04092394 | VAT registration no.
974 9705 63 |

 CUSTOMERS' BLOG     TWITTER       FACEBOOK      LINKED IN
DEVELOPERS' BLOG      YOUTUBE

Re: Deleting Rows From Large Tables

From
Vitalii Tymchyshyn
Date:

Analyze your temp tables after filling and before using!

17 трав. 2013 17:27, "Sékine Coulibaly" <scoulibaly@gmail.com> напис.
Oh, sorry, overlooked that part.
Maybe refreshing stats with VACUUM FULL ?


2013/5/17 Robert Emery <robertemery@codeweavers.net>
Hi Sékine,

Unfortunately I'm not trying to empty the table completely, just
delete about 10-15% of the data in it.

Thanks,

On 17 May 2013 14:11, Sékine Coulibaly <scoulibaly@gmail.com> wrote:
> Rob,
>
> Did you tried TRUNCATE ?
> http://www.postgresql.org/docs/8.4/static/sql-truncate.html
>
> This is is supposed to be quicker since it does scan the table.
>
> Regards
>
>
> 2013/5/17 Rob Emery <re-pgsql@codeweavers.net>
>>
>> Hi All,
>>
>> We've got 3 quite large tables that due to an unexpected surge in
>> usage (!) have grown to about 10GB each, with 72, 32 and 31 million
>> rows in. I've been tasked with cleaning out about half of them, the
>> problem I've got is that even deleting the first 1,000,000 rows seems
>> to take an unreasonable amount of time. Unfortunately this is on quite
>> an old server (Dell 2950 with a RAID-10 over 6 disks) running Postgres
>> 8.4; which serves other things like our logging systems.
>>
>> If I run a sustained (more than about 5 minutes) delete it'll have a
>> detrimental effect on the other services. I'm trying to batch up the
>> deletes into small chunks of approximately 1 month of data ; even this
>> seems to take too long, I originally reduced this down to a single
>> day's data and had the same problem. I can keep decreasing the size of
>> the window I'm deleting but I feel I must be doing something either
>> fundamentally wrong or over-complicating this enormously. I've
>> switched over to retrieving a list of IDs to delete, storing them in
>> temporary tables and deleting based on the primary keys on each of the
>> tables with something similar to this:
>>
>> BEGIN TRANSACTION;
>>
>> CREATE TEMPORARY TABLE table_a_ids_to_delete (id INT);
>> CREATE TEMPORARY TABLE table_b_ids_to_delete (id INT);
>>
>> INSERT INTO table_a_ids_to_delete
>>     SELECT id FROM table_a WHERE purchased ='-infinity' AND created_at
>> < '2007-01-01T00:00:00';
>>
>> INSERT INTO table_b_ids_to_delete
>>     SELECT table_b_id FROM table_a_table_b_xref
>>     INNER JOIN table_a_ids_to_delete ON (table_a_ids_to_delete.id =
>> table_a_table_b.quote_id);
>>
>> DELETE FROM table_a_table_b_xref USING table_a_ids_to_delete
>>     WHERE table_a_table_b_xref.table_a_id = table_a_ids_to_delete.id;
>>
>> DELETE FROM table_b USING table_b_ids_to_delete
>>     WHERE table_b.id = table_b_ids_to_delete.id;
>>
>> DELETE FROM table_a USING table_a_ids_to_delete
>>     WHERE table_a.id =  table_a_ids_to_delete.id;
>>
>> COMMIT;
>>
>> There're indices on table_a on the queried columns, table_b's primary
>> key is it's id, and table_a_table_b_xref has an index on (table_a_id,
>> table_b_id). There're FK defined on the xref table, hence why I'm
>> deleting from it first.
>>
>> Does anyone have any ideas as to what I can do to make the deletes any
>> faster? I'm running out of ideas!
>>
>> Thanks in advance,
>>
>> --
>> Rob Emery
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>



--
Robert Emery
Database Administrator

| T: 0800 021 0888 | www.codeweavers.net |
| Codeweavers Limited | Barn 4 | Dunston Business Village | Dunston | ST18 9AB |
| Registered in England and Wales No. 04092394 | VAT registration no.
974 9705 63 |

 CUSTOMERS' BLOG     TWITTER       FACEBOOK      LINKED IN
DEVELOPERS' BLOG      YOUTUBE

Re: Deleting Rows From Large Tables

From
Jeff Janes
Date:
On Fri, May 17, 2013 at 4:26 AM, Rob Emery <re-pgsql@codeweavers.net> wrote:
Hi All,

We've got 3 quite large tables that due to an unexpected surge in
usage (!) have grown to about 10GB each, with 72, 32 and 31 million
rows in. I've been tasked with cleaning out about half of them, the
problem I've got is that even deleting the first 1,000,000 rows seems
to take an unreasonable amount of time. Unfortunately this is on quite
an old server (Dell 2950 with a RAID-10 over 6 disks) running Postgres
8.4; which serves other things like our logging systems.

How many Cores do you have?  I think the Dell 2950 could have anywhere from 1 to 8.

Pick a smaller number of rows to delete, and run it with "explain analyze" to see what it is going on.  I would say to use "explain (analyze, buffers)" with track_io_timing on, but those don't exist back in 8.4.

Perhaps this would be a good excuse to upgrade!

If I run a sustained (more than about 5 minutes) delete it'll have a
detrimental effect on the other services.


Do you know why?  Can you identify the affected queries from those other services and run explain analyze on them?


 
I'm trying to batch up the
deletes into small chunks of approximately 1 month of data ; even this
seems to take too long, I originally reduced this down to a single
day's data and had the same problem. I can keep decreasing the size of
the window I'm deleting but I feel I must be doing something either
fundamentally wrong or over-complicating this enormously.


If your server is sized only to do its typical workload, then any substantial extra work load is going to cause problems.  Trying to delete 1 day's work in a few seconds stills seems like it is very likely excessive.  Why not jump all the way down to 5 minutes, or limit it to a certain number of rows from table a, say 100 per unit?  If you start large and work your way down, you will often be working in the dark because you won't have the patience to let the large ones run to completion, slowing down the whole system.  If you start at the bottom and work up, you will always know where you are as the previous one ran to completion and you have the timings from it.

How fast do you need to clean this up?  If it took months to get into the situation, can't you take a few weeks to get out of it?

 
I've
switched over to retrieving a list of IDs to delete, storing them in
temporary tables and deleting based on the primary keys on each of the
tables with something similar to this:

BEGIN TRANSACTION;

CREATE TEMPORARY TABLE table_a_ids_to_delete (id INT);
CREATE TEMPORARY TABLE table_b_ids_to_delete (id INT);

INSERT INTO table_a_ids_to_delete
    SELECT id FROM table_a WHERE purchased ='-infinity' AND created_at
< '2007-01-01T00:00:00';

I'd probably add a "LIMIT 100" in there.  Then you can set created_at to the final time point desired, rather than trying to increment it each time and deciding how much to increment.
 

INSERT INTO table_b_ids_to_delete
    SELECT table_b_id FROM table_a_table_b_xref
    INNER JOIN table_a_ids_to_delete ON (table_a_ids_to_delete.id =
table_a_table_b.quote_id);

Do these to queries slow down other operations?  Or is it just the deletes?
 

DELETE FROM table_a_table_b_xref USING table_a_ids_to_delete
    WHERE table_a_table_b_xref.table_a_id = table_a_ids_to_delete.id;

DELETE FROM table_b USING table_b_ids_to_delete
    WHERE table_b.id = table_b_ids_to_delete.id;

DELETE FROM table_a USING table_a_ids_to_delete
    WHERE table_a.id =  table_a_ids_to_delete.id;

COMMIT;

How much time to do the 3 deletes take relative to each other and to the inserts?

Cheers,

Jef

Re: Deleting Rows From Large Tables

From
Greg Spiegelberg
Date:
Rob,

I'm going to make half of the list cringe at this suggestion though I have used it successfully.

If you can guarantee the table will not be vacuumed during this cleanup or rows you want deleted updated, I would suggest using the ctid column to facilitate the delete.  Using the simple transaction below, I have witnessed a DELETE move much more quickly than one using a PK or any other column with an index.

BEGIN;
SELECT ctid INTO TEMP TABLE ctids_to_be deleted FROM my_big_table WHERE delete criteria;
DELETE FROM my_big_table bt USING ctids_to_be_deleted dels WHERE bt.ctid = dels.ctid;
COMMIT;

HTH.
-Greg


On Fri, May 17, 2013 at 5:26 AM, Rob Emery <re-pgsql@codeweavers.net> wrote:
Hi All,

We've got 3 quite large tables that due to an unexpected surge in
usage (!) have grown to about 10GB each, with 72, 32 and 31 million
rows in. I've been tasked with cleaning out about half of them, the
problem I've got is that even deleting the first 1,000,000 rows seems
to take an unreasonable amount of time. Unfortunately this is on quite
an old server (Dell 2950 with a RAID-10 over 6 disks) running Postgres
8.4; which serves other things like our logging systems.

If I run a sustained (more than about 5 minutes) delete it'll have a
detrimental effect on the other services. I'm trying to batch up the
deletes into small chunks of approximately 1 month of data ; even this
seems to take too long, I originally reduced this down to a single
day's data and had the same problem. I can keep decreasing the size of
the window I'm deleting but I feel I must be doing something either
fundamentally wrong or over-complicating this enormously. I've
switched over to retrieving a list of IDs to delete, storing them in
temporary tables and deleting based on the primary keys on each of the
tables with something similar to this:

BEGIN TRANSACTION;

CREATE TEMPORARY TABLE table_a_ids_to_delete (id INT);
CREATE TEMPORARY TABLE table_b_ids_to_delete (id INT);

INSERT INTO table_a_ids_to_delete
    SELECT id FROM table_a WHERE purchased ='-infinity' AND created_at
< '2007-01-01T00:00:00';

INSERT INTO table_b_ids_to_delete
    SELECT table_b_id FROM table_a_table_b_xref
    INNER JOIN table_a_ids_to_delete ON (table_a_ids_to_delete.id =
table_a_table_b.quote_id);

DELETE FROM table_a_table_b_xref USING table_a_ids_to_delete
    WHERE table_a_table_b_xref.table_a_id = table_a_ids_to_delete.id;

DELETE FROM table_b USING table_b_ids_to_delete
    WHERE table_b.id = table_b_ids_to_delete.id;

DELETE FROM table_a USING table_a_ids_to_delete
    WHERE table_a.id =  table_a_ids_to_delete.id;

COMMIT;

There're indices on table_a on the queried columns, table_b's primary
key is it's id, and table_a_table_b_xref has an index on (table_a_id,
table_b_id). There're FK defined on the xref table, hence why I'm
deleting from it first.

Does anyone have any ideas as to what I can do to make the deletes any
faster? I'm running out of ideas!

Thanks in advance,

--
Rob Emery


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Deleting Rows From Large Tables

From
Greg Smith
Date:
On 5/17/13 7:26 AM, Rob Emery wrote:
> I can keep decreasing the size of
> the window I'm deleting but I feel I must be doing something either
> fundamentally wrong or over-complicating this enormously.

I've had jobs like this where we ended up making the batch size cover
only 4 hours at a time.  Once you've looked at the EXPLAIN plans for the
row selection criteria and they're reasonable, dropping the period
that's deleted per pass is really the only thing you can do.  Do some
DELETEs, then pause to let the disk cache clear; repeat.

The other useful thing to do here is get very aggressive about settings
for shared_buffers, checkpoint_segments, and checkpoint_timeout.  I'll
normally push for settings like 8GB/256/15 minutes when doing this sort
of thing.  The usual situation with a checkpoint every 5 minutes may not
be feasible when you've got this type of work going on in the background.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com