Thread: Deleting millions of rows

Deleting millions of rows

From
Brian Cox
Date:
I'm using 8.3.5. Table ts_defects has 48M rows. Through psql: delete
from ts_defects;
Result: out of memory/Can't allocate size: 32
I then did 10 or so deletes to get rid of the rows. Afterwards, inserts
into or queries on this
table performed significantly slower. I tried a vacuum analyze, but this
didn't help. To fix this,
I dumped and restored the database.

1) why can't postgres delete all rows in a table if it has millions of rows?
2) is there any other way to restore performance other than restoring
the database?

Thanks,
Brian


Re: Deleting millions of rows

From
Robert Haas
Date:
On Mon, Feb 2, 2009 at 1:17 PM, Brian Cox <brian.cox@ca.com> wrote:
> I'm using 8.3.5. Table ts_defects has 48M rows. Through psql: delete from
> ts_defects;
> Result: out of memory/Can't allocate size: 32
> I then did 10 or so deletes to get rid of the rows. Afterwards, inserts into
> or queries on this
> table performed significantly slower. I tried a vacuum analyze, but this
> didn't help. To fix this,
> I dumped and restored the database.
>
> 1) why can't postgres delete all rows in a table if it has millions of rows?
> 2) is there any other way to restore performance other than restoring the
> database?

Does the table have triggers on it?  Does it have indexes? What is the
result of pg_relation_size() on that table?

How much memory do you have in your machine? What is work_mem set to?

Did you try VACUUM FULL instead of just plain VACUUM to recover
performance?  You might also need to REINDEX.

Or you could TRUNCATE the table.

...Robert

Re: Deleting millions of rows

From
"Jerry Champlin"
Date:
Brian:

One approach we use for large tables is to partition and then drop
partitions as the data becomes obsolete.  This way you never have the
problem.  Our general rule is to never delete data from a table because it
is too slow.  We have found this to be the preferred approach regardless of
database platform.

-Jerry

Jerry Champlin|Absolute Performance Inc.|Mobile:  303-588-2547


-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Brian Cox
Sent: Monday, February 02, 2009 11:18 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Deleting millions of rows

I'm using 8.3.5. Table ts_defects has 48M rows. Through psql: delete
from ts_defects;
Result: out of memory/Can't allocate size: 32
I then did 10 or so deletes to get rid of the rows. Afterwards, inserts
into or queries on this
table performed significantly slower. I tried a vacuum analyze, but this
didn't help. To fix this,
I dumped and restored the database.

1) why can't postgres delete all rows in a table if it has millions of rows?
2) is there any other way to restore performance other than restoring
the database?

Thanks,
Brian


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



Re: Deleting millions of rows

From
Brian Cox
Date:
Robert Haas [robertmhaas@gmail.com] wrote:
Thanks for your response.

> Does the table have triggers on it?  Does it have indexes? What is the
> result of pg_relation_size() on that table?
No triggers; 3 indexes
cemdb=> select pg_relation_size('ts_defects');
  pg_relation_size
------------------
        9464971264
(1 row)

cemdb=>
cemdb=> select pg_relation_size('ts_defects_DateIndex');
  pg_relation_size
------------------
        1299931136
(1 row)

cemdb=> select pg_relation_size('ts_defects_DefectIndex');
  pg_relation_size
------------------
        1217224704
(1 row)

cemdb=> select pg_relation_size('ts_defects_EventIndex');
  pg_relation_size
------------------
        1216528384

>
> How much memory do you have in your machine? What is work_mem set to?
32G; work_mem=64M

> Did you try VACUUM FULL instead of just plain VACUUM to recover
> performance?  You might also need to REINDEX.
> Or you could TRUNCATE the table.
I didn't try FULL or REINDEX. In this case, TRUNCATE is the best option
as I was just trying to reset the state of the table for another test.
But this brings up another question: will autovacuum do the right thing
to preserve performance on this table when many rows are deleted?

Thanks,
Brian

Re: Deleting millions of rows

From
Robert Haas
Date:
On Mon, Feb 2, 2009 at 3:01 PM, Brian Cox <brian.cox@ca.com> wrote:
>> How much memory do you have in your machine? What is work_mem set to?
>
> 32G; work_mem=64M

Hmm.  Well then I'm not sure why you're running out of memory, that
seems like a bug. Taking a long time, I understand.  Crashing, not so
much.

>> Did you try VACUUM FULL instead of just plain VACUUM to recover
>> performance?  You might also need to REINDEX.
>> Or you could TRUNCATE the table.
>
> I didn't try FULL or REINDEX. In this case, TRUNCATE is the best option as I
> was just trying to reset the state of the table for another test. But this
> brings up another question: will autovacuum do the right thing to preserve
> performance on this table when many rows are deleted?

I don't think so.  I think you need to VACUUM FULL and REINDEX when
you do a big DELETE.  But if you TRUNCATE then you should be OK - no
further cleanup required in that case.

...Robert

Re: Deleting millions of rows

From
David Wilson
Date:
On Mon, Feb 2, 2009 at 1:17 PM, Brian Cox <brian.cox@ca.com> wrote:
> I'm using 8.3.5. Table ts_defects has 48M rows. Through psql: delete from
> ts_defects;
> Result: out of memory/Can't allocate size: 32

Is this table the target of any foreign keys?

--
- David T. Wilson
david.t.wilson@gmail.com

Re: Deleting millions of rows

From
"Dave Dutcher"
Date:
> -----Original Message-----
> From: Brian Cox
> Subject: [PERFORM] Deleting millions of rows
>
> I'm using 8.3.5. Table ts_defects has 48M rows. Through psql:
> delete from ts_defects;
> Result: out of memory/Can't allocate size: 32 I then did 10
> or so deletes to get rid of the rows. Afterwards, inserts
> into or queries on this table performed significantly slower.
> I tried a vacuum analyze, but this didn't help. To fix this,
> I dumped and restored the database.
>
> 1) why can't postgres delete all rows in a table if it has
> millions of rows?
> 2) is there any other way to restore performance other than
> restoring the database?
>
> Thanks,
> Brian

If you are deleting an entire table, then the TRUNCATE command is the way to
go.  TRUNCATE is very fast and leaves no dead rows behind.  The problem with
a normal delete is that the rows are not actually removed from the file.
Once the table is VACUUMED the dead space is marked as available to be
reused, but plain VACUUM doesn't remove any space either.  A VACUUM FULL or
CLUSTER will actually remove dead space, but they can take a while to run.
(I've heard CLUSTER is supposed to be faster than VACUUM FULL)  Another way
is to create a new table with the same definition as the old table, select
the rows you want to keep into the new table, drop the old table, and then
rename the new table to have the old table's name.


Dave


Re: Deleting millions of rows

From
Brian Cox
Date:
David Wilson [david.t.wilson@gmail.com] wrote:

> Is this table the target of any foreign keys?
There are 2 "on delete cascade" FKs that reference this table.

Brian


Re: Deleting millions of rows

From
David Wilson
Date:
On Mon, Feb 2, 2009 at 3:37 PM, Brian Cox <brian.cox@ca.com> wrote:
> David Wilson [david.t.wilson@gmail.com] wrote:
>
>> Is this table the target of any foreign keys?
>
> There are 2 "on delete cascade" FKs that reference this table.

I believe that's the source of your memory issues. I think TRUNCATE
may handle this more effectively; alternately you can handle the
cascading yourself in these cases. (And, as Dave Dutcher mentioned,
TRUNCATE is definitely the way to go for full-table wipes).


--
- David T. Wilson
david.t.wilson@gmail.com

Re: Deleting millions of rows

From
Scott Marlowe
Date:
On Mon, Feb 2, 2009 at 11:17 AM, Brian Cox <brian.cox@ca.com> wrote:
> I'm using 8.3.5. Table ts_defects has 48M rows. Through psql: delete from
> ts_defects;
> Result: out of memory/Can't allocate size: 32
> I then did 10 or so deletes to get rid of the rows. Afterwards, inserts into
> or queries on this
> table performed significantly slower. I tried a vacuum analyze, but this
> didn't help.

There are two different problems happening here.  One is the failed
delete, the other is the normal bloating caused when a lot of rows are
deleted.

When deleting every row in a table you're much better off just
truncating it.  But foreign keys can get in the way so you might need
truncate cascade.  If you're not sure you really want to do it you can
wrap your truncate in a begin;commit; pair and see how the database
looks after the truncate.

If you choose to use a delete, then foreign keys can slow things down
quite a bit, and if you've got bad stats it's possible for the planner
to choose a plan that runs out of memory.  Was this db recently
analyzed?

If the delete is what you need for some reason, a regular vacuum won't
fix your problem, because it only makes dead tuples available again,
it doesn't remove them.  A cluster command OR vacuum full followed by
reindex are the two best ways to get the space recovered.

> To fix this,
> I dumped and restored the database.

That works too.  Since the table was empty, you could have dropped and
recreated it, but if you had foreign keys you'd have to recreate them
too.

> 1) why can't postgres delete all rows in a table if it has millions of rows?

It works fine for me.  Often into the billions.  Your test case seems
out of the ordinary.

Can you post all the non-default values in your postgresql.conf /
alter database set ... settings?

Re: Deleting millions of rows

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Feb 2, 2009 at 3:01 PM, Brian Cox <brian.cox@ca.com> wrote:
>>> How much memory do you have in your machine? What is work_mem set to?
>>
>> 32G; work_mem=64M

> Hmm.  Well then I'm not sure why you're running out of memory,

It's the pending trigger list.  He's got two trigger events per row,
which at 40 bytes apiece would approach 4GB of memory.  Apparently
it's a 32-bit build of Postgres, so he's running out of process address
space.

There's a TODO item to spill that list to disk when it gets too large,
but the reason nobody's done it yet is that actually executing that many
FK check trigger events would take longer than you want to wait anyway.

TRUNCATE is the best solution if you want to get rid of the whole table
contents.  If you're deleting very many but not all rows, people tend
to drop the FK constraints and re-establish them afterwards.  Retail
checking is just too slow.

            regards, tom lane

Re: Deleting millions of rows

From
Brian Cox
Date:
Tom Lane [tgl@sss.pgh.pa.us] wrote:
> It's the pending trigger list.  He's got two trigger events per row,
> which at 40 bytes apiece would approach 4GB of memory.  Apparently
> it's a 32-bit build of Postgres, so he's running out of process address
> space.
Yes, this is a 32 bit Postgres running on a 32 bit Linux. I assume that
the 2 triggers are due to the 2 "on delete cascade" FKs. Thanks for
explaining this bit of a mystery.

> TRUNCATE is the best solution if you want to get rid of the whole table
> contents.  If you're deleting very many but not all rows, people tend
> to drop the FK constraints and re-establish them afterwards.  Retail
> checking is just too slow.
Thanks also to you (and several others) for reminding me of TRUNCATE.
This will definitely work for what I was trying to do: reset this table
for more testing.

In production, the table on which I ran DELETE FROM grows constantly
with old data removed in bunches periodically (say up to a few 100,000s
of rows [out of several millions] in a bunch). I'm assuming that
auto-vacuum/analyze will allow Postgres to maintain reasonable
performance for INSERTs and SELECTs on it; do you think that this is a
reasonable assumption?

Thanks,
Brian


Re: Deleting millions of rows

From
Scott Marlowe
Date:
On Mon, Feb 2, 2009 at 3:01 PM, Brian Cox <brian.cox@ca.com> wrote:

> In production, the table on which I ran DELETE FROM grows constantly with
> old data removed in bunches periodically (say up to a few 100,000s of rows
> [out of several millions] in a bunch). I'm assuming that auto-vacuum/analyze
> will allow Postgres to maintain reasonable performance for INSERTs and
> SELECTs on it; do you think that this is a reasonable assumption?

Yes, as long as you're deleting a small enough percentage that it
doesn't get bloated (100k of millions is a good ratio) AND autovacuum
is running AND you have enough FSM entries to track the dead tuples
you're gold.

Re: Deleting millions of rows

From
Robert Haas
Date:
> It's the pending trigger list.  He's got two trigger events per row,
> which at 40 bytes apiece would approach 4GB of memory.  Apparently
> it's a 32-bit build of Postgres, so he's running out of process address
> space.
>
> There's a TODO item to spill that list to disk when it gets too large,
> but the reason nobody's done it yet is that actually executing that many
> FK check trigger events would take longer than you want to wait anyway.

Have you ever given any thought to whether it would be possible to
implement referential integrity constraints with statement-level
triggers instead of row-level triggers?  IOW, instead of planning this
and executing it N times:

DELETE FROM ONLY <fktable> WHERE $1 = fkatt1 [AND ...]

...we could join the original query against fktable with join clauses
on the correct pairs of attributes and then execute it once.

Is this insanely difficult to implement?

...Robert

Re: Deleting millions of rows

From
Alvaro Herrera
Date:
Robert Haas escribió:

> Have you ever given any thought to whether it would be possible to
> implement referential integrity constraints with statement-level
> triggers instead of row-level triggers?

Well, one reason we haven't discussed this is because our per-statement
triggers are too primitive yet -- we don't have access to the list of
acted-upon tuples.  As soon as we have that we can start discussing this
optimization.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Deleting millions of rows

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Robert Haas escribi�:
>> Have you ever given any thought to whether it would be possible to
>> implement referential integrity constraints with statement-level
>> triggers instead of row-level triggers?

> Well, one reason we haven't discussed this is because our per-statement
> triggers are too primitive yet -- we don't have access to the list of
> acted-upon tuples.  As soon as we have that we can start discussing this
> optimization.

I think the point is that at some number of tuples it's better to forget
about per-row tests at all, and instead perform the same whole-table
join that would be used to validate the FK from scratch.  The mechanism
we lack is not one to pass the row list to a statement trigger, but one
to smoothly segue from growing a list of per-row entries to dropping
that list and queueing one instance of a statement trigger instead.

            regards, tom lane

Re: Deleting millions of rows

From
Andrew Lazarus
Date:
Hello All,


TL> If you're deleting very many but not all rows, people tend
TL> to drop the FK constraints and re-establish them afterwards.

I find

BEGIN;
CREATE TEMP TABLE remnant AS
  SELECT * FROM bigtable WHERE (very_restrictive_condition);
TRUNCATE TABLE bigtable;
INSERT INTO bigtable SELECT * FROM remnant;
COMMIT;
ANALYSE bigtable;

works well because there is no possibility of my forgetting FKs.


--
Sincerely,
 Andrew Lazarus        mailto:andrew@pillette.com
Attachment

Re: Deleting millions of rows

From
Robert Haas
Date:
On Tue, Feb 3, 2009 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Robert Haas escribió:
>>> Have you ever given any thought to whether it would be possible to
>>> implement referential integrity constraints with statement-level
>>> triggers instead of row-level triggers?
>
>> Well, one reason we haven't discussed this is because our per-statement
>> triggers are too primitive yet -- we don't have access to the list of
>> acted-upon tuples.  As soon as we have that we can start discussing this
>> optimization.
>
> I think the point is that at some number of tuples it's better to forget
> about per-row tests at all, and instead perform the same whole-table
> join that would be used to validate the FK from scratch.  The mechanism
> we lack is not one to pass the row list to a statement trigger, but one
> to smoothly segue from growing a list of per-row entries to dropping
> that list and queueing one instance of a statement trigger instead.

That's good if you're deleting most or all of the parent table, but
what if you're deleting 100,000 values from a 10,000,000 row table?
In that case maybe I'm better off inserting all of the deleted keys
into a side table and doing a merge or hash join between the side
table and the child table...

...Robert

Re: Deleting millions of rows

From
Gregory Stark
Date:
Robert Haas <robertmhaas@gmail.com> writes:

> That's good if you're deleting most or all of the parent table, but
> what if you're deleting 100,000 values from a 10,000,000 row table?
> In that case maybe I'm better off inserting all of the deleted keys
> into a side table and doing a merge or hash join between the side
> table and the child table...

It would be neat if we could feed the queued trigger tests into a plan node
like a Materialize and use the planner to determine which type of plan to
generate.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: Deleting millions of rows

From
Robert Haas
Date:
On Wed, Feb 4, 2009 at 7:35 AM, Gregory Stark <stark@enterprisedb.com> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>
>> That's good if you're deleting most or all of the parent table, but
>> what if you're deleting 100,000 values from a 10,000,000 row table?
>> In that case maybe I'm better off inserting all of the deleted keys
>> into a side table and doing a merge or hash join between the side
>> table and the child table...
>
> It would be neat if we could feed the queued trigger tests into a plan node
> like a Materialize and use the planner to determine which type of plan to
> generate.

Yes, definitely.  If it could be built as a general facility it would
be good for a lot of other things too.  Imagine that from within a
statement-level trigger you had magical tables called OLD_TUPLES and
NEW_TUPLES, analagous to OLD and NEW, but the whole set of them.  I
can't tell you how many problems I could solve with this type of
facility...

What I think makes it a little extra-difficult is that even if you had
this, you still can't express what you want to plan as a single query.
 You can either join the foreign key relation against OLD_TUPLES and
delete everything that matches, or you can join the foreign key
relation against the remaining table contents and throw away
everything that doesn't match.

...Robert