Thread: Massive delete from a live production DB

Massive delete from a live production DB

From
Phoenix Kiula
Date:
Hi

Been reading some old threads (pre 9.x version) and it seems that the
consensus is to avoid doing massive deletes from a table as it'll
create so much unrecoverable space/gaps that vacuum full would be
needed. Etc.

Instead, we might as well do a dump/restore. Faster, cleaner.

This is all well and good, but what about a situation where the
database is in production and cannot be brought down for this
operation or even a cluster?

Any ideas on what I could do without losing all the live updates? I
need to get rid of about 11% of a 150 million rows of database, with
each row being nearly 1 to 5 KB in size...

Thanks! Version is 9.0.4.

Re: Massive delete from a live production DB

From
Eric Ndengang
Date:
Am 12.05.2011 16:38, schrieb Phoenix Kiula:
> On Thu, May 12, 2011 at 10:33 PM, Eric Ndengang
> <eric.ndengang_foyet@affinitas.de>  wrote:
>> Am 12.05.2011 16:23, schrieb Phoenix Kiula:
>>> Hi
>>>
>>> Been reading some old threads (pre 9.x version) and it seems that the
>>> consensus is to avoid doing massive deletes from a table as it'll
>>> create so much unrecoverable space/gaps that vacuum full would be
>>> needed. Etc.
>>>
>>> Instead, we might as well do a dump/restore. Faster, cleaner.
>>>
>>> This is all well and good, but what about a situation where the
>>> database is in production and cannot be brought down for this
>>> operation or even a cluster?
>>>
>>> Any ideas on what I could do without losing all the live updates? I
>>> need to get rid of about 11% of a 150 million rows of database, with
>>> each row being nearly 1 to 5 KB in size...
>>>
>>> Thanks! Version is 9.0.4.
>>>
>> Hey,
>> try to use pg_reorg -->  http://reorg.projects.postgresql.org
>> but the table must get a primary key.
>> regards
>>
>
>
> Thanks Eric.
>
> I do have a primary key.
>
> I am on version 9.0.4. Will pg_reorg work with this version too? The
> example on that website  mentions 8.3.
>
> Also, it it a fast process that does not consume too much resource?
> This DB is behind a very high traffic website, so I cannot have a
> CLUSTER alternative like pg_reog making my DB very slow concurrently.
>
> How does one install the patch easily on CentOS (Linux) 64 bit?
>
> Thanks!
Hi,

/* I am on version 9.0.4. Will pg_reorg work with this version too? The
example on that website  mentions 8.3. */

I used to use pg_reorg on version 8.4.8  and regarding the documentation
it will also work with the 9.0 version.

/* How does one install the patch easily on CentOS (Linux) 64 bit? */

You can easily install it as a contrib . Just read the installation
guide or the man Page.

/*

Also, it it a fast process that does not consume too much resource?
This DB is behind a very high traffic website, so I cannot have a
CLUSTER alternative like pg_reog making my DB very slow concurrently.*/

Yes, it's a fast process that is neither time nor resource consumming. The reorgainization of a table with about 60
millioncould take less than 8 minutes without higher cpu cost. 

cheers

--
Eric Ndengang
Datenbankadministrator

Affinitas GmbH  |  Kohlfurter Straße 41/43  |  10999 Berlin  |  Germany
email: eric.ndengang_foyet@affinitas.de  | tel: +49.(0)30. 991 949 5 0  |  www.edarling.de

Geschäftsführer: Lukas Brosseder, David Khalil, Kai Rieke, Christian Vollmann
Eingetragen beim Amtsgericht Berlin, HRB 115958

Real People:          www.edarling.de/echte-paare
Real Love:            www.youtube.de/edarling
Real Science:         www.edarling.org


Re: Massive delete from a live production DB

From
Bill Moran
Date:
In response to Phoenix Kiula <phoenix.kiula@gmail.com>:

> Hi
>
> Been reading some old threads (pre 9.x version) and it seems that the
> consensus is to avoid doing massive deletes from a table as it'll
> create so much unrecoverable space/gaps that vacuum full would be
> needed. Etc.
>
> Instead, we might as well do a dump/restore. Faster, cleaner.
>
> This is all well and good, but what about a situation where the
> database is in production and cannot be brought down for this
> operation or even a cluster?
>
> Any ideas on what I could do without losing all the live updates? I
> need to get rid of about 11% of a 150 million rows of database, with
> each row being nearly 1 to 5 KB in size...

Have you considered the following process:

1) SELECT the rows you want to keep into a new table (time-consuming)
2) Start outage
3) Pull over any new rows that might have been added between 1 & 2
4) Drop the old table
5) Rename the new table to the old name
6) Any other steps required to make the new table exactly like
   the old one (i.e. foreign keys, serials, etc)
7) End outage window

Because steps 3 - 6 are very fast, your outage window is very short.
Not a perfect, 0 downtime solution, but possibly helpful.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Massive delete from a live production DB

From
Marti Raudsepp
Date:
On Thu, May 12, 2011 at 17:23, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> Been reading some old threads (pre 9.x version) and it seems that the
> consensus is to avoid doing massive deletes from a table as it'll
> create so much unrecoverable space/gaps that vacuum full would be
> needed. Etc.

Just running DELETE with normal autovacuum won't *shrink* the physical
table, but the freed-up space will be made available for future
inserts/updates. No problem there. Fragmentation of newly inserted
records is still a potential issue.

It's true that pre-8.4 PostgreSQL versions you could run into "dead
space" that couldn't be re-used, if you had badly tuned FSM. I presume
this is why VACUUM FULL was recommended -- but this advice no longer
applies to 8.4 or 9.0.

> Instead, we might as well do a dump/restore. Faster, cleaner.
>
> Any ideas on what I could do without losing all the live updates? I
> need to get rid of about 11% of a 150 million rows of database, with
> each row being nearly 1 to 5 KB in size...

For deleting 11%, a dump and restore of 150 million records and
hundreds of gigabytes doesn't seem worth it. If it was closer to 50%,
then I'd consider it.

Regards,
Marti

Re: Massive delete from a live production DB

From
Phoenix Kiula
Date:
[snip]


> You can easily install it as a contrib . Just read the installation guide or
> the man Page.



Thanks Eric. How though?

The instructions here --
http://reorg.projects.postgresql.org/pg_reorg.html  -- are woefully
incomplete.

I have a standard PG install on WHM/Cpanel type server. I know the
path to pgsql. I can download the pg_reorg.1.1.5.tar.gz into this
folder and untar it. Then what? A "make" and "make install" does not
work -- the usual "./config" stuff is not available.

Sorry, I need more detailed steps.

I googled and found this:
http://www.postgresql.org/docs/9.0/static/contrib.html

But the recommended steps:

   gmake
   gmake install

...don't work either. Here's what I see:



[mydomain] src > cd pg_reorg-1.1.5

[mydomain] pg_reorg-1.1.5 > gmake
Makefile:13: ../../src/Makefile.global: No such file or directory
gmake: *** No rule to make target `../../src/Makefile.global'.  Stop.

[mydomain] pg_reorg-1.1.5 >



What am I missing?

PS. If pg_reorg is such a useful contribution, why can't it be
included with PG? Seems like a very useful tool anyway! No?

Thanks.

Re: Massive delete from a live production DB

From
Scott Marlowe
Date:
On Thu, May 12, 2011 at 8:23 AM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> Hi
>
> Been reading some old threads (pre 9.x version) and it seems that the
> consensus is to avoid doing massive deletes from a table as it'll
> create so much unrecoverable space/gaps that vacuum full would be
> needed. Etc.

> Any ideas on what I could do without losing all the live updates? I
> need to get rid of about 11% of a 150 million rows of database, with
> each row being nearly 1 to 5 KB in size...

11% is not big deal as the space will get re-used for future updates
and inserts.

Re: Massive delete from a live production DB

From
Tomas Vondra
Date:
Dne 12.5.2011 17:37, Phoenix Kiula napsal(a):
> [mydomain] src > cd pg_reorg-1.1.5
>
> [mydomain] pg_reorg-1.1.5 > gmake
> Makefile:13: ../../src/Makefile.global: No such file or directory
> gmake: *** No rule to make target `../../src/Makefile.global'.  Stop.
>
> [mydomain] pg_reorg-1.1.5 >
>
>
> What am I missing?

Do you have a source or just a binary package? To compile such contrib
package you need a properly configured source tree.

If you do have the sources already available, you need to run configure
(because that's what produces the src/Makefile.global). And then build
the contrib module again.

If you don't have the sources, you can download the distribution at
postgresql.org. But maybe there's a src package for your distro.

> PS. If pg_reorg is such a useful contribution, why can't it be
> included with PG? Seems like a very useful tool anyway! No?

There's a lot of packages that might be included into the default
install, but that'd put a lot of responsibilities to maintain them.

regards
Tomas

Re: Massive delete from a live production DB

From
Tomas Vondra
Date:
Dne 12.5.2011 16:23, Phoenix Kiula napsal(a):
> Hi
>
> Been reading some old threads (pre 9.x version) and it seems that the
> consensus is to avoid doing massive deletes from a table as it'll
> create so much unrecoverable space/gaps that vacuum full would be
> needed. Etc.
>
> Instead, we might as well do a dump/restore. Faster, cleaner.
>
> This is all well and good, but what about a situation where the
> database is in production and cannot be brought down for this
> operation or even a cluster?
>
> Any ideas on what I could do without losing all the live updates? I
> need to get rid of about 11% of a 150 million rows of database, with
> each row being nearly 1 to 5 KB in size...
>
> Thanks! Version is 9.0.4.

One of the possible recipes in such case is usually a partitioning. If
you can divide the data so that a delete is equal to a drop of a
partition, then you don't need to worry about vacuum etc.

But the partitioning has it's own problems - you can't reference the
partitioned table using foreign keys, the query plans often are not as
efficient as with a non-partitioned table etc.

regards
Tomas