Thread: Controlling Database Growth

Controlling Database Growth

From
Mark Drago
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I'm using PostgreSQL to log web traffic leaving our network.  This
results in the database growing to a fairly large size.  This machine
will be left unattended and basically unmaintained for long stretches of
time so I need a way to limit the disk space that Postgres uses.  So far
I have come up with two basic ideas:

1. Routinely run 'du' in the directory containing the PostgreSQL data,
which in my case is /var/postgresql/data and when it gets to a certain
size remove a whole bunch of the old data from the database, and run
'vacuum full; reindex database db_name; analyze;'.

The problem with this is that the vacuum could take nearly an hour to
run in some cases and there will be data that needs to get logged during
this hour.  Also, the vacuum process could use disk space above what the
database is currently using and that disk space may not be available.

2. Use pgstattuple() to determine how much space is being used at any
given time and delete a bunch of old rows from the database when it is
approaching a limit.

The nice thing about this is that 'vacuum full;' does not have to be
executed in order to see the space get reclaimed.  The downside is that
running pgstattuple() is much more expensive than running 'du', so the
disk space checks can't happen as often, and they can not be run at all
during the day.

I am curious to know if anyone has any other ideas as to how I can limit
the disk space that PostgreSQL uses to say 5GB.  I have not looked in to
pg_autovacuum yet, but from what I have read about it it does not seem
to be the answer to this problem.  Has anyone else had to do such a
thing before?  Does anyone have any ideas on how to do this better?

Thanks,
Mark Drago
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFFuSGy2ovBrIOxiiARAjmeAKCgmN4fNWTv1ZTgkCQZCeAAgYdLyQCgwZsb
uqveC3xd97nWNg2ty2MCs0M=
=dTca
-----END PGP SIGNATURE-----

Re: Controlling Database Growth

From
Bill Moran
Date:
In response to Mark Drago <markdrago@mail.com>:
>
> I'm using PostgreSQL to log web traffic leaving our network.  This
> results in the database growing to a fairly large size.  This machine
> will be left unattended and basically unmaintained for long stretches of
> time so I need a way to limit the disk space that Postgres uses.  So far
> I have come up with two basic ideas:
>
> 1. Routinely run 'du' in the directory containing the PostgreSQL data,
> which in my case is /var/postgresql/data and when it gets to a certain
> size remove a whole bunch of the old data from the database, and run
> 'vacuum full; reindex database db_name; analyze;'.
>
> The problem with this is that the vacuum could take nearly an hour to
> run in some cases and there will be data that needs to get logged during
> this hour.  Also, the vacuum process could use disk space above what the
> database is currently using and that disk space may not be available.
>
> 2. Use pgstattuple() to determine how much space is being used at any
> given time and delete a bunch of old rows from the database when it is
> approaching a limit.
>
> The nice thing about this is that 'vacuum full;' does not have to be
> executed in order to see the space get reclaimed.  The downside is that
> running pgstattuple() is much more expensive than running 'du', so the
> disk space checks can't happen as often, and they can not be run at all
> during the day.
>
> I am curious to know if anyone has any other ideas as to how I can limit
> the disk space that PostgreSQL uses to say 5GB.  I have not looked in to
> pg_autovacuum yet, but from what I have read about it it does not seem
> to be the answer to this problem.  Has anyone else had to do such a
> thing before?  Does anyone have any ideas on how to do this better?

I don't think either of those are good ideas, because they both rely on
disk limits to trigger drastic changes in database size, which will then
require drastic maintenance operations (vacuum full, reindex) to clean
up.

Personally, I think you'd be a lot better off estimating how much new
data comes in each day, and scheduling a daily delete of old data combined
with regular vacuum (either via cron or using autovacuum) and an occasional
reindex.

Add to that some system monitoring via snmp traps -- maybe even graphing
with mrtg -- to keep an eye on things in case you need to adjust the
frequency or amount of stuff that's done, and you should see the database
stabilize at a manageable size.

--
Bill Moran
Collaborative Fusion Inc.

Re: Controlling Database Growth

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/25/07 15:43, Bill Moran wrote:
> In response to Mark Drago <markdrago@mail.com>:
[snip]
> I don't think either of those are good ideas, because they both
> rely on disk limits to trigger drastic changes in database size,
> which will then require drastic maintenance operations (vacuum
> full, reindex) to clean up.
>
> Personally, I think you'd be a lot better off estimating how much
> new data comes in each day, and scheduling a daily delete of old
> data combined with regular vacuum (either via cron or using
> autovacuum) and an occasional reindex.
>
> Add to that some system monitoring via snmp traps -- maybe even
> graphing with mrtg -- to keep an eye on things in case you need
> to adjust the frequency or amount of stuff that's done, and you
> should see the database stabilize at a manageable size.

Agree totally with this.

You could even partition the table (by month, probably) either using
a view of a UNION ALL or with PostgreSQL's built-in partitioning.
Dropping the oldest month would then be a rapid process.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFuUOIS9HxQb37XmcRArJKAJ4u39v+IpTjpCZ6oPSpmfrhkybikACfWrGB
1JM2fokqQafd/yOWGv7vDa8=
=1jNP
-----END PGP SIGNATURE-----