Thread: Vacuum daemon (pgvacuumd ?)

Vacuum daemon (pgvacuumd ?)

From

mlw

Date:

05 March 2002, 21:26:43

(for background, see conversation: "Postgresql backend to perform vacuum
automatically" )

In the idea phase 1, brainstorm

Create a table for the defaults in template1
Create a table in each database for state inforation.

Should have a maximum duty cycle for vacuum vs non-vacuum on a per table basis.
If a vacuum takes 3 minutes, and a duty cycle is no more than 10%, the next
vacuum can not take place for another 30 minutes. Is this a table or database
setting? I am thinking table. Anyone have good arguments for database?

Must have a trigger point of number of total tuples vs number of dirty tuples.
Unfortunately some tuples are more important than others, but that I don't know
how to really detect that. We should be able to keep track of the number of
dirty tuples in a table. Is it known how many tuples are in a table at any
point? (if so, on a side note, can we use this for a count()?) How about dirty
tuples?

Is the number of deleted tuples sufficient to decide priority on vacuum? My
thinking is that the tables with the most deleted tuples is the table which
need most vacuum. Should ratio of deleted tuples vs total tuples or just count
of deleted tuples. I am thinking ratio, but maybe it need be tunable.


Here is the program flow:

(1) Startup (Do this for each database.)
(2) Get all the information from a vacuumd table.
(2) If the table does not exist, perform a vacuum on all tables, and initialize
the table to current state. 
(3) Check which tables can be vacuumed based on their duty cycle and current
time.
(4) If the tables eligible to be vacuumed have deleted tuples which exceed
acceptable limits, vacuum them.
(5) Wait a predefined time, loop (2)

This is my basic idea, what do you all think?

I plan to work on this in the next couple weeks. Any suggestions, notes,
concerns, features would be welcome.

Re: Vacuum daemon (pgvacuumd ?)

From

Bruce Momjian

Date:

05 March 2002, 22:02:04

> Is the number of deleted tuples sufficient to decide priority on vacuum? My
> thinking is that the tables with the most deleted tuples is the table which
> need most vacuum. Should ratio of deleted tuples vs total tuples or just count
> of deleted tuples. I am thinking ratio, but maybe it need be tunable.

Deleted or updated.  Both expire tuples.  Also, the old tuples can't be
vacuumed until no other transaction is viewing them as active.

> (4) If the tables eligible to be vacuumed have deleted tuples which exceed
> acceptable limits, vacuum them.

Seems you will measure in percentages, right?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Vacuum daemon (pgvacuumd ?)

From

"Rod Taylor"

Date:

05 March 2002, 22:13:45

> (for background, see conversation: "Postgresql backend to perform
vacuum
> automatically" )
>
> In the idea phase 1, brainstorm
>
> Create a table for the defaults in template1
> Create a table in each database for state inforation.
>
> Should have a maximum duty cycle for vacuum vs non-vacuum on a per
table basis.
> If a vacuum takes 3 minutes, and a duty cycle is no more than 10%,
the next
> vacuum can not take place for another 30 minutes. Is this a table or
database
> setting? I am thinking table. Anyone have good arguments for
database?

I'd vote for database (or even system) settings personally, as those
tables which don't get updated simply won't have vacuum run on them.
Those that do will.  Vacuum anywhere will degrade performance as it's
additional disk work.  To top that off, if it's a per table duty cycle
you need to add additional checks to prevent vacuum from running on
all or several tables at the same time.  Duty cycle per DB (single
vacuum tracking per db) will limit to a single instance of vacuum.

I'm a little concerned about duty cycle.  Why limit?  If a tables
access speed could be increased enough to outweight the cost of the
vacuum it should always be done.  Perhaps a generic cost > 500 + (15%
tuples updated / deleted) would work.  That is, a %age dead tuples,
plus a base to keep it from constantly firing on nearly empty tables.

Do the table, and pick the next worse off (if there are more than one
requring vacuum).  Perhaps frequency of selects weighs in here too.
15% dead in a table recieving 99% selects is worse than 100% dead in a
table receiving 99% updates as the former will have more long term
affect by doing it now.  Table with updates is probably constantly
putting up requests anyway.

I'd suggest making the base and %age dead tuple numbers GUCable rather
than stored in a system table.  It's probably not something we want
people playing with easily -- especially when they can still run
vacuum manually.


Finally, are the stats your collecting based on completed transactions
or do they include ones that are rolled back as well?  100 updates
rolled back is just as evil as 100 that completed -- speed wise
anyway.

Re: Vacuum daemon (pgvacuumd ?)

From

"Nicolas Bazin"

Date:

05 March 2002, 22:16:38

----- Original Message -----
From: "mlw" <markw@mohawksoft.com>
To: "PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Wednesday, March 06, 2002 1:21 PM
Subject: [HACKERS] Vacuum daemon (pgvacuumd ?)


> (for background, see conversation: "Postgresql backend to perform vacuum
> automatically" )
>
> In the idea phase 1, brainstorm
>
> Create a table for the defaults in template1
> Create a table in each database for state inforation.
>
> Should have a maximum duty cycle for vacuum vs non-vacuum on a per table
basis.
> If a vacuum takes 3 minutes, and a duty cycle is no more than 10%, the
next
> vacuum can not take place for another 30 minutes. Is this a table or
database
> setting? I am thinking table. Anyone have good arguments for database?
>
> Must have a trigger point of number of total tuples vs number of dirty
tuples.
> Unfortunately some tuples are more important than others, but that I don't
know
> how to really detect that. We should be able to keep track of the number
of
> dirty tuples in a table. Is it known how many tuples are in a table at any
> point? (if so, on a side note, can we use this for a count()?) How about
dirty
> tuples?
This parameters are certainly correct in a lot of cases, but why not use a
stored proc to decide when to start a vacuum. The system table can maintain
raw data related to vacuum: last vacuum timestamp, previous vacuum duration,
table priority, .... Then a parameter can be used let say from 0 to 9 (0 for
no vacuum) as a vacuum profile. The stored proc. would translate this
profile to thresholds adapted to its algorithm that can use the per-table
statistic that already exist.
Obviously a standard proc can be installed but it lets the DBA the
possibility to adapt the criteria to its DB whith no modification to the
code.

>
> Is the number of deleted tuples sufficient to decide priority on vacuum?
My
> thinking is that the tables with the most deleted tuples is the table
which
> need most vacuum. Should ratio of deleted tuples vs total tuples or just
count
> of deleted tuples. I am thinking ratio, but maybe it need be tunable.
>
>
> Here is the program flow:
>
> (1) Startup (Do this for each database.)
> (2) Get all the information from a vacuumd table.
> (2) If the table does not exist, perform a vacuum on all tables, and
initialize
> the table to current state.
> (3) Check which tables can be vacuumed based on their duty cycle and
current
> time.
> (4) If the tables eligible to be vacuumed have deleted tuples which exceed
> acceptable limits, vacuum them.
> (5) Wait a predefined time, loop (2)
>
> This is my basic idea, what do you all think?
>
> I plan to work on this in the next couple weeks. Any suggestions, notes,
> concerns, features would be welcome.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>

Re: Vacuum daemon (pgvacuumd ?)

From

Lincoln Yeoh

Date:

06 March 2002, 00:09:59

I'm thinking that unless the vacuum daemon needs backend info (stats?) it 
could be a totally separate entity (reading from the same config file 
perhaps). However that would make it slightly harder to start up automatically.

I'm not fond of the duty cycle idea. My guess is in too many cases if you 
delay the vacuum, it tends to take longer, then you delay even more, then 
it takes even longer...

It should be related to the number of updates and deletes on a table or 
database.

Maybe you don't need the duty cycle, just check stats every X minutes, if 
enough invalid rows, do vacuum. Issue: you could end up doing vacuum 
continuously, would this impact performance drastically?

If there's vacuuming to be done, is it better to do it later than now? My 
assumption is that lazy vacuum no longer has such a severe impact and so it 
might be better to just do it ASAP. So actually a simple vacuum daemon may 
be good enough.

Is there a danger of high file fragmentation with frequent lazy vacuums?

Regards,
Link.

At 09:21 PM 05-03-2002 -0500, mlw wrote:
>(for background, see conversation: "Postgresql backend to perform vacuum
>automatically" )
>
>In the idea phase 1, brainstorm
>
>Create a table for the defaults in template1
>Create a table in each database for state inforation.
>
>Should have a maximum duty cycle for vacuum vs non-vacuum on a per table 
>basis.
>If a vacuum takes 3 minutes, and a duty cycle is no more than 10%, the next
>vacuum can not take place for another 30 minutes. Is this a table or database
>setting? I am thinking table. Anyone have good arguments for database?
>
>Must have a trigger point of number of total tuples vs number of dirty tuples.
>Unfortunately some tuples are more important than others, but that I don't 
>know
>how to really detect that. We should be able to keep track of the number of
>dirty tuples in a table. Is it known how many tuples are in a table at any
>point? (if so, on a side note, can we use this for a count()?) How about dirty
>tuples?
>
>Is the number of deleted tuples sufficient to decide priority on vacuum? My
>thinking is that the tables with the most deleted tuples is the table which
>need most vacuum. Should ratio of deleted tuples vs total tuples or just count
>of deleted tuples. I am thinking ratio, but maybe it need be tunable.
>
>
>Here is the program flow:
>
>(1) Startup (Do this for each database.)
>(2) Get all the information from a vacuumd table.
>(2) If the table does not exist, perform a vacuum on all tables, and 
>initialize
>the table to current state.
>(3) Check which tables can be vacuumed based on their duty cycle and current
>time.
>(4) If the tables eligible to be vacuumed have deleted tuples which exceed
>acceptable limits, vacuum them.
>(5) Wait a predefined time, loop (2)
>
>This is my basic idea, what do you all think?
>
>I plan to work on this in the next couple weeks. Any suggestions, notes,
>concerns, features would be welcome.
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly