Thread: Auto Vacuum

Auto Vacuum

From
Russell Smith
Date:
Hi All,

I am doing serious thinking about the implementation of Auto Vacuum as part of the backend, Not using libpq, but
classinginternal functions directly.
 
It appears to me that calling internal functions directly is a better implementation than using the external library to
dothe job.
 

I know I might be stepping on Matthew's toes, but I don't really want to.  I am a complete newbie to the postgresql
code,however I am trying.
 
Vacuum appears to be one of the bigger saw points with administrator having to configure it via scheduled tasks.

I have outlined things I have thought about below.  I've surely missed a lot, and am open to feedback.  Others may like
thecurrent tuning used
 
by the external autovacuum, however to have stats that are collected at vacuum, and used for later vacuum would mean we
don'tneed the
 
stats collector running.

The major autovacuum issues 

1. Transaction Wraparound
2. Vacuum of relations
3. Tracking of when to do vacuums
4. Where to store information needed by auto vacuum

I would suggest the first step is to replicate the wraparound and relation vacuuming from the current code first. But I
couldbe wrong.
 
Currently there are a lot of tuning options in the external autovacuum, and the best way to run vacuum internally will
needto be thought out.
 

1. Transaction Wraparound

It appears from the code, that the best time to do a transaction wraparound db wide vacuum is when the
frozenXID has wrapped, TransactionIdPrecedes(myXID, dbform->datfrozenxid).  And is probably the most
simple vacuum to implement.


2. Vacuuming of relations

Currently, the entire heap must be vacuumed at one time.  I would possible be desireable to have only part of the
relationvacuumed at
 
a time.  If you can find out which parts of the relation have the most slack space.  There is a todo item regarding
trackingrecent deletions
 
so they can be resused.  Some form of this would be helpful to work out what to vacuum.  Performance issues for this
typeof activity 
 
may be a concern.  But I have no experience to be able to make comment on them.  So I welcome yours.

3. Tracking of when to vacuum

Current autovacuum relies the stats collector to be running.  I would like to only use the stats if they are
available,
and have an option to be able to vacuum accurately without having to have stats running.

By adding certain histograms, on tuples, filesize and slack space we can guage the time between vacuums

number of tuples will show the inserts as opposed to updates.
file size will show that the file is growing and by how much between vacuums.
slack space will show the delete/updated records.

A new guc and relation option would need to be implemented to give a target slack space in a file.
this will help to reduce growth in relations if vacuum happens to not run frequently enough.  This information
can also inform autovacuum that it should be vacuuming more frequently.  The number would be a percentage,
eg 10% of the total file size is allowed to be unused.  Also alter table would allow users to set levels of slackness
for each relation.  If the table gets too much more than the target slack space, a "partial" vacuum full could be run
to reduce the size of the table by moving tuples at the end of the table to slack space near the beginning and 
shortening the table length.  It would require a full table lock, but you may be able to space it out, to only do a
page
at a time.

/* target percentage of slack space */
vacuum_default_target_slack_space = 10

ALTER TABLE SET TARGET SLACK SPACE = 10;

4. Where to store information required by auto vacuum.

Auto vacuum needs somewhere to stop the information it needs about current and future vacuuming.  I am unsure of where
to put this.  It appears as if it will have a number of fields.  I feel like pg_class is the best place to put the
information,but 
 
again I have no idea.

That's the best I can do for now.  I can clarify things further if required.

Regards

Russell.


Re: Auto Vacuum

From
Bruce Momjian
Date:
I have added an auto-vacuum TODO item:

* Auto-vacuum       o Move into the backend code       o Scan the buffer cache to find free space or use background
writer      o Use free-space map information to guide refilling
 


---------------------------------------------------------------------------

Russell Smith wrote:
> Hi All,
> 
> I am doing serious thinking about the implementation of Auto Vacuum as part of the backend, Not using libpq, but
classinginternal functions directly.
 
> It appears to me that calling internal functions directly is a better implementation than using the external library
todo the job.
 
> 
> I know I might be stepping on Matthew's toes, but I don't really want to.  I am a complete newbie to the postgresql
code,however I am trying.
 
> Vacuum appears to be one of the bigger saw points with administrator having to configure it via scheduled tasks.
> 
> I have outlined things I have thought about below.  I've surely missed a lot, and am open to feedback.  Others may
likethe current tuning used
 
> by the external autovacuum, however to have stats that are collected at vacuum, and used for later vacuum would mean
wedon't need the
 
> stats collector running.
> 
> The major autovacuum issues 
> 
> 1. Transaction Wraparound
> 2. Vacuum of relations
> 3. Tracking of when to do vacuums
> 4. Where to store information needed by auto vacuum
> 
> I would suggest the first step is to replicate the wraparound and relation vacuuming from the current code first. But
Icould be wrong.
 
> Currently there are a lot of tuning options in the external autovacuum, and the best way to run vacuum internally
willneed to be thought out.
 
> 
> 1. Transaction Wraparound
> 
> It appears from the code, that the best time to do a transaction wraparound db wide vacuum is when the
> frozenXID has wrapped, TransactionIdPrecedes(myXID, dbform->datfrozenxid).  And is probably the most
> simple vacuum to implement.
> 
> 
> 2. Vacuuming of relations
> 
> Currently, the entire heap must be vacuumed at one time.  I would possible be desireable to have only part of the
relationvacuumed at
 
> a time.  If you can find out which parts of the relation have the most slack space.  There is a todo item regarding
trackingrecent deletions
 
> so they can be resused.  Some form of this would be helpful to work out what to vacuum.  Performance issues for this
typeof activity 
 
> may be a concern.  But I have no experience to be able to make comment on them.  So I welcome yours.
> 
> 3. Tracking of when to vacuum
> 
> Current autovacuum relies the stats collector to be running.  I would like to only use the stats if they are
available,
> and have an option to be able to vacuum accurately without having to have stats running.
> 
> By adding certain histograms, on tuples, filesize and slack space we can guage the time between vacuums
> 
> number of tuples will show the inserts as opposed to updates.
> file size will show that the file is growing and by how much between vacuums.
> slack space will show the delete/updated records.
> 
> A new guc and relation option would need to be implemented to give a target slack space in a file.
> this will help to reduce growth in relations if vacuum happens to not run frequently enough.  This information
> can also inform autovacuum that it should be vacuuming more frequently.  The number would be a percentage,
> eg 10% of the total file size is allowed to be unused.  Also alter table would allow users to set levels of
slackness
> for each relation.  If the table gets too much more than the target slack space, a "partial" vacuum full could be
run
> to reduce the size of the table by moving tuples at the end of the table to slack space near the beginning and 
> shortening the table length.  It would require a full table lock, but you may be able to space it out, to only do a
page
> at a time.
> 
> /* target percentage of slack space */
> vacuum_default_target_slack_space = 10
> 
> ALTER TABLE SET TARGET SLACK SPACE = 10;
> 
> 4. Where to store information required by auto vacuum.
> 
> Auto vacuum needs somewhere to stop the information it needs about current and future vacuuming.  I am unsure of
where
> to put this.  It appears as if it will have a number of fields.  I feel like pg_class is the best place to put the
information,but 
 
> again I have no idea.
> 
> That's the best I can do for now.  I can clarify things further if required.
> 
> Regards
> 
> Russell.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Auto Vacuum

From
Gavin Sherry
Date:
Matthew T. O'Connor looked at this fairly closely leading up to 8.0
feature freeze. There was a long discussion earlier this year with respect
to libpq vs. using backend functions directly to vacuum multiple
databases.

http://archives.postgresql.org/pgsql-hackers/2004-03/msg00931.php

This should be mandatory reading before proceeding. Also, you want to talk
to Matthew and review his patch (submitted early July, I think).

Thanks,

Gavin

On Sun, 28 Nov 2004, Bruce Momjian wrote:

>
> I have added an auto-vacuum TODO item:
>
> * Auto-vacuum
>         o Move into the backend code
>         o Scan the buffer cache to find free space or use background writer
>         o Use free-space map information to guide refilling
>
>
> ---------------------------------------------------------------------------
>
> Russell Smith wrote:
> > Hi All,
> >
> > I am doing serious thinking about the implementation of Auto Vacuum as part of the backend, Not using libpq, but
classinginternal functions directly.
 
> > It appears to me that calling internal functions directly is a better implementation than using the external
libraryto do the job.
 
> >
> > I know I might be stepping on Matthew's toes, but I don't really want to.  I am a complete newbie to the postgresql
code,however I am trying.
 
> > Vacuum appears to be one of the bigger saw points with administrator having to configure it via scheduled tasks.
> >
> > I have outlined things I have thought about below.  I've surely missed a lot, and am open to feedback.  Others may
likethe current tuning used
 
> > by the external autovacuum, however to have stats that are collected at vacuum, and used for later vacuum would
meanwe don't need the
 
> > stats collector running.
> >
> > The major autovacuum issues
> >
> > 1. Transaction Wraparound
> > 2. Vacuum of relations
> > 3. Tracking of when to do vacuums
> > 4. Where to store information needed by auto vacuum
> >
> > I would suggest the first step is to replicate the wraparound and relation vacuuming from the current code first.
ButI could be wrong.
 
> > Currently there are a lot of tuning options in the external autovacuum, and the best way to run vacuum internally
willneed to be thought out.
 
> >
> > 1. Transaction Wraparound
> >
> > It appears from the code, that the best time to do a transaction wraparound db wide vacuum is when the
> > frozenXID has wrapped, TransactionIdPrecedes(myXID, dbform->datfrozenxid).  And is probably the most
> > simple vacuum to implement.
> >
> >
> > 2. Vacuuming of relations
> >
> > Currently, the entire heap must be vacuumed at one time.  I would possible be desireable to have only part of the
relationvacuumed at
 
> > a time.  If you can find out which parts of the relation have the most slack space.  There is a todo item regarding
trackingrecent deletions
 
> > so they can be resused.  Some form of this would be helpful to work out what to vacuum.  Performance issues for
thistype of activity
 
> > may be a concern.  But I have no experience to be able to make comment on them.  So I welcome yours.
> >
> > 3. Tracking of when to vacuum
> >
> > Current autovacuum relies the stats collector to be running.  I would like to only use the stats if they are
available,
> > and have an option to be able to vacuum accurately without having to have stats running.
> >
> > By adding certain histograms, on tuples, filesize and slack space we can guage the time between vacuums
> >
> > number of tuples will show the inserts as opposed to updates.
> > file size will show that the file is growing and by how much between vacuums.
> > slack space will show the delete/updated records.
> >
> > A new guc and relation option would need to be implemented to give a target slack space in a file.
> > this will help to reduce growth in relations if vacuum happens to not run frequently enough.  This information
> > can also inform autovacuum that it should be vacuuming more frequently.  The number would be a percentage,
> > eg 10% of the total file size is allowed to be unused.  Also alter table would allow users to set levels of
slackness
> > for each relation.  If the table gets too much more than the target slack space, a "partial" vacuum full could be
run
> > to reduce the size of the table by moving tuples at the end of the table to slack space near the beginning and
> > shortening the table length.  It would require a full table lock, but you may be able to space it out, to only do a
page
> > at a time.
> >
> > /* target percentage of slack space */
> > vacuum_default_target_slack_space = 10
> >
> > ALTER TABLE SET TARGET SLACK SPACE = 10;
> >
> > 4. Where to store information required by auto vacuum.
> >
> > Auto vacuum needs somewhere to stop the information it needs about current and future vacuuming.  I am unsure of
where
> > to put this.  It appears as if it will have a number of fields.  I feel like pg_class is the best place to put the
information,but
 
> > again I have no idea.
> >
> > That's the best I can do for now.  I can clarify things further if required.
> >
> > Regards
> >
> > Russell.
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> >
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


Re: Auto Vacuum

From
"Matthew T. O'Connor"
Date:
Hello Russell,

Russell Smith wrote:

>I am doing serious thinking about the implementation of Auto Vacuum as part of the backend, Not using libpq, but
classinginternal functions directly.
 
>It appears to me that calling internal functions directly is a better implementation than using the external library
todo the job.
 
>
>  
>
We are planning to move it into the backend (no longer an external libpq 
based contrib module)  I tried to do this for 8.0, but it didn't make 
the cut, so I expect this work will be done for 8.1.

>I know I might be stepping on Matthew's toes, but I don't really want to.  I am a complete newbie to the postgresql
code,however I am trying.
 
>Vacuum appears to be one of the bigger saw points with administrator having to configure it via scheduled tasks.
>
>  
>
Agreed, that is one of the reasons I took on Autovacuum, I think it is 
something a lot of admins would like PG to do for itself.

>The major autovacuum issues 
>
>1. Transaction Wraparound
>2. Vacuum of relations
>3. Tracking of when to do vacuums
>4. Where to store information needed by auto vacuum
>
>1. Transaction Wraparound
>
>  
>
This is handled by the current autovacuum using the process outlined in:
http://www.postgresql.org/docs/7.4/static/maintenance.html

>2. Vacuuming of relations
>
>Currently, the entire heap must be vacuumed at one time.  I would possible be desireable to have only part of the
relationvacuumed at
 
>a time.  If you can find out which parts of the relation have the most slack space.  There is a todo item regarding
trackingrecent deletions
 
>so they can be resused.  Some form of this would be helpful to work out what to vacuum.  Performance issues for this
typeof activity 
 
>may be a concern.  But I have no experience to be able to make comment on them.  So I welcome yours.
>
>  
>
This is not really an autovacuum related topic, if at some point someone 
adds the ability to VACUUM to do partials then autovacuum will make use 
of it.  BTW, this has been suggested several times so please search the 
archives for details.

>3. Tracking of when to vacuum
>
>Current autovacuum relies the stats collector to be running.  I would like to only use the stats if they are
available,
>and have an option to be able to vacuum accurately without having to have stats running.
>  
>
I think it is universally agreed upon that using data from the FSM is a 
better solution since it would not require you to have the stats system 
running and actually gives you a very accurate picture of what table 
have slack space to recover (assuming that the FSM is large enough).  
This is a topic that I need help on from some more enlightened core hackers.

>4. Where to store information required by auto vacuum.
>  
>
The backend integration patch that I submitted a few months ago added a 
new pg_autovacuum table to the system catalogues.  This table stored 
data that pg_autovacuum needed to persist across backend restarts, and 
also allowed the user to set per table settings for thresholds etc.  I 
never heard anyone complain about this design, so from the silence I 
assume this is an acceptable way of maintaining pg_autovacuum related data.

Matthew


Re: Auto Vacuum

From
"Matthew T. O'Connor"
Date:
Bruce Momjian wrote:

>I have added an auto-vacuum TODO item:
>
>* Auto-vacuum
>        o Move into the backend code
>        o Scan the buffer cache to find free space or use background writer
>        o Use free-space map information to guide refilling
>

I'm not sure what you mean exactly by "Scan the buffer cache to find 
free space or use background writer", the other two are definitely high 
priority todo items (at least as far as autovacuum in concerned).

Matthew



Re: Auto Vacuum

From
Bruce Momjian
Date:
Matthew T. O'Connor wrote:
> Bruce Momjian wrote:
> 
> >I have added an auto-vacuum TODO item:
> >
> >* Auto-vacuum
> >        o Move into the backend code
> >        o Scan the buffer cache to find free space or use background writer
> >        o Use free-space map information to guide refilling
> >
> 
> I'm not sure what you mean exactly by "Scan the buffer cache to find 
> free space or use background writer", the other two are definitely high 
> priority todo items (at least as far as autovacuum in concerned).
> 

I am thinking we could look for expired tuples when while they are in
the buffer cache or before they are written to disk so we don't have to
a sequential scan to find them.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Auto Vacuum

From
"Matthew T. O'Connor"
Date:
Bruce Momjian wrote:

>Matthew T. O'Connor wrote:
>  
>
>>Bruce Momjian wrote:
>>    
>>
>>>I have added an auto-vacuum TODO item:
>>>
>>>* Auto-vacuum
>>>       o Move into the backend code
>>>       o Scan the buffer cache to find free space or use background writer
>>>       o Use free-space map information to guide refilling
>>>      
>>>
>>I'm not sure what you mean exactly by "Scan the buffer cache to find 
>>free space or use background writer", the other two are definitely high 
>>priority todo items (at least as far as autovacuum in concerned).
>>    
>>
>
>I am thinking we could look for expired tuples when while they are in
>the buffer cache or before they are written to disk so we don't have to
>a sequential scan to find them.
>  
>
Is that related to autovacuum? Or is that a potential feature inside the 
actual vacuum command?


Re: Auto Vacuum

From
Bruce Momjian
Date:
Matthew T. O'Connor wrote:
> Bruce Momjian wrote:
> 
> >Matthew T. O'Connor wrote:
> >  
> >
> >>Bruce Momjian wrote:
> >>    
> >>
> >>>I have added an auto-vacuum TODO item:
> >>>
> >>>* Auto-vacuum
> >>>       o Move into the backend code
> >>>       o Scan the buffer cache to find free space or use background writer
> >>>       o Use free-space map information to guide refilling
> >>>      
> >>>
> >>I'm not sure what you mean exactly by "Scan the buffer cache to find 
> >>free space or use background writer", the other two are definitely high 
> >>priority todo items (at least as far as autovacuum in concerned).
> >>    
> >>
> >
> >I am thinking we could look for expired tuples when while they are in
> >the buffer cache or before they are written to disk so we don't have to
> >a sequential scan to find them.
> >  
> >
>
> Is that related to autovacuum? Or is that a potential feature inside the 
> actual vacuum command?

That could be part of auto-vacuum.  Vacuum itself would still sequential
scan, I think.  The idea is to easily grab expire tuples when they are
most cheaply found.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Auto Vacuum

From
Christopher Browne
Date:
Oops! pgman@candle.pha.pa.us (Bruce Momjian) was seen spray-painting on a wall:
> That could be part of auto-vacuum.  Vacuum itself would still
> sequential scan, I think.  The idea is to easily grab expire tuples
> when they are most cheaply found.

The nifty handling of this would be to introduce "VACUUM CACHE", which
would simply walk through the shared memory cache to look for expiries
there.  

That could have a most interesting interaction with ARC...

On the "unfortunate" side, marking tuples as dead would, I believe
draw in some index pages.  (Right?)  

Those pages drawn in would remain at the "cheapest" end of the cache;
an ARC 'win.'  And it should be the case that this ultimately shrinks
cache usage, as dead tuples get thrown out.

Running VACUUM CACHE periodically on a system that is "killing" tuples
at a pretty steady clip ought to clear out many of those tuples
without needing to browse the tables.  

This ought to be particularly helpful with large tables that have
small "contentious" portions that generate dead tuples.
-- 
let name="cbbrowne" and tld="gmail.com" in name ^ "@" ^ tld;;
http://www.ntlug.org/~cbbrowne/rdbms.html
"Heuristics (from the  French heure, "hour") limit the  amount of time
spent executing something.  [When using heuristics] it shouldn't take
longer than an hour to do something."