Thread: autovacuum next steps

autovacuum next steps

From
Alvaro Herrera
Date:
After staring at my previous notes for autovac scheduling, it has become
clear that this basics of it is not really going to work as specified.
So here is a more realistic plan:

First, we introduce an autovacuum_max_workers parameter, to limit the
total amount of workers that can be running at any time.  Use this
number to create extra PGPROC entries, etc, similar to the way we handle
the prepared xacts stuff.  The default should be low, say 3 o 4.

The launcher sends a worker into a database just like it does currently.
This worker determines what tables need vacuuming per the pg_autovacuum
settings and pgstat data.  If it's more than one table, it puts the
number of tables in shared memory and sends a signal to the launcher.

The launcher then starts
min(autovacuum_max_workers - currently running workers, tables to vacuum - 1)
more workers to process that database.  Maybe we could have a
max-workers parameter per-database in pg_database to use as a limit here
as well.

Each worker, including the initial one, starts vacuuming tables
according to pgstat data.  They recheck the pgstat data after finishing
each table, so that a table vacuumed by another worker is not processed
twice (maybe problematic: a table with high update rate may be vacuumed
more than once.  Maybe this is a feature not a bug).


Once autovacuum_naptime has passed, if the workers have not finished
yet, the launcher wants to vacuum another database.  At this point, the
launcher wants some of the workers processing the first database to exit
early as soon as they finish one table, so that they can help vacuuming
the other database.  It can do this by setting a flag in shmem that the
workers can check when finished with a table; if the flag is set, they
exit instead of continuing with another table.  The launcher then starts
a worker in the second database.  The launcher does this until the
number of workers is even among both databases.  This can be done till
having one worker per database; so at most autovacuum_max_workers
databases can be under automatic vacuuming at any time, one worker each.

When there are autovacuum_max_workers databases under vacuum, the
launcher doesn't have anything else to do until some worker exits on its
own.

When there is a single worker processing a database, it does not recheck
pgstat data after each table.  This is to prevent a high-update-rate
table from starving the vacuuming of other databases.


How does this sound?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps

From
"Matthew T. O'Connor"
Date:
Alvaro Herrera wrote:
> After staring at my previous notes for autovac scheduling, it has become
> clear that this basics of it is not really going to work as specified.
> So here is a more realistic plan:

[Snip Detailed Description]

> How does this sound?

On first blush, I'm not sure I like this as it doesn't directly attack 
the table starvation problem, and I think it could be a net loss of speed.

VACUUM is I/O bound, as such, just sending multiple vacuum commands at a 
DB isn't going to make things faster, you are now going to have multiple 
processes reading from multiple tables at the same time.  I think in 
general this is a bad thing (unless we someday account for I/O made 
available from multiple tablespaces).  In general the only time it's a 
good idea to have multiple vacuums running at the same time is when a 
big table is starving a small hot table and causing bloat.

I think we can extend the current autovacuum stats to add one more 
column that specifies "is hot" or something to that effect.  Then when 
the AV launcher sends a worker to a DB, it will first look for tables 
marked as hot and work on them.  While working on hot tables, the 
launcher need not send any additional workers to this database, if the 
launcher notices that a worker is working on regular tables, it can send 
another worker which will look for hot tables to working, if the worker 
doesn't find any hot tables that need work, then it exits leaving the 
original working to continue plodding along.

Thoughts?




Re: autovacuum next steps

From
Alvaro Herrera
Date:
Matthew T. O'Connor wrote:
> Alvaro Herrera wrote:
> >After staring at my previous notes for autovac scheduling, it has become
> >clear that this basics of it is not really going to work as specified.
> >So here is a more realistic plan:
> 
> [Snip Detailed Description]
> 
> >How does this sound?
> 
> On first blush, I'm not sure I like this as it doesn't directly attack 
> the table starvation problem, and I think it could be a net loss of speed.
> 
> VACUUM is I/O bound, as such, just sending multiple vacuum commands at a 
> DB isn't going to make things faster, you are now going to have multiple 
> processes reading from multiple tables at the same time.  I think in 
> general this is a bad thing (unless we someday account for I/O made 
> available from multiple tablespaces).

Yeah, I understand that.  However, I think that can be remedied by using
a reasonable autovacuum_vacuum_cost_delay setting, so that each worker
uses less than the total I/O available.  The main point of the proposal
is to allow multiple workers on a DB while also allowing multiple
databases to be processed in parallel.

> I think we can extend the current autovacuum stats to add one more 
> column that specifies "is hot" or something to that effect.  Then when 
> the AV launcher sends a worker to a DB, it will first look for tables 
> marked as hot and work on them.  While working on hot tables, the 
> launcher need not send any additional workers to this database, if the 
> launcher notices that a worker is working on regular tables, it can send 
> another worker which will look for hot tables to working, if the worker 
> doesn't find any hot tables that need work, then it exits leaving the 
> original working to continue plodding along.

How would you define what's a "hot" table?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: autovacuum next steps

From
Gregory Stark
Date:
In an ideal world I think you want precisely one vacuum process running per
tablespace on the assumption that each tablespace represents a distinct
physical device.

The cases where we currently find ourselves wanting more are where small
tables are due for vacuuming more frequently than the time it takes for a
large table to receive a single full pass.

If we could have autovacuum interrupt a vacuum in mid-sweep, perform a cycle
of vacuums on smaller tables, then resume, that problem would go away. That
sounds too difficult though, but perhaps we could do something nearly as good.

One option that I've heard before is to have vacuum after a single iteration
(ie, after it fills maintenance_work_mem and does the index cleanup and the
second heap pass), remember where it was and pick up from that point next
time.

If instead autovacuum could tell vacuum exactly how long to run for (or
calculated how many pages that represented based on cost_delay) then it could
calculate when it will next need to schedule another table in the same
tablespace and try to arrange for the vacuum of the large table to be done by
then.

Once there are no smaller more frequently vacuumed small tables due to be
scheduled it would start vacuum for the large table again and it would resume
from where the first one left off.

This only works if the large tables really don't need to be vacuumed so often
that autovacuum can't keep up. Our current situation is that there is a size
at which this happens. But arranging to have only one vacuum process per
tablespace will only make that less likely to happen rather than more.

I think the changes to vacuum itself are pretty small to get it to remember
where it left off last time and start from mid-table. I'm not sure how easy it
would be to get autovacuum to juggle all these variables though.

Of course users may not create separate tablespaces for physical devices, or
they may set cost_delay so high you really do need more vacuum processes, etc.
So you probably still need a num_vacuum_daemons but the recommended setting
would be the same as the number of physical devices and autovacuum could try
to divide them equally between tablespaces which would amount to the same
thing.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: autovacuum next steps

From
"Matthew T. O'Connor"
Date:
Alvaro Herrera wrote:
> Matthew T. O'Connor wrote:
>> On first blush, I'm not sure I like this as it doesn't directly attack 
>> the table starvation problem, and I think it could be a net loss of speed.
>>
>> VACUUM is I/O bound, as such, just sending multiple vacuum commands at a 
>> DB isn't going to make things faster, you are now going to have multiple 
>> processes reading from multiple tables at the same time.  I think in 
>> general this is a bad thing (unless we someday account for I/O made 
>> available from multiple tablespaces).
> 
> Yeah, I understand that.  However, I think that can be remedied by using
> a reasonable autovacuum_vacuum_cost_delay setting, so that each worker
> uses less than the total I/O available.  The main point of the proposal
> is to allow multiple workers on a DB while also allowing multiple
> databases to be processed in parallel.

So you are telling people to choose an autovacuum_delay so high that 
they need to run multiple autovacuums at once to keep up?  I'm probably 
being to dramatic, but it seems inconsistent.

>> I think we can extend the current autovacuum stats to add one more 
>> column that specifies "is hot" or something to that effect.  Then when 
>> the AV launcher sends a worker to a DB, it will first look for tables 
>> marked as hot and work on them.  While working on hot tables, the 
>> launcher need not send any additional workers to this database, if the 
>> launcher notices that a worker is working on regular tables, it can send 
>> another worker which will look for hot tables to working, if the worker 
>> doesn't find any hot tables that need work, then it exits leaving the 
>> original working to continue plodding along.
> 
> How would you define what's a "hot" table?

I wasn't clear, I would have the Admin specified it, and we can store it 
as an additional column in the pg_autovacuum_settings table.  Or perhaps 
if the table is below some size threshold and autovacuum seems that it 
needs to be vacuumed every time it checks it 10 times in a row or 
something like that.    


Re: autovacuum next steps

From
Chris Browne
Date:
alvherre@commandprompt.com (Alvaro Herrera) writes:
> When there is a single worker processing a database, it does not recheck
> pgstat data after each table.  This is to prevent a high-update-rate
> table from starving the vacuuming of other databases.

This case is important; I don't think that having multiple workers
fully alleviates the problem condition.

Pointedly, you need to have a way of picking up tables often enough to
avoid the XID rollover problem.  That may simply require that on some
periodic basis, a query is run to queue up tables that are getting
close to having an "XID problem."
-- 
(reverse (concatenate 'string "ofni.secnanifxunil" "@" "enworbbc"))
http://linuxfinances.info/info/finances.html
Rules of  the Evil Overlord #189. "I  will never tell the  hero "Yes I
was the one who  did it, but you'll never be able  to prove it to that
incompetent  old fool."  Chances  are, that  incompetent  old fool  is
standing behind the curtain."  <http://www.eviloverlord.com/>


Re: autovacuum next steps

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Each worker, including the initial one, starts vacuuming tables
> according to pgstat data.  They recheck the pgstat data after finishing
> each table, so that a table vacuumed by another worker is not processed
> twice (maybe problematic: a table with high update rate may be vacuumed
> more than once.  Maybe this is a feature not a bug).

How are you going to make that work without race conditions?  ISTM
practically guaranteed that all the workers will try to vacuum the same
table.

> Once autovacuum_naptime has passed, if the workers have not finished
> yet, the launcher wants to vacuum another database.

This seems a rather strange design, as it will encourage concentrations
of workers in a single database.  Wouldn't it be better to spread them
out among multiple databases by default?
        regards, tom lane


Re: autovacuum next steps

From
Ron Mayer
Date:
Alvaro Herrera wrote:
> 
> Once autovacuum_naptime... autovacuum_max_workers...
> How does this sound?

The knobs exposed on autovacuum feel kinda tangential to
what I think I'd really want to control.

IMHO "vacuum_mbytes_per_second" would be quite a bit more
intuitive than cost_delay, naptime, etc.


ISTM I can relatively easily estimate and/or spec out how
much "extra" I/O bandwidth I have per device for vacuum;
and would pretty much want vacuum to be constantly
running on whichever table that needs it the most so
long as it can stay under that bandwith limit.

Could vacuum have a tunable that says "X MBytes/second"
(perhaps per device) and have it measure how much I/O
it's actually doing and try to stay under that limit?

For more fine-grained control a cron job could go
around setting different MBytes/second limits during
peak times vs idle times.


If people are concerned about CPU intensive vacuums
instead of I/O intensive ones (does anyone experience
that? - another tuneable "vacuum_percent_of_cpu" would
be more straightforward than delay_cost, cost_page_hit,
etc.   But I'd be a bit surprised if cpu intensive
vacuums are common.


Re: autovacuum next steps

From
Csaba Nagy
Date:
> One option that I've heard before is to have vacuum after a single iteration
> (ie, after it fills maintenance_work_mem and does the index cleanup and the
> second heap pass), remember where it was and pick up from that point next
> time.

>From my experience this is not acceptable... I have tables for which the
index cleanup takes hours, so no matter how low I would set the
maintenance_work_mem (in fact I set it high enough so there's only one
iteration), it will take too much time so the queue tables get overly
bloated (not happening either, they get now special "cluster"
treatment).

Cheers,
Csaba.




Re: autovacuum next steps

From
Galy Lee
Date:
Gregory Stark wrote:
> If we could have autovacuum interrupt a vacuum in mid-sweep, perform a cycle
> of vacuums on smaller tables, then resume, that problem would go away. That
> sounds too difficult though, but perhaps we could do something nearly as good.

I think to make vacuum has this interrupted-resume capability is quite 
useful for large table.

It can provide more flexibility for autovacuum to create a good schedule 
scheme. Sometimes it takes a whole day to vacuum the large table 
(Hundreds-GB table may qualify); setting the cost_delay make it even 
lasts for several days. If the system has maintenance time, vacuum task 
of the large table can be split to fit into the maintenance time by 
interrupted-resume feature.

> One option that I've heard before is to have vacuum after a single iteration
> (ie, after it fills maintenance_work_mem and does the index cleanup and the
> second heap pass), remember where it was and pick up from that point next
> time.
> 

Even a single iteration may take a long time, so it is not so much 
useful to have a break in the boundary of the iteration. I think it is 
not so difficult to get vacuum to remember where it leaves and start 
from where it leaves last time. The following is a basic idea.

A typical vacuum process mainly have the following phases: Phase 1. scan heap Phase 2. scan and sweep index Phase 3.
sweepheap Phase 4. update FSM Phase 5. truncate CLOG
 

Where vacuum is interrupted, we can just save the collected information 
into the disk, and restore it later when vacuum restarts. When vacuum 
process is interrupted, we can remember the dead tuple list and the 
block number it has scanned in phase 1; the indexes it has cleanup in 
phase 2; the tuples it has swept in phase 3. Before exiting from vacuum, 
we can also merge the free space information into FSM.

We are working on this feature now.  I will propose it latter to discuss 
with you.

Best Regards
Galy Lee
-- 
NTT Open Source Software Center


Re: autovacuum next steps

From
"Jim C. Nasby"
Date:
I'm wondering if we can do one better...

Since what we really care about is I/O responsiveness for the rest of
the system, could we just time how long I/O calls take to complete? I
know that gettimeofday can have a non-trivial overhead, but do we care
that much about it in the case of autovac?

On Fri, Feb 16, 2007 at 05:37:26PM -0800, Ron Mayer wrote:
> Alvaro Herrera wrote:
> > 
> > Once autovacuum_naptime... autovacuum_max_workers...
> > How does this sound?
> 
> The knobs exposed on autovacuum feel kinda tangential to
> what I think I'd really want to control.
> 
> IMHO "vacuum_mbytes_per_second" would be quite a bit more
> intuitive than cost_delay, naptime, etc.
> 
> 
> ISTM I can relatively easily estimate and/or spec out how
> much "extra" I/O bandwidth I have per device for vacuum;
> and would pretty much want vacuum to be constantly
> running on whichever table that needs it the most so
> long as it can stay under that bandwith limit.
> 
> Could vacuum have a tunable that says "X MBytes/second"
> (perhaps per device) and have it measure how much I/O
> it's actually doing and try to stay under that limit?
> 
> For more fine-grained control a cron job could go
> around setting different MBytes/second limits during
> peak times vs idle times.
> 
> 
> If people are concerned about CPU intensive vacuums
> instead of I/O intensive ones (does anyone experience
> that? - another tuneable "vacuum_percent_of_cpu" would
> be more straightforward than delay_cost, cost_page_hit,
> etc.   But I'd be a bit surprised if cpu intensive
> vacuums are common.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
> 

-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
Alvaro Herrera
Date:
Ok, scratch that :-)  Another round of braindumping below.


Launcher starts one worker in each database.  This worker is not going
to do vacuum work, just report how much vacuum effort is needed in the
database.  "Vacuum effort" is measured as the total number of pages in
need of vacuum, being the sum of relpages of all tables and indexes
needing vacuum.  (Note: we weight heap pages the same as index pages.
Is this OK?)

Create a plan for vacuuming all those databases within the constraints
of max_workers.  Databases needing the most work are vacuumed first.
One worker per database.  Thus max_workers databases are being vacuumed
in parallel at this time.  When one database is finished, the launcher
starts a worker in the next database in the list.

When the plan is complete (i.e. the list is empty) we can do the whole
thing again, excluding the databases that are still being vacuumed.

Perhaps we should wait autovacuum_naptime seconds between finishing one
vacuum round in all databases and starting the next.  How do we measure
this: do we start sleeping when the last worker finishes, or when the
list is empty?

Perhaps we should reserve a worker for vacuuming hot tables.  Launcher
then uses max_workers-1 workers for the above plan, and the spare worker
is continuously connecting to one database, vacuuming hot tables, going
away, the launcher starts it again to connect to the next database.
Definitional problem: how to decide what's a hot table?  One idea (the
simplest) is to let the DBA define it.

Thus, at most two workers are on any database: one of them is working on
normal tables, the other on hot tables.

(This idea can be complemented by having another GUC var,
autovacuum_hot_workers, which allows the DBA to have more than one
worker on hot tables (just for the case where there are too many hot
tables).  This may be overkill.)


Ron Mayer expressed the thought that we're complicating needlessly the
UI for vacuum_delay, naptime, etc.  He proposes that instead of having
cost_delay etc, we have a mbytes_per_second parameter of some sort.
This strikes me a good idea, but I think we could make that after this
proposal is implemented.  So this "take 2" could be implemented, and
then we could switch the cost_delay stuff to using a MB/s kind of
measurement somehow (he says waving his hands wildly).

Greg Stark and Matthew O'Connor say that we're misdirected in having
more than one worker per tablespace.  I say we're not :-)  If we
consider Ron Mayer's idea of measuring MB/s, but we do it per
tablespace, then we would inflict the correct amount of vacuum pain to
each tablespace, sleeping as appropriate.  I think this would require
workers of different databases to communicate what tablespaces they are
using, so that all of them can utilize the correct amount of bandwidth.


I'd like to know if this responds to the mentioned people's objections.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Alvaro Herrera wrote:
> Ok, scratch that :-)  Another round of braindumping below.

I still think this is solution in search of a problem.  The main problem 
we have right now is that hot tables can be starved from vacuum.  Most 
of this proposal doesn't touch that.  I would like to see that problem 
solved first, then we can talk about adding multiple workers per 
database or per tablespace etc...

> (This idea can be complemented by having another GUC var,
> autovacuum_hot_workers, which allows the DBA to have more than one
> worker on hot tables (just for the case where there are too many hot
> tables).  This may be overkill.)

I think this is more along the lines of what we need first.

> Ron Mayer expressed the thought that we're complicating needlessly the
> UI for vacuum_delay, naptime, etc.  He proposes that instead of having
> cost_delay etc, we have a mbytes_per_second parameter of some sort.
> This strikes me a good idea, but I think we could make that after this
> proposal is implemented.  So this "take 2" could be implemented, and
> then we could switch the cost_delay stuff to using a MB/s kind of
> measurement somehow (he says waving his hands wildly).

Agree this is probably a good idea in the long run, but I agree this is 
lower on the priority list and should come next.

> Greg Stark and Matthew O'Connor say that we're misdirected in having
> more than one worker per tablespace.  I say we're not :-)  If we
> consider Ron Mayer's idea of measuring MB/s, but we do it per
> tablespace, then we would inflict the correct amount of vacuum pain to
> each tablespace, sleeping as appropriate.  I think this would require
> workers of different databases to communicate what tablespaces they are
> using, so that all of them can utilize the correct amount of bandwidth.

I agree that in the long run it might be better to have multiple workers 
with MB/s throttle and tablespace aware, but we don't have any of that 
infrastructure right now.  I think the piece of low-hanging fruit that 
your launcher concept can solve is the hot table starvation.

My Proposal:  If we require admins to identify hot tables tables, then: 
1) Launcher fires-off a worker1 into database X.
2) worker1 deals with "hot" tables first, then regular tables.
3) Launcher continues to launch workers to DB X every autovac naptime. 
4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
worker1 did above.  If worker1 is still working in DB X then worker2 
looks for hot tables that are being starved because worker1 got busy. 
If worker2 finds no hot tables that need work, then worker2 exits.

This seems a very simple solution (given your launcher work) that can 
solve the starvation problem.


Thoughts?


Re: autovacuum next steps, take 2

From
Gregory Stark
Date:
"Alvaro Herrera" <alvherre@commandprompt.com> writes:

> Greg Stark and Matthew O'Connor say that we're misdirected in having
> more than one worker per tablespace.  I say we're not :-)  

I did say that. But your comment about using a high cost_delay was fairly
convincing too. It would be a simpler design and I think you're right. As long
as raise both cost_delay and cost_limit by enough you should get pretty much
the same sequential i/o rate and not step on each others toes too much.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: autovacuum next steps, take 2

From
"Joshua D. Drake"
Date:
> Ron Mayer expressed the thought that we're complicating needlessly the
> UI for vacuum_delay, naptime, etc.  He proposes that instead of having
> cost_delay etc, we have a mbytes_per_second parameter of some sort.
> This strikes me a good idea, but I think we could make that after this
> proposal is implemented.  So this "take 2" could be implemented, and
> then we could switch the cost_delay stuff to using a MB/s kind of
> measurement somehow (he says waving his hands wildly).

vacuum should be a process with the least amount of voodoo. If we can
just have vacuum_delay and vacuum_threshold, where threshold allows an
arbitrary setting of how much bandwidth we will allot to the process,
then that is a beyond wonderful thing.

It is easy to determine how much IO you have, and what you can spare.

Joshua D. Drake




> 
> Greg Stark and Matthew O'Connor say that we're misdirected in having
> more than one worker per tablespace.  I say we're not :-)  If we
> consider Ron Mayer's idea of measuring MB/s, but we do it per
> tablespace, then we would inflict the correct amount of vacuum pain to
> each tablespace, sleeping as appropriate.  I think this would require
> workers of different databases to communicate what tablespaces they are
> using, so that all of them can utilize the correct amount of bandwidth.
> 
> 
> I'd like to know if this responds to the mentioned people's objections.
> 


-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: autovacuum next steps, take 2

From
"Zeugswetter Andreas ADI SD"
Date:
> vacuum should be a process with the least amount of voodoo.
> If we can just have vacuum_delay and vacuum_threshold, where
> threshold allows an arbitrary setting of how much bandwidth
> we will allot to the process, then that is a beyond wonderful thing.
>
> It is easy to determine how much IO you have, and what you can spare.

The tricky part is what metric to use. Imho "IO per second" would be
good.
In a typical DB scenario that is the IO bottleneck, not the Mb/s.

Andreas


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
> My Proposal:  If we require admins to identify hot tables tables, then: 
> 1) Launcher fires-off a worker1 into database X.
> 2) worker1 deals with "hot" tables first, then regular tables.
> 3) Launcher continues to launch workers to DB X every autovac naptime. 
> 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
> worker1 did above.  If worker1 is still working in DB X then worker2 
> looks for hot tables that are being starved because worker1 got busy. 
> If worker2 finds no hot tables that need work, then worker2 exits.

Rather than required people to manually identify hot tables, what if we
just prioritize based on table size? So if a second autovac process hits
a specific database, it would find the smallest table in need of
vacuuming that it should be able to complete before the next naptime and
vacuum that. It could even continue picking tables until it can't find
one that it could finish within the naptime. Granted, it would have to
make some assumptions about how many pages it would dirty.

ISTM that's a lot easier than forcing admins to mark specific tables.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Jim C. Nasby wrote:
> On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
>   
>> My Proposal:  If we require admins to identify hot tables tables, then: 
>> 1) Launcher fires-off a worker1 into database X.
>> 2) worker1 deals with "hot" tables first, then regular tables.
>> 3) Launcher continues to launch workers to DB X every autovac naptime. 
>> 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
>> worker1 did above.  If worker1 is still working in DB X then worker2 
>> looks for hot tables that are being starved because worker1 got busy. 
>> If worker2 finds no hot tables that need work, then worker2 exits.
>>     
>
> Rather than required people to manually identify hot tables, what if we
> just prioritize based on table size? So if a second autovac process hits
> a specific database, it would find the smallest table in need of
> vacuuming that it should be able to complete before the next naptime and
> vacuum that. It could even continue picking tables until it can't find
> one that it could finish within the naptime. Granted, it would have to
> make some assumptions about how many pages it would dirty.
>
> ISTM that's a lot easier than forcing admins to mark specific tables.

So the heuristic would be:
* Launcher fires off workers into a database at a given interval 
(perhaps configurable?)
* Each worker works on tables in size order. 
* If a worker ever catches up to an older worker, then the younger 
worker exits.

This sounds simple and workable to me, perhaps we can later modify this 
to include some max_workers variable so that a worker would only exit if 
it catches an older worker and there are max_workers currently active.


Thoughts?


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
> Jim C. Nasby wrote:
> >On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
> >  
> >>My Proposal:  If we require admins to identify hot tables tables, then: 
> >>1) Launcher fires-off a worker1 into database X.
> >>2) worker1 deals with "hot" tables first, then regular tables.
> >>3) Launcher continues to launch workers to DB X every autovac naptime. 
> >>4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
> >>worker1 did above.  If worker1 is still working in DB X then worker2 
> >>looks for hot tables that are being starved because worker1 got busy. 
> >>If worker2 finds no hot tables that need work, then worker2 exits.
> >>    
> >
> >Rather than required people to manually identify hot tables, what if we
> >just prioritize based on table size? So if a second autovac process hits
> >a specific database, it would find the smallest table in need of
> >vacuuming that it should be able to complete before the next naptime and
> >vacuum that. It could even continue picking tables until it can't find
> >one that it could finish within the naptime. Granted, it would have to
> >make some assumptions about how many pages it would dirty.
> >
> >ISTM that's a lot easier than forcing admins to mark specific tables.
> 
> So the heuristic would be:
> * Launcher fires off workers into a database at a given interval 
> (perhaps configurable?)
> * Each worker works on tables in size order. 
> * If a worker ever catches up to an older worker, then the younger 
> worker exits.
> 
> This sounds simple and workable to me, perhaps we can later modify this 
> to include some max_workers variable so that a worker would only exit if 
> it catches an older worker and there are max_workers currently active.

That would likely result in a number of workers running in one database,
unless you limited how many workers per database. And if you did that,
you wouldn't be addressing the frequently update table problem.

A second vacuum in a database *must* exit after a fairly short time so
that we can go back in and vacuum the important tables again (well or
the 2nd vacuum has to periodically re-evaluate what tables need to be
vacuumed).
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Thu, Feb 22, 2007 at 09:35:45AM +0100, Zeugswetter Andreas ADI SD wrote:
> 
> > vacuum should be a process with the least amount of voodoo. 
> > If we can just have vacuum_delay and vacuum_threshold, where 
> > threshold allows an arbitrary setting of how much bandwidth 
> > we will allot to the process, then that is a beyond wonderful thing.
> > 
> > It is easy to determine how much IO you have, and what you can spare.
> 
> The tricky part is what metric to use. Imho "IO per second" would be
> good.
> In a typical DB scenario that is the IO bottleneck, not the Mb/s.

Well, right now they're one in the same... but yeah, IO/sec probably
does make more sense.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Zeugswetter Andreas ADI SD"
Date:
> > > vacuum should be a process with the least amount of voodoo.
> > > If we can just have vacuum_delay and vacuum_threshold, where
> > > threshold allows an arbitrary setting of how much bandwidth we
will
> > > allot to the process, then that is a beyond wonderful thing.
> > >
> > > It is easy to determine how much IO you have, and what
> you can spare.
> >
> > The tricky part is what metric to use. Imho "IO per second"
> would be
> > good.
> > In a typical DB scenario that is the IO bottleneck, not the Mb/s.
>
> Well, right now they're one in the same... but yeah, IO/sec
> probably does make more sense.

Hopefully not :-) Else you have no readahead. And that is imho the
problem.
You need to anticipate how many physical IO's your logical IO's cause.
And this is near impossible unless we group IO's in pg itself.

Andreas


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Jim C. Nasby wrote:
> On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
>   
>> So the heuristic would be:
>> * Launcher fires off workers into a database at a given interval 
>> (perhaps configurable?)
>> * Each worker works on tables in size order. 
>> * If a worker ever catches up to an older worker, then the younger 
>> worker exits.
>>
>> This sounds simple and workable to me, perhaps we can later modify this 
>> to include some max_workers variable so that a worker would only exit if 
>> it catches an older worker and there are max_workers currently active.
>>     
>
> That would likely result in a number of workers running in one database,
> unless you limited how many workers per database. And if you did that,
> you wouldn't be addressing the frequently update table problem.
>
> A second vacuum in a database *must* exit after a fairly short time so
> that we can go back in and vacuum the important tables again (well or
> the 2nd vacuum has to periodically re-evaluate what tables need to be
> vacuumed).
>   

I'm not sure this is a great idea, but I don't see how this would result 
in large numbers of workers working in one database.   If workers work 
on tables in size order, and exit as soon as they catch up to an older 
worker, I don't see the problem.  Newer works are going to catch-up to 
older workers pretty quickly since small tables will vacuum fairly quickly.



Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:
> Jim C. Nasby wrote:
> >On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
> >  
> >>So the heuristic would be:
> >>* Launcher fires off workers into a database at a given interval 
> >>(perhaps configurable?)
> >>* Each worker works on tables in size order. 
> >>* If a worker ever catches up to an older worker, then the younger 
> >>worker exits.
> >>
> >>This sounds simple and workable to me, perhaps we can later modify this 
> >>to include some max_workers variable so that a worker would only exit if 
> >>it catches an older worker and there are max_workers currently active.
> >>    
> >
> >That would likely result in a number of workers running in one database,
> >unless you limited how many workers per database. And if you did that,
> >you wouldn't be addressing the frequently update table problem.
> >
> >A second vacuum in a database *must* exit after a fairly short time so
> >that we can go back in and vacuum the important tables again (well or
> >the 2nd vacuum has to periodically re-evaluate what tables need to be
> >vacuumed).
> >  
> 
> I'm not sure this is a great idea, but I don't see how this would result 
> in large numbers of workers working in one database.   If workers work 
> on tables in size order, and exit as soon as they catch up to an older 
> worker, I don't see the problem.  Newer works are going to catch-up to 
> older workers pretty quickly since small tables will vacuum fairly quickly.

The reason that won't necessarily happen is because you can get large
tables popping up as needing vacuuming at any time.

Round 1:
Fire up autovac worker; starts working and soon hits 100G table

Round 2:
Another worker starts. Since round 1, a 98G table now needs vacuuming,
which this worker soon hits.

Round 3:
89G table now needs vacuuming. Worker 3 starts up and soon hits it.

So now we have 3 workers, all hammering away in the same database, and
likely causing a lot of random IO.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
Alvaro Herrera
Date:
Jim C. Nasby wrote:
> On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:

> > I'm not sure this is a great idea, but I don't see how this would result 
> > in large numbers of workers working in one database.   If workers work 
> > on tables in size order, and exit as soon as they catch up to an older 
> > worker, I don't see the problem.  Newer works are going to catch-up to 
> > older workers pretty quickly since small tables will vacuum fairly quickly.
> 
> The reason that won't necessarily happen is because you can get large
> tables popping up as needing vacuuming at any time.

Right.

We know that a table that needs frequent vacuum necessarily has to be
small -- so maybe have the second worker exit when it catches up with
the first, or when the next table is above 1 GB, whichever happens
first.  That way, only the first worker can be processing the huge
tables.  The problem with this is that if one of your hot tables grows
a bit larger than 1 GB, you suddenly have a change in autovacuuming
behavior, for no really good reason.

And while your second worker is processing the tables in the hundreds-MB
range, your high-update 2 MB tables are neglected :-(

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Fri, Feb 23, 2007 at 01:22:17PM -0300, Alvaro Herrera wrote:
> Jim C. Nasby wrote:
> > On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:
> 
> > > I'm not sure this is a great idea, but I don't see how this would result 
> > > in large numbers of workers working in one database.   If workers work 
> > > on tables in size order, and exit as soon as they catch up to an older 
> > > worker, I don't see the problem.  Newer works are going to catch-up to 
> > > older workers pretty quickly since small tables will vacuum fairly quickly.
> > 
> > The reason that won't necessarily happen is because you can get large
> > tables popping up as needing vacuuming at any time.
> 
> Right.
> 
> We know that a table that needs frequent vacuum necessarily has to be
> small -- so maybe have the second worker exit when it catches up with
> the first, or when the next table is above 1 GB, whichever happens
> first.  That way, only the first worker can be processing the huge
> tables.  The problem with this is that if one of your hot tables grows
> a bit larger than 1 GB, you suddenly have a change in autovacuuming
> behavior, for no really good reason.
> 
> And while your second worker is processing the tables in the hundreds-MB
> range, your high-update 2 MB tables are neglected :-(

That's why I'm thinking it would be best to keep the maximum size of
stuff for the second worker small. It probably also makes sense to tie
it to time and not size, since the key factor is that you want it to hit
the high-update tables every X number of seconds.

If we wanted to get fancy, we could factor in how far over the vacuum
threshold a table is, so even if the table is on the larger size, if
it's way over the threshold the second vacuum will hit it.

You know, maybe the best way to handle this is to force both vacuums to
exit after a certain amount of time, probably with a longer time limit
for the first vacuum in a database. That would mean that after
processing a large table for 10 minutes, the first vacuum would
exit/re-evaluate what work needs to be done. That would mean
medium-sized tables wouldn't get completely starved.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
Alvaro Herrera
Date:
Jim C. Nasby wrote:

> That's why I'm thinking it would be best to keep the maximum size of
> stuff for the second worker small. It probably also makes sense to tie
> it to time and not size, since the key factor is that you want it to hit
> the high-update tables every X number of seconds.
> 
> If we wanted to get fancy, we could factor in how far over the vacuum
> threshold a table is, so even if the table is on the larger size, if
> it's way over the threshold the second vacuum will hit it.

Ok, I think we may be actually getting somewhere.

I propose to have two different algorithms for choosing the tables to
work on.  The worker would behave differently, depending on whether
there is one or more workers on the database already or not.

The first algorithm is the plain threshold equation stuff we use today.
If a worker connects and determines that no other worker is in the
database, it uses the "plain worker" mode.  A worker in this mode would
examine pgstats, determine what tables to vacuum/analyze, sort them by
size (smaller to larger), and goes about its work.  This kind of worker
can take a long time to vacuum the whole database -- we don't impose any
time limit or table size limit to what it can do.

The second mode is the "hot table worker" mode, enabled when the worker
detects that there's already a worker in the database.  In this mode,
the worker is limited to those tables that can be vacuumed in less than
autovacuum_naptime, so large tables are not considered.  Because of
this, it'll generally not compete with the first mode above -- the
tables in plain worker were sorted by size, so the small tables were
among the first vacuumed by the plain worker.  The estimated time to
vacuum may be calculated according to autovacuum_vacuum_delay settings,
assuming that all pages constitute cache misses.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Alvaro Herrera wrote:
> Jim C. Nasby wrote:
> 
>> That's why I'm thinking it would be best to keep the maximum size of
>> stuff for the second worker small. It probably also makes sense to tie
>> it to time and not size, since the key factor is that you want it to hit
>> the high-update tables every X number of seconds.
>>
>> If we wanted to get fancy, we could factor in how far over the vacuum
>> threshold a table is, so even if the table is on the larger size, if
>> it's way over the threshold the second vacuum will hit it.
> 
> Ok, I think we may be actually getting somewhere.

Me too.

> I propose to have two different algorithms for choosing the tables to
> work on.  The worker would behave differently, depending on whether
> there is one or more workers on the database already or not.
> 
> The first algorithm is the plain threshold equation stuff we use today.
> If a worker connects and determines that no other worker is in the
> database, it uses the "plain worker" mode.  A worker in this mode would
> examine pgstats, determine what tables to vacuum/analyze, sort them by
> size (smaller to larger), and goes about its work.  This kind of worker
> can take a long time to vacuum the whole database -- we don't impose any
> time limit or table size limit to what it can do.

Right, I like this.

> The second mode is the "hot table worker" mode, enabled when the worker
> detects that there's already a worker in the database.  In this mode,
> the worker is limited to those tables that can be vacuumed in less than
> autovacuum_naptime, so large tables are not considered.  Because of
> this, it'll generally not compete with the first mode above -- the
> tables in plain worker were sorted by size, so the small tables were
> among the first vacuumed by the plain worker.  The estimated time to
> vacuum may be calculated according to autovacuum_vacuum_delay settings,
> assuming that all pages constitute cache misses.

How can you determine what tables can be vacuumed within 
autovacuum_naptime?  I agree that large tables should be excluded, but I 
don't know how we can do that calculation based on autovacuum_naptime.

So at:
t=0*autovacuume_naptime: worker1 gets started on DBX
t=1*autovacuume_naptime: worker2 gets started on DBX    worker2 determines all tables that need to be vacuumed,
worker2excludes tables that are too big from it's to-do list,    worker2 gets started working,    worker2 exits when it
either:       a) Finishes it's entire to-do-list.        b) Catches up to worker1
 

I think the questions are 1) What is the exact math you are planning on 
using to determine which tables are too big?  2) Do we want worker2 to 
exit when it catches worker1 or does the fact that we have excluded 
tables that re "too big" mean that we don't have to worry about this?



Re: autovacuum next steps, take 2

From
Alvaro Herrera
Date:
Matthew T. O'Connor wrote:
> Alvaro Herrera wrote:

> >The second mode is the "hot table worker" mode, enabled when the worker
> >detects that there's already a worker in the database.  In this mode,
> >the worker is limited to those tables that can be vacuumed in less than
> >autovacuum_naptime, so large tables are not considered.  Because of
> >this, it'll generally not compete with the first mode above -- the
> >tables in plain worker were sorted by size, so the small tables were
> >among the first vacuumed by the plain worker.  The estimated time to
> >vacuum may be calculated according to autovacuum_vacuum_delay settings,
> >assuming that all pages constitute cache misses.
> 
> How can you determine what tables can be vacuumed within 
> autovacuum_naptime?

My assumption is that
pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum

This is of course not the reality, because the delay is not how long
it takes to fetch the pages.  But it lets us have a value with which we
can do something.  With the default values, vacuum_cost_delay=10,
vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
of under 600 pages, 4800 kB (should we include indexes here in the
relpages count?  My guess is no).

A table over 600 pages does not sound like a good candidate for hot, so
this seems more or less reasonable to me.  On the other hand, maybe we
shouldn't tie this to the vacuum cost delay stuff.

> So at:
> t=0*autovacuume_naptime: worker1 gets started on DBX
> t=1*autovacuume_naptime: worker2 gets started on DBX
>     worker2 determines all tables that need to be vacuumed,
>     worker2 excludes tables that are too big from it's to-do list,
>     worker2 gets started working,
>     worker2 exits when it either:
>         a) Finishes it's entire to-do-list.
>         b) Catches up to worker1
> 
> I think the questions are 1) What is the exact math you are planning on 
> using to determine which tables are too big?  2) Do we want worker2 to 
> exit when it catches worker1 or does the fact that we have excluded 
> tables that re "too big" mean that we don't have to worry about this?

Right, I think the fact that we excluded big tables means that this
won't be a problem most of the time, but we'll need some sort of
protection anyway.  I think this is easy to achieve -- store the table
each worker is currently processing in shared memory, and have all
workers check all other workers.  If a plain worker finds that another
worker is processing the table already, it skips that table and
continues with the next one.  A hot table worker instead exits right
away (caught up).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Alvaro Herrera wrote:
> Matthew T. O'Connor wrote:
>> How can you determine what tables can be vacuumed within 
>> autovacuum_naptime?
> 
> My assumption is that
> pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum
> 
> This is of course not the reality, because the delay is not how long
> it takes to fetch the pages.  But it lets us have a value with which we
> can do something.  With the default values, vacuum_cost_delay=10,
> vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
> of under 600 pages, 4800 kB (should we include indexes here in the
> relpages count?  My guess is no).

I'm not sure how pg_class.relpages is maintained but what happens to a 
bloated table?  For example, a 100 row table that is constantly updated 
and hasn't been vacuumed in a while (say the admin disabled autovacuum 
for a while), now that small 100 row table has 1000 pages in it most of 
which are just bloat, will we miss this table?  Perhaps basing this on 
reltuples would be better?

> A table over 600 pages does not sound like a good candidate for hot, so
> this seems more or less reasonable to me.  On the other hand, maybe we
> shouldn't tie this to the vacuum cost delay stuff.

I'm not sure it's a good idea to tie this to the vacuum cost delay 
settings either, so let me as you this, how is this better than just 
allowing the admin to set a new GUC variable like 
autovacuum_hot_table_size_threshold  (or something shorter) which we can 
assign a decent default of say 8MB.

Thoughts?


Re: autovacuum next steps, take 2

From
Alvaro Herrera
Date:
Matthew T. O'Connor wrote:
> Alvaro Herrera wrote:
> >Matthew T. O'Connor wrote:
> >>How can you determine what tables can be vacuumed within 
> >>autovacuum_naptime?
> >
> >My assumption is that
> >pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to 
> >vacuum
> >
> >This is of course not the reality, because the delay is not how long
> >it takes to fetch the pages.  But it lets us have a value with which we
> >can do something.  With the default values, vacuum_cost_delay=10,
> >vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
> >of under 600 pages, 4800 kB (should we include indexes here in the
> >relpages count?  My guess is no).
> 
> I'm not sure how pg_class.relpages is maintained but what happens to a 
> bloated table?  For example, a 100 row table that is constantly updated 
> and hasn't been vacuumed in a while (say the admin disabled autovacuum 
> for a while), now that small 100 row table has 1000 pages in it most of 
> which are just bloat, will we miss this table?  Perhaps basing this on 
> reltuples would be better?

Well, this would only happen the first time, until the plain worker
processed the table; next time it would be picked up by the hot table
worker.  But yeah, we can build a better estimate using the same trick
the planner uses: estimate tuple density as reltuples/relpages times the
actual number of blocks on disk.

> >A table over 600 pages does not sound like a good candidate for hot, so
> >this seems more or less reasonable to me.  On the other hand, maybe we
> >shouldn't tie this to the vacuum cost delay stuff.
> 
> I'm not sure it's a good idea to tie this to the vacuum cost delay 
> settings either, so let me as you this, how is this better than just 
> allowing the admin to set a new GUC variable like 
> autovacuum_hot_table_size_threshold  (or something shorter) which we can 
> assign a decent default of say 8MB.

Yeah, maybe that's better -- it's certainly simpler.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Alvaro Herrera wrote:
> Matthew T. O'Connor wrote:
>> I'm not sure how pg_class.relpages is maintained but what happens to a 
>> bloated table?  For example, a 100 row table that is constantly updated 
>> and hasn't been vacuumed in a while (say the admin disabled autovacuum 
>> for a while), now that small 100 row table has 1000 pages in it most of 
>> which are just bloat, will we miss this table?  Perhaps basing this on 
>> reltuples would be better?
> 
> Well, this would only happen the first time, until the plain worker
> processed the table; next time it would be picked up by the hot table
> worker.  But yeah, we can build a better estimate using the same trick
> the planner uses: estimate tuple density as reltuples/relpages times the
> actual number of blocks on disk.

Well even skipping it the first time isn't good, anything that further 
delays a hot table from getting vacuumed is bad.  Also, I'm not sure it 
would just be the first time since plain VACUUM isn't going to reclaim 
most of the space, just mark it as reusable.  This is moot however if we 
use a good metric, I have no idea if what you suggest above would be 
good enough.

>>> A table over 600 pages does not sound like a good candidate for hot, so
>>> this seems more or less reasonable to me.  On the other hand, maybe we
>>> shouldn't tie this to the vacuum cost delay stuff.
>> I'm not sure it's a good idea to tie this to the vacuum cost delay 
>> settings either, so let me as you this, how is this better than just 
>> allowing the admin to set a new GUC variable like 
>> autovacuum_hot_table_size_threshold  (or something shorter) which we can 
>> assign a decent default of say 8MB.
> 
> Yeah, maybe that's better -- it's certainly simpler.

Simple is better, at least until proven otherwise.


Re: autovacuum next steps, take 2

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Matthew T. O'Connor wrote:
>> I'm not sure it's a good idea to tie this to the vacuum cost delay 
>> settings either, so let me as you this, how is this better than just 
>> allowing the admin to set a new GUC variable like 
>> autovacuum_hot_table_size_threshold  (or something shorter) which we can 
>> assign a decent default of say 8MB.

> Yeah, maybe that's better -- it's certainly simpler.

I'm not liking any of these very much, as they seem critically dependent
on impossible-to-tune parameters.  I think it'd be better to design this
around having the first worker explicitly expose its state (list of
tables to process, in order) and having subsequent workers key off that
info.  The shared memory state could include the OID of the table each
worker is currently working on, and we could keep the to-do list in some
simple flat file for instance (since we don't care about crash safety).

I'm not certain exactly what "key off" needs to mean; perhaps each
worker should make its own to-do list and then discard items that are
either in-progress or recently done by another worker when it gets to
them.

I think an absolute minimum requirement for a sane design is that no two
workers ever try to vacuum the same table concurrently, and I don't see
where that behavior will emerge from your proposal; whereas it's fairly
easy to make it happen if non-first workers pay attention to what other
workers are doing.

BTW, it's probably necessary to treat shared catalogs specially ...
        regards, tom lane


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Matthew T. O'Connor wrote:
>>> I'm not sure it's a good idea to tie this to the vacuum cost delay 
>>> settings either, so let me as you this, how is this better than just 
>>> allowing the admin to set a new GUC variable like 
>>> autovacuum_hot_table_size_threshold  (or something shorter) which we can 
>>> assign a decent default of say 8MB.
> 
>> Yeah, maybe that's better -- it's certainly simpler.
> 
> I'm not liking any of these very much, as they seem critically dependent
> on impossible-to-tune parameters.  I think it'd be better to design this
> around having the first worker explicitly expose its state (list of
> tables to process, in order) and having subsequent workers key off that
> info.  The shared memory state could include the OID of the table each
> worker is currently working on, and we could keep the to-do list in some
> simple flat file for instance (since we don't care about crash safety).

So far we are only talking about one parameter, the 
hot_table_size_threshold, which I agree would be a guess by an admin, 
but if we went in this direction, I would also advocate adding a column 
to the pg_autovacuum table that allows an admin to explicitly define a 
table as hot or not.

Also I think each worker should be mostly independent, the only caveat 
being that (assuming each worker works in size order) if we catch up to 
an older worker (get to the table they are currently working on) we 
exit.  Personally I think this is all we need, but others felt the 
additional threshold was needed.  What do you think?  Or what do you 
think might be better?

> I'm not certain exactly what "key off" needs to mean; perhaps each
> worker should make its own to-do list and then discard items that are
> either in-progress or recently done by another worker when it gets to
> them.

My initial design didn't have any threshold at all, but others felt this 
would/could result in too many worker working concurrently in the same DB.

> I think an absolute minimum requirement for a sane design is that no two
> workers ever try to vacuum the same table concurrently, and I don't see
> where that behavior will emerge from your proposal; whereas it's fairly
> easy to make it happen if non-first workers pay attention to what other
> workers are doing.

Maybe we never made that clear, I was always working on the assumption 
that two workers would never try to work on the same table at the same time.

> BTW, it's probably necessary to treat shared catalogs specially ...

Certainly.


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Matthew T. O'Connor wrote:
> >> I'm not sure it's a good idea to tie this to the vacuum cost delay 
> >> settings either, so let me as you this, how is this better than just 
> >> allowing the admin to set a new GUC variable like 
> >> autovacuum_hot_table_size_threshold  (or something shorter) which we can 
> >> assign a decent default of say 8MB.
> 
> > Yeah, maybe that's better -- it's certainly simpler.
> 
> I'm not liking any of these very much, as they seem critically dependent
> on impossible-to-tune parameters.  I think it'd be better to design this
> around having the first worker explicitly expose its state (list of
> tables to process, in order) and having subsequent workers key off that
> info.  The shared memory state could include the OID of the table each
> worker is currently working on, and we could keep the to-do list in some
> simple flat file for instance (since we don't care about crash safety).
> 
> I'm not certain exactly what "key off" needs to mean; perhaps each
> worker should make its own to-do list and then discard items that are
> either in-progress or recently done by another worker when it gets to
> them.
The real problem is trying to set that up in such a fashion that keeps
hot tables frequently vacuumed; that means that the 2nd daemon in a
database either needs to avoid large tables that will take 'a
significant' length of time to vacuum, or you need to allow any number
of daemons in each database (which sounds like a good way to thrash the
machine).

> I think an absolute minimum requirement for a sane design is that no two
> workers ever try to vacuum the same table concurrently, and I don't see
> where that behavior will emerge from your proposal; whereas it's fairly
> easy to make it happen if non-first workers pay attention to what other
> workers are doing.

Isn't there a special lock acquired on a relation by vacuum? Can't we
just check for that? Seems much simpler than building out the ability
for daemons to see what each other is doing (and that still wouldn't
take manual vacuums into account.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Mon, Feb 26, 2007 at 08:11:44PM -0300, Alvaro Herrera wrote:
> Matthew T. O'Connor wrote:
> > Alvaro Herrera wrote:
> 
> > >The second mode is the "hot table worker" mode, enabled when the worker
> > >detects that there's already a worker in the database.  In this mode,
> > >the worker is limited to those tables that can be vacuumed in less than
> > >autovacuum_naptime, so large tables are not considered.  Because of
> > >this, it'll generally not compete with the first mode above -- the
> > >tables in plain worker were sorted by size, so the small tables were
> > >among the first vacuumed by the plain worker.  The estimated time to
> > >vacuum may be calculated according to autovacuum_vacuum_delay settings,
> > >assuming that all pages constitute cache misses.
> > 
> > How can you determine what tables can be vacuumed within 
> > autovacuum_naptime?
> 
> My assumption is that
> pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum

Need ta take vacuum_cost_limit into account.

The advantage to keying this to autovac_naptime is that it means we
don't need another GUC, but after I suggested that before I realized
that's probably not the best idea. For example, I've seen clusters that
are running dozens-hundreds of databases; in that environment you really
need to turn naptime way down (to like a second). In that case you
wouldn't want to key to naptime.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
> Alvaro Herrera wrote:
> >Matthew T. O'Connor wrote:
> >>How can you determine what tables can be vacuumed within 
> >>autovacuum_naptime?
> >
> >My assumption is that
> >pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to 
> >vacuum
> >
> >This is of course not the reality, because the delay is not how long
> >it takes to fetch the pages.  But it lets us have a value with which we
> >can do something.  With the default values, vacuum_cost_delay=10,
> >vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
> >of under 600 pages, 4800 kB (should we include indexes here in the
> >relpages count?  My guess is no).
> 
> I'm not sure how pg_class.relpages is maintained but what happens to a 
> bloated table?  For example, a 100 row table that is constantly updated 
> and hasn't been vacuumed in a while (say the admin disabled autovacuum 
> for a while), now that small 100 row table has 1000 pages in it most of 
> which are just bloat, will we miss this table?  Perhaps basing this on 
> reltuples would be better?

The entire point of this is to ensure that the second daemon will only
vacuum tables that it can finish very quickly. If you let a table bloat
so it's too big, then you just can't vacuum it very frequently without
risking all your other hot tables bloating because they're no longer
getting vacuumed.

The reality is that you can actually vacuum a pretty good-sized table in
60 seconds with typical cost-delay settings (ie: defaults except
cost_delay set to 10). That means you can do 9 pages ~100 times a
second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
that's still 27k pages per minute.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
Alvaro Herrera
Date:
Jim C. Nasby wrote:

> The advantage to keying this to autovac_naptime is that it means we
> don't need another GUC, but after I suggested that before I realized
> that's probably not the best idea. For example, I've seen clusters that
> are running dozens-hundreds of databases; in that environment you really
> need to turn naptime way down (to like a second). In that case you
> wouldn't want to key to naptime.

Actually, I've been thinking that it would be a good idea to change the
semantics of autovacuum_naptime so that it means the average time to
start a worker in any given database.  That way, the time between
autovac runs is not dependent on the number of databases you have.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: autovacuum next steps, take 2

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote:
>> I'm not liking any of these very much, as they seem critically dependent
>> on impossible-to-tune parameters.  I think it'd be better to design this
>> around having the first worker explicitly expose its state (list of
>> tables to process, in order) and having subsequent workers key off that
>> info.

> The real problem is trying to set that up in such a fashion that keeps
> hot tables frequently vacuumed;

Certainly, but it's not clear where that behavior emerges from Alvaro's
or Matthew's proposals, either.

Are we assuming that no single worker instance will vacuum a given table
more than once?  (That's not a necessary assumption, certainly, but
without it there are so many degrees of freedom that I'm not sure how
it should act.)  Given that assumption, the maximum vacuuming rate for
any table is once per autovacuum_naptime, and most of the magic lies in
the launcher's algorithm for deciding which databases to launch workers
into.

I'm inclined to propose an even simpler algorithm in which every worker
acts alike; its behavior is
1. On startup, generate a to-do list of tables to process, sorted in
priority order.
2. For each table in the list, if the table is still around and has not
been vacuumed by someone else since you started (including the case of
a vacuum-in-progress), then vacuum it.

Detecting "already vacuumed since you started" is a bit tricky; you
can't really rely on the stats collector since its info isn't very
up-to-date.  That's why I was thinking of exposing the to-do lists
explicitly; comparing those with an advertised current-table would
allow accurate determination of what had just gotten done.
        regards, tom lane


Re: autovacuum next steps, take 2

From
Tom Lane
Date:
[ oh, I forgot to respond to this: ]

"Jim C. Nasby" <jim@nasby.net> writes:
> Isn't there a special lock acquired on a relation by vacuum? Can't we
> just check for that?

I think you're thinking that ConditionalLockRelation solves the problem,
but it does not, because it will fail if someone has taken a (quasi)
exclusive lock unrelated to vacuuming.  You don't want an application
that frequently takes short-term ExclusiveLocks on a table to thereby
cause autovacuum to frequently skip that table.
        regards, tom lane


Re: autovacuum next steps, take 2

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I think an absolute minimum requirement for a sane design is that no two
> workers ever try to vacuum the same table concurrently, and I don't see
> where that behavior will emerge from your proposal; whereas it's fairly
> easy to make it happen if non-first workers pay attention to what other
> workers are doing.

FWIW, I've always considered this to be a very important and obvious
issue, and I think I've neglected mentioning it (maybe I did too few
times).  But I think this is pretty easy to do, just have each worker
advertise the current table it's working on in shared memory, and add a
recheck loop on the table-pick algorithm (with appropriate grabs of the
autovacuum lwlock), to make sure no one starts to vacuum the same table
you're going to process, at the same time.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Jim C. Nasby wrote:
> On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
>> I'm not sure how pg_class.relpages is maintained but what happens to a 
>> bloated table?  For example, a 100 row table that is constantly updated 
>> and hasn't been vacuumed in a while (say the admin disabled autovacuum 
>> for a while), now that small 100 row table has 1000 pages in it most of 
>> which are just bloat, will we miss this table?  Perhaps basing this on 
>> reltuples would be better?
> 
> The entire point of this is to ensure that the second daemon will only
> vacuum tables that it can finish very quickly. If you let a table bloat
> so it's too big, then you just can't vacuum it very frequently without
> risking all your other hot tables bloating because they're no longer
> getting vacuumed.
> 
> The reality is that you can actually vacuum a pretty good-sized table in
> 60 seconds with typical cost-delay settings (ie: defaults except
> cost_delay set to 10). That means you can do 9 pages ~100 times a
> second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
> that's still 27k pages per minute.

At the risk of sounding like a broken record, I still think the size 
limit threshold is unnecessary.  Since all workers will be working in on 
tables in size order, younger workers will typically catch older workers 
fairly quickly since the tables will be either small, or recently 
vacuumed and not need work.  And since younger workers exit when they 
catch-up to an older worker, there is some inherent stability in the 
number of workers.

Here is a worst case example: A DB with 6 tables all of which are highly 
active and will need to be vacuumed constantly.  While this is totally 
hypothetical, it is how I envision things working (without the threshold).

table1:        10 rows
table2:       100 rows
table3:     1,000 rows
table4:    10,000 rows
table5:   100,000 rows
table6: 1,000,000 rows

time=0*naptime: No workers in the DB

time=1*naptime: worker1 starts on table1

time=2*naptime: worker1 has finished table1,table2 and table3, it's now 
working on table4, worker2 starts on table1.

time=3*naptime: worker1 is on table5, worker2 is working on table4, 
worker3 starts on table1.

time=4*naptime: worker1 is still on table5, worker2 has caught up to 
worker1 and exits, worker3 also catches up to worker1 since tables2-4 
didn't require vacuum at this time so it exits, worker4 starts on table1

time=5*naptime: worker1 is working on table6, worker4 is up to table4, 
worker5 starts on table1

time=6*naptime: worker1 is working on table6, worker4 catches up to 
worker1 and exits, worker5 finds no additional work to be done and 
exits, worker6 starts at table1.

time=7*naptime: worker1 still working on table6, worker6 is up to 
table4, worker7 starts at table1.

time=8*naptime: worker1 still working on table6, worker6 still working 
on table4, worker7 working on table3, worker8 starting on table1.

time=9*naptime: worker1 still working on table6, worker6 working on 
table5, worker7 catches worker 6 and exits, worker8 finds nothing more 
todo and exits, worker9 starts on table1

time=10*naptim: worker1 still working on table6, worker9 working on 
table4, worker10 starts on table1.



Re: autovacuum next steps, take 2

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I think an absolute minimum requirement for a sane design is that no two
>> workers ever try to vacuum the same table concurrently,

> FWIW, I've always considered this to be a very important and obvious
> issue, and I think I've neglected mentioning it (maybe I did too few
> times).  But I think this is pretty easy to do, just have each worker
> advertise the current table it's working on in shared memory, and add a
> recheck loop on the table-pick algorithm (with appropriate grabs of the
> autovacuum lwlock), to make sure no one starts to vacuum the same table
> you're going to process, at the same time.

Well, any of these proposals need that at the bottom level, to prevent
race conditions.  But I'd prefer a design that wasn't positively
encouraging multiple workers to try to pick the same table concurrently.
Not only is that wasteful, but it makes it harder to predict what is the
real behavior that emerges after race conditions are backed off from.

BTW, to what extent might this whole problem be simplified if we adopt
chunk-at-a-time vacuuming (compare current discussion with Galy Lee)?
If the unit of work has a reasonable upper bound regardless of table
size, maybe the problem of big tables starving small ones goes away.
        regards, tom lane


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
>> The real problem is trying to set that up in such a fashion that keeps
>> hot tables frequently vacuumed;
> 
> Are we assuming that no single worker instance will vacuum a given table
> more than once?  (That's not a necessary assumption, certainly, but
> without it there are so many degrees of freedom that I'm not sure how
> it should act.)  Given that assumption, the maximum vacuuming rate for
> any table is once per autovacuum_naptime, and most of the magic lies in
> the launcher's algorithm for deciding which databases to launch workers
> into.

Yes, I have been working under the assumption that a worker goes through 
the list of tables once and exits, and yes the maximum vacuuming rate 
for any table would be once per autovacuum_naptime.  We can lower the 
default if necessary, as far as I'm concerned it's (or should be) fairly 
cheap to fire off a worker and have it find that there isn't anything 
todo and exit.

> I'm inclined to propose an even simpler algorithm in which every worker
> acts alike; its behavior is
> 1. On startup, generate a to-do list of tables to process, sorted in
> priority order.
> 2. For each table in the list, if the table is still around and has not
> been vacuumed by someone else since you started (including the case of
> a vacuum-in-progress), then vacuum it.

That is what I'm proposing except for one difference, when you catch up 
to an older worker, exit.  This has the benefit reducing the number of 
workers concurrently working on big tables, which I think is a good thing.

> Detecting "already vacuumed since you started" is a bit tricky; you
> can't really rely on the stats collector since its info isn't very
> up-to-date.  That's why I was thinking of exposing the to-do lists
> explicitly; comparing those with an advertised current-table would
> allow accurate determination of what had just gotten done.

Sounds good, but I have very little insight into how we would implement 
"already vacuumed since you started" or "have I caught up to another 
worker".


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> BTW, to what extent might this whole problem be simplified if we adopt
> chunk-at-a-time vacuuming (compare current discussion with Galy Lee)?
> If the unit of work has a reasonable upper bound regardless of table
> size, maybe the problem of big tables starving small ones goes away.

So if we adopted chunk-at-a-time then perhaps each worker processes the 
list of tables in OID order (or some unique and stable order) and does 
one chunk per table that needs vacuuming.  This way an equal amount of 
bandwidth is given to all tables.

That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?


Matt



Re: autovacuum next steps, take 2

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
> Tom Lane wrote:
>> I'm inclined to propose an even simpler algorithm in which every worker
>> acts alike;

> That is what I'm proposing except for one difference, when you catch up 
> to an older worker, exit.

No, that's a bad idea, because it means that any large table starves
even-larger tables.

(Note: in all this I assume we're all using "size" as a shorthand for
some sort of priority metric that considers number of dirty tuples not
only size.  We don't want every worker insisting on passing over every
small read-only table every time, for instance.)
        regards, tom lane


Re: autovacuum next steps, take 2

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
> That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?

It seems fairly trivial to me to have a scheme where you do one
fill-workmem-and-scan-indexes cycle per invocation, and store the
next-heap-page-to-scan in some handy place (new pg_class column updated
along with relpages/reltuples, likely).  Galy is off in left field with
some far more complex ideas :-( but I don't see that there's all that
much needed to support this behavior ... especially if we don't expose
it to the SQL level but only support it for autovac's use.  Then we're
not making any big commitment to support the behavior forever.
        regards, tom lane


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> "Matthew T. O'Connor" <matthew@zeut.net> writes:
>> Tom Lane wrote:
>>> I'm inclined to propose an even simpler algorithm in which every worker
>>> acts alike;
> 
>> That is what I'm proposing except for one difference, when you catch up 
>> to an older worker, exit.
> 
> No, that's a bad idea, because it means that any large table starves
> even-larger tables.

True, but the assumption I'm making is that there is a finite amount of 
bandwidth available and more concurrent activity will have a net 
negative effect the time it takes to vacuum all tables.  I'm willing to 
pay that price to prevent small hot tables from getting starved, but 
less willing to pay the same price for large tables where the percentage 
of bloat will be much smaller.

> (Note: in all this I assume we're all using "size" as a shorthand for
> some sort of priority metric that considers number of dirty tuples not
> only size.  We don't want every worker insisting on passing over every
> small read-only table every time, for instance.)

I was using size to mean reltuples.  The whole concept of sorting by 
size was designed to ensure that smaller (more susceptible to bloat) 
tables got priority.  It might be useful for workers to sort their to-do 
lists by some other metric, but I don't have a clear vision of what that 
might be.



Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> "Matthew T. O'Connor" <matthew@zeut.net> writes:
>> That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?
> 
> It seems fairly trivial to me to have a scheme where you do one
> fill-workmem-and-scan-indexes cycle per invocation, and store the
> next-heap-page-to-scan in some handy place (new pg_class column updated
> along with relpages/reltuples, likely).  Galy is off in left field with
> some far more complex ideas :-( but I don't see that there's all that
> much needed to support this behavior ... especially if we don't expose
> it to the SQL level but only support it for autovac's use.  Then we're
> not making any big commitment to support the behavior forever.

Well, if we can make it happen soon, it might be the best thing for 
autovacuum.


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Mon, Feb 26, 2007 at 10:48:49PM -0500, Tom Lane wrote:
> "Matthew T. O'Connor" <matthew@zeut.net> writes:
> > That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?
> 
> It seems fairly trivial to me to have a scheme where you do one
> fill-workmem-and-scan-indexes cycle per invocation, and store the
> next-heap-page-to-scan in some handy place (new pg_class column updated
> along with relpages/reltuples, likely).  Galy is off in left field with
> some far more complex ideas :-( but I don't see that there's all that
> much needed to support this behavior ... especially if we don't expose
> it to the SQL level but only support it for autovac's use.  Then we're
> not making any big commitment to support the behavior forever.

The problem I see there is that the case we're trying to fix is tables
that need to be vacuumed every few minutes. As I posted elsewhere, it's
reasonable to assume a vacuum rate of ~1000 pages/second for a small
table that's going to be in memory (assuming that vacuum dirties every
page). That means that you can only dirty about 60k pages per cycle,
which seems way to small to be practical unless we come up with a way to
avoid scanning the indexes on every cycle.

The proposal to save enough state to be able to resume a vacuum at
pretty much any point in it's cycle might work; we'd have to benchmark
it.  With the default maintenance_work_mem of 128M it would mean writing
out 64M of state every minute on average, which is likely to take
several seconds to fsync (though, maybe we wouldn't need to fsync it...)
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
> Jim C. Nasby wrote:
> 
> > The advantage to keying this to autovac_naptime is that it means we
> > don't need another GUC, but after I suggested that before I realized
> > that's probably not the best idea. For example, I've seen clusters that
> > are running dozens-hundreds of databases; in that environment you really
> > need to turn naptime way down (to like a second). In that case you
> > wouldn't want to key to naptime.
> 
> Actually, I've been thinking that it would be a good idea to change the
> semantics of autovacuum_naptime so that it means the average time to
> start a worker in any given database.  That way, the time between
> autovac runs is not dependent on the number of databases you have.

Hrm... how would that work?

BTW, another thought is to only sleep if you've scanned through every
database and found nothing to do.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> The proposal to save enough state to be able to resume a vacuum at
> pretty much any point in it's cycle might work; we'd have to benchmark
> it.  With the default maintenance_work_mem of 128M it would mean writing
> out 64M of state every minute on average, which is likely to take
> several seconds to fsync (though, maybe we wouldn't need to fsync it...)

Which is exactly why we needn't bother benchmarking it.  Even if it
weren't complex and unsafe, it will be a net loss when you consider the
fact that it adds I/O instead of removing it.
        regards, tom lane


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:
> Jim C. Nasby wrote:
> >On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
> >>I'm not sure how pg_class.relpages is maintained but what happens to a 
> >>bloated table?  For example, a 100 row table that is constantly updated 
> >>and hasn't been vacuumed in a while (say the admin disabled autovacuum 
> >>for a while), now that small 100 row table has 1000 pages in it most of 
> >>which are just bloat, will we miss this table?  Perhaps basing this on 
> >>reltuples would be better?
> >
> >The entire point of this is to ensure that the second daemon will only
> >vacuum tables that it can finish very quickly. If you let a table bloat
> >so it's too big, then you just can't vacuum it very frequently without
> >risking all your other hot tables bloating because they're no longer
> >getting vacuumed.
> >
> >The reality is that you can actually vacuum a pretty good-sized table in
> >60 seconds with typical cost-delay settings (ie: defaults except
> >cost_delay set to 10). That means you can do 9 pages ~100 times a
> >second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
> >that's still 27k pages per minute.
> 
> At the risk of sounding like a broken record, I still think the size 
> limit threshold is unnecessary.  Since all workers will be working in on 
> tables in size order, younger workers will typically catch older workers 
> fairly quickly since the tables will be either small, or recently 
> vacuumed and not need work.  And since younger workers exit when they 
> catch-up to an older worker, there is some inherent stability in the 
> number of workers.
> 
> Here is a worst case example: A DB with 6 tables all of which are highly 
> active and will need to be vacuumed constantly.  While this is totally 
> hypothetical, it is how I envision things working (without the threshold).

I fail to see how a simple 6 table case is 'worst case'. It's common to
see hundreds of tables, and I've run across more than one database with
thousands of tables (think partitioning). In cases like those it's
certainly possible, perhaps even likely that you would get many daemons
running in the database at one time just from different tables suddenly
needing vacuuming and appearing at a higher point in the list than other
tables. With 100 ~1G tables getting updates it certainly wouldn't be
hard to end up with 10 of those being vacuumed all at the same time.

I do like the idea since it should be easier to tune, but I think we
still need some limit on it. Perhaps as a first-pass we could just have
a hard limit and log a message and/or set a flag any time we hit it.
That would hopefully allow us to get information about how big a problem
it really is. We could go one step further and say that the last daemon
that can start in a database will only vacuum tables that can be done
quickly; that's essentially what we've been talking about, except the
limit we've been discussing would be hard-coded at 2.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Tue, Feb 27, 2007 at 12:37:42AM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > The proposal to save enough state to be able to resume a vacuum at
> > pretty much any point in it's cycle might work; we'd have to benchmark
> > it.  With the default maintenance_work_mem of 128M it would mean writing
> > out 64M of state every minute on average, which is likely to take
> > several seconds to fsync (though, maybe we wouldn't need to fsync it...)
> 
> Which is exactly why we needn't bother benchmarking it.  Even if it
> weren't complex and unsafe, it will be a net loss when you consider the
> fact that it adds I/O instead of removing it.

Well, it depends on how often you're doing that. Adding extra IO at the
end of 4 hours of vacuuming isn't going to make any real difference, but
once a minute...

Looks like partial vacuum won't help this problem. :(
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Jim C. Nasby wrote:
> On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:
>> Jim C. Nasby wrote:
>> Here is a worst case example: A DB with 6 tables all of which are highly 
>> active and will need to be vacuumed constantly.  While this is totally 
>> hypothetical, it is how I envision things working (without the threshold).
> 
> I fail to see how a simple 6 table case is 'worst case'. It's common to
> see hundreds of tables, and I've run across more than one database with
> thousands of tables (think partitioning). In cases like those it's
> certainly possible, perhaps even likely that you would get many daemons
> running in the database at one time just from different tables suddenly
> needing vacuuming and appearing at a higher point in the list than other
> tables. With 100 ~1G tables getting updates it certainly wouldn't be
> hard to end up with 10 of those being vacuumed all at the same time.

Yes 6 tables is small, the worst-case part of the example was that all 
the tables would need to be vacuumed constantly.  Most databases only 
have a few hot tables.  Most tables only need to vacuumed every once in 
a while.

> I do like the idea since it should be easier to tune, but I think we
> still need some limit on it. Perhaps as a first-pass we could just have
> a hard limit and log a message and/or set a flag any time we hit it.
> That would hopefully allow us to get information about how big a problem
> it really is. We could go one step further and say that the last daemon
> that can start in a database will only vacuum tables that can be done
> quickly; that's essentially what we've been talking about, except the
> limit we've been discussing would be hard-coded at 2.

I'm confused, what limit would be set at 2?  The number of concurrent 
workers?  I've never said that.


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
>> The proposal to save enough state to be able to resume a vacuum at
>> pretty much any point in it's cycle might work; we'd have to benchmark
>> it.  With the default maintenance_work_mem of 128M it would mean writing
>> out 64M of state every minute on average, which is likely to take
>> several seconds to fsync (though, maybe we wouldn't need to fsync it...)
> 
> Which is exactly why we needn't bother benchmarking it.  Even if it
> weren't complex and unsafe, it will be a net loss when you consider the
> fact that it adds I/O instead of removing it.

I'm not sure what you are saying here, are you now saying that partial 
vacuum won't work for autovac?  Or are you saying that saving state as 
Jim is describing above won't work?


Re: autovacuum next steps, take 2

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
> I'm not sure what you are saying here, are you now saying that partial 
> vacuum won't work for autovac?  Or are you saying that saving state as 
> Jim is describing above won't work?

I'm saying that I don't like the idea of trying to "stop on a dime" by
saving the current contents of vacuum's dead-TID array to disk with the
idea that we can trust those values 100% later.  Saving the array is
expensive both in runtime and code complexity, and I don't believe we
can trust it later --- at least not without even more expensive-and-
complex measures, such as WAL-logging every such save :-(

I'm for stopping only after completing an index-cleaning pass, at the
point where we empty the dead-TID array anyway.  If you really have to
have "stop on a dime", just kill -INT the process, accepting that you
will have to redo your heap scan since the last restart point.
        regards, tom lane


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> "Matthew T. O'Connor" <matthew@zeut.net> writes:
>> I'm not sure what you are saying here, are you now saying that partial 
>> vacuum won't work for autovac?  Or are you saying that saving state as 
>> Jim is describing above won't work?
> 
> I'm saying that I don't like the idea of trying to "stop on a dime" by
> saving the current contents of vacuum's dead-TID array to disk with the
> idea that we can trust those values 100% later.  Saving the array is
> expensive both in runtime and code complexity, and I don't believe we
> can trust it later --- at least not without even more expensive-and-
> complex measures, such as WAL-logging every such save :-(
> 
> I'm for stopping only after completing an index-cleaning pass, at the
> point where we empty the dead-TID array anyway.  If you really have to
> have "stop on a dime", just kill -INT the process, accepting that you
> will have to redo your heap scan since the last restart point.

OK, so if I understand correct, a vacuum of a table with 10 indexes on 
it can be interrupted 10 times, once after each index-cleaning pass? 
That might have some value, especially breaking up the work required to 
vacuum a large table. Or am I still not getting it?


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Tue, Feb 27, 2007 at 01:26:00AM -0500, Matthew T. O'Connor wrote:
> Tom Lane wrote:
> >"Matthew T. O'Connor" <matthew@zeut.net> writes:
> >>I'm not sure what you are saying here, are you now saying that partial 
> >>vacuum won't work for autovac?  Or are you saying that saving state as 
> >>Jim is describing above won't work?
> >
> >I'm saying that I don't like the idea of trying to "stop on a dime" by
> >saving the current contents of vacuum's dead-TID array to disk with the
> >idea that we can trust those values 100% later.  Saving the array is
> >expensive both in runtime and code complexity, and I don't believe we
> >can trust it later --- at least not without even more expensive-and-
> >complex measures, such as WAL-logging every such save :-(
> >
> >I'm for stopping only after completing an index-cleaning pass, at the
> >point where we empty the dead-TID array anyway.  If you really have to
> >have "stop on a dime", just kill -INT the process, accepting that you
> >will have to redo your heap scan since the last restart point.
> 
> OK, so if I understand correct, a vacuum of a table with 10 indexes on 
> it can be interrupted 10 times, once after each index-cleaning pass? 
> That might have some value, especially breaking up the work required to 
> vacuum a large table. Or am I still not getting it?

It'd stop after scanning the heap, scanning all the indexes, and then
cleaning the heap. After that's done it no longer needs any of the
dead-TID info; anytime before that it does need that info, and Tom's
objection is that trying to store that info is a bad idea.

The problem with this is that typically it takes a long time to go
through a complete vacuum cycle; minutes at least, and preferably
longer. Decreasing that cycle time will greatly increase the amount of
IO required for vacuuming a table with any indexes, because every time
you cycle through you have to read the entire index. That's why I don't
see it as being useful at all for getting autovac to work on hot tables
- if you actually got that cycle time low enough you'd kill the system
with all the extra index scanning.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Tue, Feb 27, 2007 at 12:54:28AM -0500, Matthew T. O'Connor wrote:
> Jim C. Nasby wrote:
> >On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:
> >>Jim C. Nasby wrote:
> >>Here is a worst case example: A DB with 6 tables all of which are highly 
> >>active and will need to be vacuumed constantly.  While this is totally 
> >>hypothetical, it is how I envision things working (without the threshold).
> >
> >I fail to see how a simple 6 table case is 'worst case'. It's common to
> >see hundreds of tables, and I've run across more than one database with
> >thousands of tables (think partitioning). In cases like those it's
> >certainly possible, perhaps even likely that you would get many daemons
> >running in the database at one time just from different tables suddenly
> >needing vacuuming and appearing at a higher point in the list than other
> >tables. With 100 ~1G tables getting updates it certainly wouldn't be
> >hard to end up with 10 of those being vacuumed all at the same time.
> 
> Yes 6 tables is small, the worst-case part of the example was that all 
> the tables would need to be vacuumed constantly.  Most databases only 
> have a few hot tables.  Most tables only need to vacuumed every once in 
> a while.
It's not the hot tables that are the issue; it's how many large tables
(hot or not) that can come up for vacuuming in order. For example, if
A-Z are all large tables (ie: a few GB), with A being the largest and Z
the smallest, think about what happens here:

Round 1: A needs vacuuming. Daemon gets to it and starts working.
Round 2: B now needs vacuuming. It's slightly smaller than A, so daemon
2 gets to it.
Round 3: C now needs vacuuming. Daemon 3.
...
Round 26: Z now needs vacuuming. Daemon 26 picks it up.

You now have 26 daemons running in the database.

Now, we can argue about how likely that scenario is, but I don't think
it's relevant. What matters is that it *is* possible, and as long as
that's the case you'd have to have some kind of limit. (While this
simple 26 table example is definitely worst-case, if you've got hundreds
of tables that are all multiple GB in size I think it wouldn't be hard
at all for you to end up with a dozen or more daemons all hammering
away).

> >I do like the idea since it should be easier to tune, but I think we
> >still need some limit on it. Perhaps as a first-pass we could just have
> >a hard limit and log a message and/or set a flag any time we hit it.
> >That would hopefully allow us to get information about how big a problem
> >it really is. We could go one step further and say that the last daemon
> >that can start in a database will only vacuum tables that can be done
> >quickly; that's essentially what we've been talking about, except the
> >limit we've been discussing would be hard-coded at 2.
> 
> I'm confused, what limit would be set at 2?  The number of concurrent 
> workers?  I've never said that.

The point I was making is that the proposal about limiting the 2nd
daemon to only processing tables it can do in a short period of time is
akin to setting a limit of only 2 daemons in a database at a time.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
> Jim C. Nasby wrote:
> 
> > The advantage to keying this to autovac_naptime is that it means we
> > don't need another GUC, but after I suggested that before I realized
> > that's probably not the best idea. For example, I've seen clusters that
> > are running dozens-hundreds of databases; in that environment you really
> > need to turn naptime way down (to like a second). In that case you
> > wouldn't want to key to naptime.
> 
> Actually, I've been thinking that it would be a good idea to change the
> semantics of autovacuum_naptime so that it means the average time to
> start a worker in any given database.  That way, the time between
> autovac runs is not dependent on the number of databases you have.

BTW, another issue that I don't think we can ignore: we actually need to
do this on a per-tablespace level, or at least have the ability to
disable or somehow limit it. While it's not common, there are users that
run a hundred or more databases in a single cluster; it would be ugly if
we suddenly had 100 vacuums trying to run on the same set of drives
concurrently.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
"Matthew T. O'Connor"
Date:
Jim C. Nasby wrote:
> On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
>> Jim C. Nasby wrote:
>>
>>> The advantage to keying this to autovac_naptime is that it means we
>>> don't need another GUC, but after I suggested that before I realized
>>> that's probably not the best idea. For example, I've seen clusters that
>>> are running dozens-hundreds of databases; in that environment you really
>>> need to turn naptime way down (to like a second). In that case you
>>> wouldn't want to key to naptime.
>> Actually, I've been thinking that it would be a good idea to change the
>> semantics of autovacuum_naptime so that it means the average time to
>> start a worker in any given database.  That way, the time between
>> autovac runs is not dependent on the number of databases you have.
> 
> BTW, another issue that I don't think we can ignore: we actually need to
> do this on a per-tablespace level, or at least have the ability to
> disable or somehow limit it. While it's not common, there are users that
> run a hundred or more databases in a single cluster; it would be ugly if
> we suddenly had 100 vacuums trying to run on the same set of drives
> concurrently.

I think we all agree that autovacuum needs to become tablespace aware at 
some point, but I think that is further down the line, we're having 
enough trouble figuring things out without that additional complication.


Re: autovacuum next steps, take 2

From
"Jim C. Nasby"
Date:
On Tue, Feb 27, 2007 at 12:12:22PM -0500, Matthew T. O'Connor wrote:
> Jim C. Nasby wrote:
> >On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
> >>Jim C. Nasby wrote:
> >>
> >>>The advantage to keying this to autovac_naptime is that it means we
> >>>don't need another GUC, but after I suggested that before I realized
> >>>that's probably not the best idea. For example, I've seen clusters that
> >>>are running dozens-hundreds of databases; in that environment you really
> >>>need to turn naptime way down (to like a second). In that case you
> >>>wouldn't want to key to naptime.
> >>Actually, I've been thinking that it would be a good idea to change the
> >>semantics of autovacuum_naptime so that it means the average time to
> >>start a worker in any given database.  That way, the time between
> >>autovac runs is not dependent on the number of databases you have.
> >
> >BTW, another issue that I don't think we can ignore: we actually need to
> >do this on a per-tablespace level, or at least have the ability to
> >disable or somehow limit it. While it's not common, there are users that
> >run a hundred or more databases in a single cluster; it would be ugly if
> >we suddenly had 100 vacuums trying to run on the same set of drives
> >concurrently.
> 
> I think we all agree that autovacuum needs to become tablespace aware at 
> some point, but I think that is further down the line, we're having 
> enough trouble figuring things out without that additional complication.

Sure, we just need a way to disable the multiple autovac daemon stuff
then.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: autovacuum next steps, take 2

From
Casey Duncan
Date:
On Feb 26, 2007, at 12:49 PM, Alvaro Herrera wrote:

> Jim C. Nasby wrote:
>
>> That's why I'm thinking it would be best to keep the maximum size of
>> stuff for the second worker small. It probably also makes sense to  
>> tie
>> it to time and not size, since the key factor is that you want it  
>> to hit
>> the high-update tables every X number of seconds.
>>
>> If we wanted to get fancy, we could factor in how far over the vacuum
>> threshold a table is, so even if the table is on the larger size, if
>> it's way over the threshold the second vacuum will hit it.
>
> Ok, I think we may be actually getting somewhere.
>
> I propose to have two different algorithms for choosing the tables to
> work on.  The worker would behave differently, depending on whether
> there is one or more workers on the database already or not.
>
> The first algorithm is the plain threshold equation stuff we use  
> today.
> If a worker connects and determines that no other worker is in the
> database, it uses the "plain worker" mode.  A worker in this mode  
> would
> examine pgstats, determine what tables to vacuum/analyze, sort them by
> size (smaller to larger), and goes about its work.  This kind of  
> worker
> can take a long time to vacuum the whole database -- we don't  
> impose any
> time limit or table size limit to what it can do.
>
> The second mode is the "hot table worker" mode, enabled when the  
> worker
> detects that there's already a worker in the database.  In this mode,
> the worker is limited to those tables that can be vacuumed in less  
> than
> autovacuum_naptime, so large tables are not considered.  Because of
> this, it'll generally not compete with the first mode above -- the
> tables in plain worker were sorted by size, so the small tables were
> among the first vacuumed by the plain worker.  The estimated time to
> vacuum may be calculated according to autovacuum_vacuum_delay  
> settings,
> assuming that all pages constitute cache misses.

Perhaps this has already been proposed, but maybe some combination of  
the following inputs could be used to determine which table most  
needs vacuuming:

- The proportion of tuples in a table that are dead (updated rows  
since last vacuum/estimated row count). This would favor "hot" tables  
naturally regardless of size.

- The time since the last vacuum, so that larger tables are  
eventually vacuumed even if hot tables totally dominate

Of course tables that did not pass the minimum parameters specified  
in postgresql.conf would not even get considered.

I'm being intentionally vague here on the exact algorithm, since you  
all have though about this more than I have. One thing I like about  
the above is that it is independent of table size, and doesn't  
require anyone to determine which tables are hot manually.

-Casey


Re: autovacuum next steps, take 2

From
Galy Lee
Date:
Tom Lane wrote:
> Saving the array is
> expensive both in runtime and code complexity, and I don't believe we
> can trust it later --- at least not without even more expensive-and-
> complex measures, such as WAL-logging every such save :-(
I don’t understand well the things you are worrying about.
If we find that we can not trust the saved file, or the file has
corrupted, then we can drop it and scan from the beginning of the heap
block. If something like CLUSTER, PITR has changed the physical layout
of heap, then we can simply drop the files. Why do we need WAL for it?
I don’t see any point in it.

Also, I don’t think it is expensive. If it is combined with maintenancewindow to stop once in a whole day, writing
256MB/2= 128MB things out
 
can not be said expensive. Of course, this feature isn’t for autovacuumto use it in every minutes, autovacuum can use
itafter it has adopted
 
maintenance window.