Thread: autovacuum next steps, take 3

autovacuum next steps, take 3

From
Alvaro Herrera
Date:
Here is a low-level, very detailed description of the implementation of
the autovacuum ideas we have so far.

launcher's dealing with databases
---------------------------------

We'll add a new member "nexttime" to the autovac_dbase struct, which
will be the time_t of the next time a worker needs to process that DB.
Initially, those times will be 0 for all databases.  The launcher will
keep that list in memory, and on each iteration it will fetch the entry
that has the earliest time, and sleep until that time.  When it awakens,
it will start a worker on that database and set the nexttime to
now+naptime.

The list will be a Dllist so that it's easy to keep it sorted by
increasing time and picking the head of the list each time, and then
putting that node as a new tail.

Every so often seconds, the launcher will call autovac_get_database_list
and compare that list with the list it has on memory.  If a new database
is in the list, it will assign a nexttime between the current instant
and the time of the head of the Dllist.  Then it'll put it as the new
head.  The new database will thus be put as the next database to be
processed.

When a node with nexttime=0 is found, the amount of time to sleep will
be determined as Min(naptime/num_elements, 1), so that initially
databases will be distributed roughly evenly in the naptime interval.

When a nexttime in the past is detected, the launcher will start a
worker either right away or as soon as possible (read below).


launcher and worker interactions
--------------------------------

The launcher PID will be in shared memory, so that workers can signal
it.  We will also keep worker information in shared memory as an array
of WorkerInfo structs:

typedef struct
{Oid            wi_dboid;Oid            wi_tableoid;int            wi_workerpid;bool        wi_finished;
} WorkerInfo;

We will use SIGUSR1 to communicate between workers and launcher.  When
the launcher wants to start a worker, it sets the "dboid" field and
signals the postmaster.  Then goes back to sleep.  When a worker has
started up and is about to start vacuuming, it will store its PID in
workerpid, and then send a SIGUSR1 to the launcher.  If the schedule
says that there's no need to run a new worker, the launcher will go back
to sleeping.

We cannot call SendPostmasterSignal a second time just after calling it;
the second call would be lost.  So it is important that the launcher
does not try to start a worker until there's no worker starting.  So if
the launcher wakes up for any reason and detects that there is a
WorkerInfo entry with valid dboid but workerpid is zero, it will go back
to sleep.  Since the starting worker will send a signal as soon as it
finishes starting up, the launcher will wake up, detect this condition
and then it can start a second worker.

Also, the launcher cannot start new workers when there are
autovacuum_max_workers already running.  So if there are that many when
it wakes up, it cannot do anything else but go back to sleep again.
When one of those workers finishes, it will wake the launcher by setting
the finished flag on its WorkerInfo, and sending SIGUSR1 to the
launcher.  The launcher then wakes up, resets the WorkerInfo struct, and
can start another worker if needed.

There is an additional problem if, for some reason, a worker starts and
is not able to finish its task correctly.  It will not be able to set
its finished flag, so the launcher will believe that it's still starting
up.  To prevent this problem, we check the PGPROCs of worker processes,
and clean them up if we find they are not actually running (or the PIDs
correspond to processes that are not autovacuum workers).  We only do it
if all WorkerInfo structures are in use, thus frequently enough so that
this problem doesn't cause any starvation, but seldom enough so that
it's not a performance hit.


worker to-do list
-----------------

When each worker starts, it determines which tables to process in the
usual fashion: get pg_autovacuum and pgstat data and compute the
equations.

The worker then takes a "snapshot" of what's currently going on in the
database, by storing worker PIDs, the corresponding table OID that's
being currently worked, and the to-do list for each worker.

It removes from its to-do list the tables being processed.  Finally, it
writes the list to disk.

The table list will be written to a file in
PGDATA/vacuum/<database-oid>/todo.<worker-pid>
The file will consist of table OIDs, in the order in which they are
going to be vacuumed.

At this point, vacuuming can begin.

Before processing each table, it scans the WorkerInfos to see if there's
a new worker, in which case it reads its to-do list to memory.

Then it again fetches the tables being processed by other workers in the
same database, and for each other worker, removes from its own in-memory
to-do all those tables mentioned in the other lists that appear earlier
than the current table being processed (inclusive).  Then it picks the
next non-removed table in the list.  All of this must be done with the
Autovacuum LWLock grabbed in exclusive mode, so that no other worker can
pick the same table (no IO takes places here, because the whole lists
were saved in memory at the start.)


other things to consider
------------------------

This proposal doesn't deal with the hot tables stuff at all, but that is
very easy to bolt on later: just change the first phase, where the
initial to-do list is determined, to exclude "cold" tables.  That way,
the vacuuming will be fast.  Determining what is a cold table is still
an exercise to the reader ...

It may be interesting to avoid vacuuming at all when there's a
long-running transaction in progress.  That way we avoid wasting I/O for
nothing, for example when there's a pg_dump running.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps, take 3

From
"Matthew T. O'Connor"
Date:
My initial reaction is that this looks good to me, but still a few 
comments below.

Alvaro Herrera wrote:
> Here is a low-level, very detailed description of the implementation of
> the autovacuum ideas we have so far.
> 
> launcher's dealing with databases
> ---------------------------------

[ Snip ]

> launcher and worker interactions

[Snip]

> worker to-do list
> -----------------
> When each worker starts, it determines which tables to process in the
> usual fashion: get pg_autovacuum and pgstat data and compute the
> equations.
> 
> The worker then takes a "snapshot" of what's currently going on in the
> database, by storing worker PIDs, the corresponding table OID that's
> being currently worked, and the to-do list for each worker.

Does a new worker really care about the PID of other workers or what 
table they are currently working on?

> It removes from its to-do list the tables being processed.  Finally, it
> writes the list to disk.

Just to be clear, the new worker removes from it's todo list all the 
tables mentioned in the todo lists of all the other workers?

> The table list will be written to a file in
> PGDATA/vacuum/<database-oid>/todo.<worker-pid>
> The file will consist of table OIDs, in the order in which they are
> going to be vacuumed.
> 
> At this point, vacuuming can begin.

This all sounds good to me so far.

> Before processing each table, it scans the WorkerInfos to see if there's
> a new worker, in which case it reads its to-do list to memory.

It's not clear to me why a worker cares that there is a new worker, 
since the new worker is going to ignore all the tables that are already 
claimed by all worker todo lists.

> Then it again fetches the tables being processed by other workers in the
> same database, and for each other worker, removes from its own in-memory
> to-do all those tables mentioned in the other lists that appear earlier
> than the current table being processed (inclusive).  Then it picks the
> next non-removed table in the list.  All of this must be done with the
> Autovacuum LWLock grabbed in exclusive mode, so that no other worker can
> pick the same table (no IO takes places here, because the whole lists
> were saved in memory at the start.)

Again it's not clear to me what this is gaining us?  It seems to me that 
if when a worker starts up writes out it's to-do list, it should just do 
it, I don't see the value in workers constantly updating their todo 
lists.  Maybe I'm just missing something can you enlighten me?

> other things to consider
> ------------------------
> 
> This proposal doesn't deal with the hot tables stuff at all, but that is
> very easy to bolt on later: just change the first phase, where the
> initial to-do list is determined, to exclude "cold" tables.  That way,
> the vacuuming will be fast.  Determining what is a cold table is still
> an exercise to the reader ...

I think we can make this algorithm naturally favor small / hot tables 
with one small change.  Having workers remove tables that they just 
vacuumed from their to-do lists and re-write their todo lists to disk. 
Assuming the todo lists are ordered by size ascending, smaller tables 
will be made available for inspection by newer workers sooner rather 
than later.



Re: autovacuum next steps, take 3

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
> Does a new worker really care about the PID of other workers or what 
> table they are currently working on?

As written, it needs the PIDs so it can read in the other workers' todo
lists (which are in files named by PID).

> It's not clear to me why a worker cares that there is a new worker, 
> since the new worker is going to ignore all the tables that are already 
> claimed by all worker todo lists.

That seems wrong to me, since it means that new workers will ignore
tables that are scheduled for processing by an existing worker, no
matter how far in the future that schedule extends.  As an example,
suppose you have half a dozen large tables in need of vacuuming.
The first worker in will queue them all up, and subsequent workers
will do nothing useful, at least not till the first worker is done
with the first table.  Having the first worker update its todo
list file after each table allows the earlier tables to be exposed
for reconsideration, but that's expensive and it does nothing for
later tables.

I suggest that maybe we don't need exposed TODO lists at all.  Rather
the workers could have internal TODO lists that are priority-sorted
in some way, and expose only their current table OID in shared memory.
Then the algorithm for processing each table in your list is
1. Grab the AutovacSchedule LWLock exclusively.2. Check to see if another worker is currently processing   that table;
ifso drop LWLock and go to next list entry.3. Recompute whether table needs vacuuming; if not,   drop LWLock and go to
nextentry.  (This test covers the   case where someone vacuumed the table since you made your   list.)4. Put table OID
intoshared memory, drop LWLock, then   vacuum table.5. Clear current-table OID from shared memory, then   repeat for
nextlist entry.
 

This creates a behavior of "whoever gets to it first" rather than
allowing workers to claim tables that they actually won't be able
to service any time soon.
        regards, tom lane


Re: autovacuum next steps, take 3

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> "Matthew T. O'Connor" <matthew@zeut.net> writes:
>> It's not clear to me why a worker cares that there is a new worker, 
>> since the new worker is going to ignore all the tables that are already 
>> claimed by all worker todo lists.
> 
> That seems wrong to me, since it means that new workers will ignore
> tables that are scheduled for processing by an existing worker, no
> matter how far in the future that schedule extends.  As an example,
> suppose you have half a dozen large tables in need of vacuuming.
> The first worker in will queue them all up, and subsequent workers
> will do nothing useful, at least not till the first worker is done
> with the first table.  Having the first worker update its todo
> list file after each table allows the earlier tables to be exposed
> for reconsideration, but that's expensive and it does nothing for
> later tables.

Well the big problem that we have is not that large tables are being 
starved, so this doesn't bother me too much, plus there is only so much 
IO, so one worker working sequentially through the big tables seems OK 
to me.

> I suggest that maybe we don't need exposed TODO lists at all.  Rather
> the workers could have internal TODO lists that are priority-sorted
> in some way, and expose only their current table OID in shared memory.
> Then the algorithm for processing each table in your list is
> 
>     1. Grab the AutovacSchedule LWLock exclusively.
>     2. Check to see if another worker is currently processing
>        that table; if so drop LWLock and go to next list entry.
>     3. Recompute whether table needs vacuuming; if not,
>        drop LWLock and go to next entry.  (This test covers the
>        case where someone vacuumed the table since you made your
>        list.)
>     4. Put table OID into shared memory, drop LWLock, then
>        vacuum table.
>     5. Clear current-table OID from shared memory, then
>        repeat for next list entry.
> 
> This creates a behavior of "whoever gets to it first" rather than
> allowing workers to claim tables that they actually won't be able
> to service any time soon.

Right, but you could wind up with as many workers working concurrently 
as you have tables in a database which doesn't seem like a good idea 
either.  One thing I like about the todo list setup Alvaro had is that 
new workers will be assigned fewer tables to work on and hence exit 
sooner.  We are going to fire off a new worker every autovac_naptime so 
availability of new workers isn't going to be a problem.



Re: autovacuum next steps, take 3

From
Galy Lee
Date:
Alvaro Herrera wrote:
>worker to-do list
>-----------------
>It removes from its to-do list the tables being processed.  Finally, it
>writes the list to disk.

I am worrying about the worker-to-do-list in your proposal. I think
worker isn't suitable to maintain any vacuum task list; instead
it is better to maintain a unified vacuum task queue on autovacuum share
memory.

Here are the basic ideas:

* Why is such a task queue needed?

- Launcher might schedule all vacuum tasks by such a queue. It provides
a facility to schedule tasks smartly for further autovacuum improvement.

- Also such a task list can be viewed easily from a system view. This
can be implemented easily in 8.3 by the task queue.

* VACUUM task queue

VACUUM tasks of cluster are maintained in a unified cluster-wide queue
in the share memory of autovacuum.
 global shared TaskInfo tasks[];

It can be viewed as:

SELECT * FROM pg_autovacuum_tasks; dbid | relid | group | worker
-------+-------+-------+--------20000 | 20001 |     0 |  100120000 | 20002 |     0 |30000 | 30001 |     0 |  1002

VACUUM tasks belong to the same database might be divided into several
groups. One worker might be assigned to process one specific task group.

The task queue might be filled by dedicated task-gathering-worker or it
might be filled by *external task gatherer*.

It allows external program to develop a more sophisticated vacuum
scheme. Based on previous discussion, it appears that it is difficult to
implement an all-purpose algorithm to satisfy the requirements of all
applications. It is better to allow user to develop their vacuum
strategies. *User-defined external program* might fill the task queue,
and schedule tasks by their own strategy. Launcher will response for
coordinating workers only. This pluggable-vacuum-strategy approach seems
a good solution.

* status of worker

It is also convenience to allow user to monitor the status of vacuum
worker by a system view.The snapshot of worker can also be viewed as:

SELECT * FROM pg_autovacuum_workers;pid  |  dbid | relid | group
------+-------+-------+-------1001 | 20000 | 20001 |     01002 | 30000 | 30001 |     0


Best Regards
Galy Lee
lee.galy _at_ oss.ntt.co.jp
NTT Open Source Software Center



Re: autovacuum next steps, take 3

From
Tom Lane
Date:
Galy Lee <lee.galy@oss.ntt.co.jp> writes:
> I am worrying about the worker-to-do-list in your proposal. I think
> worker isn't suitable to maintain any vacuum task list; instead
> it is better to maintain a unified vacuum task queue on autovacuum share
> memory.

Shared memory is fixed-size.
        regards, tom lane


Re: autovacuum next steps, take 3

From
Alvaro Herrera
Date:
Galy Lee wrote:
> 
> Alvaro Herrera wrote:
> >worker to-do list
> >-----------------
> >It removes from its to-do list the tables being processed.  Finally, it
> >writes the list to disk.
> 
> I am worrying about the worker-to-do-list in your proposal. I think
> worker isn't suitable to maintain any vacuum task list; instead
> it is better to maintain a unified vacuum task queue on autovacuum share
> memory.

Galy,

Thanks for your comments.

I like the idea of having a global task queue, but sadly it doesn't work
for a simple reason: the launcher does not have enough information to
build it.  This is because we need access to catalogs in the database;
pg_class and pg_autovacuum in the current code, and the catalogs related
to the maintenance window feature when we implement it in the (hopefully
near) future.

Another point to be made, though of less importance, is that we cannot
keep such a task list in shared memory, because we aren't able to grow
that memory after postmaster start.  It is of lesser importance, because
we could keep the task list in plain files on disk; this is merely a
SMOP.  The functions to expose the task list to SQL queries would just
need to read those files.  It would be slower than shared memory,
certainly, but I don't think it's a showstopper (given the amount of
work VACUUM takes, anyway).

Not having access to the catalogs is a much more serious problem for the
scheduling.  One could think about dumping catalogs to plain files that
are readable to the launcher, but this is not very workable: how do you
dump pg_class and have it up to date all the time?  You'd have to be
writing that file pretty frequently, which doesn't sound a very good
idea.

Other idea I had was having a third kind of autovacuum process, namely a
"schedule builder", which would connect to the database, read catalogs,
compute needed vacuuming, write to disk, and exit.  This seems similar
to your task-gathering worker.  The launcher could then dispatch regular
workers as appropriate.  Furthermore, the launcher could create a global
schedule, based on the combination of the schedules for all databases.
I dismissed this idea because a schedule gets out of date very quickly
as tables continue to be used by regular operation.  A worker starting
at t0 may find that a task list built at t0-5 min  is not very relevant.
So it needs to build a new task list anyway, which then begs the
question of why not just let the worker itself build its task list?
Also, combining schedules is complicated and you start thinking in
asking the DBA to give each database a priority, which is annoying.

So the idea I am currently playing with is to have workers determine the
task list at start, by looking at both the catalogs and considering the
task lists of other workers.  I think this is the natural evolution of
the other ideas -- the worker is just smarter to start with, and the
whole thing is a lot simpler.


> The task queue might be filled by dedicated task-gathering-worker or it
> might be filled by *external task gatherer*.

The idea of an external task gatherer is an interesting one which I
think would make sense to implement in the future.  I think it is not
very difficult to implement once the proposal we're currently discussing
is done, because it just means we have to modify the part where each
worker decides what needs to be done, and at what times the launcher
decides to start a worker on each database.  The rest of the stuff I'm
working on is just infrastructure to make it happen.

So I think your basic idea here is still workable, just not right now.
Let's discuss it again as soon as I'm done with the current stuff.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps, take 3

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> worker to-do list
> -----------------
> 
> When each worker starts, it determines which tables to process in the
> usual fashion: get pg_autovacuum and pgstat data and compute the
> equations.
> 
> The worker then takes a "snapshot" of what's currently going on in the
> database, by storing worker PIDs, the corresponding table OID that's
> being currently worked, and the to-do list for each worker.
> 
> It removes from its to-do list the tables being processed.  Finally, it
> writes the list to disk.
> 
> The table list will be written to a file in
> PGDATA/vacuum/<database-oid>/todo.<worker-pid>
> The file will consist of table OIDs, in the order in which they are
> going to be vacuumed.
> 
> At this point, vacuuming can begin.
> 
> Before processing each table, it scans the WorkerInfos to see if there's
> a new worker, in which case it reads its to-do list to memory.
> 
> Then it again fetches the tables being processed by other workers in the
> same database, and for each other worker, removes from its own in-memory
> to-do all those tables mentioned in the other lists that appear earlier
> than the current table being processed (inclusive).  Then it picks the
> next non-removed table in the list.  All of this must be done with the
> Autovacuum LWLock grabbed in exclusive mode, so that no other worker can
> pick the same table (no IO takes places here, because the whole lists
> were saved in memory at the start.)

Sorry, I confused matters here by not clarifing on-disk to-do lists
versus in-memory.  When we write the to-do list to file, that's the
to-do lists that other workers will see.  It will not change; when I say
"remove a table for the to-do list", it will be removed from the to-do
list in memory, but the file will not get rewritten.

Note that a worker will not remove from its list a table that's in the
to-do list of another worker but not yet processed.  It will only remove
those tables that are currently being processed (i.e. they appear in the
shared memory entry for that worker), and any tables that appear _before
that one_ on that particular worker's file.

So this behaves very much like what Tom describes in an email downthread,
not like what Matthew is thinking.  In fact I'm thinking that the above
is needlessly complex, and that Tom's proposal is simpler and achieves
pretty much the same effect, so I'll have a look at evolving from that
instead.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps, take 3

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I suggest that maybe we don't need exposed TODO lists at all.  Rather
> the workers could have internal TODO lists that are priority-sorted
> in some way, and expose only their current table OID in shared memory.
> Then the algorithm for processing each table in your list is
> 
>     1. Grab the AutovacSchedule LWLock exclusively.
>     2. Check to see if another worker is currently processing
>        that table; if so drop LWLock and go to next list entry.
>     3. Recompute whether table needs vacuuming; if not,
>        drop LWLock and go to next entry.  (This test covers the
>        case where someone vacuumed the table since you made your
>        list.)
>     4. Put table OID into shared memory, drop LWLock, then
>        vacuum table.
>     5. Clear current-table OID from shared memory, then
>        repeat for next list entry.
> 
> This creates a behavior of "whoever gets to it first" rather than
> allowing workers to claim tables that they actually won't be able
> to service any time soon.

The point I'm not very sure about is that this proposal means we need to
do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date
stats.  Also, if the table was finished being vacuumed just before this
algorithm runs, and pgstats hasn't had the chance to write the updated
stats yet, we may run an unneeded vacuum.

In my proposal, all IO was done before grabbing the lock.  We may have
to the drop the lock and read the file of a worker that just started,
but that should be rare.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum next steps, take 3

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> 1. Grab the AutovacSchedule LWLock exclusively.
>> 2. Check to see if another worker is currently processing
>> that table; if so drop LWLock and go to next list entry.
>> 3. Recompute whether table needs vacuuming; if not,
>> drop LWLock and go to next entry.  (This test covers the
>> case where someone vacuumed the table since you made your
>> list.)
>> 4. Put table OID into shared memory, drop LWLock, then
>> vacuum table.
>> 5. Clear current-table OID from shared memory, then
>> repeat for next list entry.

> The point I'm not very sure about is that this proposal means we need to
> do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date
> stats.

True.  You could probably drop the lock while rechecking stats, at the
cost of having to recheck for collision (repeat step 2) afterwards.
Or recheck stats before you start, but if collisions are likely then
that's a waste of time.  But on the third hand, does it matter?
Rechecking the stats should be much cheaper than a vacuum operation,
so I'm not seeing that there's going to be a problem.  It's not like
there are going to be hundreds of workers contending for that lock...
        regards, tom lane


Re: autovacuum next steps, take 3

From
Galy Lee
Date:
Hi, Alvaro

Alvaro Herrera wrote:
> keep such a task list in shared memory, because we aren't able to grow
> that memory after postmaster start. 

We can use the fix-size share memory to maintain such a queue. The
maximum task size is the number of all tables. So the size of the queue
can be the same with max_fsm_relations which is usually larger than the
numbers of tables and indexes in the cluster. This is sufficient to
contain most of the vacuum tasks.

Even though the queue is over flow, for task-gatherer is scanning the
whole cluster every autovacuum_naptime, it is quickly enough to pick
those tasks up again. We don’t need to write any thing to external file.
So there is no problem to use a fix-size share memory to maintain a
global queue.

> Other idea I had was having a third kind of autovacuum process, namely a
> "schedule builder"

If we have such a global queue, task-gathering worker can connect to
every database every naptime to gather tasks in time.

The task-gathering worker won’t build the schedule, LAUNCHER or external
program responses for such activity. How to dispatch tasks to worker is
just a scheduling problem, a good dispatching algorithm needs to ensure
each worker can finish their tasks on time, this might resolve the
headache HOT table problem. But this is a further issue to be discussed
after 8.3.

Best Regards

Galy Lee
lee.galy _at_ oss.ntt.co.jp
NTT Open Source Software Center



Re: autovacuum next steps, take 3

From
Tom Lane
Date:
Galy Lee <lee.galy@oss.ntt.co.jp> writes:
> We can use the fix-size share memory to maintain such a queue. The
> maximum task size is the number of all tables. So the size of the queue
> can be the same with max_fsm_relations which is usually larger than the
> numbers of tables and indexes in the cluster.

The trouble with that analogy is that the system can still operate
reasonably sanely when max_fsm_relations is exceeded (at least, the
excess relations behave no worse than they did before we had FSM).
If there are relations that autovacuum ignores indefinitely because they
don't fit in a fixed-size work queue, that will be a big step backward
from prior behavior.

In any case, I still haven't seen a good case made why a global work
queue will provide better behavior than each worker keeping a local
queue.  The need for small "hot" tables to be visited more often than
big tables suggests to me that a global queue will actually be
counterproductive, because you'll have to contort the algorithm in
some hard-to-understand way to get it to do that.
        regards, tom lane


Re: autovacuum next steps, take 3

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> In any case, I still haven't seen a good case made why a global work
> queue will provide better behavior than each worker keeping a local
> queue.  The need for small "hot" tables to be visited more often than
> big tables suggests to me that a global queue will actually be
> counterproductive, because you'll have to contort the algorithm in
> some hard-to-understand way to get it to do that.

If we have some external vacuum schedulers, we need to see and touch the
content of work queue. That's why he suggested the shared work queue.
I think the present strategy of autovacuum is not enough in some heavily-used
cases and need more sophisticated schedulers, even if the optimization
for hot tables is added. Also, the best strategies of vacuum are highly
depending on systems, so that I don't think we can supply one monolithic
strategy that fits all purposes.

That was a proposal of the infrastructure for interaction between autovacuum
and user-land vacuum schedulers. Of cource, we can supply a simple scheduler
for not-so-high-load systems, but I need a kind of autovacuum that can be
controlled from an external program that knows user application well.

Though we can use a completely separated autovacuum daemon like as
contrib/pg_autovacuum of 8.0, but I think it is good for us to share
some of the codes between a built-in scheduler and external ones.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: autovacuum next steps, take 3

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In any case, I still haven't seen a good case made why a global work
>> queue will provide better behavior than each worker keeping a local
>> queue.

> If we have some external vacuum schedulers, we need to see and touch the
> content of work queue.

Who said anything about external schedulers?  I remind you that this is
AUTOvacuum.  If you want to implement manual scheduling you can still
use plain 'ol vacuum commands.
        regards, tom lane


Re: autovacuum next steps, take 3

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Who said anything about external schedulers?  I remind you that this is
> AUTOvacuum.  If you want to implement manual scheduling you can still
> use plain 'ol vacuum commands.

I think we can split autovacuum into two (or more?) functions:
task gatherers and task workers. We don't have to bother with
the monolithic style of current autovacuum.


Galy said:
> The task queue might be filled by dedicated task-gathering-worker or it
> might be filled by *external task gatherer*.

Alvaro said:
> The idea of an external task gatherer is an interesting one which I
> think would make sense to implement in the future.  I think it is not
> very difficult to implement once the proposal we're currently discussing
> is done

I said:
> Though we can use a completely separated autovacuum daemon like as
> contrib/pg_autovacuum of 8.0, but I think it is good for us to share
> some of the codes between a built-in scheduler and external ones.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: autovacuum next steps, take 3

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> 1. Grab the AutovacSchedule LWLock exclusively.
> >> 2. Check to see if another worker is currently processing
> >> that table; if so drop LWLock and go to next list entry.
> >> 3. Recompute whether table needs vacuuming; if not,
> >> drop LWLock and go to next entry.  (This test covers the
> >> case where someone vacuumed the table since you made your
> >> list.)
> >> 4. Put table OID into shared memory, drop LWLock, then
> >> vacuum table.
> >> 5. Clear current-table OID from shared memory, then
> >> repeat for next list entry.
> 
> > The point I'm not very sure about is that this proposal means we need to
> > do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date
> > stats.
> 
> True.  You could probably drop the lock while rechecking stats, at the
> cost of having to recheck for collision (repeat step 2) afterwards.
> Or recheck stats before you start, but if collisions are likely then
> that's a waste of time.  But on the third hand, does it matter?
> Rechecking the stats should be much cheaper than a vacuum operation,
> so I'm not seeing that there's going to be a problem.  It's not like
> there are going to be hundreds of workers contending for that lock...

Turns out that it does matter, because not only we need to read pgstats,
but we also need to fetch the pg_autovacuum and pg_class rows again for
the table.  So we must release the AutovacuumSchedule lock before trying
to open pg_class etc.

Unless we are prepared to "cache" (keep a private copy of) the contents
of said tuples between the first check (i.e. when building the initial
table list) and the recheck?  This is possible as well, but it gives me
an uneasy feeling.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.