Thread: autovacuum next steps, take 3
Here is a low-level, very detailed description of the implementation of the autovacuum ideas we have so far. launcher's dealing with databases --------------------------------- We'll add a new member "nexttime" to the autovac_dbase struct, which will be the time_t of the next time a worker needs to process that DB. Initially, those times will be 0 for all databases. The launcher will keep that list in memory, and on each iteration it will fetch the entry that has the earliest time, and sleep until that time. When it awakens, it will start a worker on that database and set the nexttime to now+naptime. The list will be a Dllist so that it's easy to keep it sorted by increasing time and picking the head of the list each time, and then putting that node as a new tail. Every so often seconds, the launcher will call autovac_get_database_list and compare that list with the list it has on memory. If a new database is in the list, it will assign a nexttime between the current instant and the time of the head of the Dllist. Then it'll put it as the new head. The new database will thus be put as the next database to be processed. When a node with nexttime=0 is found, the amount of time to sleep will be determined as Min(naptime/num_elements, 1), so that initially databases will be distributed roughly evenly in the naptime interval. When a nexttime in the past is detected, the launcher will start a worker either right away or as soon as possible (read below). launcher and worker interactions -------------------------------- The launcher PID will be in shared memory, so that workers can signal it. We will also keep worker information in shared memory as an array of WorkerInfo structs: typedef struct {Oid wi_dboid;Oid wi_tableoid;int wi_workerpid;bool wi_finished; } WorkerInfo; We will use SIGUSR1 to communicate between workers and launcher. When the launcher wants to start a worker, it sets the "dboid" field and signals the postmaster. Then goes back to sleep. When a worker has started up and is about to start vacuuming, it will store its PID in workerpid, and then send a SIGUSR1 to the launcher. If the schedule says that there's no need to run a new worker, the launcher will go back to sleeping. We cannot call SendPostmasterSignal a second time just after calling it; the second call would be lost. So it is important that the launcher does not try to start a worker until there's no worker starting. So if the launcher wakes up for any reason and detects that there is a WorkerInfo entry with valid dboid but workerpid is zero, it will go back to sleep. Since the starting worker will send a signal as soon as it finishes starting up, the launcher will wake up, detect this condition and then it can start a second worker. Also, the launcher cannot start new workers when there are autovacuum_max_workers already running. So if there are that many when it wakes up, it cannot do anything else but go back to sleep again. When one of those workers finishes, it will wake the launcher by setting the finished flag on its WorkerInfo, and sending SIGUSR1 to the launcher. The launcher then wakes up, resets the WorkerInfo struct, and can start another worker if needed. There is an additional problem if, for some reason, a worker starts and is not able to finish its task correctly. It will not be able to set its finished flag, so the launcher will believe that it's still starting up. To prevent this problem, we check the PGPROCs of worker processes, and clean them up if we find they are not actually running (or the PIDs correspond to processes that are not autovacuum workers). We only do it if all WorkerInfo structures are in use, thus frequently enough so that this problem doesn't cause any starvation, but seldom enough so that it's not a performance hit. worker to-do list ----------------- When each worker starts, it determines which tables to process in the usual fashion: get pg_autovacuum and pgstat data and compute the equations. The worker then takes a "snapshot" of what's currently going on in the database, by storing worker PIDs, the corresponding table OID that's being currently worked, and the to-do list for each worker. It removes from its to-do list the tables being processed. Finally, it writes the list to disk. The table list will be written to a file in PGDATA/vacuum/<database-oid>/todo.<worker-pid> The file will consist of table OIDs, in the order in which they are going to be vacuumed. At this point, vacuuming can begin. Before processing each table, it scans the WorkerInfos to see if there's a new worker, in which case it reads its to-do list to memory. Then it again fetches the tables being processed by other workers in the same database, and for each other worker, removes from its own in-memory to-do all those tables mentioned in the other lists that appear earlier than the current table being processed (inclusive). Then it picks the next non-removed table in the list. All of this must be done with the Autovacuum LWLock grabbed in exclusive mode, so that no other worker can pick the same table (no IO takes places here, because the whole lists were saved in memory at the start.) other things to consider ------------------------ This proposal doesn't deal with the hot tables stuff at all, but that is very easy to bolt on later: just change the first phase, where the initial to-do list is determined, to exclude "cold" tables. That way, the vacuuming will be fast. Determining what is a cold table is still an exercise to the reader ... It may be interesting to avoid vacuuming at all when there's a long-running transaction in progress. That way we avoid wasting I/O for nothing, for example when there's a pg_dump running. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
My initial reaction is that this looks good to me, but still a few comments below. Alvaro Herrera wrote: > Here is a low-level, very detailed description of the implementation of > the autovacuum ideas we have so far. > > launcher's dealing with databases > --------------------------------- [ Snip ] > launcher and worker interactions [Snip] > worker to-do list > ----------------- > When each worker starts, it determines which tables to process in the > usual fashion: get pg_autovacuum and pgstat data and compute the > equations. > > The worker then takes a "snapshot" of what's currently going on in the > database, by storing worker PIDs, the corresponding table OID that's > being currently worked, and the to-do list for each worker. Does a new worker really care about the PID of other workers or what table they are currently working on? > It removes from its to-do list the tables being processed. Finally, it > writes the list to disk. Just to be clear, the new worker removes from it's todo list all the tables mentioned in the todo lists of all the other workers? > The table list will be written to a file in > PGDATA/vacuum/<database-oid>/todo.<worker-pid> > The file will consist of table OIDs, in the order in which they are > going to be vacuumed. > > At this point, vacuuming can begin. This all sounds good to me so far. > Before processing each table, it scans the WorkerInfos to see if there's > a new worker, in which case it reads its to-do list to memory. It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. > Then it again fetches the tables being processed by other workers in the > same database, and for each other worker, removes from its own in-memory > to-do all those tables mentioned in the other lists that appear earlier > than the current table being processed (inclusive). Then it picks the > next non-removed table in the list. All of this must be done with the > Autovacuum LWLock grabbed in exclusive mode, so that no other worker can > pick the same table (no IO takes places here, because the whole lists > were saved in memory at the start.) Again it's not clear to me what this is gaining us? It seems to me that if when a worker starts up writes out it's to-do list, it should just do it, I don't see the value in workers constantly updating their todo lists. Maybe I'm just missing something can you enlighten me? > other things to consider > ------------------------ > > This proposal doesn't deal with the hot tables stuff at all, but that is > very easy to bolt on later: just change the first phase, where the > initial to-do list is determined, to exclude "cold" tables. That way, > the vacuuming will be fast. Determining what is a cold table is still > an exercise to the reader ... I think we can make this algorithm naturally favor small / hot tables with one small change. Having workers remove tables that they just vacuumed from their to-do lists and re-write their todo lists to disk. Assuming the todo lists are ordered by size ascending, smaller tables will be made available for inspection by newer workers sooner rather than later.
"Matthew T. O'Connor" <matthew@zeut.net> writes: > Does a new worker really care about the PID of other workers or what > table they are currently working on? As written, it needs the PIDs so it can read in the other workers' todo lists (which are in files named by PID). > It's not clear to me why a worker cares that there is a new worker, > since the new worker is going to ignore all the tables that are already > claimed by all worker todo lists. That seems wrong to me, since it means that new workers will ignore tables that are scheduled for processing by an existing worker, no matter how far in the future that schedule extends. As an example, suppose you have half a dozen large tables in need of vacuuming. The first worker in will queue them all up, and subsequent workers will do nothing useful, at least not till the first worker is done with the first table. Having the first worker update its todo list file after each table allows the earlier tables to be exposed for reconsideration, but that's expensive and it does nothing for later tables. I suggest that maybe we don't need exposed TODO lists at all. Rather the workers could have internal TODO lists that are priority-sorted in some way, and expose only their current table OID in shared memory. Then the algorithm for processing each table in your list is 1. Grab the AutovacSchedule LWLock exclusively.2. Check to see if another worker is currently processing that table; ifso drop LWLock and go to next list entry.3. Recompute whether table needs vacuuming; if not, drop LWLock and go to nextentry. (This test covers the case where someone vacuumed the table since you made your list.)4. Put table OID intoshared memory, drop LWLock, then vacuum table.5. Clear current-table OID from shared memory, then repeat for nextlist entry. This creates a behavior of "whoever gets to it first" rather than allowing workers to claim tables that they actually won't be able to service any time soon. regards, tom lane
Tom Lane wrote: > "Matthew T. O'Connor" <matthew@zeut.net> writes: >> It's not clear to me why a worker cares that there is a new worker, >> since the new worker is going to ignore all the tables that are already >> claimed by all worker todo lists. > > That seems wrong to me, since it means that new workers will ignore > tables that are scheduled for processing by an existing worker, no > matter how far in the future that schedule extends. As an example, > suppose you have half a dozen large tables in need of vacuuming. > The first worker in will queue them all up, and subsequent workers > will do nothing useful, at least not till the first worker is done > with the first table. Having the first worker update its todo > list file after each table allows the earlier tables to be exposed > for reconsideration, but that's expensive and it does nothing for > later tables. Well the big problem that we have is not that large tables are being starved, so this doesn't bother me too much, plus there is only so much IO, so one worker working sequentially through the big tables seems OK to me. > I suggest that maybe we don't need exposed TODO lists at all. Rather > the workers could have internal TODO lists that are priority-sorted > in some way, and expose only their current table OID in shared memory. > Then the algorithm for processing each table in your list is > > 1. Grab the AutovacSchedule LWLock exclusively. > 2. Check to see if another worker is currently processing > that table; if so drop LWLock and go to next list entry. > 3. Recompute whether table needs vacuuming; if not, > drop LWLock and go to next entry. (This test covers the > case where someone vacuumed the table since you made your > list.) > 4. Put table OID into shared memory, drop LWLock, then > vacuum table. > 5. Clear current-table OID from shared memory, then > repeat for next list entry. > > This creates a behavior of "whoever gets to it first" rather than > allowing workers to claim tables that they actually won't be able > to service any time soon. Right, but you could wind up with as many workers working concurrently as you have tables in a database which doesn't seem like a good idea either. One thing I like about the todo list setup Alvaro had is that new workers will be assigned fewer tables to work on and hence exit sooner. We are going to fire off a new worker every autovac_naptime so availability of new workers isn't going to be a problem.
Alvaro Herrera wrote: >worker to-do list >----------------- >It removes from its to-do list the tables being processed. Finally, it >writes the list to disk. I am worrying about the worker-to-do-list in your proposal. I think worker isn't suitable to maintain any vacuum task list; instead it is better to maintain a unified vacuum task queue on autovacuum share memory. Here are the basic ideas: * Why is such a task queue needed? - Launcher might schedule all vacuum tasks by such a queue. It provides a facility to schedule tasks smartly for further autovacuum improvement. - Also such a task list can be viewed easily from a system view. This can be implemented easily in 8.3 by the task queue. * VACUUM task queue VACUUM tasks of cluster are maintained in a unified cluster-wide queue in the share memory of autovacuum. global shared TaskInfo tasks[]; It can be viewed as: SELECT * FROM pg_autovacuum_tasks; dbid | relid | group | worker -------+-------+-------+--------20000 | 20001 | 0 | 100120000 | 20002 | 0 |30000 | 30001 | 0 | 1002 VACUUM tasks belong to the same database might be divided into several groups. One worker might be assigned to process one specific task group. The task queue might be filled by dedicated task-gathering-worker or it might be filled by *external task gatherer*. It allows external program to develop a more sophisticated vacuum scheme. Based on previous discussion, it appears that it is difficult to implement an all-purpose algorithm to satisfy the requirements of all applications. It is better to allow user to develop their vacuum strategies. *User-defined external program* might fill the task queue, and schedule tasks by their own strategy. Launcher will response for coordinating workers only. This pluggable-vacuum-strategy approach seems a good solution. * status of worker It is also convenience to allow user to monitor the status of vacuum worker by a system view.The snapshot of worker can also be viewed as: SELECT * FROM pg_autovacuum_workers;pid | dbid | relid | group ------+-------+-------+-------1001 | 20000 | 20001 | 01002 | 30000 | 30001 | 0 Best Regards Galy Lee lee.galy _at_ oss.ntt.co.jp NTT Open Source Software Center
Galy Lee <lee.galy@oss.ntt.co.jp> writes: > I am worrying about the worker-to-do-list in your proposal. I think > worker isn't suitable to maintain any vacuum task list; instead > it is better to maintain a unified vacuum task queue on autovacuum share > memory. Shared memory is fixed-size. regards, tom lane
Galy Lee wrote: > > Alvaro Herrera wrote: > >worker to-do list > >----------------- > >It removes from its to-do list the tables being processed. Finally, it > >writes the list to disk. > > I am worrying about the worker-to-do-list in your proposal. I think > worker isn't suitable to maintain any vacuum task list; instead > it is better to maintain a unified vacuum task queue on autovacuum share > memory. Galy, Thanks for your comments. I like the idea of having a global task queue, but sadly it doesn't work for a simple reason: the launcher does not have enough information to build it. This is because we need access to catalogs in the database; pg_class and pg_autovacuum in the current code, and the catalogs related to the maintenance window feature when we implement it in the (hopefully near) future. Another point to be made, though of less importance, is that we cannot keep such a task list in shared memory, because we aren't able to grow that memory after postmaster start. It is of lesser importance, because we could keep the task list in plain files on disk; this is merely a SMOP. The functions to expose the task list to SQL queries would just need to read those files. It would be slower than shared memory, certainly, but I don't think it's a showstopper (given the amount of work VACUUM takes, anyway). Not having access to the catalogs is a much more serious problem for the scheduling. One could think about dumping catalogs to plain files that are readable to the launcher, but this is not very workable: how do you dump pg_class and have it up to date all the time? You'd have to be writing that file pretty frequently, which doesn't sound a very good idea. Other idea I had was having a third kind of autovacuum process, namely a "schedule builder", which would connect to the database, read catalogs, compute needed vacuuming, write to disk, and exit. This seems similar to your task-gathering worker. The launcher could then dispatch regular workers as appropriate. Furthermore, the launcher could create a global schedule, based on the combination of the schedules for all databases. I dismissed this idea because a schedule gets out of date very quickly as tables continue to be used by regular operation. A worker starting at t0 may find that a task list built at t0-5 min is not very relevant. So it needs to build a new task list anyway, which then begs the question of why not just let the worker itself build its task list? Also, combining schedules is complicated and you start thinking in asking the DBA to give each database a priority, which is annoying. So the idea I am currently playing with is to have workers determine the task list at start, by looking at both the catalogs and considering the task lists of other workers. I think this is the natural evolution of the other ideas -- the worker is just smarter to start with, and the whole thing is a lot simpler. > The task queue might be filled by dedicated task-gathering-worker or it > might be filled by *external task gatherer*. The idea of an external task gatherer is an interesting one which I think would make sense to implement in the future. I think it is not very difficult to implement once the proposal we're currently discussing is done, because it just means we have to modify the part where each worker decides what needs to be done, and at what times the launcher decides to start a worker on each database. The rest of the stuff I'm working on is just infrastructure to make it happen. So I think your basic idea here is still workable, just not right now. Let's discuss it again as soon as I'm done with the current stuff. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > worker to-do list > ----------------- > > When each worker starts, it determines which tables to process in the > usual fashion: get pg_autovacuum and pgstat data and compute the > equations. > > The worker then takes a "snapshot" of what's currently going on in the > database, by storing worker PIDs, the corresponding table OID that's > being currently worked, and the to-do list for each worker. > > It removes from its to-do list the tables being processed. Finally, it > writes the list to disk. > > The table list will be written to a file in > PGDATA/vacuum/<database-oid>/todo.<worker-pid> > The file will consist of table OIDs, in the order in which they are > going to be vacuumed. > > At this point, vacuuming can begin. > > Before processing each table, it scans the WorkerInfos to see if there's > a new worker, in which case it reads its to-do list to memory. > > Then it again fetches the tables being processed by other workers in the > same database, and for each other worker, removes from its own in-memory > to-do all those tables mentioned in the other lists that appear earlier > than the current table being processed (inclusive). Then it picks the > next non-removed table in the list. All of this must be done with the > Autovacuum LWLock grabbed in exclusive mode, so that no other worker can > pick the same table (no IO takes places here, because the whole lists > were saved in memory at the start.) Sorry, I confused matters here by not clarifing on-disk to-do lists versus in-memory. When we write the to-do list to file, that's the to-do lists that other workers will see. It will not change; when I say "remove a table for the to-do list", it will be removed from the to-do list in memory, but the file will not get rewritten. Note that a worker will not remove from its list a table that's in the to-do list of another worker but not yet processed. It will only remove those tables that are currently being processed (i.e. they appear in the shared memory entry for that worker), and any tables that appear _before that one_ on that particular worker's file. So this behaves very much like what Tom describes in an email downthread, not like what Matthew is thinking. In fact I'm thinking that the above is needlessly complex, and that Tom's proposal is simpler and achieves pretty much the same effect, so I'll have a look at evolving from that instead. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Tom Lane wrote: > I suggest that maybe we don't need exposed TODO lists at all. Rather > the workers could have internal TODO lists that are priority-sorted > in some way, and expose only their current table OID in shared memory. > Then the algorithm for processing each table in your list is > > 1. Grab the AutovacSchedule LWLock exclusively. > 2. Check to see if another worker is currently processing > that table; if so drop LWLock and go to next list entry. > 3. Recompute whether table needs vacuuming; if not, > drop LWLock and go to next entry. (This test covers the > case where someone vacuumed the table since you made your > list.) > 4. Put table OID into shared memory, drop LWLock, then > vacuum table. > 5. Clear current-table OID from shared memory, then > repeat for next list entry. > > This creates a behavior of "whoever gets to it first" rather than > allowing workers to claim tables that they actually won't be able > to service any time soon. The point I'm not very sure about is that this proposal means we need to do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date stats. Also, if the table was finished being vacuumed just before this algorithm runs, and pgstats hasn't had the chance to write the updated stats yet, we may run an unneeded vacuum. In my proposal, all IO was done before grabbing the lock. We may have to the drop the lock and read the file of a worker that just started, but that should be rare. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> 1. Grab the AutovacSchedule LWLock exclusively. >> 2. Check to see if another worker is currently processing >> that table; if so drop LWLock and go to next list entry. >> 3. Recompute whether table needs vacuuming; if not, >> drop LWLock and go to next entry. (This test covers the >> case where someone vacuumed the table since you made your >> list.) >> 4. Put table OID into shared memory, drop LWLock, then >> vacuum table. >> 5. Clear current-table OID from shared memory, then >> repeat for next list entry. > The point I'm not very sure about is that this proposal means we need to > do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date > stats. True. You could probably drop the lock while rechecking stats, at the cost of having to recheck for collision (repeat step 2) afterwards. Or recheck stats before you start, but if collisions are likely then that's a waste of time. But on the third hand, does it matter? Rechecking the stats should be much cheaper than a vacuum operation, so I'm not seeing that there's going to be a problem. It's not like there are going to be hundreds of workers contending for that lock... regards, tom lane
Hi, Alvaro Alvaro Herrera wrote: > keep such a task list in shared memory, because we aren't able to grow > that memory after postmaster start. We can use the fix-size share memory to maintain such a queue. The maximum task size is the number of all tables. So the size of the queue can be the same with max_fsm_relations which is usually larger than the numbers of tables and indexes in the cluster. This is sufficient to contain most of the vacuum tasks. Even though the queue is over flow, for task-gatherer is scanning the whole cluster every autovacuum_naptime, it is quickly enough to pick those tasks up again. We don’t need to write any thing to external file. So there is no problem to use a fix-size share memory to maintain a global queue. > Other idea I had was having a third kind of autovacuum process, namely a > "schedule builder" If we have such a global queue, task-gathering worker can connect to every database every naptime to gather tasks in time. The task-gathering worker won’t build the schedule, LAUNCHER or external program responses for such activity. How to dispatch tasks to worker is just a scheduling problem, a good dispatching algorithm needs to ensure each worker can finish their tasks on time, this might resolve the headache HOT table problem. But this is a further issue to be discussed after 8.3. Best Regards Galy Lee lee.galy _at_ oss.ntt.co.jp NTT Open Source Software Center
Galy Lee <lee.galy@oss.ntt.co.jp> writes: > We can use the fix-size share memory to maintain such a queue. The > maximum task size is the number of all tables. So the size of the queue > can be the same with max_fsm_relations which is usually larger than the > numbers of tables and indexes in the cluster. The trouble with that analogy is that the system can still operate reasonably sanely when max_fsm_relations is exceeded (at least, the excess relations behave no worse than they did before we had FSM). If there are relations that autovacuum ignores indefinitely because they don't fit in a fixed-size work queue, that will be a big step backward from prior behavior. In any case, I still haven't seen a good case made why a global work queue will provide better behavior than each worker keeping a local queue. The need for small "hot" tables to be visited more often than big tables suggests to me that a global queue will actually be counterproductive, because you'll have to contort the algorithm in some hard-to-understand way to get it to do that. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > In any case, I still haven't seen a good case made why a global work > queue will provide better behavior than each worker keeping a local > queue. The need for small "hot" tables to be visited more often than > big tables suggests to me that a global queue will actually be > counterproductive, because you'll have to contort the algorithm in > some hard-to-understand way to get it to do that. If we have some external vacuum schedulers, we need to see and touch the content of work queue. That's why he suggested the shared work queue. I think the present strategy of autovacuum is not enough in some heavily-used cases and need more sophisticated schedulers, even if the optimization for hot tables is added. Also, the best strategies of vacuum are highly depending on systems, so that I don't think we can supply one monolithic strategy that fits all purposes. That was a proposal of the infrastructure for interaction between autovacuum and user-land vacuum schedulers. Of cource, we can supply a simple scheduler for not-so-high-load systems, but I need a kind of autovacuum that can be controlled from an external program that knows user application well. Though we can use a completely separated autovacuum daemon like as contrib/pg_autovacuum of 8.0, but I think it is good for us to share some of the codes between a built-in scheduler and external ones. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In any case, I still haven't seen a good case made why a global work >> queue will provide better behavior than each worker keeping a local >> queue. > If we have some external vacuum schedulers, we need to see and touch the > content of work queue. Who said anything about external schedulers? I remind you that this is AUTOvacuum. If you want to implement manual scheduling you can still use plain 'ol vacuum commands. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Who said anything about external schedulers? I remind you that this is > AUTOvacuum. If you want to implement manual scheduling you can still > use plain 'ol vacuum commands. I think we can split autovacuum into two (or more?) functions: task gatherers and task workers. We don't have to bother with the monolithic style of current autovacuum. Galy said: > The task queue might be filled by dedicated task-gathering-worker or it > might be filled by *external task gatherer*. Alvaro said: > The idea of an external task gatherer is an interesting one which I > think would make sense to implement in the future. I think it is not > very difficult to implement once the proposal we're currently discussing > is done I said: > Though we can use a completely separated autovacuum daemon like as > contrib/pg_autovacuum of 8.0, but I think it is good for us to share > some of the codes between a built-in scheduler and external ones. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane wrote: > >> 1. Grab the AutovacSchedule LWLock exclusively. > >> 2. Check to see if another worker is currently processing > >> that table; if so drop LWLock and go to next list entry. > >> 3. Recompute whether table needs vacuuming; if not, > >> drop LWLock and go to next entry. (This test covers the > >> case where someone vacuumed the table since you made your > >> list.) > >> 4. Put table OID into shared memory, drop LWLock, then > >> vacuum table. > >> 5. Clear current-table OID from shared memory, then > >> repeat for next list entry. > > > The point I'm not very sure about is that this proposal means we need to > > do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date > > stats. > > True. You could probably drop the lock while rechecking stats, at the > cost of having to recheck for collision (repeat step 2) afterwards. > Or recheck stats before you start, but if collisions are likely then > that's a waste of time. But on the third hand, does it matter? > Rechecking the stats should be much cheaper than a vacuum operation, > so I'm not seeing that there's going to be a problem. It's not like > there are going to be hundreds of workers contending for that lock... Turns out that it does matter, because not only we need to read pgstats, but we also need to fetch the pg_autovacuum and pg_class rows again for the table. So we must release the AutovacuumSchedule lock before trying to open pg_class etc. Unless we are prepared to "cache" (keep a private copy of) the contents of said tuples between the first check (i.e. when building the initial table list) and the recheck? This is possible as well, but it gives me an uneasy feeling. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.