autovacuum next steps, take 3 - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | autovacuum next steps, take 3 |
Date | |
Msg-id | 20070309223842.GF10875@alvh.no-ip.org Whole thread Raw |
Responses |
Re: autovacuum next steps, take 3
Re: autovacuum next steps, take 3 |
List | pgsql-hackers |
Here is a low-level, very detailed description of the implementation of the autovacuum ideas we have so far. launcher's dealing with databases --------------------------------- We'll add a new member "nexttime" to the autovac_dbase struct, which will be the time_t of the next time a worker needs to process that DB. Initially, those times will be 0 for all databases. The launcher will keep that list in memory, and on each iteration it will fetch the entry that has the earliest time, and sleep until that time. When it awakens, it will start a worker on that database and set the nexttime to now+naptime. The list will be a Dllist so that it's easy to keep it sorted by increasing time and picking the head of the list each time, and then putting that node as a new tail. Every so often seconds, the launcher will call autovac_get_database_list and compare that list with the list it has on memory. If a new database is in the list, it will assign a nexttime between the current instant and the time of the head of the Dllist. Then it'll put it as the new head. The new database will thus be put as the next database to be processed. When a node with nexttime=0 is found, the amount of time to sleep will be determined as Min(naptime/num_elements, 1), so that initially databases will be distributed roughly evenly in the naptime interval. When a nexttime in the past is detected, the launcher will start a worker either right away or as soon as possible (read below). launcher and worker interactions -------------------------------- The launcher PID will be in shared memory, so that workers can signal it. We will also keep worker information in shared memory as an array of WorkerInfo structs: typedef struct {Oid wi_dboid;Oid wi_tableoid;int wi_workerpid;bool wi_finished; } WorkerInfo; We will use SIGUSR1 to communicate between workers and launcher. When the launcher wants to start a worker, it sets the "dboid" field and signals the postmaster. Then goes back to sleep. When a worker has started up and is about to start vacuuming, it will store its PID in workerpid, and then send a SIGUSR1 to the launcher. If the schedule says that there's no need to run a new worker, the launcher will go back to sleeping. We cannot call SendPostmasterSignal a second time just after calling it; the second call would be lost. So it is important that the launcher does not try to start a worker until there's no worker starting. So if the launcher wakes up for any reason and detects that there is a WorkerInfo entry with valid dboid but workerpid is zero, it will go back to sleep. Since the starting worker will send a signal as soon as it finishes starting up, the launcher will wake up, detect this condition and then it can start a second worker. Also, the launcher cannot start new workers when there are autovacuum_max_workers already running. So if there are that many when it wakes up, it cannot do anything else but go back to sleep again. When one of those workers finishes, it will wake the launcher by setting the finished flag on its WorkerInfo, and sending SIGUSR1 to the launcher. The launcher then wakes up, resets the WorkerInfo struct, and can start another worker if needed. There is an additional problem if, for some reason, a worker starts and is not able to finish its task correctly. It will not be able to set its finished flag, so the launcher will believe that it's still starting up. To prevent this problem, we check the PGPROCs of worker processes, and clean them up if we find they are not actually running (or the PIDs correspond to processes that are not autovacuum workers). We only do it if all WorkerInfo structures are in use, thus frequently enough so that this problem doesn't cause any starvation, but seldom enough so that it's not a performance hit. worker to-do list ----------------- When each worker starts, it determines which tables to process in the usual fashion: get pg_autovacuum and pgstat data and compute the equations. The worker then takes a "snapshot" of what's currently going on in the database, by storing worker PIDs, the corresponding table OID that's being currently worked, and the to-do list for each worker. It removes from its to-do list the tables being processed. Finally, it writes the list to disk. The table list will be written to a file in PGDATA/vacuum/<database-oid>/todo.<worker-pid> The file will consist of table OIDs, in the order in which they are going to be vacuumed. At this point, vacuuming can begin. Before processing each table, it scans the WorkerInfos to see if there's a new worker, in which case it reads its to-do list to memory. Then it again fetches the tables being processed by other workers in the same database, and for each other worker, removes from its own in-memory to-do all those tables mentioned in the other lists that appear earlier than the current table being processed (inclusive). Then it picks the next non-removed table in the list. All of this must be done with the Autovacuum LWLock grabbed in exclusive mode, so that no other worker can pick the same table (no IO takes places here, because the whole lists were saved in memory at the start.) other things to consider ------------------------ This proposal doesn't deal with the hot tables stuff at all, but that is very easy to bolt on later: just change the first phase, where the initial to-do list is determined, to exclude "cold" tables. That way, the vacuuming will be fast. Determining what is a cold table is still an exercise to the reader ... It may be interesting to avoid vacuuming at all when there's a long-running transaction in progress. That way we avoid wasting I/O for nothing, for example when there's a pg_dump running. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
pgsql-hackers by date: