Thread: autovacuum next steps
After staring at my previous notes for autovac scheduling, it has become clear that this basics of it is not really going to work as specified. So here is a more realistic plan: First, we introduce an autovacuum_max_workers parameter, to limit the total amount of workers that can be running at any time. Use this number to create extra PGPROC entries, etc, similar to the way we handle the prepared xacts stuff. The default should be low, say 3 o 4. The launcher sends a worker into a database just like it does currently. This worker determines what tables need vacuuming per the pg_autovacuum settings and pgstat data. If it's more than one table, it puts the number of tables in shared memory and sends a signal to the launcher. The launcher then starts min(autovacuum_max_workers - currently running workers, tables to vacuum - 1) more workers to process that database. Maybe we could have a max-workers parameter per-database in pg_database to use as a limit here as well. Each worker, including the initial one, starts vacuuming tables according to pgstat data. They recheck the pgstat data after finishing each table, so that a table vacuumed by another worker is not processed twice (maybe problematic: a table with high update rate may be vacuumed more than once. Maybe this is a feature not a bug). Once autovacuum_naptime has passed, if the workers have not finished yet, the launcher wants to vacuum another database. At this point, the launcher wants some of the workers processing the first database to exit early as soon as they finish one table, so that they can help vacuuming the other database. It can do this by setting a flag in shmem that the workers can check when finished with a table; if the flag is set, they exit instead of continuing with another table. The launcher then starts a worker in the second database. The launcher does this until the number of workers is even among both databases. This can be done till having one worker per database; so at most autovacuum_max_workers databases can be under automatic vacuuming at any time, one worker each. When there are autovacuum_max_workers databases under vacuum, the launcher doesn't have anything else to do until some worker exits on its own. When there is a single worker processing a database, it does not recheck pgstat data after each table. This is to prevent a high-update-rate table from starving the vacuuming of other databases. How does this sound? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > After staring at my previous notes for autovac scheduling, it has become > clear that this basics of it is not really going to work as specified. > So here is a more realistic plan: [Snip Detailed Description] > How does this sound? On first blush, I'm not sure I like this as it doesn't directly attack the table starvation problem, and I think it could be a net loss of speed. VACUUM is I/O bound, as such, just sending multiple vacuum commands at a DB isn't going to make things faster, you are now going to have multiple processes reading from multiple tables at the same time. I think in general this is a bad thing (unless we someday account for I/O made available from multiple tablespaces). In general the only time it's a good idea to have multiple vacuums running at the same time is when a big table is starving a small hot table and causing bloat. I think we can extend the current autovacuum stats to add one more column that specifies "is hot" or something to that effect. Then when the AV launcher sends a worker to a DB, it will first look for tables marked as hot and work on them. While working on hot tables, the launcher need not send any additional workers to this database, if the launcher notices that a worker is working on regular tables, it can send another worker which will look for hot tables to working, if the worker doesn't find any hot tables that need work, then it exits leaving the original working to continue plodding along. Thoughts?
Matthew T. O'Connor wrote: > Alvaro Herrera wrote: > >After staring at my previous notes for autovac scheduling, it has become > >clear that this basics of it is not really going to work as specified. > >So here is a more realistic plan: > > [Snip Detailed Description] > > >How does this sound? > > On first blush, I'm not sure I like this as it doesn't directly attack > the table starvation problem, and I think it could be a net loss of speed. > > VACUUM is I/O bound, as such, just sending multiple vacuum commands at a > DB isn't going to make things faster, you are now going to have multiple > processes reading from multiple tables at the same time. I think in > general this is a bad thing (unless we someday account for I/O made > available from multiple tablespaces). Yeah, I understand that. However, I think that can be remedied by using a reasonable autovacuum_vacuum_cost_delay setting, so that each worker uses less than the total I/O available. The main point of the proposal is to allow multiple workers on a DB while also allowing multiple databases to be processed in parallel. > I think we can extend the current autovacuum stats to add one more > column that specifies "is hot" or something to that effect. Then when > the AV launcher sends a worker to a DB, it will first look for tables > marked as hot and work on them. While working on hot tables, the > launcher need not send any additional workers to this database, if the > launcher notices that a worker is working on regular tables, it can send > another worker which will look for hot tables to working, if the worker > doesn't find any hot tables that need work, then it exits leaving the > original working to continue plodding along. How would you define what's a "hot" table? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
In an ideal world I think you want precisely one vacuum process running per tablespace on the assumption that each tablespace represents a distinct physical device. The cases where we currently find ourselves wanting more are where small tables are due for vacuuming more frequently than the time it takes for a large table to receive a single full pass. If we could have autovacuum interrupt a vacuum in mid-sweep, perform a cycle of vacuums on smaller tables, then resume, that problem would go away. That sounds too difficult though, but perhaps we could do something nearly as good. One option that I've heard before is to have vacuum after a single iteration (ie, after it fills maintenance_work_mem and does the index cleanup and the second heap pass), remember where it was and pick up from that point next time. If instead autovacuum could tell vacuum exactly how long to run for (or calculated how many pages that represented based on cost_delay) then it could calculate when it will next need to schedule another table in the same tablespace and try to arrange for the vacuum of the large table to be done by then. Once there are no smaller more frequently vacuumed small tables due to be scheduled it would start vacuum for the large table again and it would resume from where the first one left off. This only works if the large tables really don't need to be vacuumed so often that autovacuum can't keep up. Our current situation is that there is a size at which this happens. But arranging to have only one vacuum process per tablespace will only make that less likely to happen rather than more. I think the changes to vacuum itself are pretty small to get it to remember where it left off last time and start from mid-table. I'm not sure how easy it would be to get autovacuum to juggle all these variables though. Of course users may not create separate tablespaces for physical devices, or they may set cost_delay so high you really do need more vacuum processes, etc. So you probably still need a num_vacuum_daemons but the recommended setting would be the same as the number of physical devices and autovacuum could try to divide them equally between tablespaces which would amount to the same thing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Alvaro Herrera wrote: > Matthew T. O'Connor wrote: >> On first blush, I'm not sure I like this as it doesn't directly attack >> the table starvation problem, and I think it could be a net loss of speed. >> >> VACUUM is I/O bound, as such, just sending multiple vacuum commands at a >> DB isn't going to make things faster, you are now going to have multiple >> processes reading from multiple tables at the same time. I think in >> general this is a bad thing (unless we someday account for I/O made >> available from multiple tablespaces). > > Yeah, I understand that. However, I think that can be remedied by using > a reasonable autovacuum_vacuum_cost_delay setting, so that each worker > uses less than the total I/O available. The main point of the proposal > is to allow multiple workers on a DB while also allowing multiple > databases to be processed in parallel. So you are telling people to choose an autovacuum_delay so high that they need to run multiple autovacuums at once to keep up? I'm probably being to dramatic, but it seems inconsistent. >> I think we can extend the current autovacuum stats to add one more >> column that specifies "is hot" or something to that effect. Then when >> the AV launcher sends a worker to a DB, it will first look for tables >> marked as hot and work on them. While working on hot tables, the >> launcher need not send any additional workers to this database, if the >> launcher notices that a worker is working on regular tables, it can send >> another worker which will look for hot tables to working, if the worker >> doesn't find any hot tables that need work, then it exits leaving the >> original working to continue plodding along. > > How would you define what's a "hot" table? I wasn't clear, I would have the Admin specified it, and we can store it as an additional column in the pg_autovacuum_settings table. Or perhaps if the table is below some size threshold and autovacuum seems that it needs to be vacuumed every time it checks it 10 times in a row or something like that.
alvherre@commandprompt.com (Alvaro Herrera) writes: > When there is a single worker processing a database, it does not recheck > pgstat data after each table. This is to prevent a high-update-rate > table from starving the vacuuming of other databases. This case is important; I don't think that having multiple workers fully alleviates the problem condition. Pointedly, you need to have a way of picking up tables often enough to avoid the XID rollover problem. That may simply require that on some periodic basis, a query is run to queue up tables that are getting close to having an "XID problem." -- (reverse (concatenate 'string "ofni.secnanifxunil" "@" "enworbbc")) http://linuxfinances.info/info/finances.html Rules of the Evil Overlord #189. "I will never tell the hero "Yes I was the one who did it, but you'll never be able to prove it to that incompetent old fool." Chances are, that incompetent old fool is standing behind the curtain." <http://www.eviloverlord.com/>
Alvaro Herrera <alvherre@commandprompt.com> writes: > Each worker, including the initial one, starts vacuuming tables > according to pgstat data. They recheck the pgstat data after finishing > each table, so that a table vacuumed by another worker is not processed > twice (maybe problematic: a table with high update rate may be vacuumed > more than once. Maybe this is a feature not a bug). How are you going to make that work without race conditions? ISTM practically guaranteed that all the workers will try to vacuum the same table. > Once autovacuum_naptime has passed, if the workers have not finished > yet, the launcher wants to vacuum another database. This seems a rather strange design, as it will encourage concentrations of workers in a single database. Wouldn't it be better to spread them out among multiple databases by default? regards, tom lane
Alvaro Herrera wrote: > > Once autovacuum_naptime... autovacuum_max_workers... > How does this sound? The knobs exposed on autovacuum feel kinda tangential to what I think I'd really want to control. IMHO "vacuum_mbytes_per_second" would be quite a bit more intuitive than cost_delay, naptime, etc. ISTM I can relatively easily estimate and/or spec out how much "extra" I/O bandwidth I have per device for vacuum; and would pretty much want vacuum to be constantly running on whichever table that needs it the most so long as it can stay under that bandwith limit. Could vacuum have a tunable that says "X MBytes/second" (perhaps per device) and have it measure how much I/O it's actually doing and try to stay under that limit? For more fine-grained control a cron job could go around setting different MBytes/second limits during peak times vs idle times. If people are concerned about CPU intensive vacuums instead of I/O intensive ones (does anyone experience that? - another tuneable "vacuum_percent_of_cpu" would be more straightforward than delay_cost, cost_page_hit, etc. But I'd be a bit surprised if cpu intensive vacuums are common.
> One option that I've heard before is to have vacuum after a single iteration > (ie, after it fills maintenance_work_mem and does the index cleanup and the > second heap pass), remember where it was and pick up from that point next > time. >From my experience this is not acceptable... I have tables for which the index cleanup takes hours, so no matter how low I would set the maintenance_work_mem (in fact I set it high enough so there's only one iteration), it will take too much time so the queue tables get overly bloated (not happening either, they get now special "cluster" treatment). Cheers, Csaba.
Gregory Stark wrote: > If we could have autovacuum interrupt a vacuum in mid-sweep, perform a cycle > of vacuums on smaller tables, then resume, that problem would go away. That > sounds too difficult though, but perhaps we could do something nearly as good. I think to make vacuum has this interrupted-resume capability is quite useful for large table. It can provide more flexibility for autovacuum to create a good schedule scheme. Sometimes it takes a whole day to vacuum the large table (Hundreds-GB table may qualify); setting the cost_delay make it even lasts for several days. If the system has maintenance time, vacuum task of the large table can be split to fit into the maintenance time by interrupted-resume feature. > One option that I've heard before is to have vacuum after a single iteration > (ie, after it fills maintenance_work_mem and does the index cleanup and the > second heap pass), remember where it was and pick up from that point next > time. > Even a single iteration may take a long time, so it is not so much useful to have a break in the boundary of the iteration. I think it is not so difficult to get vacuum to remember where it leaves and start from where it leaves last time. The following is a basic idea. A typical vacuum process mainly have the following phases: Phase 1. scan heap Phase 2. scan and sweep index Phase 3. sweepheap Phase 4. update FSM Phase 5. truncate CLOG Where vacuum is interrupted, we can just save the collected information into the disk, and restore it later when vacuum restarts. When vacuum process is interrupted, we can remember the dead tuple list and the block number it has scanned in phase 1; the indexes it has cleanup in phase 2; the tuples it has swept in phase 3. Before exiting from vacuum, we can also merge the free space information into FSM. We are working on this feature now. I will propose it latter to discuss with you. Best Regards Galy Lee -- NTT Open Source Software Center
I'm wondering if we can do one better... Since what we really care about is I/O responsiveness for the rest of the system, could we just time how long I/O calls take to complete? I know that gettimeofday can have a non-trivial overhead, but do we care that much about it in the case of autovac? On Fri, Feb 16, 2007 at 05:37:26PM -0800, Ron Mayer wrote: > Alvaro Herrera wrote: > > > > Once autovacuum_naptime... autovacuum_max_workers... > > How does this sound? > > The knobs exposed on autovacuum feel kinda tangential to > what I think I'd really want to control. > > IMHO "vacuum_mbytes_per_second" would be quite a bit more > intuitive than cost_delay, naptime, etc. > > > ISTM I can relatively easily estimate and/or spec out how > much "extra" I/O bandwidth I have per device for vacuum; > and would pretty much want vacuum to be constantly > running on whichever table that needs it the most so > long as it can stay under that bandwith limit. > > Could vacuum have a tunable that says "X MBytes/second" > (perhaps per device) and have it measure how much I/O > it's actually doing and try to stay under that limit? > > For more fine-grained control a cron job could go > around setting different MBytes/second limits during > peak times vs idle times. > > > If people are concerned about CPU intensive vacuums > instead of I/O intensive ones (does anyone experience > that? - another tuneable "vacuum_percent_of_cpu" would > be more straightforward than delay_cost, cost_page_hit, > etc. But I'd be a bit surprised if cpu intensive > vacuums are common. > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Ok, scratch that :-) Another round of braindumping below. Launcher starts one worker in each database. This worker is not going to do vacuum work, just report how much vacuum effort is needed in the database. "Vacuum effort" is measured as the total number of pages in need of vacuum, being the sum of relpages of all tables and indexes needing vacuum. (Note: we weight heap pages the same as index pages. Is this OK?) Create a plan for vacuuming all those databases within the constraints of max_workers. Databases needing the most work are vacuumed first. One worker per database. Thus max_workers databases are being vacuumed in parallel at this time. When one database is finished, the launcher starts a worker in the next database in the list. When the plan is complete (i.e. the list is empty) we can do the whole thing again, excluding the databases that are still being vacuumed. Perhaps we should wait autovacuum_naptime seconds between finishing one vacuum round in all databases and starting the next. How do we measure this: do we start sleeping when the last worker finishes, or when the list is empty? Perhaps we should reserve a worker for vacuuming hot tables. Launcher then uses max_workers-1 workers for the above plan, and the spare worker is continuously connecting to one database, vacuuming hot tables, going away, the launcher starts it again to connect to the next database. Definitional problem: how to decide what's a hot table? One idea (the simplest) is to let the DBA define it. Thus, at most two workers are on any database: one of them is working on normal tables, the other on hot tables. (This idea can be complemented by having another GUC var, autovacuum_hot_workers, which allows the DBA to have more than one worker on hot tables (just for the case where there are too many hot tables). This may be overkill.) Ron Mayer expressed the thought that we're complicating needlessly the UI for vacuum_delay, naptime, etc. He proposes that instead of having cost_delay etc, we have a mbytes_per_second parameter of some sort. This strikes me a good idea, but I think we could make that after this proposal is implemented. So this "take 2" could be implemented, and then we could switch the cost_delay stuff to using a MB/s kind of measurement somehow (he says waving his hands wildly). Greg Stark and Matthew O'Connor say that we're misdirected in having more than one worker per tablespace. I say we're not :-) If we consider Ron Mayer's idea of measuring MB/s, but we do it per tablespace, then we would inflict the correct amount of vacuum pain to each tablespace, sleeping as appropriate. I think this would require workers of different databases to communicate what tablespaces they are using, so that all of them can utilize the correct amount of bandwidth. I'd like to know if this responds to the mentioned people's objections. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Ok, scratch that :-) Another round of braindumping below. I still think this is solution in search of a problem. The main problem we have right now is that hot tables can be starved from vacuum. Most of this proposal doesn't touch that. I would like to see that problem solved first, then we can talk about adding multiple workers per database or per tablespace etc... > (This idea can be complemented by having another GUC var, > autovacuum_hot_workers, which allows the DBA to have more than one > worker on hot tables (just for the case where there are too many hot > tables). This may be overkill.) I think this is more along the lines of what we need first. > Ron Mayer expressed the thought that we're complicating needlessly the > UI for vacuum_delay, naptime, etc. He proposes that instead of having > cost_delay etc, we have a mbytes_per_second parameter of some sort. > This strikes me a good idea, but I think we could make that after this > proposal is implemented. So this "take 2" could be implemented, and > then we could switch the cost_delay stuff to using a MB/s kind of > measurement somehow (he says waving his hands wildly). Agree this is probably a good idea in the long run, but I agree this is lower on the priority list and should come next. > Greg Stark and Matthew O'Connor say that we're misdirected in having > more than one worker per tablespace. I say we're not :-) If we > consider Ron Mayer's idea of measuring MB/s, but we do it per > tablespace, then we would inflict the correct amount of vacuum pain to > each tablespace, sleeping as appropriate. I think this would require > workers of different databases to communicate what tablespaces they are > using, so that all of them can utilize the correct amount of bandwidth. I agree that in the long run it might be better to have multiple workers with MB/s throttle and tablespace aware, but we don't have any of that infrastructure right now. I think the piece of low-hanging fruit that your launcher concept can solve is the hot table starvation. My Proposal: If we require admins to identify hot tables tables, then: 1) Launcher fires-off a worker1 into database X. 2) worker1 deals with "hot" tables first, then regular tables. 3) Launcher continues to launch workers to DB X every autovac naptime. 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as worker1 did above. If worker1 is still working in DB X then worker2 looks for hot tables that are being starved because worker1 got busy. If worker2 finds no hot tables that need work, then worker2 exits. This seems a very simple solution (given your launcher work) that can solve the starvation problem. Thoughts?
"Alvaro Herrera" <alvherre@commandprompt.com> writes: > Greg Stark and Matthew O'Connor say that we're misdirected in having > more than one worker per tablespace. I say we're not :-) I did say that. But your comment about using a high cost_delay was fairly convincing too. It would be a simpler design and I think you're right. As long as raise both cost_delay and cost_limit by enough you should get pretty much the same sequential i/o rate and not step on each others toes too much. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
> Ron Mayer expressed the thought that we're complicating needlessly the > UI for vacuum_delay, naptime, etc. He proposes that instead of having > cost_delay etc, we have a mbytes_per_second parameter of some sort. > This strikes me a good idea, but I think we could make that after this > proposal is implemented. So this "take 2" could be implemented, and > then we could switch the cost_delay stuff to using a MB/s kind of > measurement somehow (he says waving his hands wildly). vacuum should be a process with the least amount of voodoo. If we can just have vacuum_delay and vacuum_threshold, where threshold allows an arbitrary setting of how much bandwidth we will allot to the process, then that is a beyond wonderful thing. It is easy to determine how much IO you have, and what you can spare. Joshua D. Drake > > Greg Stark and Matthew O'Connor say that we're misdirected in having > more than one worker per tablespace. I say we're not :-) If we > consider Ron Mayer's idea of measuring MB/s, but we do it per > tablespace, then we would inflict the correct amount of vacuum pain to > each tablespace, sleeping as appropriate. I think this would require > workers of different databases to communicate what tablespaces they are > using, so that all of them can utilize the correct amount of bandwidth. > > > I'd like to know if this responds to the mentioned people's objections. > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
> vacuum should be a process with the least amount of voodoo. > If we can just have vacuum_delay and vacuum_threshold, where > threshold allows an arbitrary setting of how much bandwidth > we will allot to the process, then that is a beyond wonderful thing. > > It is easy to determine how much IO you have, and what you can spare. The tricky part is what metric to use. Imho "IO per second" would be good. In a typical DB scenario that is the IO bottleneck, not the Mb/s. Andreas
On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote: > My Proposal: If we require admins to identify hot tables tables, then: > 1) Launcher fires-off a worker1 into database X. > 2) worker1 deals with "hot" tables first, then regular tables. > 3) Launcher continues to launch workers to DB X every autovac naptime. > 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as > worker1 did above. If worker1 is still working in DB X then worker2 > looks for hot tables that are being starved because worker1 got busy. > If worker2 finds no hot tables that need work, then worker2 exits. Rather than required people to manually identify hot tables, what if we just prioritize based on table size? So if a second autovac process hits a specific database, it would find the smallest table in need of vacuuming that it should be able to complete before the next naptime and vacuum that. It could even continue picking tables until it can't find one that it could finish within the naptime. Granted, it would have to make some assumptions about how many pages it would dirty. ISTM that's a lot easier than forcing admins to mark specific tables. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote: > >> My Proposal: If we require admins to identify hot tables tables, then: >> 1) Launcher fires-off a worker1 into database X. >> 2) worker1 deals with "hot" tables first, then regular tables. >> 3) Launcher continues to launch workers to DB X every autovac naptime. >> 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as >> worker1 did above. If worker1 is still working in DB X then worker2 >> looks for hot tables that are being starved because worker1 got busy. >> If worker2 finds no hot tables that need work, then worker2 exits. >> > > Rather than required people to manually identify hot tables, what if we > just prioritize based on table size? So if a second autovac process hits > a specific database, it would find the smallest table in need of > vacuuming that it should be able to complete before the next naptime and > vacuum that. It could even continue picking tables until it can't find > one that it could finish within the naptime. Granted, it would have to > make some assumptions about how many pages it would dirty. > > ISTM that's a lot easier than forcing admins to mark specific tables. So the heuristic would be: * Launcher fires off workers into a database at a given interval (perhaps configurable?) * Each worker works on tables in size order. * If a worker ever catches up to an older worker, then the younger worker exits. This sounds simple and workable to me, perhaps we can later modify this to include some max_workers variable so that a worker would only exit if it catches an older worker and there are max_workers currently active. Thoughts?
On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote: > Jim C. Nasby wrote: > >On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote: > > > >>My Proposal: If we require admins to identify hot tables tables, then: > >>1) Launcher fires-off a worker1 into database X. > >>2) worker1 deals with "hot" tables first, then regular tables. > >>3) Launcher continues to launch workers to DB X every autovac naptime. > >>4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as > >>worker1 did above. If worker1 is still working in DB X then worker2 > >>looks for hot tables that are being starved because worker1 got busy. > >>If worker2 finds no hot tables that need work, then worker2 exits. > >> > > > >Rather than required people to manually identify hot tables, what if we > >just prioritize based on table size? So if a second autovac process hits > >a specific database, it would find the smallest table in need of > >vacuuming that it should be able to complete before the next naptime and > >vacuum that. It could even continue picking tables until it can't find > >one that it could finish within the naptime. Granted, it would have to > >make some assumptions about how many pages it would dirty. > > > >ISTM that's a lot easier than forcing admins to mark specific tables. > > So the heuristic would be: > * Launcher fires off workers into a database at a given interval > (perhaps configurable?) > * Each worker works on tables in size order. > * If a worker ever catches up to an older worker, then the younger > worker exits. > > This sounds simple and workable to me, perhaps we can later modify this > to include some max_workers variable so that a worker would only exit if > it catches an older worker and there are max_workers currently active. That would likely result in a number of workers running in one database, unless you limited how many workers per database. And if you did that, you wouldn't be addressing the frequently update table problem. A second vacuum in a database *must* exit after a fairly short time so that we can go back in and vacuum the important tables again (well or the 2nd vacuum has to periodically re-evaluate what tables need to be vacuumed). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, Feb 22, 2007 at 09:35:45AM +0100, Zeugswetter Andreas ADI SD wrote: > > > vacuum should be a process with the least amount of voodoo. > > If we can just have vacuum_delay and vacuum_threshold, where > > threshold allows an arbitrary setting of how much bandwidth > > we will allot to the process, then that is a beyond wonderful thing. > > > > It is easy to determine how much IO you have, and what you can spare. > > The tricky part is what metric to use. Imho "IO per second" would be > good. > In a typical DB scenario that is the IO bottleneck, not the Mb/s. Well, right now they're one in the same... but yeah, IO/sec probably does make more sense. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
> > > vacuum should be a process with the least amount of voodoo. > > > If we can just have vacuum_delay and vacuum_threshold, where > > > threshold allows an arbitrary setting of how much bandwidth we will > > > allot to the process, then that is a beyond wonderful thing. > > > > > > It is easy to determine how much IO you have, and what > you can spare. > > > > The tricky part is what metric to use. Imho "IO per second" > would be > > good. > > In a typical DB scenario that is the IO bottleneck, not the Mb/s. > > Well, right now they're one in the same... but yeah, IO/sec > probably does make more sense. Hopefully not :-) Else you have no readahead. And that is imho the problem. You need to anticipate how many physical IO's your logical IO's cause. And this is near impossible unless we group IO's in pg itself. Andreas
Jim C. Nasby wrote: > On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote: > >> So the heuristic would be: >> * Launcher fires off workers into a database at a given interval >> (perhaps configurable?) >> * Each worker works on tables in size order. >> * If a worker ever catches up to an older worker, then the younger >> worker exits. >> >> This sounds simple and workable to me, perhaps we can later modify this >> to include some max_workers variable so that a worker would only exit if >> it catches an older worker and there are max_workers currently active. >> > > That would likely result in a number of workers running in one database, > unless you limited how many workers per database. And if you did that, > you wouldn't be addressing the frequently update table problem. > > A second vacuum in a database *must* exit after a fairly short time so > that we can go back in and vacuum the important tables again (well or > the 2nd vacuum has to periodically re-evaluate what tables need to be > vacuumed). > I'm not sure this is a great idea, but I don't see how this would result in large numbers of workers working in one database. If workers work on tables in size order, and exit as soon as they catch up to an older worker, I don't see the problem. Newer works are going to catch-up to older workers pretty quickly since small tables will vacuum fairly quickly.
On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote: > Jim C. Nasby wrote: > >On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote: > > > >>So the heuristic would be: > >>* Launcher fires off workers into a database at a given interval > >>(perhaps configurable?) > >>* Each worker works on tables in size order. > >>* If a worker ever catches up to an older worker, then the younger > >>worker exits. > >> > >>This sounds simple and workable to me, perhaps we can later modify this > >>to include some max_workers variable so that a worker would only exit if > >>it catches an older worker and there are max_workers currently active. > >> > > > >That would likely result in a number of workers running in one database, > >unless you limited how many workers per database. And if you did that, > >you wouldn't be addressing the frequently update table problem. > > > >A second vacuum in a database *must* exit after a fairly short time so > >that we can go back in and vacuum the important tables again (well or > >the 2nd vacuum has to periodically re-evaluate what tables need to be > >vacuumed). > > > > I'm not sure this is a great idea, but I don't see how this would result > in large numbers of workers working in one database. If workers work > on tables in size order, and exit as soon as they catch up to an older > worker, I don't see the problem. Newer works are going to catch-up to > older workers pretty quickly since small tables will vacuum fairly quickly. The reason that won't necessarily happen is because you can get large tables popping up as needing vacuuming at any time. Round 1: Fire up autovac worker; starts working and soon hits 100G table Round 2: Another worker starts. Since round 1, a 98G table now needs vacuuming, which this worker soon hits. Round 3: 89G table now needs vacuuming. Worker 3 starts up and soon hits it. So now we have 3 workers, all hammering away in the same database, and likely causing a lot of random IO. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote: > > I'm not sure this is a great idea, but I don't see how this would result > > in large numbers of workers working in one database. If workers work > > on tables in size order, and exit as soon as they catch up to an older > > worker, I don't see the problem. Newer works are going to catch-up to > > older workers pretty quickly since small tables will vacuum fairly quickly. > > The reason that won't necessarily happen is because you can get large > tables popping up as needing vacuuming at any time. Right. We know that a table that needs frequent vacuum necessarily has to be small -- so maybe have the second worker exit when it catches up with the first, or when the next table is above 1 GB, whichever happens first. That way, only the first worker can be processing the huge tables. The problem with this is that if one of your hot tables grows a bit larger than 1 GB, you suddenly have a change in autovacuuming behavior, for no really good reason. And while your second worker is processing the tables in the hundreds-MB range, your high-update 2 MB tables are neglected :-( -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, Feb 23, 2007 at 01:22:17PM -0300, Alvaro Herrera wrote: > Jim C. Nasby wrote: > > On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote: > > > > I'm not sure this is a great idea, but I don't see how this would result > > > in large numbers of workers working in one database. If workers work > > > on tables in size order, and exit as soon as they catch up to an older > > > worker, I don't see the problem. Newer works are going to catch-up to > > > older workers pretty quickly since small tables will vacuum fairly quickly. > > > > The reason that won't necessarily happen is because you can get large > > tables popping up as needing vacuuming at any time. > > Right. > > We know that a table that needs frequent vacuum necessarily has to be > small -- so maybe have the second worker exit when it catches up with > the first, or when the next table is above 1 GB, whichever happens > first. That way, only the first worker can be processing the huge > tables. The problem with this is that if one of your hot tables grows > a bit larger than 1 GB, you suddenly have a change in autovacuuming > behavior, for no really good reason. > > And while your second worker is processing the tables in the hundreds-MB > range, your high-update 2 MB tables are neglected :-( That's why I'm thinking it would be best to keep the maximum size of stuff for the second worker small. It probably also makes sense to tie it to time and not size, since the key factor is that you want it to hit the high-update tables every X number of seconds. If we wanted to get fancy, we could factor in how far over the vacuum threshold a table is, so even if the table is on the larger size, if it's way over the threshold the second vacuum will hit it. You know, maybe the best way to handle this is to force both vacuums to exit after a certain amount of time, probably with a longer time limit for the first vacuum in a database. That would mean that after processing a large table for 10 minutes, the first vacuum would exit/re-evaluate what work needs to be done. That would mean medium-sized tables wouldn't get completely starved. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > That's why I'm thinking it would be best to keep the maximum size of > stuff for the second worker small. It probably also makes sense to tie > it to time and not size, since the key factor is that you want it to hit > the high-update tables every X number of seconds. > > If we wanted to get fancy, we could factor in how far over the vacuum > threshold a table is, so even if the table is on the larger size, if > it's way over the threshold the second vacuum will hit it. Ok, I think we may be actually getting somewhere. I propose to have two different algorithms for choosing the tables to work on. The worker would behave differently, depending on whether there is one or more workers on the database already or not. The first algorithm is the plain threshold equation stuff we use today. If a worker connects and determines that no other worker is in the database, it uses the "plain worker" mode. A worker in this mode would examine pgstats, determine what tables to vacuum/analyze, sort them by size (smaller to larger), and goes about its work. This kind of worker can take a long time to vacuum the whole database -- we don't impose any time limit or table size limit to what it can do. The second mode is the "hot table worker" mode, enabled when the worker detects that there's already a worker in the database. In this mode, the worker is limited to those tables that can be vacuumed in less than autovacuum_naptime, so large tables are not considered. Because of this, it'll generally not compete with the first mode above -- the tables in plain worker were sorted by size, so the small tables were among the first vacuumed by the plain worker. The estimated time to vacuum may be calculated according to autovacuum_vacuum_delay settings, assuming that all pages constitute cache misses. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Jim C. Nasby wrote: > >> That's why I'm thinking it would be best to keep the maximum size of >> stuff for the second worker small. It probably also makes sense to tie >> it to time and not size, since the key factor is that you want it to hit >> the high-update tables every X number of seconds. >> >> If we wanted to get fancy, we could factor in how far over the vacuum >> threshold a table is, so even if the table is on the larger size, if >> it's way over the threshold the second vacuum will hit it. > > Ok, I think we may be actually getting somewhere. Me too. > I propose to have two different algorithms for choosing the tables to > work on. The worker would behave differently, depending on whether > there is one or more workers on the database already or not. > > The first algorithm is the plain threshold equation stuff we use today. > If a worker connects and determines that no other worker is in the > database, it uses the "plain worker" mode. A worker in this mode would > examine pgstats, determine what tables to vacuum/analyze, sort them by > size (smaller to larger), and goes about its work. This kind of worker > can take a long time to vacuum the whole database -- we don't impose any > time limit or table size limit to what it can do. Right, I like this. > The second mode is the "hot table worker" mode, enabled when the worker > detects that there's already a worker in the database. In this mode, > the worker is limited to those tables that can be vacuumed in less than > autovacuum_naptime, so large tables are not considered. Because of > this, it'll generally not compete with the first mode above -- the > tables in plain worker were sorted by size, so the small tables were > among the first vacuumed by the plain worker. The estimated time to > vacuum may be calculated according to autovacuum_vacuum_delay settings, > assuming that all pages constitute cache misses. How can you determine what tables can be vacuumed within autovacuum_naptime? I agree that large tables should be excluded, but I don't know how we can do that calculation based on autovacuum_naptime. So at: t=0*autovacuume_naptime: worker1 gets started on DBX t=1*autovacuume_naptime: worker2 gets started on DBX worker2 determines all tables that need to be vacuumed, worker2excludes tables that are too big from it's to-do list, worker2 gets started working, worker2 exits when it either: a) Finishes it's entire to-do-list. b) Catches up to worker1 I think the questions are 1) What is the exact math you are planning on using to determine which tables are too big? 2) Do we want worker2 to exit when it catches worker1 or does the fact that we have excluded tables that re "too big" mean that we don't have to worry about this?
Matthew T. O'Connor wrote: > Alvaro Herrera wrote: > >The second mode is the "hot table worker" mode, enabled when the worker > >detects that there's already a worker in the database. In this mode, > >the worker is limited to those tables that can be vacuumed in less than > >autovacuum_naptime, so large tables are not considered. Because of > >this, it'll generally not compete with the first mode above -- the > >tables in plain worker were sorted by size, so the small tables were > >among the first vacuumed by the plain worker. The estimated time to > >vacuum may be calculated according to autovacuum_vacuum_delay settings, > >assuming that all pages constitute cache misses. > > How can you determine what tables can be vacuumed within > autovacuum_naptime? My assumption is that pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum This is of course not the reality, because the delay is not how long it takes to fetch the pages. But it lets us have a value with which we can do something. With the default values, vacuum_cost_delay=10, vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables of under 600 pages, 4800 kB (should we include indexes here in the relpages count? My guess is no). A table over 600 pages does not sound like a good candidate for hot, so this seems more or less reasonable to me. On the other hand, maybe we shouldn't tie this to the vacuum cost delay stuff. > So at: > t=0*autovacuume_naptime: worker1 gets started on DBX > t=1*autovacuume_naptime: worker2 gets started on DBX > worker2 determines all tables that need to be vacuumed, > worker2 excludes tables that are too big from it's to-do list, > worker2 gets started working, > worker2 exits when it either: > a) Finishes it's entire to-do-list. > b) Catches up to worker1 > > I think the questions are 1) What is the exact math you are planning on > using to determine which tables are too big? 2) Do we want worker2 to > exit when it catches worker1 or does the fact that we have excluded > tables that re "too big" mean that we don't have to worry about this? Right, I think the fact that we excluded big tables means that this won't be a problem most of the time, but we'll need some sort of protection anyway. I think this is easy to achieve -- store the table each worker is currently processing in shared memory, and have all workers check all other workers. If a plain worker finds that another worker is processing the table already, it skips that table and continues with the next one. A hot table worker instead exits right away (caught up). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Matthew T. O'Connor wrote: >> How can you determine what tables can be vacuumed within >> autovacuum_naptime? > > My assumption is that > pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum > > This is of course not the reality, because the delay is not how long > it takes to fetch the pages. But it lets us have a value with which we > can do something. With the default values, vacuum_cost_delay=10, > vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables > of under 600 pages, 4800 kB (should we include indexes here in the > relpages count? My guess is no). I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? > A table over 600 pages does not sound like a good candidate for hot, so > this seems more or less reasonable to me. On the other hand, maybe we > shouldn't tie this to the vacuum cost delay stuff. I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Thoughts?
Matthew T. O'Connor wrote: > Alvaro Herrera wrote: > >Matthew T. O'Connor wrote: > >>How can you determine what tables can be vacuumed within > >>autovacuum_naptime? > > > >My assumption is that > >pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to > >vacuum > > > >This is of course not the reality, because the delay is not how long > >it takes to fetch the pages. But it lets us have a value with which we > >can do something. With the default values, vacuum_cost_delay=10, > >vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables > >of under 600 pages, 4800 kB (should we include indexes here in the > >relpages count? My guess is no). > > I'm not sure how pg_class.relpages is maintained but what happens to a > bloated table? For example, a 100 row table that is constantly updated > and hasn't been vacuumed in a while (say the admin disabled autovacuum > for a while), now that small 100 row table has 1000 pages in it most of > which are just bloat, will we miss this table? Perhaps basing this on > reltuples would be better? Well, this would only happen the first time, until the plain worker processed the table; next time it would be picked up by the hot table worker. But yeah, we can build a better estimate using the same trick the planner uses: estimate tuple density as reltuples/relpages times the actual number of blocks on disk. > >A table over 600 pages does not sound like a good candidate for hot, so > >this seems more or less reasonable to me. On the other hand, maybe we > >shouldn't tie this to the vacuum cost delay stuff. > > I'm not sure it's a good idea to tie this to the vacuum cost delay > settings either, so let me as you this, how is this better than just > allowing the admin to set a new GUC variable like > autovacuum_hot_table_size_threshold (or something shorter) which we can > assign a decent default of say 8MB. Yeah, maybe that's better -- it's certainly simpler. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Matthew T. O'Connor wrote: >> I'm not sure how pg_class.relpages is maintained but what happens to a >> bloated table? For example, a 100 row table that is constantly updated >> and hasn't been vacuumed in a while (say the admin disabled autovacuum >> for a while), now that small 100 row table has 1000 pages in it most of >> which are just bloat, will we miss this table? Perhaps basing this on >> reltuples would be better? > > Well, this would only happen the first time, until the plain worker > processed the table; next time it would be picked up by the hot table > worker. But yeah, we can build a better estimate using the same trick > the planner uses: estimate tuple density as reltuples/relpages times the > actual number of blocks on disk. Well even skipping it the first time isn't good, anything that further delays a hot table from getting vacuumed is bad. Also, I'm not sure it would just be the first time since plain VACUUM isn't going to reclaim most of the space, just mark it as reusable. This is moot however if we use a good metric, I have no idea if what you suggest above would be good enough. >>> A table over 600 pages does not sound like a good candidate for hot, so >>> this seems more or less reasonable to me. On the other hand, maybe we >>> shouldn't tie this to the vacuum cost delay stuff. >> I'm not sure it's a good idea to tie this to the vacuum cost delay >> settings either, so let me as you this, how is this better than just >> allowing the admin to set a new GUC variable like >> autovacuum_hot_table_size_threshold (or something shorter) which we can >> assign a decent default of say 8MB. > > Yeah, maybe that's better -- it's certainly simpler. Simple is better, at least until proven otherwise.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Matthew T. O'Connor wrote: >> I'm not sure it's a good idea to tie this to the vacuum cost delay >> settings either, so let me as you this, how is this better than just >> allowing the admin to set a new GUC variable like >> autovacuum_hot_table_size_threshold (or something shorter) which we can >> assign a decent default of say 8MB. > Yeah, maybe that's better -- it's certainly simpler. I'm not liking any of these very much, as they seem critically dependent on impossible-to-tune parameters. I think it'd be better to design this around having the first worker explicitly expose its state (list of tables to process, in order) and having subsequent workers key off that info. The shared memory state could include the OID of the table each worker is currently working on, and we could keep the to-do list in some simple flat file for instance (since we don't care about crash safety). I'm not certain exactly what "key off" needs to mean; perhaps each worker should make its own to-do list and then discard items that are either in-progress or recently done by another worker when it gets to them. I think an absolute minimum requirement for a sane design is that no two workers ever try to vacuum the same table concurrently, and I don't see where that behavior will emerge from your proposal; whereas it's fairly easy to make it happen if non-first workers pay attention to what other workers are doing. BTW, it's probably necessary to treat shared catalogs specially ... regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Matthew T. O'Connor wrote: >>> I'm not sure it's a good idea to tie this to the vacuum cost delay >>> settings either, so let me as you this, how is this better than just >>> allowing the admin to set a new GUC variable like >>> autovacuum_hot_table_size_threshold (or something shorter) which we can >>> assign a decent default of say 8MB. > >> Yeah, maybe that's better -- it's certainly simpler. > > I'm not liking any of these very much, as they seem critically dependent > on impossible-to-tune parameters. I think it'd be better to design this > around having the first worker explicitly expose its state (list of > tables to process, in order) and having subsequent workers key off that > info. The shared memory state could include the OID of the table each > worker is currently working on, and we could keep the to-do list in some > simple flat file for instance (since we don't care about crash safety). So far we are only talking about one parameter, the hot_table_size_threshold, which I agree would be a guess by an admin, but if we went in this direction, I would also advocate adding a column to the pg_autovacuum table that allows an admin to explicitly define a table as hot or not. Also I think each worker should be mostly independent, the only caveat being that (assuming each worker works in size order) if we catch up to an older worker (get to the table they are currently working on) we exit. Personally I think this is all we need, but others felt the additional threshold was needed. What do you think? Or what do you think might be better? > I'm not certain exactly what "key off" needs to mean; perhaps each > worker should make its own to-do list and then discard items that are > either in-progress or recently done by another worker when it gets to > them. My initial design didn't have any threshold at all, but others felt this would/could result in too many worker working concurrently in the same DB. > I think an absolute minimum requirement for a sane design is that no two > workers ever try to vacuum the same table concurrently, and I don't see > where that behavior will emerge from your proposal; whereas it's fairly > easy to make it happen if non-first workers pay attention to what other > workers are doing. Maybe we never made that clear, I was always working on the assumption that two workers would never try to work on the same table at the same time. > BTW, it's probably necessary to treat shared catalogs specially ... Certainly.
On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Matthew T. O'Connor wrote: > >> I'm not sure it's a good idea to tie this to the vacuum cost delay > >> settings either, so let me as you this, how is this better than just > >> allowing the admin to set a new GUC variable like > >> autovacuum_hot_table_size_threshold (or something shorter) which we can > >> assign a decent default of say 8MB. > > > Yeah, maybe that's better -- it's certainly simpler. > > I'm not liking any of these very much, as they seem critically dependent > on impossible-to-tune parameters. I think it'd be better to design this > around having the first worker explicitly expose its state (list of > tables to process, in order) and having subsequent workers key off that > info. The shared memory state could include the OID of the table each > worker is currently working on, and we could keep the to-do list in some > simple flat file for instance (since we don't care about crash safety). > > I'm not certain exactly what "key off" needs to mean; perhaps each > worker should make its own to-do list and then discard items that are > either in-progress or recently done by another worker when it gets to > them. The real problem is trying to set that up in such a fashion that keeps hot tables frequently vacuumed; that means that the 2nd daemon in a database either needs to avoid large tables that will take 'a significant' length of time to vacuum, or you need to allow any number of daemons in each database (which sounds like a good way to thrash the machine). > I think an absolute minimum requirement for a sane design is that no two > workers ever try to vacuum the same table concurrently, and I don't see > where that behavior will emerge from your proposal; whereas it's fairly > easy to make it happen if non-first workers pay attention to what other > workers are doing. Isn't there a special lock acquired on a relation by vacuum? Can't we just check for that? Seems much simpler than building out the ability for daemons to see what each other is doing (and that still wouldn't take manual vacuums into account. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Mon, Feb 26, 2007 at 08:11:44PM -0300, Alvaro Herrera wrote: > Matthew T. O'Connor wrote: > > Alvaro Herrera wrote: > > > >The second mode is the "hot table worker" mode, enabled when the worker > > >detects that there's already a worker in the database. In this mode, > > >the worker is limited to those tables that can be vacuumed in less than > > >autovacuum_naptime, so large tables are not considered. Because of > > >this, it'll generally not compete with the first mode above -- the > > >tables in plain worker were sorted by size, so the small tables were > > >among the first vacuumed by the plain worker. The estimated time to > > >vacuum may be calculated according to autovacuum_vacuum_delay settings, > > >assuming that all pages constitute cache misses. > > > > How can you determine what tables can be vacuumed within > > autovacuum_naptime? > > My assumption is that > pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum Need ta take vacuum_cost_limit into account. The advantage to keying this to autovac_naptime is that it means we don't need another GUC, but after I suggested that before I realized that's probably not the best idea. For example, I've seen clusters that are running dozens-hundreds of databases; in that environment you really need to turn naptime way down (to like a second). In that case you wouldn't want to key to naptime. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote: > Alvaro Herrera wrote: > >Matthew T. O'Connor wrote: > >>How can you determine what tables can be vacuumed within > >>autovacuum_naptime? > > > >My assumption is that > >pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to > >vacuum > > > >This is of course not the reality, because the delay is not how long > >it takes to fetch the pages. But it lets us have a value with which we > >can do something. With the default values, vacuum_cost_delay=10, > >vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables > >of under 600 pages, 4800 kB (should we include indexes here in the > >relpages count? My guess is no). > > I'm not sure how pg_class.relpages is maintained but what happens to a > bloated table? For example, a 100 row table that is constantly updated > and hasn't been vacuumed in a while (say the admin disabled autovacuum > for a while), now that small 100 row table has 1000 pages in it most of > which are just bloat, will we miss this table? Perhaps basing this on > reltuples would be better? The entire point of this is to ensure that the second daemon will only vacuum tables that it can finish very quickly. If you let a table bloat so it's too big, then you just can't vacuum it very frequently without risking all your other hot tables bloating because they're no longer getting vacuumed. The reality is that you can actually vacuum a pretty good-sized table in 60 seconds with typical cost-delay settings (ie: defaults except cost_delay set to 10). That means you can do 9 pages ~100 times a second, or 54k pages a minute. Even with a vacuum_cost_delay of 20, that's still 27k pages per minute. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > The advantage to keying this to autovac_naptime is that it means we > don't need another GUC, but after I suggested that before I realized > that's probably not the best idea. For example, I've seen clusters that > are running dozens-hundreds of databases; in that environment you really > need to turn naptime way down (to like a second). In that case you > wouldn't want to key to naptime. Actually, I've been thinking that it would be a good idea to change the semantics of autovacuum_naptime so that it means the average time to start a worker in any given database. That way, the time between autovac runs is not dependent on the number of databases you have. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Jim C. Nasby" <jim@nasby.net> writes: > On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote: >> I'm not liking any of these very much, as they seem critically dependent >> on impossible-to-tune parameters. I think it'd be better to design this >> around having the first worker explicitly expose its state (list of >> tables to process, in order) and having subsequent workers key off that >> info. > The real problem is trying to set that up in such a fashion that keeps > hot tables frequently vacuumed; Certainly, but it's not clear where that behavior emerges from Alvaro's or Matthew's proposals, either. Are we assuming that no single worker instance will vacuum a given table more than once? (That's not a necessary assumption, certainly, but without it there are so many degrees of freedom that I'm not sure how it should act.) Given that assumption, the maximum vacuuming rate for any table is once per autovacuum_naptime, and most of the magic lies in the launcher's algorithm for deciding which databases to launch workers into. I'm inclined to propose an even simpler algorithm in which every worker acts alike; its behavior is 1. On startup, generate a to-do list of tables to process, sorted in priority order. 2. For each table in the list, if the table is still around and has not been vacuumed by someone else since you started (including the case of a vacuum-in-progress), then vacuum it. Detecting "already vacuumed since you started" is a bit tricky; you can't really rely on the stats collector since its info isn't very up-to-date. That's why I was thinking of exposing the to-do lists explicitly; comparing those with an advertised current-table would allow accurate determination of what had just gotten done. regards, tom lane
[ oh, I forgot to respond to this: ] "Jim C. Nasby" <jim@nasby.net> writes: > Isn't there a special lock acquired on a relation by vacuum? Can't we > just check for that? I think you're thinking that ConditionalLockRelation solves the problem, but it does not, because it will fail if someone has taken a (quasi) exclusive lock unrelated to vacuuming. You don't want an application that frequently takes short-term ExclusiveLocks on a table to thereby cause autovacuum to frequently skip that table. regards, tom lane
Tom Lane wrote: > I think an absolute minimum requirement for a sane design is that no two > workers ever try to vacuum the same table concurrently, and I don't see > where that behavior will emerge from your proposal; whereas it's fairly > easy to make it happen if non-first workers pay attention to what other > workers are doing. FWIW, I've always considered this to be a very important and obvious issue, and I think I've neglected mentioning it (maybe I did too few times). But I think this is pretty easy to do, just have each worker advertise the current table it's working on in shared memory, and add a recheck loop on the table-pick algorithm (with appropriate grabs of the autovacuum lwlock), to make sure no one starts to vacuum the same table you're going to process, at the same time. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Jim C. Nasby wrote: > On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote: >> I'm not sure how pg_class.relpages is maintained but what happens to a >> bloated table? For example, a 100 row table that is constantly updated >> and hasn't been vacuumed in a while (say the admin disabled autovacuum >> for a while), now that small 100 row table has 1000 pages in it most of >> which are just bloat, will we miss this table? Perhaps basing this on >> reltuples would be better? > > The entire point of this is to ensure that the second daemon will only > vacuum tables that it can finish very quickly. If you let a table bloat > so it's too big, then you just can't vacuum it very frequently without > risking all your other hot tables bloating because they're no longer > getting vacuumed. > > The reality is that you can actually vacuum a pretty good-sized table in > 60 seconds with typical cost-delay settings (ie: defaults except > cost_delay set to 10). That means you can do 9 pages ~100 times a > second, or 54k pages a minute. Even with a vacuum_cost_delay of 20, > that's still 27k pages per minute. At the risk of sounding like a broken record, I still think the size limit threshold is unnecessary. Since all workers will be working in on tables in size order, younger workers will typically catch older workers fairly quickly since the tables will be either small, or recently vacuumed and not need work. And since younger workers exit when they catch-up to an older worker, there is some inherent stability in the number of workers. Here is a worst case example: A DB with 6 tables all of which are highly active and will need to be vacuumed constantly. While this is totally hypothetical, it is how I envision things working (without the threshold). table1: 10 rows table2: 100 rows table3: 1,000 rows table4: 10,000 rows table5: 100,000 rows table6: 1,000,000 rows time=0*naptime: No workers in the DB time=1*naptime: worker1 starts on table1 time=2*naptime: worker1 has finished table1,table2 and table3, it's now working on table4, worker2 starts on table1. time=3*naptime: worker1 is on table5, worker2 is working on table4, worker3 starts on table1. time=4*naptime: worker1 is still on table5, worker2 has caught up to worker1 and exits, worker3 also catches up to worker1 since tables2-4 didn't require vacuum at this time so it exits, worker4 starts on table1 time=5*naptime: worker1 is working on table6, worker4 is up to table4, worker5 starts on table1 time=6*naptime: worker1 is working on table6, worker4 catches up to worker1 and exits, worker5 finds no additional work to be done and exits, worker6 starts at table1. time=7*naptime: worker1 still working on table6, worker6 is up to table4, worker7 starts at table1. time=8*naptime: worker1 still working on table6, worker6 still working on table4, worker7 working on table3, worker8 starting on table1. time=9*naptime: worker1 still working on table6, worker6 working on table5, worker7 catches worker 6 and exits, worker8 finds nothing more todo and exits, worker9 starts on table1 time=10*naptim: worker1 still working on table6, worker9 working on table4, worker10 starts on table1.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I think an absolute minimum requirement for a sane design is that no two >> workers ever try to vacuum the same table concurrently, > FWIW, I've always considered this to be a very important and obvious > issue, and I think I've neglected mentioning it (maybe I did too few > times). But I think this is pretty easy to do, just have each worker > advertise the current table it's working on in shared memory, and add a > recheck loop on the table-pick algorithm (with appropriate grabs of the > autovacuum lwlock), to make sure no one starts to vacuum the same table > you're going to process, at the same time. Well, any of these proposals need that at the bottom level, to prevent race conditions. But I'd prefer a design that wasn't positively encouraging multiple workers to try to pick the same table concurrently. Not only is that wasteful, but it makes it harder to predict what is the real behavior that emerges after race conditions are backed off from. BTW, to what extent might this whole problem be simplified if we adopt chunk-at-a-time vacuuming (compare current discussion with Galy Lee)? If the unit of work has a reasonable upper bound regardless of table size, maybe the problem of big tables starving small ones goes away. regards, tom lane
Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: >> The real problem is trying to set that up in such a fashion that keeps >> hot tables frequently vacuumed; > > Are we assuming that no single worker instance will vacuum a given table > more than once? (That's not a necessary assumption, certainly, but > without it there are so many degrees of freedom that I'm not sure how > it should act.) Given that assumption, the maximum vacuuming rate for > any table is once per autovacuum_naptime, and most of the magic lies in > the launcher's algorithm for deciding which databases to launch workers > into. Yes, I have been working under the assumption that a worker goes through the list of tables once and exits, and yes the maximum vacuuming rate for any table would be once per autovacuum_naptime. We can lower the default if necessary, as far as I'm concerned it's (or should be) fairly cheap to fire off a worker and have it find that there isn't anything todo and exit. > I'm inclined to propose an even simpler algorithm in which every worker > acts alike; its behavior is > 1. On startup, generate a to-do list of tables to process, sorted in > priority order. > 2. For each table in the list, if the table is still around and has not > been vacuumed by someone else since you started (including the case of > a vacuum-in-progress), then vacuum it. That is what I'm proposing except for one difference, when you catch up to an older worker, exit. This has the benefit reducing the number of workers concurrently working on big tables, which I think is a good thing. > Detecting "already vacuumed since you started" is a bit tricky; you > can't really rely on the stats collector since its info isn't very > up-to-date. That's why I was thinking of exposing the to-do lists > explicitly; comparing those with an advertised current-table would > allow accurate determination of what had just gotten done. Sounds good, but I have very little insight into how we would implement "already vacuumed since you started" or "have I caught up to another worker".
Tom Lane wrote: > BTW, to what extent might this whole problem be simplified if we adopt > chunk-at-a-time vacuuming (compare current discussion with Galy Lee)? > If the unit of work has a reasonable upper bound regardless of table > size, maybe the problem of big tables starving small ones goes away. So if we adopted chunk-at-a-time then perhaps each worker processes the list of tables in OID order (or some unique and stable order) and does one chunk per table that needs vacuuming. This way an equal amount of bandwidth is given to all tables. That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? Matt
"Matthew T. O'Connor" <matthew@zeut.net> writes: > Tom Lane wrote: >> I'm inclined to propose an even simpler algorithm in which every worker >> acts alike; > That is what I'm proposing except for one difference, when you catch up > to an older worker, exit. No, that's a bad idea, because it means that any large table starves even-larger tables. (Note: in all this I assume we're all using "size" as a shorthand for some sort of priority metric that considers number of dirty tuples not only size. We don't want every worker insisting on passing over every small read-only table every time, for instance.) regards, tom lane
"Matthew T. O'Connor" <matthew@zeut.net> writes: > That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? It seems fairly trivial to me to have a scheme where you do one fill-workmem-and-scan-indexes cycle per invocation, and store the next-heap-page-to-scan in some handy place (new pg_class column updated along with relpages/reltuples, likely). Galy is off in left field with some far more complex ideas :-( but I don't see that there's all that much needed to support this behavior ... especially if we don't expose it to the SQL level but only support it for autovac's use. Then we're not making any big commitment to support the behavior forever. regards, tom lane
Tom Lane wrote: > "Matthew T. O'Connor" <matthew@zeut.net> writes: >> Tom Lane wrote: >>> I'm inclined to propose an even simpler algorithm in which every worker >>> acts alike; > >> That is what I'm proposing except for one difference, when you catch up >> to an older worker, exit. > > No, that's a bad idea, because it means that any large table starves > even-larger tables. True, but the assumption I'm making is that there is a finite amount of bandwidth available and more concurrent activity will have a net negative effect the time it takes to vacuum all tables. I'm willing to pay that price to prevent small hot tables from getting starved, but less willing to pay the same price for large tables where the percentage of bloat will be much smaller. > (Note: in all this I assume we're all using "size" as a shorthand for > some sort of priority metric that considers number of dirty tuples not > only size. We don't want every worker insisting on passing over every > small read-only table every time, for instance.) I was using size to mean reltuples. The whole concept of sorting by size was designed to ensure that smaller (more susceptible to bloat) tables got priority. It might be useful for workers to sort their to-do lists by some other metric, but I don't have a clear vision of what that might be.
Tom Lane wrote: > "Matthew T. O'Connor" <matthew@zeut.net> writes: >> That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? > > It seems fairly trivial to me to have a scheme where you do one > fill-workmem-and-scan-indexes cycle per invocation, and store the > next-heap-page-to-scan in some handy place (new pg_class column updated > along with relpages/reltuples, likely). Galy is off in left field with > some far more complex ideas :-( but I don't see that there's all that > much needed to support this behavior ... especially if we don't expose > it to the SQL level but only support it for autovac's use. Then we're > not making any big commitment to support the behavior forever. Well, if we can make it happen soon, it might be the best thing for autovacuum.
On Mon, Feb 26, 2007 at 10:48:49PM -0500, Tom Lane wrote: > "Matthew T. O'Connor" <matthew@zeut.net> writes: > > That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? > > It seems fairly trivial to me to have a scheme where you do one > fill-workmem-and-scan-indexes cycle per invocation, and store the > next-heap-page-to-scan in some handy place (new pg_class column updated > along with relpages/reltuples, likely). Galy is off in left field with > some far more complex ideas :-( but I don't see that there's all that > much needed to support this behavior ... especially if we don't expose > it to the SQL level but only support it for autovac's use. Then we're > not making any big commitment to support the behavior forever. The problem I see there is that the case we're trying to fix is tables that need to be vacuumed every few minutes. As I posted elsewhere, it's reasonable to assume a vacuum rate of ~1000 pages/second for a small table that's going to be in memory (assuming that vacuum dirties every page). That means that you can only dirty about 60k pages per cycle, which seems way to small to be practical unless we come up with a way to avoid scanning the indexes on every cycle. The proposal to save enough state to be able to resume a vacuum at pretty much any point in it's cycle might work; we'd have to benchmark it. With the default maintenance_work_mem of 128M it would mean writing out 64M of state every minute on average, which is likely to take several seconds to fsync (though, maybe we wouldn't need to fsync it...) -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: > Jim C. Nasby wrote: > > > The advantage to keying this to autovac_naptime is that it means we > > don't need another GUC, but after I suggested that before I realized > > that's probably not the best idea. For example, I've seen clusters that > > are running dozens-hundreds of databases; in that environment you really > > need to turn naptime way down (to like a second). In that case you > > wouldn't want to key to naptime. > > Actually, I've been thinking that it would be a good idea to change the > semantics of autovacuum_naptime so that it means the average time to > start a worker in any given database. That way, the time between > autovac runs is not dependent on the number of databases you have. Hrm... how would that work? BTW, another thought is to only sleep if you've scanned through every database and found nothing to do. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > The proposal to save enough state to be able to resume a vacuum at > pretty much any point in it's cycle might work; we'd have to benchmark > it. With the default maintenance_work_mem of 128M it would mean writing > out 64M of state every minute on average, which is likely to take > several seconds to fsync (though, maybe we wouldn't need to fsync it...) Which is exactly why we needn't bother benchmarking it. Even if it weren't complex and unsafe, it will be a net loss when you consider the fact that it adds I/O instead of removing it. regards, tom lane
On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote: > Jim C. Nasby wrote: > >On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote: > >>I'm not sure how pg_class.relpages is maintained but what happens to a > >>bloated table? For example, a 100 row table that is constantly updated > >>and hasn't been vacuumed in a while (say the admin disabled autovacuum > >>for a while), now that small 100 row table has 1000 pages in it most of > >>which are just bloat, will we miss this table? Perhaps basing this on > >>reltuples would be better? > > > >The entire point of this is to ensure that the second daemon will only > >vacuum tables that it can finish very quickly. If you let a table bloat > >so it's too big, then you just can't vacuum it very frequently without > >risking all your other hot tables bloating because they're no longer > >getting vacuumed. > > > >The reality is that you can actually vacuum a pretty good-sized table in > >60 seconds with typical cost-delay settings (ie: defaults except > >cost_delay set to 10). That means you can do 9 pages ~100 times a > >second, or 54k pages a minute. Even with a vacuum_cost_delay of 20, > >that's still 27k pages per minute. > > At the risk of sounding like a broken record, I still think the size > limit threshold is unnecessary. Since all workers will be working in on > tables in size order, younger workers will typically catch older workers > fairly quickly since the tables will be either small, or recently > vacuumed and not need work. And since younger workers exit when they > catch-up to an older worker, there is some inherent stability in the > number of workers. > > Here is a worst case example: A DB with 6 tables all of which are highly > active and will need to be vacuumed constantly. While this is totally > hypothetical, it is how I envision things working (without the threshold). I fail to see how a simple 6 table case is 'worst case'. It's common to see hundreds of tables, and I've run across more than one database with thousands of tables (think partitioning). In cases like those it's certainly possible, perhaps even likely that you would get many daemons running in the database at one time just from different tables suddenly needing vacuuming and appearing at a higher point in the list than other tables. With 100 ~1G tables getting updates it certainly wouldn't be hard to end up with 10 of those being vacuumed all at the same time. I do like the idea since it should be easier to tune, but I think we still need some limit on it. Perhaps as a first-pass we could just have a hard limit and log a message and/or set a flag any time we hit it. That would hopefully allow us to get information about how big a problem it really is. We could go one step further and say that the last daemon that can start in a database will only vacuum tables that can be done quickly; that's essentially what we've been talking about, except the limit we've been discussing would be hard-coded at 2. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, Feb 27, 2007 at 12:37:42AM -0500, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > The proposal to save enough state to be able to resume a vacuum at > > pretty much any point in it's cycle might work; we'd have to benchmark > > it. With the default maintenance_work_mem of 128M it would mean writing > > out 64M of state every minute on average, which is likely to take > > several seconds to fsync (though, maybe we wouldn't need to fsync it...) > > Which is exactly why we needn't bother benchmarking it. Even if it > weren't complex and unsafe, it will be a net loss when you consider the > fact that it adds I/O instead of removing it. Well, it depends on how often you're doing that. Adding extra IO at the end of 4 hours of vacuuming isn't going to make any real difference, but once a minute... Looks like partial vacuum won't help this problem. :( -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote: >> Jim C. Nasby wrote: >> Here is a worst case example: A DB with 6 tables all of which are highly >> active and will need to be vacuumed constantly. While this is totally >> hypothetical, it is how I envision things working (without the threshold). > > I fail to see how a simple 6 table case is 'worst case'. It's common to > see hundreds of tables, and I've run across more than one database with > thousands of tables (think partitioning). In cases like those it's > certainly possible, perhaps even likely that you would get many daemons > running in the database at one time just from different tables suddenly > needing vacuuming and appearing at a higher point in the list than other > tables. With 100 ~1G tables getting updates it certainly wouldn't be > hard to end up with 10 of those being vacuumed all at the same time. Yes 6 tables is small, the worst-case part of the example was that all the tables would need to be vacuumed constantly. Most databases only have a few hot tables. Most tables only need to vacuumed every once in a while. > I do like the idea since it should be easier to tune, but I think we > still need some limit on it. Perhaps as a first-pass we could just have > a hard limit and log a message and/or set a flag any time we hit it. > That would hopefully allow us to get information about how big a problem > it really is. We could go one step further and say that the last daemon > that can start in a database will only vacuum tables that can be done > quickly; that's essentially what we've been talking about, except the > limit we've been discussing would be hard-coded at 2. I'm confused, what limit would be set at 2? The number of concurrent workers? I've never said that.
Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: >> The proposal to save enough state to be able to resume a vacuum at >> pretty much any point in it's cycle might work; we'd have to benchmark >> it. With the default maintenance_work_mem of 128M it would mean writing >> out 64M of state every minute on average, which is likely to take >> several seconds to fsync (though, maybe we wouldn't need to fsync it...) > > Which is exactly why we needn't bother benchmarking it. Even if it > weren't complex and unsafe, it will be a net loss when you consider the > fact that it adds I/O instead of removing it. I'm not sure what you are saying here, are you now saying that partial vacuum won't work for autovac? Or are you saying that saving state as Jim is describing above won't work?
"Matthew T. O'Connor" <matthew@zeut.net> writes: > I'm not sure what you are saying here, are you now saying that partial > vacuum won't work for autovac? Or are you saying that saving state as > Jim is describing above won't work? I'm saying that I don't like the idea of trying to "stop on a dime" by saving the current contents of vacuum's dead-TID array to disk with the idea that we can trust those values 100% later. Saving the array is expensive both in runtime and code complexity, and I don't believe we can trust it later --- at least not without even more expensive-and- complex measures, such as WAL-logging every such save :-( I'm for stopping only after completing an index-cleaning pass, at the point where we empty the dead-TID array anyway. If you really have to have "stop on a dime", just kill -INT the process, accepting that you will have to redo your heap scan since the last restart point. regards, tom lane
Tom Lane wrote: > "Matthew T. O'Connor" <matthew@zeut.net> writes: >> I'm not sure what you are saying here, are you now saying that partial >> vacuum won't work for autovac? Or are you saying that saving state as >> Jim is describing above won't work? > > I'm saying that I don't like the idea of trying to "stop on a dime" by > saving the current contents of vacuum's dead-TID array to disk with the > idea that we can trust those values 100% later. Saving the array is > expensive both in runtime and code complexity, and I don't believe we > can trust it later --- at least not without even more expensive-and- > complex measures, such as WAL-logging every such save :-( > > I'm for stopping only after completing an index-cleaning pass, at the > point where we empty the dead-TID array anyway. If you really have to > have "stop on a dime", just kill -INT the process, accepting that you > will have to redo your heap scan since the last restart point. OK, so if I understand correct, a vacuum of a table with 10 indexes on it can be interrupted 10 times, once after each index-cleaning pass? That might have some value, especially breaking up the work required to vacuum a large table. Or am I still not getting it?
On Tue, Feb 27, 2007 at 01:26:00AM -0500, Matthew T. O'Connor wrote: > Tom Lane wrote: > >"Matthew T. O'Connor" <matthew@zeut.net> writes: > >>I'm not sure what you are saying here, are you now saying that partial > >>vacuum won't work for autovac? Or are you saying that saving state as > >>Jim is describing above won't work? > > > >I'm saying that I don't like the idea of trying to "stop on a dime" by > >saving the current contents of vacuum's dead-TID array to disk with the > >idea that we can trust those values 100% later. Saving the array is > >expensive both in runtime and code complexity, and I don't believe we > >can trust it later --- at least not without even more expensive-and- > >complex measures, such as WAL-logging every such save :-( > > > >I'm for stopping only after completing an index-cleaning pass, at the > >point where we empty the dead-TID array anyway. If you really have to > >have "stop on a dime", just kill -INT the process, accepting that you > >will have to redo your heap scan since the last restart point. > > OK, so if I understand correct, a vacuum of a table with 10 indexes on > it can be interrupted 10 times, once after each index-cleaning pass? > That might have some value, especially breaking up the work required to > vacuum a large table. Or am I still not getting it? It'd stop after scanning the heap, scanning all the indexes, and then cleaning the heap. After that's done it no longer needs any of the dead-TID info; anytime before that it does need that info, and Tom's objection is that trying to store that info is a bad idea. The problem with this is that typically it takes a long time to go through a complete vacuum cycle; minutes at least, and preferably longer. Decreasing that cycle time will greatly increase the amount of IO required for vacuuming a table with any indexes, because every time you cycle through you have to read the entire index. That's why I don't see it as being useful at all for getting autovac to work on hot tables - if you actually got that cycle time low enough you'd kill the system with all the extra index scanning. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, Feb 27, 2007 at 12:54:28AM -0500, Matthew T. O'Connor wrote: > Jim C. Nasby wrote: > >On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote: > >>Jim C. Nasby wrote: > >>Here is a worst case example: A DB with 6 tables all of which are highly > >>active and will need to be vacuumed constantly. While this is totally > >>hypothetical, it is how I envision things working (without the threshold). > > > >I fail to see how a simple 6 table case is 'worst case'. It's common to > >see hundreds of tables, and I've run across more than one database with > >thousands of tables (think partitioning). In cases like those it's > >certainly possible, perhaps even likely that you would get many daemons > >running in the database at one time just from different tables suddenly > >needing vacuuming and appearing at a higher point in the list than other > >tables. With 100 ~1G tables getting updates it certainly wouldn't be > >hard to end up with 10 of those being vacuumed all at the same time. > > Yes 6 tables is small, the worst-case part of the example was that all > the tables would need to be vacuumed constantly. Most databases only > have a few hot tables. Most tables only need to vacuumed every once in > a while. It's not the hot tables that are the issue; it's how many large tables (hot or not) that can come up for vacuuming in order. For example, if A-Z are all large tables (ie: a few GB), with A being the largest and Z the smallest, think about what happens here: Round 1: A needs vacuuming. Daemon gets to it and starts working. Round 2: B now needs vacuuming. It's slightly smaller than A, so daemon 2 gets to it. Round 3: C now needs vacuuming. Daemon 3. ... Round 26: Z now needs vacuuming. Daemon 26 picks it up. You now have 26 daemons running in the database. Now, we can argue about how likely that scenario is, but I don't think it's relevant. What matters is that it *is* possible, and as long as that's the case you'd have to have some kind of limit. (While this simple 26 table example is definitely worst-case, if you've got hundreds of tables that are all multiple GB in size I think it wouldn't be hard at all for you to end up with a dozen or more daemons all hammering away). > >I do like the idea since it should be easier to tune, but I think we > >still need some limit on it. Perhaps as a first-pass we could just have > >a hard limit and log a message and/or set a flag any time we hit it. > >That would hopefully allow us to get information about how big a problem > >it really is. We could go one step further and say that the last daemon > >that can start in a database will only vacuum tables that can be done > >quickly; that's essentially what we've been talking about, except the > >limit we've been discussing would be hard-coded at 2. > > I'm confused, what limit would be set at 2? The number of concurrent > workers? I've never said that. The point I was making is that the proposal about limiting the 2nd daemon to only processing tables it can do in a short period of time is akin to setting a limit of only 2 daemons in a database at a time. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: > Jim C. Nasby wrote: > > > The advantage to keying this to autovac_naptime is that it means we > > don't need another GUC, but after I suggested that before I realized > > that's probably not the best idea. For example, I've seen clusters that > > are running dozens-hundreds of databases; in that environment you really > > need to turn naptime way down (to like a second). In that case you > > wouldn't want to key to naptime. > > Actually, I've been thinking that it would be a good idea to change the > semantics of autovacuum_naptime so that it means the average time to > start a worker in any given database. That way, the time between > autovac runs is not dependent on the number of databases you have. BTW, another issue that I don't think we can ignore: we actually need to do this on a per-tablespace level, or at least have the ability to disable or somehow limit it. While it's not common, there are users that run a hundred or more databases in a single cluster; it would be ugly if we suddenly had 100 vacuums trying to run on the same set of drives concurrently. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: >> Jim C. Nasby wrote: >> >>> The advantage to keying this to autovac_naptime is that it means we >>> don't need another GUC, but after I suggested that before I realized >>> that's probably not the best idea. For example, I've seen clusters that >>> are running dozens-hundreds of databases; in that environment you really >>> need to turn naptime way down (to like a second). In that case you >>> wouldn't want to key to naptime. >> Actually, I've been thinking that it would be a good idea to change the >> semantics of autovacuum_naptime so that it means the average time to >> start a worker in any given database. That way, the time between >> autovac runs is not dependent on the number of databases you have. > > BTW, another issue that I don't think we can ignore: we actually need to > do this on a per-tablespace level, or at least have the ability to > disable or somehow limit it. While it's not common, there are users that > run a hundred or more databases in a single cluster; it would be ugly if > we suddenly had 100 vacuums trying to run on the same set of drives > concurrently. I think we all agree that autovacuum needs to become tablespace aware at some point, but I think that is further down the line, we're having enough trouble figuring things out without that additional complication.
On Tue, Feb 27, 2007 at 12:12:22PM -0500, Matthew T. O'Connor wrote: > Jim C. Nasby wrote: > >On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: > >>Jim C. Nasby wrote: > >> > >>>The advantage to keying this to autovac_naptime is that it means we > >>>don't need another GUC, but after I suggested that before I realized > >>>that's probably not the best idea. For example, I've seen clusters that > >>>are running dozens-hundreds of databases; in that environment you really > >>>need to turn naptime way down (to like a second). In that case you > >>>wouldn't want to key to naptime. > >>Actually, I've been thinking that it would be a good idea to change the > >>semantics of autovacuum_naptime so that it means the average time to > >>start a worker in any given database. That way, the time between > >>autovac runs is not dependent on the number of databases you have. > > > >BTW, another issue that I don't think we can ignore: we actually need to > >do this on a per-tablespace level, or at least have the ability to > >disable or somehow limit it. While it's not common, there are users that > >run a hundred or more databases in a single cluster; it would be ugly if > >we suddenly had 100 vacuums trying to run on the same set of drives > >concurrently. > > I think we all agree that autovacuum needs to become tablespace aware at > some point, but I think that is further down the line, we're having > enough trouble figuring things out without that additional complication. Sure, we just need a way to disable the multiple autovac daemon stuff then. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Feb 26, 2007, at 12:49 PM, Alvaro Herrera wrote: > Jim C. Nasby wrote: > >> That's why I'm thinking it would be best to keep the maximum size of >> stuff for the second worker small. It probably also makes sense to >> tie >> it to time and not size, since the key factor is that you want it >> to hit >> the high-update tables every X number of seconds. >> >> If we wanted to get fancy, we could factor in how far over the vacuum >> threshold a table is, so even if the table is on the larger size, if >> it's way over the threshold the second vacuum will hit it. > > Ok, I think we may be actually getting somewhere. > > I propose to have two different algorithms for choosing the tables to > work on. The worker would behave differently, depending on whether > there is one or more workers on the database already or not. > > The first algorithm is the plain threshold equation stuff we use > today. > If a worker connects and determines that no other worker is in the > database, it uses the "plain worker" mode. A worker in this mode > would > examine pgstats, determine what tables to vacuum/analyze, sort them by > size (smaller to larger), and goes about its work. This kind of > worker > can take a long time to vacuum the whole database -- we don't > impose any > time limit or table size limit to what it can do. > > The second mode is the "hot table worker" mode, enabled when the > worker > detects that there's already a worker in the database. In this mode, > the worker is limited to those tables that can be vacuumed in less > than > autovacuum_naptime, so large tables are not considered. Because of > this, it'll generally not compete with the first mode above -- the > tables in plain worker were sorted by size, so the small tables were > among the first vacuumed by the plain worker. The estimated time to > vacuum may be calculated according to autovacuum_vacuum_delay > settings, > assuming that all pages constitute cache misses. Perhaps this has already been proposed, but maybe some combination of the following inputs could be used to determine which table most needs vacuuming: - The proportion of tuples in a table that are dead (updated rows since last vacuum/estimated row count). This would favor "hot" tables naturally regardless of size. - The time since the last vacuum, so that larger tables are eventually vacuumed even if hot tables totally dominate Of course tables that did not pass the minimum parameters specified in postgresql.conf would not even get considered. I'm being intentionally vague here on the exact algorithm, since you all have though about this more than I have. One thing I like about the above is that it is independent of table size, and doesn't require anyone to determine which tables are hot manually. -Casey
Tom Lane wrote: > Saving the array is > expensive both in runtime and code complexity, and I don't believe we > can trust it later --- at least not without even more expensive-and- > complex measures, such as WAL-logging every such save :-( I don’t understand well the things you are worrying about. If we find that we can not trust the saved file, or the file has corrupted, then we can drop it and scan from the beginning of the heap block. If something like CLUSTER, PITR has changed the physical layout of heap, then we can simply drop the files. Why do we need WAL for it? I don’t see any point in it. Also, I don’t think it is expensive. If it is combined with maintenancewindow to stop once in a whole day, writing 256MB/2= 128MB things out can not be said expensive. Of course, this feature isn’t for autovacuumto use it in every minutes, autovacuum can use itafter it has adopted maintenance window.