Thread: autovacuum prioritization
autovacuum, as it exists today, has some reasonable - though imperfect - heuristics for deciding which tables need to be vacuumed or analyzed. However, it has nothing to help it decide which tables need to be vacuumed or analyzed first. As a result, when it's configured aggressively enough, and has sufficient hardware resources, to do all the things that need to get done, the system holds together reasonably well. When it's not configured sufficiently aggressively, or when hardware resources are lacking, the situation deteriorates. Some deterioration is inevitable in that kind of scenario, because if you can't do all of the things that need to be done, something will have to get postponed, and that can cause problems. However, autovacuum makes it worse than it needs to be by deciding which things to do first more or less at random. When autovacuum chooses to vacuum a table that is a little bit bloated in preference to one that is severely bloated, or one that is bloated in place of one that is about to cause a wraparound shutdown, users are understandably upset, and some of them post on this mailing list about it. I am not going to try to review in detail the history of previous threads looking for action in this area, but there are quite a few. In my view, previous efforts in this area have been too simplistic. For example, it's been proposed that a table that is perceived to be in any kind of wraparound danger ought to get top priority, but I find that implausible. A mostly-quiescent table that is one XID past autovacuum_freeze_max_age is not likely to be a more urgent problem than a table that is absorbing 20,000 row updates per second. There is a long way between the default value of autovacuum_freeze_max_age and a wraparound shutdown, but if you don't vacuum a hotly-updated table promptly, you will get irreversible bloat and force an application shutdown to run VACUUM FULL. It's also been proposed that we provide some way to let the user fix priorities, which was rightly criticized as foisting on the user what the system ought to be figuring out for itself. Perhaps a user-controllable prioritization system is a good idea, but only as a way of overriding some built-in system in cases where that goes wrong. A few people have proposed scoring systems, which I think is closer to the right idea, because our basic goal is to start vacuuming any given table soon enough that we finish vacuuming it before some catastrophe strikes. The more imminent the catastrophe, the more urgent it is to start vacuuming right away. Also, and I think this is very important, the longer vacuum is going to take, the more urgent it is to start vacuuming right away. If table A will cause wraparound in 2 hours and take 2 hours to vacuum, and table B will cause wraparound in 1 hour and take 10 minutes to vacuum, table A is more urgent even though the catastrophe is further out. So at a high level, I think that what we ought to do is, first, for each table, estimate the time at which we think something bad will occur. There are several bad events: too much bloat, XID wraparound, MXID wraparound. We need to estimate the time at which we think each of those things will occur, and then take the earliest of those estimates. That's the time by which we need to have finished vacuuming the table. Then, make an estimate of how long it will take to complete a vacuum of the table, subtract that from the time at which we need to be done, and that's the time by which we need to start. The earliest need-to-start time is the highest priority table. There are a number of problems here. One is that we actually need to be able to estimate all the things that I just described, which will probably require tracking statistics that we don't capture today, such as the rate at which the system is consuming XIDs, and the rate at which a table is bloating, rather than just the current state of things. Unfortunately, proposing that the statistics system should store more per-table information is a well-known way to get your patch forcefully rejected. Maybe if we were storing and updating the statistics data in a better way it wouldn't be an issue - so perhaps shared memory stats collector stuff will resolve this issue somehow. Or maybe it's not an issue anyway, since the big problem with the statistics files is that they have to be constantly rewritten. If we took snapshots of the values in even a relatively dumb way, they'd be kinda bug, but they'd also be write-once. Maybe that would keep the expense reasonable. A second problem is that, if the earliest need-to-start time is in the past, then we definitely are in trouble and had better get to work at once, but if it's in the future, that doesn't necessarily mean we're safe. If there are three tables with a need-to-finish time that is 12 hours in the future and each of them will take 11 hours to vacuum, then every need-to-start time computed according to the algorithm above is in the future, but in fact we're in a lot of trouble. If the estimates are accurate, we need 3 autovacuum workers to be available to start within 1 hour, or we're doomed. The very last thing we want to do is wait another hour before doing anything. It's not impossible to factor this into the calculation of need-to-start times, assuming we know how many workers we have. For instance, if we've got tables whose need-to-finish times are 30, 50, and 70 minutes in the future, we can see that if each one takes 20 minutes or less to vacuum, then the need-to-start times can just be computed by subtraction. But the tables with 50 or 70 minute deadlines are going to take more than 20 minutes to vacuum, then we've got to back up the need-to-start times so that we finish each table in time to start on the next one. I haven't looked into what algorithms exist for this kind of scheduling problem, but I feel like a literature search, or pulling out my college textbooks, would probably turn up some useful ideas. As soon as you see that you can't decide when you need to start on a particular table without knowing what's going on with all the other tables on the system, a third problem becomes apparent: you can't figure anything out with confidence by looking at a single database, but must rather gather information from all databases and decide to which databases the workers should connect and what they ought to do when they get there. And then, as tasks finish and system activity progresses, you need to continuously update your notion of what needs to be done next and move workers around to accommodate it. This gets quite complicated, but that doesn't mean that it's unimportant. There's a pretty well-known "thundering herd" type effect when every table in the system crosses autovacuum_freeze_max_age around the same time, and suddenly we go from not much vacuuming to a whole ton of vacuuming all at once. A system like this could give us enough information to spread that out in an intelligent way: we could see the storm coming and start a single autovacuum worker working on the problem well in advance, and then ramp up to multiple workers only if it looks like that's not going to be enough to get the job done. I'm not sure I want to deal with all that complexity on day one, but I think it's important to do something about it eventually. In the meantime, I think a sensible place to start would be to figure out some system that makes sensible estimates of how soon we need to address bloat, XID wraparound, and MXID wraparound for each table, and some system that estimates how long each one will take to vacuum. Then, even if the workers aren't talking to each other, each individual worker can make an effort to deal with the most urgent tasks first. I think that estimating how long it will take to vacuum a table shouldn't be too bad: examining the visibility map and the index sizes and thinking about the autovacuum cost limits should give us at least some notion of what's likely to happen. Also, I think estimating the time until XID age becomes critical isn't too bad either. First, fix a threshold (perhaps autovacuum_max_freeze_age, maybe with a higher-than-current value, or maybe some other threshold entirely) that represents the target below which we always want to remain. Second, know something about how fast the system is consuming XIDs. Then just divide. I thought for a while that it would be too hard to understand the XID consumption rate, because it might be very uneven, but I think we can mitigate that problem somewhat by averaging over relatively long time intervals. For instance, we might measure the number of XIDs consumed in each 24 hour period, keep 7 days of data, and then take the average of those values, or maybe better, the highest or second-highest. That's a very small amount of data to store and in combination with the relfrozenxid for each table, it's all we need. We could also give the user a way to override our estimate of the XID consumption rate, in case they have very brief, very severe spikes. All of this can also be done for MXID age. It's estimating that the time at which table bloat will exceed some threshold that seems most difficult, because that seems to require measuring trends on a per-table basis, as opposed to the XID consumption rate, which is global. I know that this email is kind of a giant wall of text, so my thanks if you've read this far, and even more if you feel inspired to write back with your own thoughts. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 20, 2022 at 11:24 AM Robert Haas <robertmhaas@gmail.com> wrote: > In my view, previous efforts in this area have been too simplistic. > For example, it's been proposed that a table that is perceived to be > in any kind of wraparound danger ought to get top priority, but I find > that implausible. I agree that it doesn't follow that table A should be more of a priority than table B, either because it has a greater age, or because its age happens to exceed some actually-arbitrary threshold. But I will point out that my ongoing work on freezing does make something along these lines much more plausible. As I said over on that thread, there is now a kind of "natural variation" among tables, in terms of relfrozenxid, as a result of tracking the actual oldest XID, and using that (plus the emphasis on advancing relfrozenxid wherever possible). And so we'll have a much better idea of what's going on with each table -- it's typically a precise XID value from the table, from the recent past. As of today, on HEAD, the picture is rather fuzzy. If a table has a really old relminmxid, then which is more likely: 1. there are lots of remaining MultiXactIds references by the old table, or 2. it has been a while since that table was last aggressively vacuumed, and it actually has exactly zero MultiXactId references? I would guess 2 myself, but right now I could never be too sure. But, in a world where we consistently advance relfrozenxid and relminmxid, then *not* advancing them (or advancing either by relatively little in one particular table) becomes a strong signal, in a way that it just isn't currently. This is a negative signal, not a positive signal. And as you yourself go on to say, that's what any new heuristics for this stuff ought to be exclusively concerned with -- what not to allow to happen, ever. There is a great deal of diversity among healthy databases; they're hard to make generalizations about that work. But unhealthy (really very unhealthy) states are *far* easier to recognize and understand, without really needing to understand the workload itself at all. Since we now have the failsafe, the scheduling algorithm can afford to not give too much special attention to table age until we're maybe over the 1 billion age mark -- or even 1.5 billion+. But once the scheduling stuff starts to give table age special attention, it should probably become the dominant consideration, by far, completely drowning out any signals about bloat. It's kinda never really supposed to get that high, so when we do end up there it is reasonable to fully freak out. Unlike the bloat criteria, the wraparound safety criteria doesn't seem to have much recognizable space between not worrying at all, and freaking out. > A second problem is that, if the earliest need-to-start time is in the > past, then we definitely are in trouble and had better get to work at > once, but if it's in the future, that doesn't necessarily mean we're > safe. There is a related problem that you didn't mention: autovacuum_max_workers controls how many autovacuum workers can run at once, but there is no particular concern for whether or not running that many workers actually makes sense, in any given scenario. As a general rule, the system should probably be *capable* of running a large number of autovacuums at the same time, but never actually do that (because it just doesn't ever prove necessary). Better to have the option and never use it than need it and not have it. > In the meantime, I think a sensible place to start would be to figure > out some system that makes sensible estimates of how soon we need to > address bloat, XID wraparound, and MXID wraparound for each table, and > some system that estimates how long each one will take to vacuum. I think that it's going to be hard to model how long index vacuuming will take accurately. And harder still to model which indexes will adversely impact the user in some way if we delay vacuuming some more. Might be more useful to start off by addressing how to spread out the burden of vacuuming over time. The needs of queries matters, but controlling costs matters too. One of the most effective techniques is to manually VACUUM when the system is naturally idle, like at night time. If that could be quasi-automated, or if the criteria used by autovacuum scheduling gave just a little weight to how busy the system is right now, then we would have more slack when the system becomes very busy. -- Peter Geoghegan
On Thu, Jan 20, 2022 at 6:54 PM Peter Geoghegan <pg@bowt.ie> wrote: > I agree that it doesn't follow that table A should be more of a > priority than table B, either because it has a greater age, or because > its age happens to exceed some actually-arbitrary threshold. But I > will point out that my ongoing work on freezing does make something > along these lines much more plausible. As I said over on that thread, > there is now a kind of "natural variation" among tables, in terms of > relfrozenxid, as a result of tracking the actual oldest XID, and using > that (plus the emphasis on advancing relfrozenxid wherever possible). > And so we'll have a much better idea of what's going on with each > table -- it's typically a precise XID value from the table, from the > recent past. I agree. > Since we now have the failsafe, the scheduling algorithm can afford to > not give too much special attention to table age until we're maybe > over the 1 billion age mark -- or even 1.5 billion+. But once the > scheduling stuff starts to give table age special attention, it should > probably become the dominant consideration, by far, completely > drowning out any signals about bloat. It's kinda never really supposed > to get that high, so when we do end up there it is reasonable to fully > freak out. Unlike the bloat criteria, the wraparound safety criteria > doesn't seem to have much recognizable space between not worrying at > all, and freaking out. I do not agree with all of this. First, on general principle, I think sharp edges are bad. If a table had priority 0 for autovacuum 10 minutes ago, it can't now have priority one million bazillion. If you're saying that the priority of wraparound needs to, in the limit, become higher than any bloat-based priority, that is reasonable. Bloat never causes a hard stop in the way that wraparound does, even if the practical effects are not much different. However, if you're saying that the priority should shoot up to the maximum all at once, I don't agree with that at all. Second, I think it is good and appropriate to leave a lot of slop in the mechanism. As you point out later, we don't really know whether any of our estimates for how long things will take are accurate, and therefore we don't know whether the time we've budgeted will be sufficient. We need to leave lots of slop so that even if we turn out to be quite wrong, we don't hit a wall. Also, it's worth keeping in mind that waiting longer to freak out is not necessarily an advantage. It may well be that the only way the problem will ever get resolved is by human intervention - going in and fixing whatever dumb thing somebody did - e.g. resolving the pending prepared transaction. In that sense, we might be best off freaking out after a relatively small number of transactions, because that might get some human being's attention. In a very real sense, if old prepared transactions shut down the system after 100 million transactions, users would probably be better off on average, because the problems would get fixed before so much damage is done. I'm not seriously proposing that as a design, but I think it's a mistake to think that pushing off the day of reckoning is necessarily better. All that being said, I do agree that trying to keep the table age below 300 million is too conservative. I think we need to be conservative now because we don't take the time that the table will take to vacuum into account, and I think if we start thinking about it as a target to finish vacuuming rather than a target to start vacuuming, it can go significantly higher. But I would be disinclined to go to say, 1.5 billion. If the user hasn't taken any action when we hit the 1 billion transaction mark, or really probably a lot sooner, they're unlikely to wake up any time soon. I don't think there are many systems out there where vacuum ages >1b are the result of the system trying frantically to keep up and not having enough juice. There are probably some, but most such cases are the result of misconfiguration, user error, software failure, etc. > There is a related problem that you didn't mention: > autovacuum_max_workers controls how many autovacuum workers can run at > once, but there is no particular concern for whether or not running > that many workers actually makes sense, in any given scenario. As a > general rule, the system should probably be *capable* of running a > large number of autovacuums at the same time, but never actually do > that (because it just doesn't ever prove necessary). Better to have > the option and never use it than need it and not have it. I agree. And related to that, the more workers we have, the slower each one goes, which I think is often counterintuitive for people, and also often counterproductive. I'm sure there are cases where table A is really big and needs to be vacuumed but not terribly urgently, and table B is really small but needs to be vacuumed right now, and I/O bandwidth is really tight. In that case, slowing down the vacuum on table A so that the vacuum on table B can do its thing is the right call. But what I think is more common is that we get more workers because the first one is not getting the job done. And if they all get slower then we're still not getting the job done, but at greater expense. > > In the meantime, I think a sensible place to start would be to figure > > out some system that makes sensible estimates of how soon we need to > > address bloat, XID wraparound, and MXID wraparound for each table, and > > some system that estimates how long each one will take to vacuum. > > I think that it's going to be hard to model how long index vacuuming > will take accurately. And harder still to model which indexes will > adversely impact the user in some way if we delay vacuuming some more. Those are fair concerns. I assumed that if we knew the number of pages in the index, which we do, it wouldn't be too hard to make an estimate like this ... but you know more about this than I do, so tell me why you think that won't work. It's perhaps worth noting that even a somewhat poor estimate could be a big improvement over what we have now. > Might be more useful to start off by addressing how to spread out the > burden of vacuuming over time. The needs of queries matters, but > controlling costs matters too. > > One of the most effective techniques is to manually VACUUM when the > system is naturally idle, like at night time. If that could be > quasi-automated, or if the criteria used by autovacuum scheduling gave > just a little weight to how busy the system is right now, then we > would have more slack when the system becomes very busy. I have thought about this approach but I'm not very hopeful about it as a development direction. One problem is that we don't necessarily know when the quiet times are, and another is that there might not even be any quiet times. Still, neither of those problems by itself would discourage me from attempting something in this area. The thing that does discourage me is: if you have a quiet period, you can take advantage of that to do vacuuming without any code changes at all. You can just crontab a vacuum that runs with a reduced setting for vacuum_freeze_table_age and vacuum_freeze_min_age during your nightly quiet period and call it good. The problem that I'm principally concerned about here is the case where somebody had a system that was basically OK and then at some point, bad things started to happen. At some point they realize they're in trouble and try to get back on track. Very often, autovacuum is actually the enemy in that situation: it insists on consuming resources to vacuum the wrong stuff. Whatever we can do to avoid such disastrous situations is all to the good, but since we can't realistically expect to avoid them entirely, we need to improve the behavior in the cases where they do happen. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 20, 2022 at 4:43 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Since we now have the failsafe, the scheduling algorithm can afford to > > not give too much special attention to table age until we're maybe > > over the 1 billion age mark -- or even 1.5 billion+. But once the > > scheduling stuff starts to give table age special attention, it should > > probably become the dominant consideration, by far, completely > > drowning out any signals about bloat. It's kinda never really supposed > > to get that high, so when we do end up there it is reasonable to fully > > freak out. Unlike the bloat criteria, the wraparound safety criteria > > doesn't seem to have much recognizable space between not worrying at > > all, and freaking out. > > I do not agree with all of this. First, on general principle, I think > sharp edges are bad. If a table had priority 0 for autovacuum 10 > minutes ago, it can't now have priority one million bazillion. If > you're saying that the priority of wraparound needs to, in the limit, > become higher than any bloat-based priority, that is reasonable. I'm definitely saying considerations about wraparound need to swamp everything else out at the limit. But I'm also making the point that (at least with the ongoing relfrozenxid/freezing work) the system does remarkably well at avoiding all aggressive anti-wraparound VACUUMs in most individual tables, with most workloads. And so having an aggressive anti-wraparound VACUUM at all now becomes a pretty strong signal. As we discussed on the other thread recently, you're still only going to get anti-wraparound VACUUMs in a minority of tables with the patches in place -- for tables that won't ever get an autovacuum for any other reason. And so having an anti-wraparound probably just signals that we have such a table, which is totally inconsequential. But what about when there is an anti-wraparound VACUUM (or a need for one) on a table whose age is already (say) 2x the value of autovacuum_freeze_max_age? That really is an incredibly strong signal that something is very much amiss. Since the relfrozenxid/freezing patch series actually makes each VACUUM able to advance relfrozenxid in a way that's really robust when the system is not under great pressure, the failure of that strategy becomes a really strong signal. So it's not that table age signals something that we can generalize about too much, without context. The context is important. The relationship between table age and autovacuum_freeze_max_age with the new strategy from my patch series becomes an important negative signal, about something that we reasonably expected to be quite stable not actually being stable. (Sorry to keep going on about my work, but it really seems relevant.) > Also, it's worth keeping in mind that waiting longer to freak out is > not necessarily an advantage. It may well be that the only way the > problem will ever get resolved is by human intervention - going in and > fixing whatever dumb thing somebody did - e.g. resolving the pending > prepared transaction. In that case we ought to try to alert the user earlier. > Those are fair concerns. I assumed that if we knew the number of pages > in the index, which we do, it wouldn't be too hard to make an estimate > like this ... but you know more about this than I do, so tell me why > you think that won't work. It's perhaps worth noting that even a > somewhat poor estimate could be a big improvement over what we have > now. I can construct a plausible, totally realistic counter-example that breaks a heuristic like that, unless it focuses on extremes only, like no index growth at all since the last VACUUM (which didn't leave behind any deleted pages). I think that such a model can work well, but only if it's designed to matter less and less as our uncertainty grows. It seems as if the uncertainty grows very sharply, once you begin to generalize past the extremes. We have to be totally prepared for the model to be wrong, except perhaps as a way of prioritizing things when there is real urgency, and we don't have a choice about choosing. All models are wrong, some are useful. > The problem that I'm principally concerned about here is the case > where somebody had a system that was basically OK and then at some > point, bad things started to happen. It seems necessary to distinguish between the case where things really were okay for a time, and the case where they merely appeared to be okay to somebody whose understanding of the system isn't impossibly deep and sophisticated. You'd have to be an all-knowing oracle to be able to tell the difference, because the system itself has no sophisticated notion of how far it is into debt. There are things that we can do to address this gap directly (that's what I have been doing myself), but that can only go so far. ISTM that the higher the amount of debt that the system is actually in, the greater the uncertainty about the total amount of debt. In other words, the advantage of paying down debt isn't limited to the obvious stuff; there is also the advantage of gaining confidence about how far into debt the system really is. The longer it's been since the last real VACUUM, the more your model of debt/bloat is likely to have diverged from reality. And that's why I bring costs into it. Vacuuming at night because you know that the cost will be relatively low, even if the benefits might not be quite as high as you'd usually expect makes sense on its own terms, and also has the advantage of making the overall picture clearer to the system/your model. > At some point they realize > they're in trouble and try to get back on track. Very often, > autovacuum is actually the enemy in that situation: it insists on > consuming resources to vacuum the wrong stuff. To some degree this is because the statistics that autovacuum has access to are flat out wrong, even though we could do better. For example, the issue that I highlighted a while back about ANALYZE's dead tuples accounting. Or the issue that I pointed out on this thread already, about relfrozenxid being a very bad indicator of what's actually going on with XIDs in the table (at least without my relfrozenxid patches in place). Another idea centered on costs: with my freezing/relfrozenxid patch series, strict append-only tables like pgbench_history will only ever need to have VACUUM process each heap page once. That's good, but it could be even better if we didn't have to rely on the autovacuum scheduling and autovacuum_vacuum_insert_scale_factor to drive everything. This is technically a special case, but it's a rather important one -- it's both very common and not that hard to do a lot better on. We ought to be aiming to only dirty each page exactly once, by *dynamically* deciding to VACUUM much more often than the current model supposes makes sense. I think that this would require a two-way dialog between autovacuum.c and vacuumlazy.c. At a high level, vacuumlazy.c would report back "turns out that that table looks very much like an append-only table". That feedback would cause the autovacuum.c scheduling to eagerly launch another autovacuum worker, ignoring the usual criteria -- just wait (say) another 60 seconds, and then launch a new autovacuum worker on the same table if it became larger by some smallish fixed amount (stop caring about percentage table growth). Constant mini-vacuums against such a table make sense, since costs are almost exactly proportional to the number of heap pages appended since the last VACUUM. -- Peter Geoghegan
On Fri, Jan 21, 2022 at 12:54 AM Robert Haas <robertmhaas@gmail.com> wrote: > So at a high level, I think that what we ought to do is, first, for > each table, estimate the time at which we think something bad will > occur. There are several bad events: too much bloat, XID wraparound, > MXID wraparound. We need to estimate the time at which we think each > of those things will occur, and then take the earliest of those > estimates. That's the time by which we need to have finished vacuuming > the table. Then, make an estimate of how long it will take to complete > a vacuum of the table, subtract that from the time at which we need to > be done, and that's the time by which we need to start. The earliest > need-to-start time is the highest priority table. I think we need some more parameters to compare bloat vs wraparound. I mean in one of your examples in the 2nd paragraph we can say that the need-to-start of table A is earlier than table B so it's kind of simple. But when it comes to wraparound vs bloat we need to add some weightage to compute how much bloat is considered as bad as wraparound. I think the amount of bloat can not be an absolute number but it should be relative w.r.t the total database size or so. I don't think it can be computed w.r.t to the table size because if the table is e.g. just 1 GB size and it is 5 times bloated then it is not as bad as another 1 TB table which is just 2 times bloated. > > A second problem is that, if the earliest need-to-start time is in the > past, then we definitely are in trouble and had better get to work at > once, but if it's in the future, that doesn't necessarily mean we're > safe. If there are three tables with a need-to-finish time that is 12 > hours in the future and each of them will take 11 hours to vacuum, > then every need-to-start time computed according to the algorithm > above is in the future, but in fact we're in a lot of trouble. If the > estimates are accurate, we need 3 autovacuum workers to be available > to start within 1 hour, or we're doomed. The very last thing we want > to do is wait another hour before doing anything. It's not impossible > to factor this into the calculation of need-to-start times, assuming > we know how many workers we have. For instance, if we've got tables > whose need-to-finish times are 30, 50, and 70 minutes in the future, > we can see that if each one takes 20 minutes or less to vacuum, then > the need-to-start times can just be computed by subtraction. But the > tables with 50 or 70 minute deadlines are going to take more than 20 > minutes to vacuum, then we've got to back up the need-to-start times > so that we finish each table in time to start on the next one. I > haven't looked into what algorithms exist for this kind of scheduling > problem, but I feel like a literature search, or pulling out my > college textbooks, would probably turn up some useful ideas. I think we should be thinking of dynamically adjusting priority as well. Because it is possible that when autovacuum started we prioritize the table based on some statistics and estimation but vacuuming process can take long time and during that some priority might change so during the start of the autovacuum if we push all table to some priority queue and simply vacuum in that order then we might go wrong somewhere. I think we need to make different priority queues based on different factors, for example 1 queue for wraparound risk and another for bloat risk. Even though there would be multiple queue we would have need_to_start time with each item so that we exactly know from which queue we pick the next item but dynamically whenever picking the item we can recheck the priority of the item at the head of the queue and always assume that the queue is arranged in order of need_to_start time. So now what if the item back in the queue becomes more important than the item at the queue head based on some statistics? I don't think it is wise to compute the need_to_start time for all the items before picking any new item. But I think we need to have multiple queues based on different factors (not only just wraparound and bloat) to reduce the risk of items in the back of the queue becoming higher priority than items in front of the queue. I mean this can not completely be avoided but this can be reduced by creating multiple work queues based on more factors which can dynamically change. > > I know that this email is kind of a giant wall of text, so my thanks > if you've read this far, and even more if you feel inspired to write > back with your own thoughts. Yeah it is a long email but quite interesting. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 24, 2022 at 11:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think we need some more parameters to compare bloat vs wraparound. > I mean in one of your examples in the 2nd paragraph we can say that > the need-to-start of table A is earlier than table B so it's kind of > simple. But when it comes to wraparound vs bloat we need to add some > weightage to compute how much bloat is considered as bad as > wraparound. I think the amount of bloat can not be an absolute number > but it should be relative w.r.t the total database size or so. I > don't think it can be computed w.r.t to the table size because if the > table is e.g. just 1 GB size and it is 5 times bloated then it is not > as bad as another 1 TB table which is just 2 times bloated. Thanks for writing back. I don't think that I believe the last part of this argument, because it seems to suppose that the big problem with bloat is that it might use up disk space, whereas in my experience the big problem with bloat is that it slows down access to your data. Yet the dead space in some other table will not have much impact on the speed of access to the current table. In fact, if most accesses to the table are index scans, even dead space in the current table may not have much effect, but sequential scans are bound to notice. It's true that, on a cluster-wide basis, every dead page is one more page that can potentially take up space in cache, so in that sense the performance consequences are global to the whole cluster. However, that effect is more indirect and takes a long time to become a big problem. The direct effect of having to read more pages to execute the same query plan causes problems a lot sooner. But your broader point that we need to consider how much bloat represents a problem is a really good one. In the past, one rule that I've thought about is: if we're vacuuming a table and we're not going to finish before it needs to be vacuumed again, then we should vacuum faster (i.e. in effect, increase the cost limit on the fly). That might still not result in good behavior, but it would at least result in behavior that is less bad. However, it doesn't really answer the question of how we decide when to start the very first VACUUM. I don't really know the answer to that question. The current heuristics result in estimates of acceptable bloat that are too high in some cases and too low in others. I've seen tables that got bloated vastly beyond what autovacuum is configured to tolerate before they caused any real difficulty, and I know there are other cases where users start to suffer long before those thresholds are reached. At the moment, the best idea I have is to use something like the current algorithm, but treat it as a deadline (keep bloat below this amount) rather than an initiation criteria (start when you reach this amount). But I think that idea is a bit weak; maybe there's something better out there. > I think we should be thinking of dynamically adjusting priority as > well. Because it is possible that when autovacuum started we > prioritize the table based on some statistics and estimation but > vacuuming process can take long time and during that some priority > might change so during the start of the autovacuum if we push all > table to some priority queue and simply vacuum in that order then we > might go wrong somewhere. Yep. I think we should reassess what to do next after each table. Possibly making some exception for really small tables - e.g. if we last recomputed priorities less than 1 minute ago, don't do it again. > I think we need to make different priority > queues based on different factors, for example 1 queue for wraparound > risk and another for bloat risk. I don't see why we want multiple queues. We have to answer the question "what should we do next?" which requires us, in some way, to funnel everything into a single prioritization. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 25, 2022 at 11:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > But your broader point that we need to consider how much bloat > represents a problem is a really good one. In the past, one rule that > I've thought about is: if we're vacuuming a table and we're not going > to finish before it needs to be vacuumed again, then we should vacuum > faster (i.e. in effect, increase the cost limit on the fly). That seems reasonable, but I doubt that that's a huge issue in practice, now that the default cost limits are more sensible. > That might still not result in good behavior, but it would at least result > in behavior that is less bad. However, it doesn't really answer the > question of how we decide when to start the very first VACUUM. I don't > really know the answer to that question. The current heuristics result > in estimates of acceptable bloat that are too high in some cases and > too low in others. I've seen tables that got bloated vastly beyond > what autovacuum is configured to tolerate before they caused any real > difficulty, and I know there are other cases where users start to > suffer long before those thresholds are reached. ISTM that the easiest thing that could be done to improve this is to give some consideration to page-level characteristics. For example, a page that has 5 dead heap-only tuples is vastly different to a similar page that has 5 LP_DEAD items instead -- and yet our current approach makes no distinction. Chances are very high that if the only dead tuples are heap-only tuples, then things are going just fine on that page -- opportunistic pruning is actually keeping up. Page-level stability over time seems to be the thing that matters most -- we must make sure that the same "logical rows" that were inserted around the same time remain on the same block for as long as possible, without mixing in other unrelated tuples needlessly. In other words, preserve natural locality. This is related to the direction of things, and the certain knowledge that VACUUM alone can deal with line pointer bloat. The current state of individual pages hints at the direction of things even without tracking how things change directly. But tracking the change over time in ANALYZE seems better still: if successive ANALYZE operations notice a consistent pattern where pages that had a non-zero number of LP_DEAD items last time now have a significantly higher number, then it's a good idea to err in the direction of more aggressive vacuuming. *Growing* concentrations of LP_DEAD items signal chaos. I think that placing a particular emphasis on pages with non-zero LP_DEAD items as a qualitatively distinct category of page might well make sense -- relatively few blocks with a growing number of LP_DEAD items seems like it should be enough to make autovacuum run aggressively. As I pointed out not long ago, ANALYZE does a terrible job of accurately counting dead tuples/LP_DEAD items when they aren't uniformly distributed in the table -- which is often a hugely important factor, with a table that is append-mostly with updates and deletes. That's why I suggested bringing the visibility map into it. In general I think that the statistics that drive autovacuum are currently often quite wrong, even on their own simplistic, quantitative terms. > I don't see why we want multiple queues. We have to answer the > question "what should we do next?" which requires us, in some way, to > funnel everything into a single prioritization. Even busy production DBs should usually only be vacuuming one large table at a time. Also might make sense to strategically align the work with the beginning of a new checkpoint. -- Peter Geoghegan
On Tue, Jan 25, 2022 at 2:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jan 24, 2022 at 11:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I think we need to make different priority > > queues based on different factors, for example 1 queue for wraparound > > risk and another for bloat risk. > > I don't see why we want multiple queues. We have to answer the > question "what should we do next?" which requires us, in some way, to > funnel everything into a single prioritization. I was thinking along the same lines as Dilip: If the anti-wraparound risk is really far in the future, there might not be much eligible freezing work to do. Dead tuples can be removed as soon as visibility rules allow it. With a separate bloat queue, there might always be some work to do. Maybe "bloat queue" is too specific, because insert-only tables can use more vacuuming for the VM even if they have not reached the configured threshold. So a worker would check the wraparound queue, and if nothing's there grab something from the other queue. Maybe low-priority work would have a low cost limit. Probably the true best way to do schedule, at least at first, is what's the least complex. I'm not yet sure what that is... -- John Naylor EDB: http://www.enterprisedb.com
On Tue, Jan 25, 2022 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > For example, a > page that has 5 dead heap-only tuples is vastly different to a similar > page that has 5 LP_DEAD items instead -- and yet our current approach > makes no distinction. Chances are very high that if the only dead > tuples are heap-only tuples, then things are going just fine on that > page -- opportunistic pruning is actually keeping up. Hmm, I think that's a really good insight. Perhaps we ought to forget about counting updates and deletes and instead count dead line pointers. Or maybe we can estimate the number of dead line pointers by knowing how many updates and deletes there were, as long as we can distinguish hot updates from non-HOT updates, which I think we can. > if successive ANALYZE operations notice > a consistent pattern where pages that had a non-zero number of LP_DEAD > items last time now have a significantly higher number, then it's a > good idea to err in the direction of more aggressive vacuuming. > *Growing* concentrations of LP_DEAD items signal chaos. I think that > placing a particular emphasis on pages with non-zero LP_DEAD items as > a qualitatively distinct category of page might well make sense -- > relatively few blocks with a growing number of LP_DEAD items seems > like it should be enough to make autovacuum run aggressively. I think measuring the change over time here might be fraught with peril. If vacuum makes a single pass over the indexes, it can retire as many dead line pointers as we have, or as will fit in memory, and the effort doesn't really depend too much on exactly how many dead line pointers we're trying to find. (I hear that it does depend more than you'd think ... but I still don't think that should be the dominant consideration here.) So to be efficient, we want to do that pass over the indexes when we have a suitably large batch of dead line pointers. I don't think it really depends on how long it took the batch to get to that size. I don't want to vacuum a terabyte of indexes with a much-smaller-than-normal batch of dead TIDs just because the number of dead TIDs seems to be increasing quickly at the moment: it's hard to imagine that the results will be worth the resources I'll have to expend to get there. On the other hand I also don't think I want to postpone vacuuming the indexes because the number is really big but not growing that fast. I feel like my threshold for the number of dead TIDs that ought to trigger a vacuum grows as the table gets bigger, capped by how much memory I've got. But I don't feel like the rate at which it's changing necessarily matters. Like if I create a million dead line pointers really quickly, wait a month, and then create another million dead line pointers, I feel like I want the system to respond just as aggressively as if the month-long delay were omitted. Maybe my feelings are wrong here. I'm just saying that, to me, it doesn't feel like the rate of change is all that relevant. > Even busy production DBs should usually only be vacuuming one large > table at a time. Also might make sense to strategically align the work > with the beginning of a new checkpoint. I'm not sure that either of those statements are correct. But on the other hand, I am also not sure that either of those statements are incorrect. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 25, 2022 at 3:34 PM John Naylor <john.naylor@enterprisedb.com> wrote: > I was thinking along the same lines as Dilip: If the anti-wraparound > risk is really far in the future, there might not be much eligible > freezing work to do. Dead tuples can be removed as soon as visibility > rules allow it. With a separate bloat queue, there might always be > some work to do. Isn't the same thing true of bloat, though? If the XID threshold hasn't advanced that much, then there may be nothing that's worth doing about XID wraparound in the short term. If there aren't many dead tuples in any of the tables, then there may be nothing that's worth doing about bloat. Then we should just do nothing. On the other hand, we may have a relatively urgent problem in one of those areas but not the other. Then we should work on that one. Or we may have problems in both areas, and then we need to somehow decide which one is more urgent -- that's the situation in which I feel like we need to unify the prioritization or ordering in some way. It is an interesting point that we could have low priority work with a low cost limit and high priority work with a higher cost limit, or as I think Peter suggested, just use a single process for low-priority stuff but allow multiple processes when there's high-priority stuff. -- Robert Haas EDB: http://www.enterprisedb.com
> On Wed, Jan 26, 2022 at 10:55 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 25, 2022 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > > For example, a > > page that has 5 dead heap-only tuples is vastly different to a similar > > page that has 5 LP_DEAD items instead -- and yet our current approach > > makes no distinction. Chances are very high that if the only dead > > tuples are heap-only tuples, then things are going just fine on that > > page -- opportunistic pruning is actually keeping up. > > Hmm, I think that's a really good insight. Perhaps we ought to forget > about counting updates and deletes and instead count dead line > pointers. Or maybe we can estimate the number of dead line pointers by > knowing how many updates and deletes there were, as long as we can > distinguish hot updates from non-HOT updates, which I think we can. All that we have to go on is a bunch of observations in any case, though -- the map is not the territory. And so it seems to me that the sensible thing to do is just embrace that we won't ever really exactly know what's going on in a given database, at any given time. Fortunately, we don't really have to know. We should be able to get away with only having roughly the right idea, by focussing on the few things that we are sure of -- things like the difference between LP_DEAD items and dead heap-only tuples, which are invariant to workload characteristics. I recently said (on the ANALYZE related thread) that we should be thinking probabilistically here [1]. Our understanding of the amount of bloat could very usefully be framed that way. Maybe the model we use is a probability density function (maybe not formally, not sure). A PDF has an exact expectation, which for us might be the most probable number of dead tuples in total in a given table right now (let's just assume it's still dead tuples, ignoring the problems with that metric for now). This is a useful basis for making better decisions by weighing competing considerations -- which might themselves be another PDF. Example: For a given table that is approaching the point where the model says "time to VACUUM", we may very well spend hours, days, or even weeks approaching the crossover point. The exact expectation isn't truly special here -- there is actually zero practical reason to have special reverence for that precise point (with a good model, within certain reasonable bounds). If our model says that there is only a noise-level difference between doing a VACUUM on a given table today, tomorrow, or next week, why not take advantage? For example, why not do the VACUUM when the system appears to not be busy at all (typically in the dead of night), just because it'll definitely be both cheaper in absolute terms (FPIs can be avoided by spreading things out over multiple checkpoints), and less disruptive? There are many opportunities like that, I believe. It's hard for me to suppress the urge to blurt out 17 more ideas like that. What are the chances that you won't have at least a few real winners among all of the ideas that everybody will come up with, in the end? > > if successive ANALYZE operations notice > > a consistent pattern where pages that had a non-zero number of LP_DEAD > > items last time now have a significantly higher number, then it's a > > good idea to err in the direction of more aggressive vacuuming. > > *Growing* concentrations of LP_DEAD items signal chaos. I think that > > placing a particular emphasis on pages with non-zero LP_DEAD items as > > a qualitatively distinct category of page might well make sense -- > > relatively few blocks with a growing number of LP_DEAD items seems > > like it should be enough to make autovacuum run aggressively. > > I think measuring the change over time here might be fraught with > peril. I'd say that that depends on how you define the problem we're trying to solve. If you define the problem as coming up with a significantly improved statistical model that determines (say) how many dead tuples there are in the table right now, given a set of observations made by ANALYZE in the past, then yes, it's fraught with peril. But why would you define it that way? It seems far easier to improve things by putting model error and *actual* exposure to real known issues (e.g. line pointer bloat) front and center. It doesn't necessarily matter if we're *usually* wrong with a good model. But with a bad model we may need to consistently get the correct answer. And so the model that is the most accurate quantitatively is probably *not* the best available model, all things considered. Most of the time we shouldn't VACUUM right this second, and so a model that consists of "return false" is very frequently correct. But that doesn't mean it's a good model. You get the idea. > If vacuum makes a single pass over the indexes, it can retire > as many dead line pointers as we have, or as will fit in memory, and > the effort doesn't really depend too much on exactly how many dead > line pointers we're trying to find. Line pointer bloat is something that displays hysteresis; once it happens (past some significant threshold) then there is no reversing the damage. This makes the behavior very non-linear. In other words, it makes it incredibly hard to model mathematically [2] -- once you cross a certain hard to define threshold, it's total chaos, even in a closed well-specified system (i.e. a highly constrained workload), because you have all these feedback loops. On top of all that, even with a perfect model we're still forced to make a go/no-go decision for the entire table, moment to moment. So even a mythical perfect model runs into the problem that it is simultaneously much too early and much too late at the level of the table. Which is even more reason to just focus on not going totally off the rails, in any particular direction. Note that this includes going off the rails by vacuuming in a way that's unsustainably aggressive -- sometimes you have to cut your losses at that level as well. There is usually some bigger picture to consider when things do go wrong -- there is usually some much worse fate that must be avoided. Like with VACUUM's failsafe. Sure, controlling index bloat is extremely important. But it's also much less important than keeping the system online and responsive. That's another level up. (The level up *after that* is "at least we didn't lose data", or maybe something about limiting the amount of downtime, not going out of business, whatever.) > I feel like my threshold for the number of dead TIDs that ought to > trigger a vacuum grows as the table gets bigger, capped by how much > memory I've got. I thought of another PDF related idea when I read this, without even trying: we could account for the discontinuity from multiple index scans in a single VACUUM operation (instead of just one) by erring in the direction of doing the VACUUM sooner rather than later, when the model says that doing so will make very little difference in terms of extra costs incurred (extra costs from vacuuming sooner rather than later, conservatively assuming that our concern about TIDs not fitting in memory is basically unfounded). > But I don't feel like the rate at which it's changing > necessarily matters. Like if I create a million dead line pointers > really quickly, wait a month, and then create another million dead > line pointers, I feel like I want the system to respond just as > aggressively as if the month-long delay were omitted. > > Maybe my feelings are wrong here. I'm just saying that, to me, it > doesn't feel like the rate of change is all that relevant. It's not that they're wrong, exactly -- I wouldn't say that. It's more like this: you as a Postgres user actually care about a great many things, not just one thing. Some of these things might be somewhat in tension, from time to time. And so it seems wise to find a way to live with any tension that may crop up -- by acknowledging the tension, we get the chance to honor the preferences of the user to the greatest extent possible. [1] https://postgr.es/m/CAH2-WzmvXXEKtEph7U360umZ5pN3d18RBfu=nyPg9neBLDUWdw@mail.gmail.com [2] https://en.wikipedia.org/wiki/Hysteretic_model -- Peter Geoghegan
On Thu, 20 Jan 2022 at 14:31, Robert Haas <robertmhaas@gmail.com> wrote: > > In my view, previous efforts in this area have been too simplistic. > One thing I've been wanting to do something about is I think autovacuum needs to be a little cleverer about when *not* to vacuum a table because it won't do any good. I've seen a lot of cases where autovacuum kicks off a vacuum of a table even though the globalxmin hasn't really advanced significantly over the oldest frozen xid. When it's a large table this really hurts because it could be hours or days before it finishes and at that point there's quite a bit of bloat. This isn't a common occurrence, it happens when the system is broken in some way. Either there's an idle-in-transaction session or something else keeping the global xmin held back. What it does though is make things *much* worse and *much* harder for a non-expert to hit on the right remediation. It's easy enough to tell them to look for these idle-in-transaction sessions or set timeouts. It's much harder to determine whether it's a good idea for them to go and kill the vacuum that's been running for days. And it's not a great thing for people to be getting in the habit of doing either. I want to be able to stop telling people to kill vacuums kicked off by autovacuum. I feel like it's a bad thing for someone to ever have to do and I know some fraction of the time I'm telling them to do it it'll have been a terrible thing to have done (but we'll never know which times those were). Determining whether a running vacuum is actually doing any good is pretty hard and on older versions probably impossible. I was thinking of just putting a check in before kicking off a vacuum and if the globalxmin is a significant fraction of the distance to the relfrozenxid then instead log a warning. Basically it means "we can't keep the bloat below the threshold due to the idle transactions et al, not because there's insufficient i/o bandwidth". At the same time it would be nice if autovacuum could recognize when the i/o bandwidth is insufficient. If it finishes a vacuum it could recheck whether the table is eligible for vacuuming and log that it's unable to keep up with the vacuuming requirements -- but right now that would be a lie much of the time when it's not a lack of bandwidth preventing it from keeping up. -- greg
On Wed, Jan 26, 2022 at 3:46 PM Greg Stark <stark@mit.edu> wrote: > One thing I've been wanting to do something about is I think > autovacuum needs to be a little cleverer about when *not* to vacuum a > table because it won't do any good. There was a thread about this exact thing not too long ago: https://postgr.es/m/CAH2-Wzmx6+PrfpmmFw8JZbxD+kkwhQWPOhE5RUBy6S4_Jwty=Q@mail.gmail.com If everything goes according to plan, then Postgres 15 will have my work on freezing and dynamically advancing relfrozenxid. Meaning that you'll be able to see (in autovacuum log output and in VACUUM VERBOSE output) how much relfrozenxid has been advanced by, if at all. You'll also directly see how far behind the VACUUM operation's OldestXmin that is (and how far behind the OldestXmin is at the end of the VACUUM operation). It seems as if this offers you exactly what you need. You'll be able to notice the inherent futility of an anti-wraparound VACUUM that runs against a table whose relfrozenxid is already exactly equal to the VACUUM's OldestXmin (say because of a leaked replication slot -- anything that makes vacuuming fundamentally unable to advance relfrozenxid, really). -- Peter Geoghegan
On Wed, 26 Jan 2022 at 18:46, Greg Stark <stark@mit.edu> wrote: > > On Thu, 20 Jan 2022 at 14:31, Robert Haas <robertmhaas@gmail.com> wrote: > > > > In my view, previous efforts in this area have been too simplistic. > > > > One thing I've been wanting to do something about is I think > autovacuum needs to be a little cleverer about when *not* to vacuum a > table because it won't do any good. > > I've seen a lot of cases where autovacuum kicks off a vacuum of a > table even though the globalxmin hasn't really advanced significantly > over the oldest frozen xid. When it's a large table this really hurts > because it could be hours or days before it finishes and at that point > there's quite a bit of bloat. Another case I would like to see autovacuum get clever about is when there is a wide disparity in the size of tables. If you have a few large tables and a few small tables there could be enough bandwidth for everyone but you can get in trouble if the workers are all tied up vacuuming the large tables. This is a case where autovacuum scheduling can create a problem where there shouldn't be one. It often happens when you have a set of large tables that were all loaded with data around the same time and you have your busy tables that are well designed small tables receiving lots of updates. They can happily be getting vacuumed every 15-30min and finishing promptly maintaining a nice steady state until one day all the large tables suddenly hit the freeze threshold and suddenly all your workers are busy vacuuming huge tables that take hours or days to vacuum and your small tables bloat by orders of magnitude. I was thinking of dividing the eligible tables up into ntiles based on size and then making sure one worker was responsible for each ntile. I'm not sure that would actually be quite right though. -- greg
On Wed, Jan 26, 2022 at 6:56 PM Greg Stark <stark@mit.edu> wrote: > > On Wed, 26 Jan 2022 at 18:46, Greg Stark <stark@mit.edu> wrote: > > > > On Thu, 20 Jan 2022 at 14:31, Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > In my view, previous efforts in this area have been too simplistic. > > > > > > > One thing I've been wanting to do something about is I think > > autovacuum needs to be a little cleverer about when *not* to vacuum a > > table because it won't do any good. > > > > I've seen a lot of cases where autovacuum kicks off a vacuum of a > > table even though the globalxmin hasn't really advanced significantly > > over the oldest frozen xid. When it's a large table this really hurts > > because it could be hours or days before it finishes and at that point > > there's quite a bit of bloat. > > > Another case I would like to see autovacuum get clever about is when > there is a wide disparity in the size of tables. If you have a few > large tables and a few small tables there could be enough bandwidth > for everyone but you can get in trouble if the workers are all tied up > vacuuming the large tables. > > This is a case where autovacuum scheduling can create a problem where > there shouldn't be one. It often happens when you have a set of large > tables that were all loaded with data around the same time and you > have your busy tables that are well designed small tables receiving > lots of updates. They can happily be getting vacuumed every 15-30min > and finishing promptly maintaining a nice steady state until one day > all the large tables suddenly hit the freeze threshold and suddenly > all your workers are busy vacuuming huge tables that take hours or > days to vacuum and your small tables bloat by orders of magnitude. > > I was thinking of dividing the eligible tables up into ntiles based on > size and then making sure one worker was responsible for each ntile. > I'm not sure that would actually be quite right though. > I've been working off and on some external vacuum scheduling tools the past yearish and one thing that seems to be an issue is a lack of observability into the various cost delay/limit mechanisms, like how much does a vacuum contribute towards the limit or how much was it delayed during a given run. One theory was if we are seeing a lot of slow down due to cost limiting, we should more heavily weight smaller tables in our priority list for which tables to vacuum vs larger tables which we expect to exacerbate the situation. I've also thought it'd be nice for users to have an easy way to guesstimate % of frozen tables (like live vs dead tuples in pg_stat_all_tables), but this seems difficult to maintain accurately. Had a similar thing with tracking clock time of vacuums; just keeping the duration of the last vacuum ended up being insufficient for some cases, so we ended up tracking it historically... we haven't quite yet designed a pg_stat_vacuums a la pg_stat_statements, but it has crossed our minds. Robert Treat https://xzilla.net
On Wed, Jan 26, 2022 at 6:46 PM Greg Stark <stark@mit.edu> wrote: > One thing I've been wanting to do something about is I think > autovacuum needs to be a little cleverer about when *not* to vacuum a > table because it won't do any good. I agree. > I was thinking of just putting a check in before kicking off a vacuum > and if the globalxmin is a significant fraction of the distance to the > relfrozenxid then instead log a warning. Basically it means "we can't > keep the bloat below the threshold due to the idle transactions et al, > not because there's insufficient i/o bandwidth". Unfortunately, XID distances don't tell us much, because the tuples need not be uniformly distributed across the XID space. In fact, it seems highly likely that they will be very non-uniformly distributed, with a few transactions having created a lot of dead tuples and most having created none. Therefore, it's pretty plausible that a vacuum that permits relfrozenxid++ could solve every problem we have. If we knew something about the distribution of dead XIDs in the table, then we could make an intelligent judgement about whether vacuuming would be useful. But otherwise I feel like we're just guessing, so instead of really fixing the problem we'll just be making it happen in a set of cases that's even harder to grasp. -- Robert Haas EDB: http://www.enterprisedb.com