Thread: vacuum_cost_page_miss default value and modern hardware
vacuum_cost_page_miss has a default value of 10, while vacuum_cost_page_dirty has a default value of 20. This has been the case since cost-based delays were introduced by commit f425b605f4e back in 2004. The obvious implication is that dirtying a page is on average only twice as expensive as a single shared_buffers miss (that doesn't dirty the page). While that might have made sense back in 2004, when magnetic disks were the norm, it seems questionable now. The trend these days is that the total number of dirty pages is the limiting factor for OLTP workloads. This is a broad trend among all disk-based RDBMSs with an ARIES style design. It is more or less a result of long term trends in main memory size scaling and flash storage. It's rare for OLTP workloads to be truly I/O bound, and yet the backpressure from page cleaning/checkpointing becomes a bottleneck. In short, writes are way more expensive than reads -- the *relative* cost of writes has increased significantly (our use of the OS FS cache makes this even worse). I suspect that this trend will become even more important for Postgres in the coming years, but that's not what I want to talk about right now. I just want to talk about vacuum_cost_page_miss on this thread. Simply decreasing vacuum_cost_page_dirty seems like a low risk way of making the VACUUM costing more useful within autovacuum workers. Halving vacuum_cost_page_dirty to 5 would be a good start, though I think that a value as low as 2 would be better. That would make it only 2x vacuum_cost_page_hit's default (i.e 2x the cost of processing a page that is in shared_buffers but did not need to be dirtied), which seems sensible to me when considered in the context in which the value is actually applied (and not some abstract theoretical context). There are a few reasons why this seems like a good idea now: * Throttling/delaying VACUUM is only useful as a way of smoothing the impact on production queries, which is an important goal, but currently we don't discriminate against the cost that we really should keep under control (dirtying new pages within VACUUM) very well. This is due to the aforementioned trends, the use of a strategy ring buffer by VACUUM, the fact that indexes are usually vacuumed in sequential physical order these days, and many other things that were not a factor in 2004. * There is a real downside to throttling VACUUM unnecessarily, and the effects are *non-linear*. On a large table, the oldest xmin cutoff may become very old by the time we're only (say) half way through the initial table scan in lazy_scan_heap(). There may be relatively little work to do because most of the dead tuples won't be before the oldest xmin cutoff by that time (VACUUM just cannot keep up). Excessive throttling for simple page misses may actually *increase* the amount of I/O that VACUUM has to do over time -- we should try to get to the pages that actually need to be vacuumed quickly, which are probably already dirty anyway (and thus are probably going to add little to the cost delay limit in practice). Everything is connected to everything else. * vacuum_cost_page_miss is very much not like random_page_cost, and the similar names confuse the issue -- this is not an optimization problem. Thinking about VACUUM as unrelated to the workload itself is obviously wrong. Changing the default is also an opportunity to clear that up. Even if I am wrong to suggest that a miss within VACUUM should only be thought of as 2x as expensive as a hit in some *general* sense, I am concerned about *specific* consequences. There is no question about picking the best access path here -- we're still going to have to access the same blocks in the same way sooner or later. In general I think that we should move in the direction of more frequent, cheaper VACUUM operations [1], though we've already done a lot of work in that direction (e.g. freeze map work). * Some impact from VACUUM on query performance may in fact be a good thing. Deferring the cost of vacuuming can only make sense if we'll eventually be able to catch up because we're experiencing a surge in demand, which seems kind of optimistic -- it seems more likely that the GC debt will just grow and grow. Why should the DBA not expect to experience some kind of impact, which could be viewed as a natural kind of backpressure? The most important thing is consistent performance. * Other recent work such as the vacuum_cleanup_index_scale_factor patch has increased the relative cost of index vacuuming in some important cases: we don't have a visibility/freeze map for indexes, but index vacuuming that doesn't dirty any pages and has a TID kill list that's concentrated at the end of the heap/table is pretty cheap (the TID binary search is cache efficient/cheap). This change will help these workloads by better reflecting the way in which index vacuuming can be cheap for append-only tables with a small amount of garbage for recently inserted tuples that also got updated/deleted. * Lowering vacuum_cost_page_miss's default (as opposed to changing something else) is a simple and less disruptive way of achieving these goals. This approach seems unlikely to break existing VACUUM-related custom settings from current versions that get reused on upgrade. I expect little impact on small installations. [1] https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com -- Peter Geoghegan
On Mon, Dec 28, 2020 at 5:17 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Simply decreasing vacuum_cost_page_dirty seems like a low risk way of > making the VACUUM costing more useful within autovacuum workers. > Halving vacuum_cost_page_dirty to 5 would be a good start, though I > think that a value as low as 2 would be better. That would make it > only 2x vacuum_cost_page_hit's default (i.e 2x the cost of processing > a page that is in shared_buffers but did not need to be dirtied), > which seems sensible to me when considered in the context in which the > value is actually applied (and not some abstract theoretical context). Perhaps you meant to decrease vacuumm_cost_page_miss instead of vacuum_cost_page_dirty? > > There are a few reasons why this seems like a good idea now: > > * Throttling/delaying VACUUM is only useful as a way of smoothing the > impact on production queries, which is an important goal, but > currently we don't discriminate against the cost that we really should > keep under control (dirtying new pages within VACUUM) very well. > > This is due to the aforementioned trends, the use of a strategy ring > buffer by VACUUM, the fact that indexes are usually vacuumed in > sequential physical order these days, and many other things that were > not a factor in 2004. > > * There is a real downside to throttling VACUUM unnecessarily, and the > effects are *non-linear*. On a large table, the oldest xmin cutoff may > become very old by the time we're only (say) half way through the > initial table scan in lazy_scan_heap(). There may be relatively little > work to do because most of the dead tuples won't be before the oldest > xmin cutoff by that time (VACUUM just cannot keep up). Excessive > throttling for simple page misses may actually *increase* the amount > of I/O that VACUUM has to do over time -- we should try to get to the > pages that actually need to be vacuumed quickly, which are probably > already dirty anyway (and thus are probably going to add little to the > cost delay limit in practice). Everything is connected to everything > else. > > * vacuum_cost_page_miss is very much not like random_page_cost, and > the similar names confuse the issue -- this is not an optimization > problem. Thinking about VACUUM as unrelated to the workload itself is > obviously wrong. Changing the default is also an opportunity to clear > that up. > > Even if I am wrong to suggest that a miss within VACUUM should only be > thought of as 2x as expensive as a hit in some *general* sense, I am > concerned about *specific* consequences. There is no question about > picking the best access path here -- we're still going to have to > access the same blocks in the same way sooner or later. In general I > think that we should move in the direction of more frequent, cheaper > VACUUM operations [1], though we've already done a lot of work in that > direction (e.g. freeze map work). I agree with that direction. > > * Some impact from VACUUM on query performance may in fact be a good thing. > > Deferring the cost of vacuuming can only make sense if we'll > eventually be able to catch up because we're experiencing a surge in > demand, which seems kind of optimistic -- it seems more likely that > the GC debt will just grow and grow. Why should the DBA not expect to > experience some kind of impact, which could be viewed as a natural > kind of backpressure? The most important thing is consistent > performance. > > * Other recent work such as the vacuum_cleanup_index_scale_factor > patch has increased the relative cost of index vacuuming in some > important cases: we don't have a visibility/freeze map for indexes, > but index vacuuming that doesn't dirty any pages and has a TID kill > list that's concentrated at the end of the heap/table is pretty cheap > (the TID binary search is cache efficient/cheap). This change will > help these workloads by better reflecting the way in which index > vacuuming can be cheap for append-only tables with a small amount of > garbage for recently inserted tuples that also got updated/deleted. > > * Lowering vacuum_cost_page_miss's default (as opposed to changing > something else) is a simple and less disruptive way of achieving these > goals. > > This approach seems unlikely to break existing VACUUM-related custom > settings from current versions that get reused on upgrade. I expect > little impact on small installations. > I recalled the discussion decreasing the default value for autovacuum_cost_delay from 20ms to 2ms on PostgreSQL 12. I re-read through the discussion but there wasn't the discussion changing hit/miss/dirty. Whereas the change we did for autovacuum_cost_delay affects every installation, lowering vacuum_cost_page_miss would bring a different impact depending on workload and database size etc. For example, the user would have a larger I/O spike in a case where the database doesn’t fit in the server's RAM and doing vacuuming cold tables/indexes, for example, when anti-wraparound vacuum. Lowering vacuum_cost_page_miss basically makes sense to me. But I’m concerned a bit that the cheaper hardware that has a small RAM etc would be likely to be affected by this change. Since the database doesn’t fit in the server’s RAM, pages are unlikely to be on neither the shared buffers nor OS page cache. Since PostgreSQL's default values seem conservative to me (which is okay to me), I think there might be an argument that this change could lead to trouble in such a cheaper environment that PostgreSQL’s default values are taking care of. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Wed, Jan 6, 2021 at 5:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Perhaps you meant to decrease vacuumm_cost_page_miss instead of > vacuum_cost_page_dirty? You're right. Evidently I didn't write this email very carefully. Sorry about that. To say it again: I think that a miss (without dirtying the page) should be cheaper than dirtying a page. This thread began because I wanted to discuss the relative cost of different kinds of I/O operations to VACUUM, without necessarily discussing the absolute costs/time delays. -- Peter Geoghegan
On Wed, Jan 6, 2021 at 5:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Dec 28, 2020 at 5:17 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > > Simply decreasing vacuum_cost_page_dirty seems like a low risk way of > > making the VACUUM costing more useful within autovacuum workers. > > Halving vacuum_cost_page_dirty to 5 would be a good start, though I > > think that a value as low as 2 would be better. That would make it > > only 2x vacuum_cost_page_hit's default (i.e 2x the cost of processing > > a page that is in shared_buffers but did not need to be dirtied), > > which seems sensible to me when considered in the context in which the > > value is actually applied (and not some abstract theoretical context). > > Perhaps you meant to decrease vacuumm_cost_page_miss instead of > vacuum_cost_page_dirty? > > > > > There are a few reasons why this seems like a good idea now: > > > > * Throttling/delaying VACUUM is only useful as a way of smoothing the > > impact on production queries, which is an important goal, but > > currently we don't discriminate against the cost that we really should > > keep under control (dirtying new pages within VACUUM) very well. > > > > This is due to the aforementioned trends, the use of a strategy ring > > buffer by VACUUM, the fact that indexes are usually vacuumed in > > sequential physical order these days, and many other things that were > > not a factor in 2004. > > > > * There is a real downside to throttling VACUUM unnecessarily, and the > > effects are *non-linear*. On a large table, the oldest xmin cutoff may > > become very old by the time we're only (say) half way through the > > initial table scan in lazy_scan_heap(). There may be relatively little > > work to do because most of the dead tuples won't be before the oldest > > xmin cutoff by that time (VACUUM just cannot keep up). Excessive > > throttling for simple page misses may actually *increase* the amount > > of I/O that VACUUM has to do over time -- we should try to get to the > > pages that actually need to be vacuumed quickly, which are probably > > already dirty anyway (and thus are probably going to add little to the > > cost delay limit in practice). Everything is connected to everything > > else. > > > > * vacuum_cost_page_miss is very much not like random_page_cost, and > > the similar names confuse the issue -- this is not an optimization > > problem. Thinking about VACUUM as unrelated to the workload itself is > > obviously wrong. Changing the default is also an opportunity to clear > > that up. > > > > Even if I am wrong to suggest that a miss within VACUUM should only be > > thought of as 2x as expensive as a hit in some *general* sense, I am > > concerned about *specific* consequences. There is no question about > > picking the best access path here -- we're still going to have to > > access the same blocks in the same way sooner or later. In general I > > think that we should move in the direction of more frequent, cheaper > > VACUUM operations [1], though we've already done a lot of work in that > > direction (e.g. freeze map work). > > I agree with that direction. > > > > > * Some impact from VACUUM on query performance may in fact be a good thing. > > > > Deferring the cost of vacuuming can only make sense if we'll > > eventually be able to catch up because we're experiencing a surge in > > demand, which seems kind of optimistic -- it seems more likely that > > the GC debt will just grow and grow. Why should the DBA not expect to > > experience some kind of impact, which could be viewed as a natural > > kind of backpressure? The most important thing is consistent > > performance. > > > > * Other recent work such as the vacuum_cleanup_index_scale_factor > > patch has increased the relative cost of index vacuuming in some > > important cases: we don't have a visibility/freeze map for indexes, > > but index vacuuming that doesn't dirty any pages and has a TID kill > > list that's concentrated at the end of the heap/table is pretty cheap > > (the TID binary search is cache efficient/cheap). This change will > > help these workloads by better reflecting the way in which index > > vacuuming can be cheap for append-only tables with a small amount of > > garbage for recently inserted tuples that also got updated/deleted. > > > > * Lowering vacuum_cost_page_miss's default (as opposed to changing > > something else) is a simple and less disruptive way of achieving these > > goals. > > > > This approach seems unlikely to break existing VACUUM-related custom > > settings from current versions that get reused on upgrade. I expect > > little impact on small installations. > > > > I recalled the discussion decreasing the default value for > autovacuum_cost_delay from 20ms to 2ms on PostgreSQL 12. I re-read > through the discussion but there wasn't the discussion changing > hit/miss/dirty. > > Whereas the change we did for autovacuum_cost_delay affects every > installation, lowering vacuum_cost_page_miss would bring a different > impact depending on workload and database size etc. For example, the > user would have a larger I/O spike in a case where the database > doesn’t fit in the server's RAM and doing vacuuming cold > tables/indexes, for example, when anti-wraparound vacuum. > > Lowering vacuum_cost_page_miss basically makes sense to me. But I’m > concerned a bit that the cheaper hardware that has a small RAM etc > would be likely to be affected by this change. Since the database > doesn’t fit in the server’s RAM, pages are unlikely to be on neither > the shared buffers nor OS page cache. Since PostgreSQL's default > values seem conservative to me (which is okay to me), I think there > might be an argument that this change could lead to trouble in such a > cheaper environment that PostgreSQL’s default values are taking care > of. More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It has had the value as 10 as its default since 2004 (just like vacuum_cost_page_dirty, whose default has also not been changed since the start). These defaults were decided in a time when nbtree VACUUM could do lots of random I/O, there was no visibility map, etc. So this refresh is not just about hardware. A useful consequence of halving vacuum_cost_page_miss's default setting is that we still get a "built-in delay" when an I/O miss is naturally slow (which is really hard to predict). On the other hand, dirtying lots of pages within VACUUM is dangerous - VACUUM cannot be allowed to create "I/O debt" very quickly, which is quite possible due to the lack of any "natural back pressure". -- Peter Geoghegan
On Wed, Jan 6, 2021 at 7:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It > has had the value as 10 as its default since 2004 (just like > vacuum_cost_page_dirty, whose default has also not been changed since > the start). These defaults were decided in a time when nbtree VACUUM > could do lots of random I/O, there was no visibility map, etc. So this > refresh is not just about hardware. Attached patch lowers vacuum_cost_page_miss to 3. I think that this change in the default is both likely to be helpful in medium to large installations, and unlikely to cause harm in small installations. If I/O for reads made by VACUUM is naturally very slow (even in the common case where it's entirely sequential), then that will naturally provide additional throttling. I will add this to the final CF for Postgres 14. Thanks -- Peter Geoghegan
Attachment
On Thu, Jan 14, 2021 at 1:24 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jan 6, 2021 at 7:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > > More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It > > has had the value as 10 as its default since 2004 (just like > > vacuum_cost_page_dirty, whose default has also not been changed since > > the start). These defaults were decided in a time when nbtree VACUUM > > could do lots of random I/O, there was no visibility map, etc. So this > > refresh is not just about hardware. > > Attached patch lowers vacuum_cost_page_miss to 3. I think that this > change in the default is both likely to be helpful in medium to large > installations, and unlikely to cause harm in small installations. If > I/O for reads made by VACUUM is naturally very slow (even in the > common case where it's entirely sequential), then that will naturally > provide additional throttling. +1 for this in principle. Do you have any actual metrics between specifically choosing the value 3? Or is that off a gut feeling? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On Thu, Jan 14, 2021 at 12:29 PM Magnus Hagander <magnus@hagander.net> wrote: > +1 for this in principle. I'm not opposed to this change and I agree that the relative expense of dirtying a page is higher than what the current defaults suggest. So I also think Peter is going in the right general direction, though like you I am not sure about the specifics. In practice, most users get hosed by not vacuuming aggressively enough, rather than by vacuuming too aggressively. For instance, suppose you have a table and workload such that the table needs to be vacuumed once per hour to maintain good performance. As you make the table bigger and bigger, you will eventually reach a size where the configured cost limits aren't high enough to permit this to happen. The system has no option to disregard the configured limit, even for an emergency autovacuum. Eventually the user is forced into an outage either by the table becoming so bloated that VACUUM FULL is required, or by running out of XIDs. It seems bad that we ship a set of default settings that are guaranteed to hose any database with a reasonable number of updates once the database size exceeds some limit. The fact that we decreased autovacuum_cost_delay by 10x increased the limit by 10x, which is good, but the problem remains. I don't know exactly how to do better, and any proposal in that area would be much more complicated than what Peter is proposing here, but it's something to think about. There are also users I've seen get hosed by vacuuming too aggressively. I have seen this happen in two ways. One is too much dirty data. The other is too much read I/O, pushing hot data out of the cache, leading to a storm of random I/O later when the foreground workload needs to get that stuff back, basically killing the system, sometimes for hours, while it tries to get back the stuff it lost. That might seem like an argument against further raising the possible I/O rate, which would be the effect of the change Peter is proposing, but that's not really my position. I think the bigger problem with all this is that it's too hard to configure; almost nobody can work out what a given set of configuration parameters actually means in MB/s or GB/hour. In the past I've proposed that maybe we should redesign this whole system to work in those kinds of units, which people actually understand, but I don't know if that's the right idea. Still another approach would be to try to reduce the degree to which the cache gets trashed, or make it have less harmful effect on future performance by reading things back in more efficiently. I don't really know. TL;DR: This change is fine with me, but this whole system has much deeper issues. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 14, 2021 at 9:29 AM Magnus Hagander <magnus@hagander.net> wrote: > Do you have any actual metrics between specifically choosing the value > 3? Or is that off a gut feeling? I have no metrics, exactly, but I'm sure that the trend I mentioned about page cleaning/dirtying being the bottleneck more and more these days is true. This trend is very apparent to all of this, it seems, so I am sure that I basically have the right idea here. I'm a little concerned that it should actually be lowered to 2. With that said, I don't actually accept what seems to be the original premise of these GUCs, so I am not interested in using that to justify changing the vacuum_cost_page_miss default. The premise seems to be: VACUUM's behavior is determined by treating it as an optimization problem, so all you as the DBA need to do is characterize the cost of each kind of elementary operation using the GUCs -- the dynamic algorithm will do the rest. What algorithm might that be, though? This is not the optimizer, and there is no scope to come up with a cheaper plan for VACUUM. Why not throttle longer running queries instead, or as well? More on the first principles of the costing stuff in a bit, when I respond to Robert... -- Peter Geoghegan
On Thu, Jan 14, 2021 at 10:42 AM Robert Haas <robertmhaas@gmail.com> wrote: > There are also users I've seen get hosed by vacuuming too > aggressively. I have seen this happen in two ways. One is too much > dirty data. The other is too much read I/O, pushing hot data out of > the cache, leading to a storm of random I/O later when the foreground > workload needs to get that stuff back, basically killing the system, > sometimes for hours, while it tries to get back the stuff it lost. > TL;DR: This change is fine with me, but this whole system has much > deeper issues. It seems like there is a good chance the customers of yours that complained about the read I/O (which was not accompanied by dirtying) were really bothered by all of their indexes being read by VACUUM. The freeze map is probably quite effective as far as that goes, but the burden of index vacuuming is what they tend to notice. This is perhaps made worse by the sudden infrequent nature of the index vacuuming against a big append-only or append-mostly table. I imagine that the problem here is that we're doing index vacuuming when we shouldn't be -- these customers basically had it right. Their intuition that this is unnecessary is in fact the right one. How can it be okay to vacuum an index when the table only has 10 dead tuples (just to freeze some pages at the end of the table)? That's ridiculous. And it has nothing to do with these settings. (Even if I'm wrong to suggest that that was what it was, I think that the details and nuance of what actually happened is likely to be important.) We should be avoiding index vacuuming in many more cases. If there are only a tiny number of garbage index tuples, then we really shouldn't bother (once again, please feel free to weigh in on Masahiko's patch over on the "New IndexAM API controlling index vacuum strategies" thread -- that's very interesting work IMV). Bottom-up index deletion creates far more opportunities for this kind of stuff to naturally occur. It will now do ~99.9% of garbage tuple cleanup in indexes that naturally use it all the time. We can expect that intuitions held by DBAs that have experience with other RDBMSs will start to have more predictive power when they think about Postgres and VACUUM, which seems like a very good thing (and something that we can and should continue to build on). Roughly speaking, we ought to look for more and more ways to make the physical representation of the data closer to the logical contents of the database (that's what these DBAs start with, that's where the intuitions seem to start with, which actually makes perfect sense). Now back to the main topic, the GUC's default value. I believe that your experiences here (the experiences in both directions) are representative -- I think I've heard of all that myself. Like you, I think that insufficient vacuuming is much more common than excessive vacuuming. You do still have some cases where an excessive amount of I/O from VACUUM (without any dirtying) is the issue (or at least *seems* to be the issue, per my aside). I think that I have a high level theory that is consistent with what you say and may suggest a better direction for us, but it's tricky. I'll try to resolve the apparent contradictions in my own arguments as I go (I'm a little burnt out at the moment, so please indulge me). I think that The Real Problem is *not* that it's too hard to tune this stuff as a practical matter, exactly. The entire premise of these costing parameters is that the DBA can and should make a trade-off between query response time/workload throughput and vacuuming, as if these two things were separate constituents that are simply unrelated. That sounds completely wrong to me. It sounds so wrong that I can't go down that mental path for more than 5 seconds without giving up on it. Are we really expected to believe that in general VACUUM probably has all the time in the world, and so should proceed at a leisurely pace? It's almost as if the original designer imagined that the IT department should be made to wait on the result of one of those "VACUUM bigtable;" reports that they seem to be so keen on (other queries are running that deliver real business value, after all). I'm only half-joking here -- anybody reading this should now take a moment to actively consider just how little sense any of this makes. It's so completely and implausibly wrong that it seems likely to actually be slightly right, if only by mistake. There seems to be one important way in which the cost parameter design is accidentally useful: the page dirtying stuff probably works reasonably well. It really does make sense to throttle VACUUM in response to dirtying pages, optimistically assuming that VACUUM will eventually catch up. That part makes sense precisely because it seems like it treats VACUUM as a thing that is directly tied to the workload (an accidental happy exception to the bogus general rule for the costing stuff). Of course, this optimism does not work out because it is true in some general sense that these situations will work themselves out naturally (that's definitely not something we should *truly* assume). There is a deeper way in which it works as a heuristic, though: it covers all the possibilities at once (kind of), without truly predicting what will happen with the workload. We need to be adaptive whenever we think about GC/vacuuming, I believe -- worrying about the general or common case (for any value of common) is often a waste of time IME. I'll now go through two different scenarios in which this same assumption works in our favor. I'll start with the simple case here: It's possible that VACUUM is throttled consistently by dirtying pages, slowing it down considerably. The situation may naturally resolve itself in an uncomplicated way (e.g. the bursty application calms down following a spike), in which case the original foundational design assumption I described works out in the obvious way. Easy. Now for the more complicated case: VACUUM is throttled consistently by dirtying pages (if anything), but the database load does not and will not go down. We *still* end up doing approximately the right thing here (again this is assuming you reduce vacuum_cost_page_miss to something like 3). The fact that we're dirtying new pages (when throttled) suggests that the garbage in these pages really is quite old (it's "floating garbage" in GC theory terms). So we don't want to slow everything down by dirtying ever-more pages. At the same time we want to get through already-dirty pages as quickly as possible, and also quickly move past pages not in shared buffers and not dirtied (those are relatively cheap on modern hardware). Opportunistic heap pruning eventually runs enough that VACUUM won't have to dirty most heap pages in a big table (in its first pass). So dirty pages are debt that VACUUM can easily create, whereas buffer misses are paid directly by VACUUM. It is its own backpressure, for the most part. Making the costing stuff highly sensitive to dirtying pages (but not sensitive to much else) works out because it either avoids making a bad situation worse, or has no real additional downside when the system is completely overwhelmed (i.e. bottlenecked on cleaning dirty pages). If it's just impossible for buffer cleaning to really keep up then things will naturally slow down. This might even be seen as a natural thing -- workloads must live within their means. We should do our best to make it less likely that things slow down, by improving various important subsystems, whatever it might be (e.g. the io_uring stuff from Andres). But ultimately there is no way to reliably avoid having the system get overwhelmed by dirty pages. The fact that such an outcome is always possible is no reason to double down on dubious top-down interventions from VACUUM. You cannot get out of debt by taking out another loan to pay off your existing creditors. -- Peter Geoghegan
On Thu, Jan 14, 2021 at 9:24 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jan 6, 2021 at 7:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > > More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It > > has had the value as 10 as its default since 2004 (just like > > vacuum_cost_page_dirty, whose default has also not been changed since > > the start). These defaults were decided in a time when nbtree VACUUM > > could do lots of random I/O, there was no visibility map, etc. So this > > refresh is not just about hardware. > > Attached patch lowers vacuum_cost_page_miss to 3. I think that this > change in the default is both likely to be helpful in medium to large > installations, and unlikely to cause harm in small installations. If > I/O for reads made by VACUUM is naturally very slow (even in the > common case where it's entirely sequential), then that will naturally > provide additional throttling. +1 for this change. Lowering to 2 also looks good to me. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Thu, Jan 14, 2021 at 8:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > +1 for this change. Lowering to 2 also looks good to me. I'm going to go ahead with committing my patch to lower the default next week. If anybody has any objections to that plan, please speak up. It doesn't really need to be said again, but just to be clear: I share the concerns held by Magnus and Robert. It's certainly true that this change alone is no fix for the general problem I described. At the same time I believe that the patch makes incremental progress in the right direction, without much risk. -- Peter Geoghegan
On Thu, Jan 21, 2021 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'm going to go ahead with committing my patch to lower the default > next week. If anybody has any objections to that plan, please speak > up. Just pushed a commit that reduced the default for vacuum_cost_page_miss to 2. One more thing on this: I noticed that the docs for these settings say "There are many situations where it is not important that maintenance commands like VACUUM and ANALYZE finish quickly". That now seems a little dubious to me -- I wonder if we should refresh it. There are not that many problems that can be solved by making VACUUM slower. This is due to the visibility map, and to a lesser degree certain key improvements in index AMs. The text that I quoted was written in 2005, a time when the delay stuff was still very new, and even the earliest visibility map design was still a few years away. Thanks -- Peter Geoghegan
On Thu, Jan 14, 2021 at 8:09 PM Peter Geoghegan <pg@bowt.ie> wrote: > So dirty pages are debt that VACUUM can easily create, whereas buffer > misses are paid directly by VACUUM. It is its own backpressure, for > the most part. Making the costing stuff highly sensitive to dirtying > pages (but not sensitive to much else) works out because it either > avoids making a bad situation worse, or has no real additional > downside when the system is completely overwhelmed (i.e. bottlenecked > on cleaning dirty pages). This isn't really true. The cost of a buffer miss is not limited to the cost of reading the replacement buffer, a cost which is paid by VACUUM. It is also very often the cost of rereading the evicted buffer, which VACUUM does nothing about. Customers get hosed by VACUUM reading a lot of rarely-used data overnight and evicting all of the actually-hot data from cache. Then in the morning when the workload picks up the system starts trying to pull the stuff they actually need into memory one block at a time. Such a customer can go from a 99% hit rate on Monday morning to say 50% on Tuesday morning, which results in a fifty-fold increase in I/O, all of which is random I/O. The system basically grinds to a halt for hours. It is fair to argue that perhaps such customers should invest in more and better hardware. In some cases, a customer who can fit 1% of their database in cache is relying on a 99% cache hit ratio, which is precarious at best. But, they can sometimes get away with it until a large batch job like VACUUM gets involved. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 28, 2021 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 14, 2021 at 8:09 PM Peter Geoghegan <pg@bowt.ie> wrote: > > So dirty pages are debt that VACUUM can easily create, whereas buffer > > misses are paid directly by VACUUM. It is its own backpressure, for > > the most part. Making the costing stuff highly sensitive to dirtying > > pages (but not sensitive to much else) works out because it either > > avoids making a bad situation worse, or has no real additional > > downside when the system is completely overwhelmed (i.e. bottlenecked > > on cleaning dirty pages). > > This isn't really true. The cost of a buffer miss is not limited to > the cost of reading the replacement buffer, a cost which is paid by > VACUUM. It is also very often the cost of rereading the evicted > buffer, which VACUUM does nothing about. Customers get hosed by VACUUM > reading a lot of rarely-used data overnight and evicting all of the > actually-hot data from cache. Well, I did say "for the most part". In any case there is not much reason to think that throttling VACUUM on shared_buffers page misses can make very much difference in this scenario. > It is fair to argue that perhaps such customers should invest in more > and better hardware. In some cases, a customer who can fit 1% of their > database in cache is relying on a 99% cache hit ratio, which is > precarious at best. But, they can sometimes get away with it until a > large batch job like VACUUM gets involved. Actually, my first observation here is that VACUUM probably shouldn't do this at all. In other words, I agree with what I suspect your customer's intuition was in a rather straightforward way: VACUUM really shouldn't be reading several large indexes in full when they have barely been modified in months or years -- that's the real problem. It ought to be possible to make big improvements in that area without changing the fundamental invariants. I am once again referring to the pending work on VACUUM from Masahiko. -- Peter Geoghegan