Thread: decoupling table and index vacuum
Hi, We are used to thinking about table vacuum and index vacuum as parts of a single, indivisible operation. You vacuum the table -- among other things by performing HOT pruning and remembering dead TIDs -- and then you vacuum the indexes -- removing the remembered TIDs from the index -- and then you vacuum the table some more, setting those dead TIDs unused -- and then you're done. And along the way you do some other things too like considering truncation that aren't relevant to the point I want to make here. Now, the problem with this is that every index has its own needs, which are separate from the needs of the tables, as I think Peter Geoghegan and Masahiko Sawada were also discussing recently. Opportunistic index cleanup strategies like kill_prior_tuple and bottom-up deletion may work much better for some indexes than others, meaning that you could have some indexes that badly need to be vacuumed because they are full of garbage, and other indexes on the same table where the opportunistic cleanup has worked perfectly and there is no need for vacuuming at all. Separately, the table may or may not need to get some dead pointers set back to unused to avoid table bloat. But, as things stand today, strategy options to deal with such situations are limited. Leaving aside what the code actually does right now, let's talk about what options we have in theory with the technology as it exists now. They basically all boil down to stopping early and then redoing the work later. We must always start with a pass over the heap to collect dead TIDs, because otherwise there's nothing else to do. Now we can choose to stop, but then the next VACUUM will have to collect all those TIDs again. It may get to skip more all-visible pages than the current vacuum did, but the pages that still have dead TIDs will all have to be visited again. If we don't stop, then we can choose to vacuum all of the indexes or just some of them, and then afterwards stop. But if we do this, the next VACUUM will have to scan all indexes again for the same TIDs. Here, we don't even have the visibility map to allow skipping known-clean pages, so it's *really* a lot of work we have to redo. Thus what we normally do is press on to the third step, where we mark dead line pointers unused after scanning every index in its entirety, and now they're gone and we don't have to worry about them again. Barring emergency escape valves, as things stand today, the frequency of table vacuuming is the same as the frequency of index vacuuming, even though the *required* frequency of vacuuming is not the same, and also varies from index to index. Now, the reason for this is that when we discover dead TIDs, we only record them in memory, not on disk. So, as soon as VACUUM ends, we lose all knowledge of whether those TIDs were and must rediscover them. Suppose we didn't do this, and instead had a "dead TID" fork for each table. Suppose further that this worked like a conveyor belt, similar to WAL, where every dead TID we store into the fork is assigned an identifying 64-bit number that is never reused. Then, suppose that for each index, we store the number of the oldest entry that might still need to be vacuumed from the index. Every time you perform what we now call the first heap pass of a VACUUM, you add the new TIDs you find to the dead TID fork. Every time you vacuum an index, the TIDs that need to be removed are those between the oldest-entry pointer for that index and the current end of the TID fork. You remove all of those and then advance your oldest-entry pointer accordingly. If that's too many TIDs to fit in maintenance_work_mem, you can just read as many as will fit and advance your oldest-entry pointer less far. Every time you perform what we now call the second heap pass of a VACUUM, you find all the TIDs that precede every index's oldest-entry pointer and set them unused. You then throw away the associated storage at the OS level. This requires a scheme where relations can be efficiently truncated from the beginning rather than only at the end, which is why I said "a conveyor belt" and "similar to WAL". Details deliberately vague since I am just brainstorming here. This scheme adds a lot of complexity, which is a concern, but it seems to me that it might have several benefits. One is concurrency. You could have one process gathering dead TIDs and adding them to the dead-TID fork while another process is vacuuming previously-gathered TIDs from some index. In fact, every index could be getting vacuumed at the same time, and different indexes could be removing different TID ranges. At the same time, you could have another process setting dead TIDs that all indexes have previously removed to unused. Furthermore, all of these operations can start in any order, and any of them can be repeated any number of times during a single run of any particular other one, or indeed, without that particular one ever being run at all. Both heap phases can also now be done in smaller chunks, if desired. You can gather TIDs from a portion of the table and remember where you left off, and come back and pick up from that point later, if you wish. You can likewise pick a subset of dead-TIDs-retired-from-all-indexes to set unused, and do just that many, and then at a later time come back and do some more. Also, you can now make mostly-independent decisions about how to perform each of these operations, too. It's not completely independent: if you need to set some dead TIDs in the table to unused, you may have to force index vacuuming that isn't needed for bloat control. However, you only need to force it for indexes that haven't been vacuumed recently enough for some other reason, rather than every index. If you have a target of reclaiming 30,000 TIDs, you can just pick the indexes where there are fewer than 30,000 dead TIDs behind their oldest-entry pointers and force vacuuming only of those. By the time that's done, there will be at least 30,000 dead line pointers you can mark unused, and maybe more, minus whatever reclamation someone else did concurrently. But is this worthwhile? I think it depends a lot on what you think the comparative required frequencies are for the various operations. If index A needs to be vacuumed every 40 minutes and index B needs to be vacuumed every 55 minutes and the table that owns both of them needs to be vacuumed every 70 minutes, I am not sure there is a whole lot here. I think you will be pretty much equally well off if you just do what we do today every 40 minutes and call it good. Also, you will not benefit very much if the driving force is reclaiming dead line pointers in the table itself. If that has to happen frequently, then the indexes have to be scanned frequently, and this whole thing is a lot of work for not much. But, maybe that's not the case. Suppose index A needs to be vacuumed every hour to avoid bloat, index B needs to be vacuumed every 4 hours to avoid bloat, and the table needs dead line pointers reclaimed every 5.5 hours. Well, now you can gain a lot. You can vacuum index A frequently while vacuuming index B only as often as it needs, and you can reclaim dead line pointers on their own schedule based on whatever index vacuuming was already done for bloat avoidance. Without this scheme, there's just no way to give everybody what they need without some of the participants being "dragged along for the ride" and forced into work that they don't actually need done simply because "that's how it works." One thing I don't know is whether the kind of scenario that I describe above is common, i.e. is the main reason we need to vacuum to control index bloat, where this kind of approach seems likely to help, or is it to reclaim dead line pointers in the heap, where it's not? I'd be interested in hearing from people who have some experience in this area, or at least better intuition than I do. I'm interested in this idea partly because I think it would be much more likely to help in a hypothetical world where we had global indexes. Imagine a partitioned table where each partition has a local index and a then there is also a global index which indexes tuples from every partition. Waving away the difficulty of making such a thing work, there's a vacuuming problem here, which has been discussed before. In short, if you tried to handle this in the naive way, you'd end up having to vacuum the global index every time you vacuumed any partition. That sucks. Suppose that there are 1000 partitions, each partition is 1GB, and each local index is 50MB. All things being equal, the global index should end up being about as big as all of the local indexes put together, which in this case would be 50GB. Clearly, we do not want to vacuum each partition by scanning the 1GB partition + the 50MB local index + the 50GB global index. That's insane. With the above system, since everything's decoupled, you can vacuum the partition tables individually as often as required, and similarly for their local indexes, but put off vacuuming the global index until you've vacuumed a bunch, maybe all, of the partitions, so that the work of cleaning up the global index cleans up dead TIDs from many or all partitions instead of just one at a time. Now, the fly in the ointment here is that this supposes that we don't get forced into vacuuming the global index too quickly because of dead line pointer accumulation. But, I think if that does happen, with careful scheduling, we might not really be worse off than we would have been without partitioning. If we scan the table for just one partition and, say, exhaust maintenance_work_mem, we have some expensive index vacuuming to do immediately, but that would've also happened in pretty much the same way with an unpartitioned table. If we don't fill maintenance_work_mem but we do notice that the table for this partition is full of dead line pointers that we need to reclaim, we can still choose to scan some other partitions and collect some more dead TIDs before cleaning the global index. That could delay actually getting those line pointers reclaimed, but an unpartitioned table would have suffered from at least as much delay, because it wouldn't even consider the possibility of stopping before scanning every available table page, and we could choose to stop after dealing with only some partitions but not all. It's probably tricky to get the autovacuum algorithm right here, but there seems to be some room for optimism. Even if global indexes never happened, though, I think this could have other benefits. For example, the wraparound failsafe mechanism recently added by Masahiko Sawada and Peter Geoghegan bypasses index vacuuming when wraparound danger is imminent. The only problem is that making that decision means throwing away the accumulated list of dead TIDs, which then need to be rediscovered whenever we get around to vacuuming the indexes. But that's avoidable, if they're stored on disk rather than in RAM. One rather serious objection to this whole line of attack is that we'd ideally like VACUUM to reclaim disk space without using any more, in case the motivation for running VACUUM in the first place. A related objection is that if it's sometimes agreable to do everything all at once as we currently do, the I/O overhead could be avoided. I think we'd probably have to retain a code path that buffers the dead TIDs in memory to account, at least, for the low-on-disk-space case, and maybe that can also be used to avoid I/O in some other cases, too. I haven't thought through all the details here. It seems to me that the actual I/O avoidance is probably not all that much - each dead TID is much smaller than the deleted tuple that gave rise to it, and typically you don't delete all the tuples at once - but it might be material in some cases, and it's definitely material if you don't have enough disk space left for it to complete without error. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-04-21 11:21:31 -0400, Robert Haas wrote: > Opportunistic index cleanup strategies like > kill_prior_tuple and bottom-up deletion may work much better for some > indexes than others, meaning that you could have some indexes that > badly need to be vacuumed because they are full of garbage, and other > indexes on the same table where the opportunistic cleanup has worked > perfectly and there is no need for vacuuming at all. Partial indexes are another case that can lead to individual indexes being without bloat, with others severely bloated. > This requires a scheme where relations can be efficiently truncated > from the beginning rather than only at the end, which is why I said "a > conveyor belt" and "similar to WAL". Details deliberately vague since > I am just brainstorming here. I'm not sure that's the only way to deal with this. While some form of generic "conveyor belt" infrastructure would be a useful building block, and it'd be sensible to use it here if it existed, it seems feasible to dead tids in a different way here. You could e.g. have per-heap-vacuum files with a header containing LSNs that indicate the age of the contents. > This scheme adds a lot of complexity, which is a concern, but it seems > to me that it might have several benefits. One is concurrency. You > could have one process gathering dead TIDs and adding them to the > dead-TID fork while another process is vacuuming previously-gathered > TIDs from some index. I think it might even open the door to using multiple processes gathering dead TIDs for the same relation. > In fact, every index could be getting vacuumed at the same time, and > different indexes could be removing different TID ranges. We kind of have this feature right now, due to parallel vacuum... > It's not completely independent: if you need to set some dead TIDs in > the table to unused, you may have to force index vacuuming that isn't > needed for bloat control. However, you only need to force it for > indexes that haven't been vacuumed recently enough for some other > reason, rather than every index. Hm - how would we know how recently that TID has been marked dead? We don't even have xids for dead ItemIds... Maybe you're intending to answer that in your next paragraph, but it's not obvious to me that'd be sufficient... > If you have a target of reclaiming 30,000 TIDs, you can just pick the > indexes where there are fewer than 30,000 dead TIDs behind their > oldest-entry pointers and force vacuuming only of those. By the time > that's done, there will be at least 30,000 dead line pointers you can > mark unused, and maybe more, minus whatever reclamation someone else > did concurrently. One thing that you didn't mention so far is that this'd allow us to add dead TIDs to the "dead tid" file outside of vacuum too. In some workloads most of the dead tuple removal happens as part of on-access HOT pruning. While some indexes are likely to see that via the killtuples logic, others may not. Being able to have more aggressive index vacuum for the one or two bloated index, without needing to rescan the heap, seems like it'd be a significant improvement. > Suppose index A needs to be vacuumed every hour to avoid bloat, index > B needs to be vacuumed every 4 hours to avoid bloat, and the table > needs dead line pointers reclaimed every 5.5 hours. Well, now you can > gain a lot. You can vacuum index A frequently while vacuuming index B > only as often as it needs, and you can reclaim dead line pointers on > their own schedule based on whatever index vacuuming was already done > for bloat avoidance. Without this scheme, there's just no way to give > everybody what they need without some of the participants being > "dragged along for the ride" and forced into work that they don't > actually need done simply because "that's how it works." Have you thought about how we would do the scheduling of vacuums for the different indexes? We don't really have useful stats for the number of dead index entries to be expected in an index. It'd not be hard to track how many entries are removed in an index via killtuples, but e.g. estimating how many dead entries there are in a partial index seems quite hard (at least without introducing significant overhead). > One thing I don't know is whether the kind of scenario that I describe > above is common, i.e. is the main reason we need to vacuum to control > index bloat, where this kind of approach seems likely to help, or is > it to reclaim dead line pointers in the heap, where it's not? I'd be > interested in hearing from people who have some experience in this > area, or at least better intuition than I do. I think doing something like this has a fair bit of potential. Being able to perform freezing independently of index scans, without needing to scan the table again to re-discover dead line item pointers seems like it'd be a win. More aggressive/targeted index vacuum in cases where most tuples are removed via HOT pruning seems like a win. Not having to restart from scratch after a cancelled autvacuum would be a win. Additional parallelization seems like a win... > One rather serious objection to this whole line of attack is that we'd > ideally like VACUUM to reclaim disk space without using any more, in > case the motivation for running VACUUM in the first place. I suspect we'd need a global limit of space used for this data. If above that limit we'd switch to immediately performing the work required to remove some of that space. > A related objection is that if it's sometimes agreable to do > everything all at once as we currently do, the I/O overhead could be > avoided. I think we'd probably have to retain a code path that buffers > the dead TIDs in memory to account, at least, for the > low-on-disk-space case, and maybe that can also be used to avoid I/O > in some other cases, too. We'd likely want to do some batching of insertions into the "dead tid" map - which'd probably end up looking similar to a purely in-memory path anyway. Greetings, Andres Freund
On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > We are used to thinking about table vacuum and index vacuum as parts > of a single, indivisible operation. You vacuum the table -- among > other things by performing HOT pruning and remembering dead TIDs -- > and then you vacuum the indexes -- removing the remembered TIDs from > the index -- and then you vacuum the table some more, setting those > dead TIDs unused -- and then you're done. And along the way you do > some other things too like considering truncation that aren't relevant > to the point I want to make here. Now, the problem with this is that > every index has its own needs, which are separate from the needs of > the tables, as I think Peter Geoghegan and Masahiko Sawada were also > discussing recently. I'm very happy to see that you've taken an interest in this work! I believe it's an important area. It's too important to be left to only two contributors. I welcome your participation as an equal partner in the broader project to fix problems with VACUUM. Masahiko and I have had plenty of ideas about where this could go next -- way too many ideas, in fact. Maybe that kind of partnership sounds unnecessary or at least seems premature, but it turns out that this area is extremely broad and far reaching, if you really think it through -- you end up having to negotiate rather a lot all at once. Apart from anything else, I simply don't have the authority to commit a bunch of stuff that implicitly makes Postgres do things a certain way in a huge number of different subsystems. (Whether or not I'd be right in each case is beside the point.) My most ambitious goal is finding a way to remove the need to freeze or to set hint bits. I think that we can do this by inventing a new kind of VACUUM just for aborted transactions, which doesn't do index vacuuming. You'd need something like an ARIES-style dirty page table to make this cheap -- so it's a little like UNDO, but not very much. The basic idea is that eagerly cleaning up aborted transactions in an autovacuum worker allows you to broadly assume that most blocks contain definitely-committed heap tuples, or else LP_DEAD stubs (which of course don't contain any XIDs). You'd still have something like conventional VACUUM, which wouldn't change that much. Freezing is largely implicit, but maybe you freeze tuples the old way if and only if a backend dirties a "known-all-commited" block -- that can still be expensive. The visibility map still has an all-visible bit, but now it also has an all-committed bit (or maybe it's a separate data structure). The combination of all-visible and all-committed to precisely the same as frozen, so you don't need a separate VM bit for that anymore. Notice that this design doesn't change much about our basic approach to transaction management. It just further decouples things. Conventional VACUUMs are now only about garbage collection, and so can be further optimized with that goal in mind. It's much easier to do clever scheduling if VACUUM really only has to do garbage collection. > Opportunistic index cleanup strategies like > kill_prior_tuple and bottom-up deletion may work much better for some > indexes than others, meaning that you could have some indexes that > badly need to be vacuumed because they are full of garbage, and other > indexes on the same table where the opportunistic cleanup has worked > perfectly and there is no need for vacuuming at all. I know I say this all the time these days, but it seems worth repeating now: it is a qualitative difference, not a quantitative difference. Bottom-up index deletion will frequently stop most indexes on a table from growing by even one single block, while indexes that cannot use the optimization (indexes that are logically modified by UPDATE statements) might be hugely bloated. If this is the case during one VACUUM operation, it's probably going to work like that with all future VACUUM operations. It's abundantly clear that the current quantitative approach just cannot be pushed much further. > But, as things stand today, strategy options to deal with such > situations are limited. Leaving aside what the code actually does > right now, let's talk about what options we have in theory with the > technology as it exists now. They basically all boil down to stopping > early and then redoing the work later. We must always start with a > pass over the heap to collect dead TIDs, because otherwise there's > nothing else to do. Now we can choose to stop, but then the next > VACUUM will have to collect all those TIDs again. It may get to skip > more all-visible pages than the current vacuum did, but the pages that > still have dead TIDs will all have to be visited again. If we don't > stop, then we can choose to vacuum all of the indexes or just some of > them, and then afterwards stop. But if we do this, the next VACUUM > will have to scan all indexes again for the same TIDs. Here, we don't > even have the visibility map to allow skipping known-clean pages, so > it's *really* a lot of work we have to redo. Thus what we normally do > is press on to the third step, where we mark dead line pointers unused > after scanning every index in its entirety, and now they're gone and > we don't have to worry about them again. Barring emergency escape > valves, as things stand today, the frequency of table vacuuming is the > same as the frequency of index vacuuming, even though the *required* > frequency of vacuuming is not the same, and also varies from index to > index. I'm 100% in agreement about all of this. > Now, the reason for this is that when we discover dead TIDs, we only > record them in memory, not on disk. So, as soon as VACUUM ends, we > lose all knowledge of whether those TIDs were and must rediscover > them. Suppose we didn't do this, and instead had a "dead TID" fork for > each table. I had a similar idea myself recently -- clearly remembering the TIDs that you haven't vacuumed to save work later on makes a lot of sense. I didn't get very far with it, even in my own head, but I like the direction you're taking it. Having it work a little like a queue makes a lot of sense. > Suppose further that this worked like a conveyor belt, > similar to WAL, where every dead TID we store into the fork is > assigned an identifying 64-bit number that is never reused. Then, > suppose that for each index, we store the number of the oldest entry > that might still need to be vacuumed from the index. Every time you > perform what we now call the first heap pass of a VACUUM, you add the > new TIDs you find to the dead TID fork. Maybe we can combine this known-dead-tid structure with the visibility map. Index-only scans might be able to reason about blocks that are mostly all-visible, but still have stub LP_DEAD line pointers that this other structure knows about -- so you can get index-only scans without requiring a full round of traditional vacuuming. Maybe there is some opportunity like that, but not sure how to fit it in to everything else right now. > Every time you vacuum an > index, the TIDs that need to be removed are those between the > oldest-entry pointer for that index and the current end of the TID > fork. You remove all of those and then advance your oldest-entry > pointer accordingly. If that's too many TIDs to fit in > maintenance_work_mem, you can just read as many as will fit and > advance your oldest-entry pointer less far. Every time you perform > what we now call the second heap pass of a VACUUM, you find all the > TIDs that precede every index's oldest-entry pointer and set them > unused. You then throw away the associated storage at the OS level. > This requires a scheme where relations can be efficiently truncated > from the beginning rather than only at the end, which is why I said "a > conveyor belt" and "similar to WAL". Details deliberately vague since > I am just brainstorming here. This amounts to adding yet more decoupling -- which seems great to me. Anything that gives us the option but not the obligation to perform work either more lazily or more eagerly (whichever makes sense) seems highly desirable to me. Especially if we can delay our decision until the last possible point, when we can have relatively high confidence that we know what we're doing. And especially if holding it open as an option is pretty cheap (that's the point of remembering dead TIDs). > Furthermore, all of these operations can start in any order, and any > of them can be repeated any number of times during a single run of any > particular other one, or indeed, without that particular one ever > being run at all. Both heap phases can also now be done in smaller > chunks, if desired. > But is this worthwhile? I think it depends a lot on what you think the > comparative required frequencies are for the various operations. There is a risk that you'll never think that any optimization is worth it because each optimization seems marginal in isolation. Sometimes a diversity of strategies is the real strategy. Let's say you have a bunch of options that you can pick and choose from, with fallbacks and with ways of course correcting even halfway through the VACUUM. It's possible that that will work wonderfully well for a given complex user workload, but if you subtract away *any one* of the strategies suddenly things get much worse in some obvious high-level way. It's entirely possible for a single table to have different needs in different parts of the table. Certainly works that way with indexes -- that much I can say for sure. > If index A needs to be vacuumed every 40 minutes and index B needs to be > vacuumed every 55 minutes and the table that owns both of them needs > to be vacuumed every 70 minutes, I am not sure there is a whole lot > here. I think you will be pretty much equally well off if you just do > what we do today every 40 minutes and call it good. That's probably all true, but I think that an excellent heuristic is to work hard to avoid really terrible outcomes, rather than trying hard to get good outcomes. The extremes don't just matter -- they may even be the only thing that matters. If index A needs to be vacuumed about as frequently as index B anyway, then the user happens to naturally be in a position where the current simplistic scheduling works for them. Which is fine, as far as it goes, but that's not really where we have problems. If you consider the "qualitative, not quantitative" perspective, things change. It's now pretty unlikely that all of the indexes on the same table will have approximately the same needs -- except when there is very little to do with indexes anyway, which is pretty much not interesting anyway. Because workloads generally don't logically modify all indexes columns within each UPDATE. They just don't tend to look like that in practice. > Also, you will not > benefit very much if the driving force is reclaiming dead line > pointers in the table itself. If that has to happen frequently, then > the indexes have to be scanned frequently, and this whole thing is a > lot of work for not much. But, maybe that's not the case. Suppose > index A needs to be vacuumed every hour to avoid bloat, index B needs > to be vacuumed every 4 hours to avoid bloat, and the table needs dead > line pointers reclaimed every 5.5 hours. Well, now you can gain a lot. > You can vacuum index A frequently while vacuuming index B only as > often as it needs, and you can reclaim dead line pointers on their own > schedule based on whatever index vacuuming was already done for bloat > avoidance. Without this scheme, there's just no way to give everybody > what they need without some of the participants being "dragged along > for the ride" and forced into work that they don't actually need done > simply because "that's how it works." Right. And, the differences between index A and index B will tend to be pretty consistent and often much larger than this. Many indexes would never have to be vacuumed, even with non-HOT UPDATES due to bottom-up index deletion -- because they literally won't even have one single page split for hours, while maybe one index gets 3x larger in the same timeframe. Eventually you'll need to vacuum the indexes all the same (not just the bloated index), but that's only required to enable safely performing heap vacuuming. It's not so bad if the non-bloated indexes won't be dirtied and if it's not so frequent (dirtying pages is the real cost to keep under control here). > One thing I don't know is whether the kind of scenario that I describe > above is common, i.e. is the main reason we need to vacuum to control > index bloat, where this kind of approach seems likely to help, or is > it to reclaim dead line pointers in the heap, where it's not? I'd be > interested in hearing from people who have some experience in this > area, or at least better intuition than I do. The paradox here is: 1. Workload characteristics are important and must be exploited to get optimal performance. 2. Workloads are too complicated and unpredictable to ever truly understand. Roughly speaking, the solution that I think has the most promise is to make it okay for your heuristics to be wrong. You do this by keeping the costs simple, fixed and low, and by doing things that have multiple benefits (not just one). This is why it's important to give VACUUM a bunch of strategies that it can choose from and switch back and forth from, with minimal commitment -- you let VACUUM figure out what to do about the workload through trial and error. It has to try and fail on occasion -- you must be willing to pay the cost of negative feedback (though the cost must be carefully managed). This approach is perhaps sufficient to cover all of the possible extremes with all workloads. I think that the extremes are where our problems all are, or close to it. The cost shouldn't be terribly noticeable because you have the flexibility to change your mind at the first sign of an issue. So you never pay an extreme cost (you pay a pretty low fixed cost incrementally, at worst), but you do sometimes (and maybe even often) get an extreme benefit -- the benefit of avoiding current pathological performance issues. We know that the cost of bloat is very non-linear in a bunch of ways that can be pretty well understood, so that seems like the right thing to focus on -- this is perhaps the only thing that we can expect to understand with a relatively high degree of confidence. We can live with a lot of uncertainty about what's going on with the workload by managing it continually, ramping up and down, etc. > Clearly, > we do not want to vacuum each partition by scanning the 1GB partition > + the 50MB local index + the 50GB global index. That's insane. With > the above system, since everything's decoupled, you can vacuum the > partition tables individually as often as required, and similarly for > their local indexes, but put off vacuuming the global index until > you've vacuumed a bunch, maybe all, of the partitions, so that the > work of cleaning up the global index cleans up dead TIDs from many or > all partitions instead of just one at a time. I can't think of any other way of sensibly implementing global indexes. > Now, the fly in the ointment here is that this supposes that we don't > get forced into vacuuming the global index too quickly because of dead > line pointer accumulation. But, I think if that does happen, with > careful scheduling, we might not really be worse off than we would > have been without partitioning. If we scan the table for just one > partition and, say, exhaust maintenance_work_mem, we have some > expensive index vacuuming to do immediately, but that would've also > happened in pretty much the same way with an unpartitioned table. But you can at least drop the partitions with a global index. It shouldn't be too hard to make that work without breaking things. > It's probably tricky to get the > autovacuum algorithm right here, but there seems to be some room for > optimism. I think that it's basically okay if global indexes suck when you do lots of UPDATEs -- this is a limitation that users can probably live with. What's not okay is if they suck when you do relatively few UPDATEs. It's especially not okay to have to scan the global index to delete one index tuple that points to one LP_DEAD item. Since you tend to get a tiny number of LP_DEAD items even when the DBA bends over backwards to make all UPDATEs use HOT. Getting that to happen 99%+ of the time is so much easier than getting it to happen 100% of the time. There can be enormous asymmetry with this stuff. Long term, I see VACUUM evolving into something that can only be configured in an advisory way. It's too hard to tune this stuff because what we really want here is to structure many things as an optimization problem, and to have a holistic view that considers how the table changes over time -- over multiple VACUUM operations. We can safely be very lazy if we have some basic sense of proportion about what the risk is. For example, maybe we limit the number of newly dirtied pages during VACUUM by being lazy about pruning pages that don't happen to be dirty when encountered within VACUUM. We still have some sense of how much work we've put off, so as to never get in over our head with debt. We might also have a sense of how many dirty pages in total there are in the system currently -- maybe if the DB is not busy right now we can be extra aggressive. In short, we apply holistic thinking. > Even if global indexes never happened, though, I think this could have > other benefits. For example, the wraparound failsafe mechanism > recently added by Masahiko Sawada and Peter Geoghegan bypasses index > vacuuming when wraparound danger is imminent. The only problem is that > making that decision means throwing away the accumulated list of dead > TIDs, which then need to be rediscovered whenever we get around to > vacuuming the indexes. But that's avoidable, if they're stored on disk > rather than in RAM. Yeah, that's not ideal. > One rather serious objection to this whole line of attack is that we'd > ideally like VACUUM to reclaim disk space without using any more, in > case the motivation for running VACUUM in the first place. A related > objection is that if it's sometimes agreable to do everything all at > once as we currently do, the I/O overhead could be avoided. Of course it's possible that what we currently do will be optimal. But it's pretty much a question of mostly-independent things all going the same way. So I expect that it will be rare. > I think > we'd probably have to retain a code path that buffers the dead TIDs in > memory to account, at least, for the low-on-disk-space case, and maybe > that can also be used to avoid I/O in some other cases, too. I haven't > thought through all the details here. It seems to me that the actual > I/O avoidance is probably not all that much - each dead TID is much > smaller than the deleted tuple that gave rise to it, and typically you > don't delete all the tuples at once - but it might be material in some > cases, and it's definitely material if you don't have enough disk > space left for it to complete without error. All true. -- Peter Geoghegan
On Wed, Apr 21, 2021 at 8:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > Now, the reason for this is that when we discover dead TIDs, we only > record them in memory, not on disk. So, as soon as VACUUM ends, we > lose all knowledge of whether those TIDs were and must rediscover > them. Suppose we didn't do this, and instead had a "dead TID" fork for > each table. Interesting idea. However, you only need > to force it for indexes that haven't been vacuumed recently enough for > some other reason, rather than every index. If you have a target of > reclaiming 30,000 TIDs, you can just pick the indexes where there are > fewer than 30,000 dead TIDs behind their oldest-entry pointers and > force vacuuming only of those. How do we decide this target, I mean at a given point how do we decide that what is the limit of dead TID's at which we want to trigger the index vacuuming? > One rather serious objection to this whole line of attack is that we'd > ideally like VACUUM to reclaim disk space without using any more, in > case the motivation for running VACUUM in the first place. A related > objection is that if it's sometimes agreable to do everything all at > once as we currently do, the I/O overhead could be avoided. I think > we'd probably have to retain a code path that buffers the dead TIDs in > memory to account, at least, for the low-on-disk-space case, and maybe > that can also be used to avoid I/O in some other cases, too. I haven't > thought through all the details here. It seems to me that the actual > I/O avoidance is probably not all that much - each dead TID is much > smaller than the deleted tuple that gave rise to it, and typically you > don't delete all the tuples at once - but it might be material in some > cases, and it's definitely material if you don't have enough disk > space left for it to complete without error. Is it a good idea to always perform an I/O after collecting the dead TID's or there should be an option where the user can configure so that it aggressively vacuum all the indexes and this I/O overhead can be avoided completely. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 22, 2021 at 8:51 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > > We are used to thinking about table vacuum and index vacuum as parts > > of a single, indivisible operation. You vacuum the table -- among > > other things by performing HOT pruning and remembering dead TIDs -- > > and then you vacuum the indexes -- removing the remembered TIDs from > > the index -- and then you vacuum the table some more, setting those > > dead TIDs unused -- and then you're done. And along the way you do > > some other things too like considering truncation that aren't relevant > > to the point I want to make here. Now, the problem with this is that > > every index has its own needs, which are separate from the needs of > > the tables, as I think Peter Geoghegan and Masahiko Sawada were also > > discussing recently. > > I'm very happy to see that you've taken an interest in this work! I > believe it's an important area. It's too important to be left to only > two contributors. I welcome your participation as an equal partner in > the broader project to fix problems with VACUUM. +many > > > Now, the reason for this is that when we discover dead TIDs, we only > > record them in memory, not on disk. So, as soon as VACUUM ends, we > > lose all knowledge of whether those TIDs were and must rediscover > > them. Suppose we didn't do this, and instead had a "dead TID" fork for > > each table. > > I had a similar idea myself recently -- clearly remembering the TIDs > that you haven't vacuumed to save work later on makes a lot of sense. > I didn't get very far with it, even in my own head, but I like the > direction you're taking it. Having it work a little like a queue makes > a lot of sense. Agreed. (I now remembered I gave a talk about a similar idea at PGCon a couple years ago). Another good point of this "dead TID fork" design is that IIUC we don't necessarily need to make it crash-safe. We would not need WAL logging for remembering dead TIDs. If the server crashes, we can simply throw it away and assume we haven't done the first heap pass yet. > > > Suppose further that this worked like a conveyor belt, > > similar to WAL, where every dead TID we store into the fork is > > assigned an identifying 64-bit number that is never reused. Then, > > suppose that for each index, we store the number of the oldest entry > > that might still need to be vacuumed from the index. Every time you > > perform what we now call the first heap pass of a VACUUM, you add the > > new TIDs you find to the dead TID fork. > > Maybe we can combine this known-dead-tid structure with the visibility > map. Index-only scans might be able to reason about blocks that are > mostly all-visible, but still have stub LP_DEAD line pointers that > this other structure knows about -- so you can get index-only scans > without requiring a full round of traditional vacuuming. Maybe there > is some opportunity like that, but not sure how to fit it in to > everything else right now. Interesting idea. > > > Every time you vacuum an > > index, the TIDs that need to be removed are those between the > > oldest-entry pointer for that index and the current end of the TID > > fork. You remove all of those and then advance your oldest-entry > > pointer accordingly. If that's too many TIDs to fit in > > maintenance_work_mem, you can just read as many as will fit and > > advance your oldest-entry pointer less far. Every time you perform > > what we now call the second heap pass of a VACUUM, you find all the > > TIDs that precede every index's oldest-entry pointer and set them > > unused. You then throw away the associated storage at the OS level. > > This requires a scheme where relations can be efficiently truncated > > from the beginning rather than only at the end, which is why I said "a > > conveyor belt" and "similar to WAL". Details deliberately vague since > > I am just brainstorming here. The dead TID fork needs to also be efficiently searched. If the heap scan runs twice, the collected dead TIDs on each heap pass could be overlapped. But we would not be able to merge them if we did index vacuuming on one of indexes at between those two heap scans. The second time heap scan would need to record only TIDs that are not collected by the first time heap scan. > > > Clearly, > > we do not want to vacuum each partition by scanning the 1GB partition > > + the 50MB local index + the 50GB global index. That's insane. With > > the above system, since everything's decoupled, you can vacuum the > > partition tables individually as often as required, and similarly for > > their local indexes, but put off vacuuming the global index until > > you've vacuumed a bunch, maybe all, of the partitions, so that the > > work of cleaning up the global index cleans up dead TIDs from many or > > all partitions instead of just one at a time. > > I can't think of any other way of sensibly implementing global indexes. Given that we could load all dead TIDs from many or all partitions, having dead TIDs on the memory with an efficient format would also become important. > > It's probably tricky to get the > > autovacuum algorithm right here, but there seems to be some room for > > optimism. > > I think that it's basically okay if global indexes suck when you do > lots of UPDATEs -- this is a limitation that users can probably live > with. What's not okay is if they suck when you do relatively few > UPDATEs. It's especially not okay to have to scan the global index to > delete one index tuple that points to one LP_DEAD item. Right. Given decoupling index vacuuming, I think the index’s garbage statistics are important which preferably need to be fetchable without accessing indexes. It would be not hard to estimate how many index tuples might be able to be deleted by looking at the dead TID fork but it doesn’t necessarily match the actual number. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote: > I'm not sure that's the only way to deal with this. While some form of > generic "conveyor belt" infrastructure would be a useful building block, > and it'd be sensible to use it here if it existed, it seems feasible to > dead tids in a different way here. You could e.g. have per-heap-vacuum > files with a header containing LSNs that indicate the age of the > contents. That's true, but have some reservations about being overly reliant on the filesystem to provide structure here. There are good reasons to be worried about bloating the number of files in the data directory. Hmm, but maybe we could mitigate that. First, we could skip this for small relations. If you can vacuum the table and all of its indexes using the naive algorithm in <10 seconds, you probably shouldn't do anything fancy. That would *greatly* reduce the number of additional files generated. Second, we could forget about treating them as separate relation forks and make them some other kind of thing entirely, in a separate directory, especially if we adopted Sawada-san's proposal to skip WAL logging. I don't know if that proposal is actually a good idea, because it effectively adds a performance penalty when you crash or fail over, and that sort of thing can be an unpleasant surprise. But it's something to think about. > > This scheme adds a lot of complexity, which is a concern, but it seems > > It's not completely independent: if you need to set some dead TIDs in > > the table to unused, you may have to force index vacuuming that isn't > > needed for bloat control. However, you only need to force it for > > indexes that haven't been vacuumed recently enough for some other > > reason, rather than every index. > > Hm - how would we know how recently that TID has been marked dead? We > don't even have xids for dead ItemIds... Maybe you're intending to > answer that in your next paragraph, but it's not obvious to me that'd be > sufficient... You wouldn't know anything about when things were added in terms of wall clock time, but the idea was that TIDs get added in order and stay in that order. So you know which ones were added first. Imagine a conceptually infinite array of TIDs: (17,5) (332,6) (5, 1) (2153,92) .... Each index keeps a pointer into this array. Initially it points to the start of the array, here (17,5). If an index vacuum starts after (17,5) and (332,6) have been added to the array but before (5,1) is added, then upon completion it updates its pointer to point to (5,1). If every index is pointing to (5,1) or some later element, then you know that (17,5) and (332,6) can be set LP_UNUSED. If not, and you want to get to a state where you CAN set (17,5) and (332,6) to LP_UNUSED, you just need to force index vac on indexes that are pointing to something prior to (5,1) -- and keep forcing it until those pointers reach (5,1) or later. > One thing that you didn't mention so far is that this'd allow us to add > dead TIDs to the "dead tid" file outside of vacuum too. In some > workloads most of the dead tuple removal happens as part of on-access > HOT pruning. While some indexes are likely to see that via the > killtuples logic, others may not. Being able to have more aggressive > index vacuum for the one or two bloated index, without needing to rescan > the heap, seems like it'd be a significant improvement. Oh, that's a very interesting idea. It does impose some additional requirements on any such system, though, because it means you have to be able to efficiently add single TIDs. For example, you mention a per-heap-VACUUM file above, but you can't get away with creating a new file per HOT prune no matter how you arrange things at the FS level. Actually, though, I think the big problem here is deduplication. A full-blown VACUUM can perhaps read all the already-known-to-be-dead TIDs into some kind of data structure and avoid re-adding them, but that's impractical for a HOT prune. > Have you thought about how we would do the scheduling of vacuums for the > different indexes? We don't really have useful stats for the number of > dead index entries to be expected in an index. It'd not be hard to track > how many entries are removed in an index via killtuples, but > e.g. estimating how many dead entries there are in a partial index seems > quite hard (at least without introducing significant overhead). No, I don't have any good ideas about that, really. Partial indexes seem like a hard problem, and so do GIN indexes or other kinds of things where you may have multiple index entries per heap tuple. We might have to accept some known-to-be-wrong approximations in such cases. > > One rather serious objection to this whole line of attack is that we'd > > ideally like VACUUM to reclaim disk space without using any more, in > > case the motivation for running VACUUM in the first place. > > I suspect we'd need a global limit of space used for this data. If above > that limit we'd switch to immediately performing the work required to > remove some of that space. I think that's entirely the wrong approach. On the one hand, it doesn't prevent you from running out of disk space during emergency maintenance, because the disk overall can be full even though you're below your quota of space for this particular purpose. On the other hand, it does subject you to random breakage when your database gets big enough that the critical information can't be stored within the configured quota. I think we'd end up with pathological cases very much like what used to happen with the fixed-size free space map. What happened there was that your database got big enough that you couldn't track all the free space any more and it just started bloating out the wazoo. What would happen here is that you'd silently lose the well-optimized version of VACUUM when your database gets too big. That does not seem like something anybody wants. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Apr 21, 2021 at 7:51 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'm very happy to see that you've taken an interest in this work! I > believe it's an important area. It's too important to be left to only > two contributors. I welcome your participation as an equal partner in > the broader project to fix problems with VACUUM. Err, thanks. I agree this needs broad discussion and participation. > My most ambitious goal is finding a way to remove the need to freeze > or to set hint bits. I think that we can do this by inventing a new > kind of VACUUM just for aborted transactions, which doesn't do index > vacuuming. You'd need something like an ARIES-style dirty page table > to make this cheap -- so it's a little like UNDO, but not very much. I don't see how that works. An aborted transaction can have made index entries, and those index entries can have already been moved by page splits, and there can be arbitrarily many of them, so that you can't keep track of them all in RAM. Also, you can crash after making the index entries and writing them to the disk and before the abort happens. Anyway, this is probably a topic for a separate thread. > I know I say this all the time these days, but it seems worth > repeating now: it is a qualitative difference, not a quantitative > difference. For the record, I find your quantitative vs. qualitative distinction to be mostly unhelpful in understanding what's actually going on here. I've backed into it by reading the explanatory statements you've made at various times (including here, in the part I didn't quote). But that phrase in and of itself means very little to me. Other people's mileage may vary, of course; I'm just telling you how I feel about it. > Right. And, the differences between index A and index B will tend to > be pretty consistent and often much larger than this. > > Many indexes would never have to be vacuumed, even with non-HOT > UPDATES due to bottom-up index deletion -- because they literally > won't even have one single page split for hours, while maybe one index > gets 3x larger in the same timeframe. Eventually you'll need to vacuum > the indexes all the same (not just the bloated index), but that's only > required to enable safely performing heap vacuuming. It's not so bad > if the non-bloated indexes won't be dirtied and if it's not so > frequent (dirtying pages is the real cost to keep under control here). Interesting. > The cost shouldn't be terribly noticeable because you have the > flexibility to change your mind at the first sign of an issue. So you > never pay an extreme cost (you pay a pretty low fixed cost > incrementally, at worst), but you do sometimes (and maybe even often) > get an extreme benefit -- the benefit of avoiding current pathological > performance issues. We know that the cost of bloat is very non-linear > in a bunch of ways that can be pretty well understood, so that seems > like the right thing to focus on -- this is perhaps the only thing > that we can expect to understand with a relatively high degree of > confidence. We can live with a lot of uncertainty about what's going > on with the workload by managing it continually, ramping up and down, > etc. I generally agree. You want to design a system in a way that's going to do a good job avoiding pathological cases. The current system is kind of naive about that. It does things that work well in middle-of-the-road cases, but often does stupid things in extreme cases. There are numerous examples of that; one is the "useless vacuuming" problem about which I've blogged in http://rhaas.blogspot.com/2020/02/useless-vacuuming.html where the system keeps on vacuuming because relfrozenxid is old but doesn't actually succeed in advancing it, so that it's just spinning to no purpose. Another thing is when it picks the "wrong" thing to do first, focusing on a less urgent problem rather than a more urgent one. This can go either way: we might spend a lot of energy cleaning up bloat when a wraparound shutdown is imminent, but we also might spend a lot of energy dealing with a wraparound issue that's not yet urgent while some table bloats out of control. I think it's important not to let the present discussion get overbroad; we won't be able to solve everything at once, and trying to do too many things at the same time will likely result in instability. > > Clearly, > > we do not want to vacuum each partition by scanning the 1GB partition > > + the 50MB local index + the 50GB global index. That's insane. With > > the above system, since everything's decoupled, you can vacuum the > > partition tables individually as often as required, and similarly for > > their local indexes, but put off vacuuming the global index until > > you've vacuumed a bunch, maybe all, of the partitions, so that the > > work of cleaning up the global index cleans up dead TIDs from many or > > all partitions instead of just one at a time. > > I can't think of any other way of sensibly implementing global indexes. Awesome. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Apr 22, 2021 at 9:15 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Have you thought about how we would do the scheduling of vacuums for the > > different indexes? We don't really have useful stats for the number of > > dead index entries to be expected in an index. It'd not be hard to track > > how many entries are removed in an index via killtuples, but > > e.g. estimating how many dead entries there are in a partial index seems > > quite hard (at least without introducing significant overhead). > > No, I don't have any good ideas about that, really. Partial indexes > seem like a hard problem, and so do GIN indexes or other kinds of > things where you may have multiple index entries per heap tuple. We > might have to accept some known-to-be-wrong approximations in such > cases. I think that you're both missing very important subtleties here. Apparently the "quantitative vs qualitative" distinction I like to make hasn't cleared it up. You both seem to be assuming that everything would be fine if you could somehow inexpensively know the total number of undeleted dead tuples in each index at all times. But I don't think that that's true at all. I don't mean that it might not be true. What I mean is that it's usually a meaningless number *on its own*, at least if you assume that every index is either an nbtree index (or an index that uses some other index AM that has the same index deletion capabilities). My mental models for index bloat usually involve imagining an idealized version of a real world bloated index -- I compare the empirical reality against an imagined idealized version. I then try to find optimizations that make the reality approximate the idealized version. Say a version of the same index in a traditional 2PL database without MVCC, or in real world Postgres with VACUUM that magically runs infinitely fast. Bottom-up index deletion usually leaves a huge number of undeleted-though-dead index tuples untouched for hours, even when it works perfectly. 10% - 30% of the index tuples might be undeleted-though-dead at any given point in time (traditional B-Tree space utilization math generally ensures that there is about that much free space on each leaf page if we imagine no version churn/bloat -- we *naturally* have a lot of free space to work with). These are "Schrodinger's dead index tuples". You could count them mechanistically, but then you'd be counting index tuples that are "already dead and deleted" in an important theoretical sense, despite the fact that they are not yet literally deleted. That's why bottom-up index deletion frequently avoids 100% of all unnecessary page splits. The asymmetry that was there all along was just crazy. I merely had the realization that it was there and could be exploited -- I didn't create or invent the natural asymmetry. A similar issue exists within vacuumlazy.c (though that might matter a lot less). Notice that it counts a recently dead heap tuple in its new_dead_tuples accounting, even though the workload can probably delete that tuple in a just-in-time fashion opportunistically. Might we be better off recognizing that such a heap tuple is already morally dead and gone, even if that isn't literally true? (That's a harder argument to make, and I'm not making it right now -- it's just an example.) -- Peter Geoghegan
On Thu, Apr 22, 2021 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > How do we decide this target, I mean at a given point how do we decide > that what is the limit of dead TID's at which we want to trigger the > index vacuuming? It's tricky. Essentially, it's a cost-benefit analysis. On the "cost" side, the expense associated with an index vacuum is basically the number of pages that we're going to visit, and the number of those that we're going to dirty. We can know the former with certainty but can only estimate the latter. On the "benefit" side, setting dead TIDs unused helps us in two ways. First, it lets us mark heap pages all-visible, which makes index-only scans work better and reduces the cost of future vacuuming. These benefits are mitigated by future DML unsetting those bits again; there's no point in marking a page all-visible if it's about to be modified again. Second, it avoids line pointer bloat. Dead line pointers still take up space on the page, and potentially force the line pointer array to be extended, which can eventually cause tuples that would have fit into the page to spill into a different page, possibly a newly-allocated one that forces a table extension. It's hard to compare the costs to the benefits because we don't know the frequency of access. Suppose it costs $1000 to vacuum relevant indexes and set dead line pointers unused. And suppose that if you do it, you thereafter will save $1 every time someone does an index-only scan. If there will be >1000 index-only scans in a meaningful time frame, it's a good trade-off, but if not, it's a bad one, but it's difficult to predict the future, and we have limited information even about the past. My intuition is that two things that we want to consider are the total number of dead line pointers in the heap, and the number of pages across which they are spread. It is also my intuition that the latter is the more important number, possibly to the extent that we could ignore the former number completely. But exactly what the thresholds should be is very unclear to me. > Is it a good idea to always perform an I/O after collecting the dead > TID's or there should be an option where the user can configure so > that it aggressively vacuum all the indexes and this I/O overhead can > be avoided completely. It's my view that there should definitely be such an option. As I also mentioned on another thread recently, suppose we pick words for each phase of vacuum. For the sake of argument, suppose we refer to the first heap phase as PRUNE, the index phase as SANITIZE, and the second heap phase as RECYCLE. Then you can imagine syntax like this: VACUUM (PRUNE) my_table; VACUUM (SANITIZE) my_table; -- all indexes VACUUM my_index; -- must be sanitize only VACUUM (PRUNE, SANITIZE, RECYCLE) my_table; -- do everything Now in the last case is clearly possible for the system to do everything in memory since all phases are being performed, but depending on what we decide, maybe it will choose to use the dead-TID fork in some cases for some reason or other. If so, we might have explicit syntax to override that behavior, e.g. VACUUM (PRUNE, SANITIZE, RECYCLE, TID_STORE 0) my_table; which might be able to be abbreviated, depending on how we set the defaults, to just: VACUUM (TID_STORE 0) my_table; This is all just hypothetical syntax and probably needs a good deal of polish and bike-shedding. But it would be really nice to standardize on some set of terms like prune/sanitize/recycle or whatever we pick, because then we could document what they mean, use them in autovacuum log messages, use them internally for function names, use them for VACUUM option names when we get to that point, etc. and the whole thing would be a good deal more comprehensible than at present. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-04-22 11:30:21 -0700, Peter Geoghegan wrote: > I think that you're both missing very important subtleties here. > Apparently the "quantitative vs qualitative" distinction I like to > make hasn't cleared it up. I'm honestly getting a bit annoyed about this stuff. Yes it's a cool improvement, but no, it doesn't mean that there aren't still relevant issues in important cases. It doesn't help that you repeatedly imply that people that don't see it your way need to have their view "cleared up". "Bottom up index deletion" is practically *irrelevant* for a significant set of workloads. > You both seem to be assuming that everything would be fine if you > could somehow inexpensively know the total number of undeleted dead > tuples in each index at all times. I don't think we'd need an exact number. Just a reasonable approximation so we know whether it's worth spending time vacuuming some index. > But I don't think that that's true at all. I don't mean that it might > not be true. What I mean is that it's usually a meaningless number *on > its own*, at least if you assume that every index is either an nbtree > index (or an index that uses some other index AM that has the same > index deletion capabilities). You also have to assume that you have roughly evenly distributed index insertions and deletions. But workloads that insert into some parts of a value range and delete from another range are common. I even would say that *precisely* because "Bottom up index deletion" can be very efficient in some workloads it is useful to have per-index stats determining whether an index should be vacuumed or not. > My mental models for index bloat usually involve imagining an > idealized version of a real world bloated index -- I compare the > empirical reality against an imagined idealized version. I then try to > find optimizations that make the reality approximate the idealized > version. Say a version of the same index in a traditional 2PL database > without MVCC, or in real world Postgres with VACUUM that magically > runs infinitely fast. > > Bottom-up index deletion usually leaves a huge number of > undeleted-though-dead index tuples untouched for hours, even when it > works perfectly. 10% - 30% of the index tuples might be > undeleted-though-dead at any given point in time (traditional B-Tree > space utilization math generally ensures that there is about that much > free space on each leaf page if we imagine no version churn/bloat -- > we *naturally* have a lot of free space to work with). These are > "Schrodinger's dead index tuples". You could count them > mechanistically, but then you'd be counting index tuples that are > "already dead and deleted" in an important theoretical sense, despite > the fact that they are not yet literally deleted. That's why bottom-up > index deletion frequently avoids 100% of all unnecessary page splits. > The asymmetry that was there all along was just crazy. I merely had > the realization that it was there and could be exploited -- I didn't > create or invent the natural asymmetry. Except that heap bloat not index bloat might be the more pressing concern. Or that there will be no meaningful amount of bottom-up deletions. Or ... Greetings, Andres Freund
On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > The dead TID fork needs to also be efficiently searched. If the heap > scan runs twice, the collected dead TIDs on each heap pass could be > overlapped. But we would not be able to merge them if we did index > vacuuming on one of indexes at between those two heap scans. The > second time heap scan would need to record only TIDs that are not > collected by the first time heap scan. I agree that there's a problem here. It seems to me that it's probably possible to have a dead TID fork that implements "throw away the oldest stuff" efficiently, and it's probably also possible to have a TID fork that can be searched efficiently. However, I am not sure that it's possible to have a dead TID fork that does both of those things efficiently. Maybe you have an idea. My intuition is that if we have to pick one, it's MUCH more important to be able to throw away the oldest stuff efficiently. I think we can work around the lack of efficient lookup, but I don't see a way to work around the lack of an efficient operation to discard the oldest stuff. > Right. Given decoupling index vacuuming, I think the index’s garbage > statistics are important which preferably need to be fetchable without > accessing indexes. It would be not hard to estimate how many index > tuples might be able to be deleted by looking at the dead TID fork but > it doesn’t necessarily match the actual number. Right, and to appeal (I think) to Peter's quantitative vs. qualitative principle, it could be way off. Like, we could have a billion dead TIDs and in one index the number of index entries that need to be cleaned out could be 1 billion and in another index it could be zero (0). We know how much data we will need to scan because we can fstat() the index, but there seems to be no easy way to estimate how many of those pages we'll need to dirty, because we don't know how successful previous opportunistic cleanup has been. It is not impossible, as Peter has pointed out a few times now, that it has worked perfectly and there will be no modifications required, but it is also possible that it has done nothing. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-04-22 14:47:14 -0400, Robert Haas wrote: > On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Right. Given decoupling index vacuuming, I think the index’s garbage > > statistics are important which preferably need to be fetchable without > > accessing indexes. It would be not hard to estimate how many index > > tuples might be able to be deleted by looking at the dead TID fork but > > it doesn’t necessarily match the actual number. > > Right, and to appeal (I think) to Peter's quantitative vs. qualitative > principle, it could be way off. Like, we could have a billion dead > TIDs and in one index the number of index entries that need to be > cleaned out could be 1 billion and in another index it could be zero > (0). We know how much data we will need to scan because we can fstat() > the index, but there seems to be no easy way to estimate how many of > those pages we'll need to dirty, because we don't know how successful > previous opportunistic cleanup has been. That aspect seems reasonably easy to fix: We can start to report the number of opportunistically deleted index entries to pgstat. At least nbtree already performs the actual deletion in bulk and we already have (currently unused) space in the pgstat entries for it, so I don't think it'd meanginfully increase overhead. And it'd improve insight in how indexes operate significantly, even leaving autovacuum etc aside. Greetings, Andres Freund
On Thu, Apr 22, 2021 at 11:44 AM Andres Freund <andres@anarazel.de> wrote: > I'm honestly getting a bit annoyed about this stuff. You're easily annoyed. > Yes it's a cool > improvement, but no, it doesn't mean that there aren't still relevant > issues in important cases. It doesn't help that you repeatedly imply > that people that don't see it your way need to have their view "cleared > up". I don't think that anything that I've said about it contradicts anything that you or Robert said. What I said that you're missing a couple of important subtleties (or that you seem to be). It's not really about the optimization in particular -- it's about the subtleties that it exploits. I think that they're generalizable. Even if there was only a 1% chance of that being true, it would still be worth exploring in depth. I think that everybody's beliefs about VACUUM tend to be correct. It almost doesn't matter if scenario A is the problem in 90% or cases versus 10% of cases for scenario B (or vice-versa). What actually matters is that we have good handling for both. (It's probably some weird combination of scenario A and scenario B in any case.) > "Bottom up index deletion" is practically *irrelevant* for a significant > set of workloads. You're missing the broader point. Which is that we don't know how much it helps in each case, just as we don't know how much some other complementary optimization helps. It's important to develop complementary techniques precisely because (say) bottom-up index deletion only solves one class of problem. And because it's so hard to predict. I actually went on at length about the cases that the optimization *doesn't* help. Because that'll be a disproportionate source of problems now. And you really need to avoid all of the big sources of trouble to get a really good outcome. Avoiding each and every source of trouble might be much much more useful than avoiding all but one. > > You both seem to be assuming that everything would be fine if you > > could somehow inexpensively know the total number of undeleted dead > > tuples in each index at all times. > > I don't think we'd need an exact number. Just a reasonable approximation > so we know whether it's worth spending time vacuuming some index. I agree. > You also have to assume that you have roughly evenly distributed index > insertions and deletions. But workloads that insert into some parts of a > value range and delete from another range are common. > > I even would say that *precisely* because "Bottom up index deletion" can > be very efficient in some workloads it is useful to have per-index stats > determining whether an index should be vacuumed or not. Exactly! > Except that heap bloat not index bloat might be the more pressing > concern. Or that there will be no meaningful amount of bottom-up > deletions. Or ... Exactly! -- Peter Geoghegan
On Thu, Apr 22, 2021 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that everybody's beliefs about VACUUM tend to be correct. It > almost doesn't matter if scenario A is the problem in 90% or cases > versus 10% of cases for scenario B (or vice-versa). What actually > matters is that we have good handling for both. (It's probably some > weird combination of scenario A and scenario B in any case.) I agree strongly with this. In fact, I seem to remember saying similar things to you in the past. If something wins $1 in 90% of cases and loses $5 in 10% of cases, is it a good idea? Well, it depends on how the losses are distributed. If every user can be expected to hit both winning and losing cases with approximately those frequencies, then yes, it's a good idea, because everyone will come out ahead on average. But if 90% of users will see only wins and 10% of users will see only losses, it sucks. That being said, I don't know what this really has to do with the proposal on the table, except in the most general sense. If you're just saying that decoupling stuff is good because different indexes have different needs, I am in agreement, as I said in my OP. It sort of sounded like you were saying that it's not important to try to estimate the number of undeleted dead tuples in each index, which puzzled me, because while knowing doesn't mean everything is wonderful, not knowing it sure seems worse. But I guess maybe that's not what you were saying, so I don't know. I feel like we're in danger of drifting into discussions about whether we're disagreeing with each other rather than, as I would like, focusing on how to design a system for $SUBJECT. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-04-22 12:15:27 -0400, Robert Haas wrote: > On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote: > > I'm not sure that's the only way to deal with this. While some form of > > generic "conveyor belt" infrastructure would be a useful building block, > > and it'd be sensible to use it here if it existed, it seems feasible to > > dead tids in a different way here. You could e.g. have per-heap-vacuum > > files with a header containing LSNs that indicate the age of the > > contents. > > That's true, but have some reservations about being overly reliant on > the filesystem to provide structure here. There are good reasons to be > worried about bloating the number of files in the data directory. Hmm, > but maybe we could mitigate that. First, we could skip this for small > relations. If you can vacuum the table and all of its indexes using > the naive algorithm in <10 seconds, you probably shouldn't do anything > fancy. That would *greatly* reduce the number of additional files > generated. Second, we could forget about treating them as separate > relation forks and make them some other kind of thing entirely, in a > separate directory I'm not *too* worried about this issue. IMO the big difference to the cost of additional relation forks is that such files would only exist when the table is modified to a somewhat meaningful degree. IME the practical issues with the number of files due to forks are cases where huge number of tables that are practically never modified exist. That's not to say that I am sure that some form of "conveyor belt" storage *wouldn't* be the right thing. How were you thinking of dealing with the per-relation aspects of this? One conveyor belt per relation? > especially if we adopted Sawada-san's proposal to skip WAL logging. I > don't know if that proposal is actually a good idea, because it > effectively adds a performance penalty when you crash or fail over, > and that sort of thing can be an unpleasant surprise. But it's > something to think about. I'm doubtful about skipping WAL logging entirely - I'd have to think harder about it, but I think that'd mean we'd restart from scratch after crashes / immediate restarts as well, because we couldn't rely on the contents of the "dead tid" files to be accurate. In addition to the replication issues you mention. > > One thing that you didn't mention so far is that this'd allow us to add > > dead TIDs to the "dead tid" file outside of vacuum too. In some > > workloads most of the dead tuple removal happens as part of on-access > > HOT pruning. While some indexes are likely to see that via the > > killtuples logic, others may not. Being able to have more aggressive > > index vacuum for the one or two bloated index, without needing to rescan > > the heap, seems like it'd be a significant improvement. > > Oh, that's a very interesting idea. It does impose some additional > requirements on any such system, though, because it means you have to > be able to efficiently add single TIDs. For example, you mention a > per-heap-VACUUM file above, but you can't get away with creating a new > file per HOT prune no matter how you arrange things at the FS level. I agree that it'd be an issue, even though I think it's not too common that only one tuple gets pruned. It might be possible to have a per-relation file per backend or such... But yes, we'd definitely have to think about it. I've previously pondered adding some cross-page batching and deferring of hot pruning in the read case, which I guess might be more advantageous with this. The main reason for thinking about batching & deferring of HOT pruning is that I found during the AIO work that there's speed gains to be head if we pad xlog pages instead of partially filling them - obviously risking increasing WAL usage. One idea to reduce the cost of that was to fill the padded space with actually useful things, like FPIs or hot pruning records. A related speedup opportunity with AIO is to perform useful work while waiting for WAL flushes during commit (i.e. after initiating IO to flush the commit record, but before that IO has completed). > Actually, though, I think the big problem here is deduplication. A > full-blown VACUUM can perhaps read all the already-known-to-be-dead > TIDs into some kind of data structure and avoid re-adding them, but > that's impractical for a HOT prune. What is there to deduplicate during HOT pruning? It seems that hot pruning would need to log all items that it marks dead, but nothing else? And that VACUUM can't yet have put those items onto the dead tuple map, because they weren't yet? This actually brings up a question I vaguely had to the fore: How are you assuming indexes would access the list of dead tids? As far as I can see the on-disk data would not be fully sorted even without adding things during HOT pruning - the dead tids from a single heap pass will be, but there'll be tids from multiple passes, right? Are you assuming that we'd read the data into memory and then merge-sort between each of the pre-sorted "runs"? Or that we'd read and cache parts of the on-disk data during index checks? > > Have you thought about how we would do the scheduling of vacuums for the > > different indexes? We don't really have useful stats for the number of > > dead index entries to be expected in an index. It'd not be hard to track > > how many entries are removed in an index via killtuples, but > > e.g. estimating how many dead entries there are in a partial index seems > > quite hard (at least without introducing significant overhead). > > No, I don't have any good ideas about that, really. Partial indexes > seem like a hard problem, and so do GIN indexes or other kinds of > things where you may have multiple index entries per heap tuple. We > might have to accept some known-to-be-wrong approximations in such > cases. The gin case seems a bit easier than the partial index case. Keeping stats about the number of new entries in a GIN index doesn't seem too hard, nor does tracking the number of cleaned up index entries. But knowing which indexes are affected when a heap tuple becomes dead seems harder. I guess we could just start doing a stats-only version of ExecInsertIndexTuples() for deletes, but obviously the cost of that is not enticing. Perhaps it'd not be too bad if we only did it when there's an index with predicates? > > > One rather serious objection to this whole line of attack is that we'd > > > ideally like VACUUM to reclaim disk space without using any more, in > > > case the motivation for running VACUUM in the first place. > > > > I suspect we'd need a global limit of space used for this data. If above > > that limit we'd switch to immediately performing the work required to > > remove some of that space. > > I think that's entirely the wrong approach. On the one hand, it > doesn't prevent you from running out of disk space during emergency > maintenance, because the disk overall can be full even though you're > below your quota of space for this particular purpose. On the other > hand, it does subject you to random breakage when your database gets > big enough that the critical information can't be stored within the > configured quota. What random breakage are you thinking of? I'm not thinking of a hard limit that may not be crossed at any cost, by even a single byte, but that [auto]VACUUMs would start to be more aggressive about performing index vacuums once the limit is reached. > I think we'd end up with pathological cases very much like what used > to happen with the fixed-size free space map. What happened there was > that your database got big enough that you couldn't track all the free > space any more and it just started bloating out the wazoo. What would > happen here is that you'd silently lose the well-optimized version of > VACUUM when your database gets too big. That does not seem like > something anybody wants. I don't think the consequences would really be that comparable. Once the FSM size was reached in the bad old days, we'd just loose track of of free space. Whereas here we'd start to be more aggressive about cleaning up once the "dead tids" data reaches a certain size. Of course that would have efficiency impacts, but I think "global free space wasted" is a valid input in deciding when to perform index vacuums. I think max_wal_size has worked out pretty well, even if not perfect. Greetings, Andres Freund
On Thu, Apr 22, 2021 at 12:27 PM Robert Haas <robertmhaas@gmail.com> wrote: > I agree strongly with this. In fact, I seem to remember saying similar > things to you in the past. If something wins $1 in 90% of cases and > loses $5 in 10% of cases, is it a good idea? Well, it depends on how > the losses are distributed. If every user can be expected to hit both > winning and losing cases with approximately those frequencies, then > yes, it's a good idea, because everyone will come out ahead on > average. But if 90% of users will see only wins and 10% of users will > see only losses, it sucks. Right. It's essential that we not disadvantage any workload by more than a small fixed amount (and only with a huge upside elsewhere). The even more general version is this: the average probably doesn't even exist in any meaningful sense. Bottom-up index deletion tends to be effective either 100% of the time or 0% of the time, which varies on an index by index basis. Does that mean we should split the difference, and assume that it's effective 50% of the time? Clearly not. Clearly that particular framing is just wrong. And clearly it basically doesn't matter if it's half of all indexes, or a quarter, or none, whatever. Because it's all of those proportions, and also because who cares. > That being said, I don't know what this really has to do with the > proposal on the table, except in the most general sense. If you're > just saying that decoupling stuff is good because different indexes > have different needs, I am in agreement, as I said in my OP. Mostly what I'm saying is that I would like to put together a rough list of things that we could do to improve VACUUM along the lines we've discussed -- all of which stem from $SUBJECT. There are literally dozens of goals (some of which are quite disparate) that we could conceivably set out to pursue under the banner of $SUBJECT. Ideally there would be soft agreement about which ideas were more promising. Ideally we'd avoid painting ourselves into a corner with respect to one of these goals, in pursuit of any other goal. I suspect that we'll need somewhat more of a top-down approach to this work, which is something that we as a community don't have much experience with. It might be useful to set the parameters of the discussion up-front, which seems weird to me too, but might actually help. (A lot of the current problems with VACUUM seem like they might be consequences of pgsql-hackers not usually working like this.) > It sort > of sounded like you were saying that it's not important to try to > estimate the number of undeleted dead tuples in each index, which > puzzled me, because while knowing doesn't mean everything is > wonderful, not knowing it sure seems worse. But I guess maybe that's > not what you were saying, so I don't know. I agree that it matters that we are able to characterize how bloated a partial index is, because an improved VACUUM implementation will need to know that. My main point about that was that it's complicated in surprising ways that actually matter. An approximate solution seems quite possible to me, but I think that that will probably have to involve the index AM directly. Sometimes 10% - 30% of the extant physical index tuples will be dead and it'll be totally fine in every practical sense -- the index won't have grown by even one page since the last VACUUM! Other times it might be as few as 2% - 5% that are now dead when VACUUM is considered, which will in fact be a serious problem (e.g., it's concentrated in one part of the keyspace, say). I would say that having some rough idea of which case we have on our hands is extremely important here. Even if the distinction only arises in rare cases (though FWIW I don't think that these differences will be rare at all). (I also tried to clarify what I mean about qualitative bloat in passing in my response about the case of a bloated partial index.) > I feel like we're in danger > of drifting into discussions about whether we're disagreeing with each > other rather than, as I would like, focusing on how to design a system > for $SUBJECT. While I am certainly guilty of being kind of hand-wavy and talking about lots of stuff all at once here, it's still kind of unclear what practical benefits you hope to attain through $SUBJECT. Apart from the thing about global indexes, which matters but is hardly the overwhelming reason to do all this. I myself don't expect your goals to be super crisp just yet. As I said, I'm happy to talk about it in very general terms at first -- isn't that what you were doing yourself? Or did I misunderstand -- are global indexes mostly all that you're thinking about here? (Even if they are all you care about, it still seems like you're still somewhat obligated to generalize the dead TID fork/map thing to help with a bunch of other things, just to justify the complexity of adding a dead TID relfork.) -- Peter Geoghegan
On Thu, Apr 22, 2021 at 11:16 AM Robert Haas <robertmhaas@gmail.com> wrote: > > My most ambitious goal is finding a way to remove the need to freeze > > or to set hint bits. I think that we can do this by inventing a new > > kind of VACUUM just for aborted transactions, which doesn't do index > > vacuuming. You'd need something like an ARIES-style dirty page table > > to make this cheap -- so it's a little like UNDO, but not very much. > > I don't see how that works. An aborted transaction can have made index > entries, and those index entries can have already been moved by page > splits, and there can be arbitrarily many of them, so that you can't > keep track of them all in RAM. Also, you can crash after making the > index entries and writing them to the disk and before the abort > happens. Anyway, this is probably a topic for a separate thread. This is a topic for a separate thread, but I will briefly address your question. Under the scheme I've sketched, we never do index vacuuming when invoking an autovacuum worker (or something like it) to clean-up after an aborted transaction. We track the pages that all transactions have modified. If a transaction commits then we quickly discard the relevant dirty page table metadata. If a transaction aborts (presumably a much rarer event), then we launch an autovacuum worker that visits precisely those heap blocks that were modified by the aborted transaction, and just prune each page, one by one. We have a cutoff that works a little like relfrozenxid, except that it tracks the point in the XID space before which we know any XIDs (any XIDs that we can read from extant tuple headers) must be committed. The idea of a "Dirty page table" is standard ARIES. It'd be tricky to get it working, but still quite possible. The overall goal of this design is for the system to be able to reason about committed-ness inexpensively (to obviate the need for hint bits and per-tuple freezing). We want to be able to say for sure that almost all heap blocks in the database only contain heap tuples whose headers contain only committed XIDs, or LP_DEAD items that are simply dead (the exact provenance of these LP_DEAD items is not a concern, just like today). The XID cutoff for committed-ness could be kept quite recent due to the fact that aborted transactions are naturally rare. And because we can do relatively little work to "logically roll back" aborted transactions. Note that a heap tuple whose xmin and xmax are committed might also be dead under this scheme, since of course it might have been updated or deleted by an xact that committed. We've effectively decoupled things by making aborted transactions special, and subject to very eager cleanup. I'm sure that there are significant challenges with making something like this work. But to me this design seems roughly the right combination of radical and conservative. -- Peter Geoghegan
Getting rid of freezing and hint bits by eagerly vacuuming aborted xacts (was: decoupling table and index vacuum)
From
Peter Geoghegan
Date:
On Thu, Apr 22, 2021 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Apr 22, 2021 at 11:16 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > My most ambitious goal is finding a way to remove the need to freeze > > > or to set hint bits. I think that we can do this by inventing a new > > > kind of VACUUM just for aborted transactions, which doesn't do index > > > vacuuming. You'd need something like an ARIES-style dirty page table > > > to make this cheap -- so it's a little like UNDO, but not very much. > > > > I don't see how that works. An aborted transaction can have made index > > entries, and those index entries can have already been moved by page > > splits, and there can be arbitrarily many of them, so that you can't > > keep track of them all in RAM. Also, you can crash after making the > > index entries and writing them to the disk and before the abort > > happens. Anyway, this is probably a topic for a separate thread. > > This is a topic for a separate thread, but I will briefly address your question. > > Under the scheme I've sketched, we never do index vacuuming when > invoking an autovacuum worker (or something like it) to clean-up after > an aborted transaction. We track the pages that all transactions have > modified. If a transaction commits then we quickly discard the > relevant dirty page table metadata. If a transaction aborts > (presumably a much rarer event), then we launch an autovacuum worker > that visits precisely those heap blocks that were modified by the > aborted transaction, and just prune each page, one by one. We have a > cutoff that works a little like relfrozenxid, except that it tracks > the point in the XID space before which we know any XIDs (any XIDs > that we can read from extant tuple headers) must be committed. > > The idea of a "Dirty page table" is standard ARIES. It'd be tricky to > get it working, but still quite possible. > > The overall goal of this design is for the system to be able to reason > about committed-ness inexpensively (to obviate the need for hint bits > and per-tuple freezing). We want to be able to say for sure that > almost all heap blocks in the database only contain heap tuples whose > headers contain only committed XIDs, or LP_DEAD items that are simply > dead (the exact provenance of these LP_DEAD items is not a concern, > just like today). The XID cutoff for committed-ness could be kept > quite recent due to the fact that aborted transactions are naturally > rare. And because we can do relatively little work to "logically roll > back" aborted transactions. > > Note that a heap tuple whose xmin and xmax are committed might also be > dead under this scheme, since of course it might have been updated or > deleted by an xact that committed. We've effectively decoupled things > by making aborted transactions special, and subject to very eager > cleanup. > > I'm sure that there are significant challenges with making something > like this work. But to me this design seems roughly the right > combination of radical and conservative. I'll start a new thread now, as a placeholder for further discussion. This would be an incredibly ambitious project, and I'm sure that this thread will be very hand-wavy at first. But you've got to start somewhere. -- Peter Geoghegan
On Fri, Apr 23, 2021 at 3:47 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > The dead TID fork needs to also be efficiently searched. If the heap > > scan runs twice, the collected dead TIDs on each heap pass could be > > overlapped. But we would not be able to merge them if we did index > > vacuuming on one of indexes at between those two heap scans. The > > second time heap scan would need to record only TIDs that are not > > collected by the first time heap scan. > > I agree that there's a problem here. It seems to me that it's probably > possible to have a dead TID fork that implements "throw away the > oldest stuff" efficiently, and it's probably also possible to have a > TID fork that can be searched efficiently. However, I am not sure that > it's possible to have a dead TID fork that does both of those things > efficiently. Maybe you have an idea. My intuition is that if we have > to pick one, it's MUCH more important to be able to throw away the > oldest stuff efficiently. I think we can work around the lack of > efficient lookup, but I don't see a way to work around the lack of an > efficient operation to discard the oldest stuff. Agreed. I think we can divide the TID fork into 16MB or 32MB chunks like WAL segment files so that we can easily remove old chunks. Regarding the efficient search part, I think we need to consider the case where the TID fork gets bigger than maintenance_work_mem. In that case, during the heap scan, we need to check if the discovered TID exists in a chunk of the TID fork that could be on the disk. Even if all known-dead-TIDs are loaded into an array on the memory, it could get much slower than the current heap scan to bsearch over the array for each dead TID discovered during heap scan. So it would be better to have a way to skip searching by already recorded TIDs. For example, during heap scan or HOT pruning, I think that when marking TIDs dead and recording it to the dead TID fork we can mark them “dead and recorded” instead of just “dead” so that future heap scans can skip those TIDs without existence check. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Fri, Apr 23, 2021 at 7:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think we can divide the TID fork into 16MB or 32MB chunks like WAL > segment files so that we can easily remove old chunks. Regarding the > efficient search part, I think we need to consider the case where the > TID fork gets bigger than maintenance_work_mem. In that case, during > the heap scan, we need to check if the discovered TID exists in a > chunk of the TID fork that could be on the disk. Even if all > known-dead-TIDs are loaded into an array on the memory, it could get > much slower than the current heap scan to bsearch over the array for > each dead TID discovered during heap scan. So it would be better to > have a way to skip searching by already recorded TIDs. For example, > during heap scan or HOT pruning, I think that when marking TIDs dead > and recording it to the dead TID fork we can mark them “dead and > recorded” instead of just “dead” so that future heap scans can skip > those TIDs without existence check. I'm not very excited about this. If we did this, and if we ever generated dead-but-not-recorded TIDs, then we will potentially dirty those blocks again later to mark them recorded. Also, if bsearch() is a bottleneck, how about just using an O(1) algorithm instead of an O(lg n) algorithm, rather than changing the on-disk format? Also, can you clarify exactly what you think the problem case is here? It seems to me that: 1. If we're pruning the heap to collect dead TIDs, we should stop when the number of TIDs we've accumulated reaches maintenance_work_mem. It is possible that we could find when starting to prune that there are *already* more dead TIDs than will fit, because maintenance_work_mem might have been reduced since they were gathered. But that's OK: we can figure out that there are more than will fit without loading them all, and since we shouldn't do additional pruning in this case, there's no issue. 2. If we're sanitizing indexes, we should normally discover that there are few enough TIDs that we can still fit them all in memory. But if that proves not to be the case, again because for example maintenance_work_mem has been reduced, then we can handle that with multiple index passes just as we do today. 3. If we're going back to the heap to permit TIDs to be recycled by setting dead line pointers to unused, we can load in as many of those as will fit in maintenance_work_mem, sort them by block number, and go through block by block and DTRT. Then, we can release all that memory and, if necessary, do the whole thing again. This isn't even particularly inefficient. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Apr 22, 2021 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > Mostly what I'm saying is that I would like to put together a rough > list of things that we could do to improve VACUUM along the lines > we've discussed -- all of which stem from $SUBJECT. There are > literally dozens of goals (some of which are quite disparate) that we > could conceivably set out to pursue under the banner of $SUBJECT. I hope not. I don't have a clue why there would be dozens of possible goals here, or why it matters. I think if we're going to do something like $SUBJECT, we should just concentrate on the best way to make that particular change happen with minimal change to anything else. Otherwise, we risk conflating this engineering effort with others that really should be separate endeavors. For example, as far as possible, I think it would be best to try to do this without changing the statistics that are currently gathered, and just make the best decisions we can with the information we already have. Ideally, I'd like to avoid introducing a new kind of relation fork that uses a different on-disk storage format (e.g. 16MB segments that are dropped from the tail) rather than the one used by the other forks, but I'm not sure we can get away with that, because conveyor-belt storage looks pretty appealing here. Regardless, the more we have to change to accomplish the immediate goal, the more likely we are to introduce instability into places where it could have been avoided, or to get tangled up in endless bikeshedding. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 23, 2021 at 8:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 22, 2021 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Mostly what I'm saying is that I would like to put together a rough > > list of things that we could do to improve VACUUM along the lines > > we've discussed -- all of which stem from $SUBJECT. There are > > literally dozens of goals (some of which are quite disparate) that we > > could conceivably set out to pursue under the banner of $SUBJECT. > > I hope not. I don't have a clue why there would be dozens of possible > goals here, or why it matters. Not completely distinct goals, for the most part, but I can certainly see dozens of benefits. For example, if we know before index vacuuming starts that heap vacuuming definitely won't go ahead (quite possible when we decide that we're only vacuuming a subset of indexes), we can then tell the index AM about that fact. It can then safely vacuum in an "approximate" fashion, for example by skipping pages whose LSNs are from before the last VACUUM, and by not bothering with a super-exclusive lock in the case of nbtree. The risk of a conflict between this goal and another goal that we may want to pursue (which might be a bit contrived) is that we fail to do the right thing when a large range deletion has taken place, which must be accounted in the statistics, but creates a tension with the global index stuff. It's probably only safe to do this when we know that there have been hardly any DELETEs. There is also the question of how the TID map thing interacts with the visibility map, and how that affects how VACUUM behaves (both in general and in order to attain some kind of specific new benefit from this synergy). Who knows? We're never going to get on exactly the same page, but some rough idea of which page each of us are on might save everybody time. The stuff that I went into about making aborted transactions special as a means of decoupling transaction status management from garbage collection is arguably totally unrelated -- perhaps it's just too much of a stretch to link that to what you want to do now. I suppose it's hard to invest the time to engage with me on that stuff, and I wouldn't be surprised if you never did so. If it doesn't happen it would be understandable, though quite possibly a missed opportunity for both of us. My basic intuition there is that it's another variety of decoupling, so (for better or worse) it does *seem* related to me. (I am an intuitive thinker, which has advantages and disadvantages.) > I think if we're going to do something > like $SUBJECT, we should just concentrate on the best way to make that > particular change happen with minimal change to anything else. > Otherwise, we risk conflating this engineering effort with others that > really should be separate endeavors. Of course it's true that that is a risk. That doesn't mean that the opposite risk is not also a concern. I am concerned about both risks. I'm not sure which risk I should be more concerned about. I agree that we ought to focus on a select few goals as part of the first round of work in this area (without necessarily doing all or even most of them at the same time). It's not self-evident which goals those should be at this point, though. You've said that you're interested in global indexes. Okay, that's a start. I'll add the basic idea of not doing index vacuuming for some indexes and not others to the list -- this will necessitate that we teach index AMs to assess how much bloat the index has accumulated since the last VACUUM, which presumably must work in some generalized, composable way. > For example, as far as possible, > I think it would be best to try to do this without changing the > statistics that are currently gathered, and just make the best > decisions we can with the information we already have. I have no idea if that's the right way to do it. In any case the statistics that we gather influence the behavior of autovacuum.c, but nothing stops us from doing our own dynamic gathering of statistics to decide what we should do within vacuumlazy.c each time. We don't have to change the basic triggering conditions to change the work each VACUUM performs. As I've said before, I think that we're likely to get more benefit (at least at first) from making the actual reality of what VACUUM does simpler and more predictable in practice than we are from changing how reality is modeled inside autovacuum.c. I'll go further with that now: if we do change that modelling at some point, I think that it should work in an additive way, which can probably be compatible with how the statistics and so on work already. For example, maybe vacuumlazy.c asks autovacuum.c to do a VACUUM earlier next time. This can be structured as an exception to the general rule of autovacuum scheduling, probably -- something that occurs when it becomes evident that the generic schedule isn't quite cutting it in some important, specific way. > Ideally, I'd > like to avoid introducing a new kind of relation fork that uses a > different on-disk storage format (e.g. 16MB segments that are dropped > from the tail) rather than the one used by the other forks, but I'm > not sure we can get away with that, because conveyor-belt storage > looks pretty appealing here. No opinion on that just yet. > Regardless, the more we have to change to > accomplish the immediate goal, the more likely we are to introduce > instability into places where it could have been avoided, or to get > tangled up in endless bikeshedding. Certainly true. I'm not really trying to convince you of specific actionable points just yet, though. Perhaps that was the problem (or perhaps it simply led to miscommunication). It would be so much easier to discuss some of this stuff at an event like pgCon. Oh well. -- Peter Geoghegan
On Thu, Apr 22, 2021 at 1:01 PM Andres Freund <andres@anarazel.de> wrote: > The gin case seems a bit easier than the partial index case. Keeping > stats about the number of new entries in a GIN index doesn't seem too > hard, nor does tracking the number of cleaned up index entries. But > knowing which indexes are affected when a heap tuple becomes dead seems > harder. I guess we could just start doing a stats-only version of > ExecInsertIndexTuples() for deletes, but obviously the cost of that is > not enticing. Perhaps it'd not be too bad if we only did it when there's > an index with predicates? Though I agree that we need some handling here, I doubt that an index with a predicate is truly a special case. Suppose you have a partial index that covers 10% of the table. How is that meaningfully different from an index without a predicate that is otherwise equivalent? If the churn occurs in the same keyspace in either case, and if that's the part of the keyspace that queries care about, then ISTM that the practical difference is fairly insignificant. (If you have some churn all over the standard index by queries are only interested in the same 10% of the full keyspace then this will be less true, but still roughly true.) There is an understandable tendency to focus on the total size of the index in each case, and to be alarmed that the partial index has (say) doubled in size, while at the same time not being overly concerned about lower *proportionate* growth for the standard index case (assuming otherwise identical workload/conditions). The page splits that affect the same 10% of the key space in each case will be approximately as harmful in each case, though. We can expect the same growth in leaf pages in each case, which will look very similar. It should be obvious that it's somewhat of a problem that 90% of the standard index is apparently not useful (this is an unrelated problem). But if the DBA fixes this unrelated problem (by making the standard index a partial index), surely it would be absurd to then conclude that that helpful intervention somehow had the effect of making the index bloat situation much worse! I think that a simple heuristic could work very well here, but it needs to be at least a little sensitive to the extremes. And I mean all of the extremes, not just the one from my example -- every variation exists and will cause problems if given zero weight. -- Peter Geoghegan
On Fri, Apr 23, 2021 at 1:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that a simple heuristic could work very well here, but it > needs to be at least a little sensitive to the extremes. And I mean > all of the extremes, not just the one from my example -- every > variation exists and will cause problems if given zero weight. To expand on this a bit, my objection to counting the number of live tuples in the index (as a means to determining how aggressively each individual index needs to be vacuumed) is this: it's driven by positive feedback, not negative feedback. We should focus on *extreme* adverse events (e.g., version-driven page splits) instead. We don't even need to understand ordinary adverse events (e.g., how many dead tuples are in the index). The cost of accumulating dead tuples in an index (could be almost any index AM) grows very slowly at first, and then suddenly explodes (actually it's more like a cascade of correlated explosions, but for the purposes of this explanation that doesn't matter). In a way, this makes life easy for us. The cost of accumulating dead tuples rises so dramatically at a certain inflection point that we can reasonably assume that that's all that matters -- just stop the explosions. An extremely simple heuristic that prevents these extreme adverse events can work very well because that's where almost all of the possible downside is. We can be sure that these extreme adverse events are universally very harmful (workload doesn't matter). Note that the same is not true for an approach driven by positive feedback -- it'll be fragile because it depends on workload characteristics in unfathomably many ways. We should focus on what we can understand with a high degree of confidence. We just need to identify what the extreme adverse event is in each index AM, count them, and focus on those (could be a VACUUM thing, could be local to the index AM like bottom-up deletion is). We need to notice when things are *starting* to go really badly and intervene aggressively. So we need to be willing to try a generic index vacuuming strategy first, and then notice that it has just failed, or is just about to fail. Something like version-driven page splits really shouldn't ever happen, so even a very crude approach will probably work very well. -- Peter Geoghegan
Hi, On 2021-04-24 11:21:49 -0700, Peter Geoghegan wrote: > To expand on this a bit, my objection to counting the number of live > tuples in the index (as a means to determining how aggressively each > individual index needs to be vacuumed) is this: it's driven by > positive feedback, not negative feedback. We should focus on *extreme* > adverse events (e.g., version-driven page splits) instead. We don't > even need to understand ordinary adverse events (e.g., how many dead > tuples are in the index). I don't see how that's good enough as a general approach. It won't work on indexes that insert on one end, delete from the other (think inserted_at or serial primary keys in many workloads). Greetings, Andres Freund
On Sat, Apr 24, 2021 at 11:43 AM Andres Freund <andres@anarazel.de> wrote: > I don't see how that's good enough as a general approach. It won't work > on indexes that insert on one end, delete from the other (think > inserted_at or serial primary keys in many workloads). That can be treated as another extreme that we need to treat as negative feedback. There may also be other types of negative feedback that occur only in some index AMs, that neither of us have thought of just yet. But that's okay -- we can just add that to the list. Some varieties of negative feedback might be much more common in practice than others. This shouldn't matter. The number of live tuples (or even dead tuples) in the whole entire index is simply not a useful proxy for what actually matters -- this is 100% clear. There are many cases where this will do completely the wrong thing, even if we have perfectly accurate information. I'll spare you a repeat of the example of bottom-up index deletion and "Schrodinger's dead index tuples" (it's not the only example, just the purest). -- Peter Geoghegan
Hi, On 2021-04-24 11:59:29 -0700, Peter Geoghegan wrote: > The number of live tuples (or even dead tuples) in the whole entire > index is simply not a useful proxy for what actually matters -- this > is 100% clear. Did anybody actually argue for using #live entries directly? I think *dead* entries is more relevant, partiuclarly because various forms of local cleanup can be taken into account. Live tuples might come in to put the number of dead tuples into perspective, but otherwise not that much? > There are many cases where this will do completely the wrong thing, > even if we have perfectly accurate information. Imo the question isn't really whether criteria will ever do something wrong, but how often and how consequential such mistakes will be. E.g. unnecessarily vacuuming an index isn't fun, but it's better than ending up not never cleaning up dead index pointers despite repeat accesses (think bitmap scans). Greetings, Andres Freund
On Sat, Apr 24, 2021 at 12:56 PM Andres Freund <andres@anarazel.de> wrote: > Did anybody actually argue for using #live entries directly? I think > *dead* entries is more relevant, partiuclarly because various forms of > local cleanup can be taken into account. Live tuples might come in to > put the number of dead tuples into perspective, but otherwise not that > much? I was unclear. I can't imagine how you'd do anything like this without using both together. Or if you didn't use live tuples you'd use heap blocks instead. Something like that. > > There are many cases where this will do completely the wrong thing, > > even if we have perfectly accurate information. > > Imo the question isn't really whether criteria will ever do something > wrong, but how often and how consequential such mistakes will > be. E.g. unnecessarily vacuuming an index isn't fun, but it's better > than ending up not never cleaning up dead index pointers despite repeat > accesses (think bitmap scans). I strongly agree. The risk with what I propose is that we'd somehow overlook a relevant extreme cost. But I think that that's an acceptable risk. Plus I see no workable alternative -- your "indexes that insert on one end, delete from the other" example works much better as an argument against what you propose than an argument against my own alternative proposal. Which reminds me: how would your framework for index bloat/skipping indexes in VACUUM deal cope with this same scenario? Though I don't think that it's useful to use quantitative thinking as a starting point here, that doesn't mean there is exactly zero role for it. Not sure about how far I'd go here. But I would probably not argue that we shouldn't vacuum an index that is known to (say) be more than 60% dead tuples. I guess I'd always prefer to have a better metric, but speaking hypothetically: Why take a chance? This is not because it's definitely worth it -- it really isn't! It's just because the benefit of being right is low compared to the cost of being wrong -- as you point out, that is really important. -- Peter Geoghegan
On Sat, Apr 24, 2021 at 1:17 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Apr 24, 2021 at 12:56 PM Andres Freund <andres@anarazel.de> wrote: > > Imo the question isn't really whether criteria will ever do something > > wrong, but how often and how consequential such mistakes will > > be. E.g. unnecessarily vacuuming an index isn't fun, but it's better > > than ending up not never cleaning up dead index pointers despite repeat > > accesses (think bitmap scans). > > I strongly agree. The risk with what I propose is that we'd somehow > overlook a relevant extreme cost. But I think that that's an > acceptable risk. IMV the goal here is not really to skip index vacuuming when it's unnecessary. The goal is to do *more* index vacuuming when and where it *is* necessary (in one problematic index among several) -- maybe even much much more. We currently treat index vacuuming as an all-or-nothing thing at the level of the table, which makes this impossible. This is another reason why we can be pretty conservative about skipping. We only need to skip index vacuuming those indexes that we're pretty confident just don't need it -- that's sufficient to be able to do vastly more index vacuuming where it is needed in almost all cases. There is some gray area, but that seems much less interesting to me. -- Peter Geoghegan
On Sat, Apr 24, 2021 at 12:22 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Apr 23, 2021 at 7:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I think we can divide the TID fork into 16MB or 32MB chunks like WAL > > segment files so that we can easily remove old chunks. Regarding the > > efficient search part, I think we need to consider the case where the > > TID fork gets bigger than maintenance_work_mem. In that case, during > > the heap scan, we need to check if the discovered TID exists in a > > chunk of the TID fork that could be on the disk. Even if all > > known-dead-TIDs are loaded into an array on the memory, it could get > > much slower than the current heap scan to bsearch over the array for > > each dead TID discovered during heap scan. So it would be better to > > have a way to skip searching by already recorded TIDs. For example, > > during heap scan or HOT pruning, I think that when marking TIDs dead > > and recording it to the dead TID fork we can mark them “dead and > > recorded” instead of just “dead” so that future heap scans can skip > > those TIDs without existence check. > > I'm not very excited about this. If we did this, and if we ever > generated dead-but-not-recorded TIDs, then we will potentially dirty > those blocks again later to mark them recorded. Since the idea I imagined is that we always mark a TID recorded at the same time when marking it dead we don't dirty the page again, but, yes, if we do that the recorded flag is not necessary. We can simply think that TID marked dead is recorded to the TID fork. Future vacuum can skip TID that are already marked dead. > > Also, if bsearch() is a bottleneck, how about just using an O(1) > algorithm instead of an O(lg n) algorithm, rather than changing the > on-disk format? > > Also, can you clarify exactly what you think the problem case is here? > It seems to me that: > > 1. If we're pruning the heap to collect dead TIDs, we should stop when > the number of TIDs we've accumulated reaches maintenance_work_mem. It > is possible that we could find when starting to prune that there are > *already* more dead TIDs than will fit, because maintenance_work_mem > might have been reduced since they were gathered. But that's OK: we > can figure out that there are more than will fit without loading them > all, and since we shouldn't do additional pruning in this case, > there's no issue. The case I'm thinking is that pruning the heap and sanitizing indexes are running concurrently as you mentioned that concurrency is one of the benefits of decoupling vacuum phases. In that case, one process is doing index vacuuming using known-dead-TIDs in the TID fork while another process is appending new dead TIDs. We can suspend heap pruning until the size of the TID fork gets smaller as you mentioned but it seems inefficient. > > 2. If we're sanitizing indexes, we should normally discover that there > are few enough TIDs that we can still fit them all in memory. But if > that proves not to be the case, again because for example > maintenance_work_mem has been reduced, then we can handle that with > multiple index passes just as we do today. Yeah, there seems to be room for improvement but not worse than today. I imagine users will want to set a high maintenance_work_mem for sanitizing global index separately from the setting for heap pruning. > > 3. If we're going back to the heap to permit TIDs to be recycled by > setting dead line pointers to unused, we can load in as many of those > as will fit in maintenance_work_mem, sort them by block number, and go > through block by block and DTRT. Then, we can release all that memory > and, if necessary, do the whole thing again. This isn't even > particularly inefficient. Agreed. Just an idea: during pruning the heap, we can buffer the collected dead TIDs before writing the TID fork to the disk so that we can sort the dead TIDs in a chunk (say a 16MB chunk consists of 8KB blocks)? We write the chunk to the disk either when the chunk filled with dead TIDs or when index sanitizing starts. The latter timing is required to remember the chunk ID or uint64 ID of dead TID indicating how far index sanitizing removed dead TIDs up to. One of the benefits would be to reduce the disk I/O for the dead TID fork. Another would be we’re likely to complete the recycle phase in one heap scan since we load only one block per chunk during scanning the heap. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Fri, Apr 23, 2021 at 5:01 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-04-22 12:15:27 -0400, Robert Haas wrote: > > On Wed, Apr 21, 2021 at 5:38 PM Andres Freund <andres@anarazel.de> wrote: > > > I'm not sure that's the only way to deal with this. While some form of > > > generic "conveyor belt" infrastructure would be a useful building block, > > > and it'd be sensible to use it here if it existed, it seems feasible to > > > dead tids in a different way here. You could e.g. have per-heap-vacuum > > > files with a header containing LSNs that indicate the age of the > > > contents. > > > > That's true, but have some reservations about being overly reliant on > > the filesystem to provide structure here. There are good reasons to be > > worried about bloating the number of files in the data directory. Hmm, > > but maybe we could mitigate that. First, we could skip this for small > > relations. If you can vacuum the table and all of its indexes using > > the naive algorithm in <10 seconds, you probably shouldn't do anything > > fancy. That would *greatly* reduce the number of additional files > > generated. Second, we could forget about treating them as separate > > relation forks and make them some other kind of thing entirely, in a > > separate directory > > I'm not *too* worried about this issue. IMO the big difference to the > cost of additional relation forks is that such files would only exist > when the table is modified to a somewhat meaningful degree. IME the > practical issues with the number of files due to forks are cases where > huge number of tables that are practically never modified exist. > > That's not to say that I am sure that some form of "conveyor belt" > storage *wouldn't* be the right thing. How were you thinking of dealing > with the per-relation aspects of this? One conveyor belt per relation? > > > > especially if we adopted Sawada-san's proposal to skip WAL logging. I > > don't know if that proposal is actually a good idea, because it > > effectively adds a performance penalty when you crash or fail over, > > and that sort of thing can be an unpleasant surprise. But it's > > something to think about. > > I'm doubtful about skipping WAL logging entirely - I'd have to think > harder about it, but I think that'd mean we'd restart from scratch after > crashes / immediate restarts as well, because we couldn't rely on the > contents of the "dead tid" files to be accurate. In addition to the > replication issues you mention. Yeah, not having WAL would have a big negative impact on other various aspects. Can we piggyback the WAL for the TID fork and XLOG_HEAP2_PRUNE? That is, we add the buffer for the TID fork to XLOG_HEAP2_PRUNE and record one 64-bit number of the first dead TID in the list so that we can add dead TIDs to the TID fork during replaying XLOG_HEAP2_PRUNE. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, May 6, 2021 at 8:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I'm doubtful about skipping WAL logging entirely - I'd have to think > > harder about it, but I think that'd mean we'd restart from scratch after > > crashes / immediate restarts as well, because we couldn't rely on the > > contents of the "dead tid" files to be accurate. In addition to the > > replication issues you mention. > > Yeah, not having WAL would have a big negative impact on other various > aspects. Can we piggyback the WAL for the TID fork and > XLOG_HEAP2_PRUNE? That is, we add the buffer for the TID fork to > XLOG_HEAP2_PRUNE and record one 64-bit number of the first dead TID in > the list so that we can add dead TIDs to the TID fork during replaying > XLOG_HEAP2_PRUNE. That could be an option but we need to be careful about the buffer lock order because now we will have to hold the lock on the TID fork buffer as well as the heap buffer so that we don't create any deadlock. And there is also a possibility of holding the lock on multiple TID fork buffers, which will depend upon how many tid we have pruned. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, May 6, 2021 at 3:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, May 6, 2021 at 8:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I'm doubtful about skipping WAL logging entirely - I'd have to think > > > harder about it, but I think that'd mean we'd restart from scratch after > > > crashes / immediate restarts as well, because we couldn't rely on the > > > contents of the "dead tid" files to be accurate. In addition to the > > > replication issues you mention. > > > > Yeah, not having WAL would have a big negative impact on other various > > aspects. Can we piggyback the WAL for the TID fork and > > XLOG_HEAP2_PRUNE? That is, we add the buffer for the TID fork to > > XLOG_HEAP2_PRUNE and record one 64-bit number of the first dead TID in > > the list so that we can add dead TIDs to the TID fork during replaying > > XLOG_HEAP2_PRUNE. > > That could be an option but we need to be careful about the buffer > lock order because now we will have to hold the lock on the TID fork > buffer as well as the heap buffer so that we don't create any > deadlock. And there is also a possibility of holding the lock on > multiple TID fork buffers, which will depend upon how many tid we have > pruned. Not sure we will need to hold buffer locks for both the TID fork and the heap at the same time but I agree that we could need to lock on multiple TID fork buffers. We could need to add dead TIDs to up to two pages for the TID fork during replaying XLOG_HEAP2_PRUNE since we write it per heap pages. Probably we can process one by one. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, May 6, 2021 at 5:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Not sure we will need to hold buffer locks for both the TID fork and > the heap at the same time but I agree that we could need to lock on > multiple TID fork buffers. We could need to add dead TIDs to up to two > pages for the TID fork during replaying XLOG_HEAP2_PRUNE since we > write it per heap pages. Probably we can process one by one. It seems like we do need to hold them at the same time, because typically for a WAL record you lock all the buffers, modify them all while writing the WAL record, and then unlock them all. Now maybe there's some argument that we can dodge that requirement here, but I have reservations about departing from the usual locking pattern. It's easier to reason about the behavior when everybody follows the same set of rules. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, May 6, 2021 at 7:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, May 6, 2021 at 5:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Not sure we will need to hold buffer locks for both the TID fork and > > the heap at the same time but I agree that we could need to lock on > > multiple TID fork buffers. We could need to add dead TIDs to up to two > > pages for the TID fork during replaying XLOG_HEAP2_PRUNE since we > > write it per heap pages. Probably we can process one by one. > > It seems like we do need to hold them at the same time, because > typically for a WAL record you lock all the buffers, modify them all > while writing the WAL record, and then unlock them all. > > Now maybe there's some argument that we can dodge that requirement > here, but I have reservations about departing from the usual locking > pattern. It's easier to reason about the behavior when everybody > follows the same set of rules. Yes, agreed. I was thinking of replaying WAL, not writing WAL. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, 6 May 2021 at 4:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, May 6, 2021 at 7:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 6, 2021 at 5:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Not sure we will need to hold buffer locks for both the TID fork and
> > the heap at the same time but I agree that we could need to lock on
> > multiple TID fork buffers. We could need to add dead TIDs to up to two
> > pages for the TID fork during replaying XLOG_HEAP2_PRUNE since we
> > write it per heap pages. Probably we can process one by one.
>
> It seems like we do need to hold them at the same time, because
> typically for a WAL record you lock all the buffers, modify them all
> while writing the WAL record, and then unlock them all.
>
> Now maybe there's some argument that we can dodge that requirement
> here, but I have reservations about departing from the usual locking
> pattern. It's easier to reason about the behavior when everybody
> follows the same set of rules.
Yes, agreed. I was thinking of replaying WAL, not writing WAL.
Right, I was pointing to while writing the WAL.
Andres Freund <andres@anarazel.de> wrote: > On 2021-04-21 11:21:31 -0400, Robert Haas wrote: > > This scheme adds a lot of complexity, which is a concern, but it seems > > to me that it might have several benefits. One is concurrency. You > > could have one process gathering dead TIDs and adding them to the > > dead-TID fork while another process is vacuuming previously-gathered > > TIDs from some index. > > I think it might even open the door to using multiple processes > gathering dead TIDs for the same relation. I think the possible concurrency improvements are themselves a valid reason to do the decoupling. Or rather it's hard to imagine how the current implementation of VACUUM can get parallel workers involved in gathering the dead heap TIDs efficiently. Currently, a single backend gathers the heap TIDs, and it can then launch several parallel workers to remove the TIDs from indexes. If parallel workers gathered the heap TIDs, then (w/o the decoupling) the parallel index processing would be a problem because a parallel worker cannot launch other parallel workers. > > In fact, every index could be getting vacuumed at the same time, and > > different indexes could be removing different TID ranges. > > We kind of have this feature right now, due to parallel vacuum... -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > Now, the reason for this is that when we discover dead TIDs, we only > record them in memory, not on disk. So, as soon as VACUUM ends, we > lose all knowledge of whether those TIDs were and must rediscover > them. Suppose we didn't do this, and instead had a "dead TID" fork for > each table. Suppose further that this worked like a conveyor belt, > similar to WAL, where every dead TID we store into the fork is > assigned an identifying 64-bit number that is never reused. Have you started any work on this project? I think that it's a very good idea. Enabling index-only scans is a good enough reason to pursue this project, even on its own. The flexibility that this design offers allows VACUUM to run far more aggressively, with little possible downside. It makes it possible for VACUUM to run so frequently that it rarely dirties pages most of the time -- at least in many important cases. Imagine if VACUUM almost kept in lockstep with inserters into an append-mostly table -- that would be great. The main blocker to making VACUUM behave like that is of course indexes. Setting visibility map bits during VACUUM can make future vacuuming cheaper (for the obvious reason), which *also* makes it cheaper to set *most* visibility map bits as the table is further extended, which in turn makes future vacuuming cheaper...and so on. This virtuous circle seems like it might be really important. Especially once you factor in the cost of dirtying pages a second or a third time. I think that we can really keep the number of times VACUUM dirties pages under control, simply by decoupling. Decoupling is key to keeping the costs to a minimum. I attached a POC autovacuum logging instrumentation patch that shows how VACUUM uses *and* sets VM bits. I wrote this for my TPC-C + FSM work. Seeing both things together, and seeing how both things *change* over time was a real eye opener for me: it turns out that the master branch keeps setting and resetting VM bit pages in the two big append-mostly tables that are causing so much trouble for Postgres today. What we see right now is pretty disorderly -- the numbers don't trend in the right direction when they should. But it could be a lot more orderly, with a little work. This instrumentation helped me to discover a better approach to indexing within TPC-C, based on index-only scans [1]. It also made me realize that it's possible for a table to have real problems with dead tuple cleanup in indexes, while nevertheless being an effective target for index-only scans. There is actually no good reason to think that one condition should preclude the other -- they may very well go together. You did say this yourself when talking about global indexes, but there is no reason to think that it's limited to partitioning cases. The current "ANALYZE dead_tuples statistics" paradigm cannot recognize when both conditions go together, even though I now think that it's fairly common. I also like your idea here because it enables a more qualitative approach, based on recent information for recently modified blocks -- not whole-table statistics. Averages are notoriously misleading. [1] https://github.com/pgsql-io/benchmarksql/pull/16 -- Peter Geoghegan
Attachment
On Thu, Sep 16, 2021 at 7:09 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Now, the reason for this is that when we discover dead TIDs, we only > > record them in memory, not on disk. So, as soon as VACUUM ends, we > > lose all knowledge of whether those TIDs were and must rediscover > > them. Suppose we didn't do this, and instead had a "dead TID" fork for > > each table. Suppose further that this worked like a conveyor belt, > > similar to WAL, where every dead TID we store into the fork is > > assigned an identifying 64-bit number that is never reused. > > Enabling index-only scans is a good enough reason to pursue this > project, even on its own. +1 > I attached a POC autovacuum logging instrumentation patch that shows > how VACUUM uses *and* sets VM bits. Logging how vacuum uses and sets VM bits seems a good idea. I've read the proposed PoC patch. Probably it's better to start a new thread for this patch and write the comment for it there but let me leave one comment on the patch: With the patch, we increment allfrozen_pages counter, which is used to determine whether or not we advance relfrozenxid and relminmxid, at two places: @@ -1141,7 +1201,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) * in this case an approximate answer is OK. */ if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer)) - vacrel->frozenskipped_pages++; + vacrel->allfrozen_pages++; + else + vacrel->allvisible_pages++; continue; @@ -1338,6 +1400,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) */ if (!PageIsAllVisible(page)) { + vacrel->allfrozen_pages++; + I think that we will end up doubly counting the page as scanned_pages and allfrozen_pages due to the newly added latter change. This seems wrong to me because we calculate as follows: @@ -644,7 +656,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, * NB: We need to check this before truncating the relation, because that * will change ->rel_pages. */ - if ((vacrel->scanned_pages + vacrel->frozenskipped_pages) + if ((vacrel->scanned_pages + vacrel->allfrozen_pages) < vacrel->rel_pages) { Assert(!aggressive); Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Sep 15, 2021 at 6:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > Have you started any work on this project? I think that it's a very good idea. Actually, I have. I've been focusing on trying to create a general infrastructure for conveyor belt storage. An incomplete and likely quite buggy version of this can be found here: https://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/conveyor Mark Dilger has been helping me debug it, but it's still very early days. I was planning to wait until it was a little more baked before posting it to the list, but since you asked... Once that infrastructure is sufficiently mature, then the next step, I think, would be to try to use it to store dead TIDs. And then after that, one has to think about how autovacuum scheduling ought to work in a world where table vacuuming and index vacuuming are decoupled. This is a very hard problem, and I don't expect to solve it quickly. I do hope to keep plugging away at it, though. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Sep 23, 2021 at 10:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Sep 16, 2021 at 7:09 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Enabling index-only scans is a good enough reason to pursue this > > project, even on its own. > > +1 I was hoping that you might be able to work on opportunistically freezing whole pages for Postgres 15. I think that it would make sense to opportunistically make a page that is about to become all_visible during VACUUM become all_frozen instead. Our goal is to make most pages skip all_visible, and go straight to all_frozen directly. Often the page won't need to be dirtied again, ever. Right now freezing is something that we mostly just think about as occurring at the level of tuples, which doesn't seem ideal. This seems related to Robert's project because both projects are connected to the question of how autovacuum scheduling works in general. We will probably need to rethink things like the vacuum_freeze_min_age GUC. (I also think that we might need to reconsider how aggressive/anti-wraparound VACUUMs work, but that's another story.) Obviously this is a case of performing work eagerly; a form of speculation that tries to lower costs in the aggregate, over time. Heuristics that work well on average seem possible, but even excellent heuristics could be wrong -- in the end we're trying to predict the future, which is inherently impossible to do reliably for all workloads. I think that that will be okay, provided that the cost of being wrong is kept low and *fixed* (the exact definition of "fixed" will need to be defined, but the basic idea is that any regression is once per page, not once per page per VACUUM or something). Once it's much cheaper enough to freeze a whole page early (i.e. all tuple headers from all tuples), then the implementation can be wrong 95%+ of the time, and maybe we'll still win by a lot. That may sound bad, until you realize that it's 95% *per VACUUM* -- the entire situation is much better once you think about the picture for the entire table over time and across many different VACUUM operations, and once you think about FPIs in the WAL stream. We'll be paying the cost of freezing in smaller and more predictable increments, too, which can make the whole system more robust. Many pages that all go from all_visible to all_frozen at the same time (just because they crossed some usually-meaningless XID-based threshold) is actually quite risky (this is why I mentioned aggressive VACUUMs in passing). The hard part is getting the cost way down. lazy_scan_prune() uses xl_heap_freeze_tuple records for each tuple it freezes. These obviously have a lot of redundancy across tuples from the same page in practice. And the WAL overhead is much larger just because these are per-tuple records, not per-page records. Getting the cost down is hard because of issues with MultiXacts, freezing xmin but not freezing xmax at the same time, etc. > Logging how vacuum uses and sets VM bits seems a good idea. > I think that we will end up doubly counting the page as scanned_pages > and allfrozen_pages due to the newly added latter change. This seems > wrong to me because we calculate as follows: I agree that that's buggy. Oops. It was just a prototype that I wrote for my own work. I do think that we should have a patch that has some of this, for users, but I am not sure about the details just yet. This is probably too much information for users, but I think it will take me more time to decide what really does matter to users. -- Peter Geoghegan
On Fri, Sep 24, 2021 at 11:48 AM Robert Haas <robertmhaas@gmail.com> wrote: > Actually, I have. I've been focusing on trying to create a general > infrastructure for conveyor belt storage. An incomplete and likely > quite buggy version of this can be found here: > > https://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/conveyor That's great news! I think that this is the right high level direction. > Mark Dilger has been helping me debug it, but it's still very early > days. I was planning to wait until it was a little more baked before > posting it to the list, but since you asked... Reminds me of my FSM patch, in a way. It's ambitious, and still very rough, but maybe I should bite the bullet and post it as a POC soon. > Once that infrastructure is sufficiently mature, then the next step, I > think, would be to try to use it to store dead TIDs. +1. > And then after that, one has to think about how autovacuum scheduling > ought to work in a world where table vacuuming and index vacuuming are > decoupled. I'm excited about the possibility of using this infrastructure as a springboard for driving autovacuum's behavior using more or less authoritative information, rather than dubious statistics that can consistently lead us down the wrong path. ANALYZE style statistics are something that can only work under specific conditions that take their obvious limitations into account -- and even there (even within the optimizer) it's amazing that they work as well as they do. I fear that we assumed that the statistics driving autovacuum were good enough at some point in the distant past, and never really validated that assumption. Perhaps because anti-wraparound VACUUM was *accidentally* protective. The scheduling of autovacuum is itself a big problem for the two big BenchmarkSQL tables I'm always going on about -- though it did get a lot better with the introduction of the autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. I recently noticed that the tables have *every* autovacuum driven by inserts (i.e. by the new autovacuum_vacuum_scale_factor stuff), and never by updates -- even though updates obviously produce significant bloat in the two tables. BenchmarkSQL on Postgres was far worse than it is now a few releases ago [1], and I think that this stats business was a big factor (on top of everything else). I can clearly see that autovacuum_vacuum_scale_factor is certainly accidentally protective with BenchmarkSQL today, in a way that wasn't particularly anticipated by anybody. The fact that the intellectual justifications for a lot of these things are so vague concerns me. For example, why do we apply autovacuum_vacuum_scale_factor based on reltuples at the end of the last VACUUM? That aspect of the design will make much less sense once we have this decoupling in place. Even with the happy accident of autovacuum_vacuum_insert_scale_factor helping BenchmarkSQL, the conventional dead tuples based approach to VACUUM still doesn't drive autovacuum sensibly -- we still systematically undercount LP_DEAD stubs because (with this workload) they're systemically concentrated in relatively few heap pages. So if this was a real app, the DBA would somehow have to work out that they should aggressively tune autovacuum_vacuum_scale_factor to clean up bloat from updates. I doubt any DBA could ever figure that out, because it doesn't make any sense. The problem goes both ways: in addition to undercounting dead tuples, we effectively overcount, which can lead to autovacuum chasing its own tail [2]. I think that we could do *way* better than we do today without enormous effort, and I think that it really matters. Maybe we could select from a few standard models for autovacuum scheduling using Bayesian inference -- converge on the more predictive model for a given table over time, using actual outcomes for each autovacuum. Be sensitive to how LP_DEAD stub line pointers can become concentrated in relatively few heap pages, and stuff like that. Maybe keep a little history to work off of. The problem with the current model is not that it might be wrong. The problem is that it might *never* be right (for a given table). The scheduling never learns any lessons, because it's fundamentally static -- it ought to be dynamic. How things change is much more informative than where things are at an arbitrary point in time. [1] https://www.postgresql.org/message-id/flat/0265f9e2-3e32-e67d-f106-8abde596c0e4%40commandprompt.com [2] https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=ww@mail.gmail.com -- Peter Geoghegan
On Fri, Sep 24, 2021 at 7:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > The scheduling of autovacuum is itself a big problem for the two big > BenchmarkSQL tables I'm always going on about -- though it did get a > lot better with the introduction of the > autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. I recently > noticed that the tables have *every* autovacuum driven by inserts > (i.e. by the new autovacuum_vacuum_scale_factor stuff), and never by > updates -- even though updates obviously produce significant bloat in > the two tables. BenchmarkSQL on Postgres was far worse than it is now > a few releases ago [1], and I think that this stats business was a big > factor (on top of everything else). I can clearly see that > autovacuum_vacuum_scale_factor is certainly accidentally protective > with BenchmarkSQL today, in a way that wasn't particularly anticipated > by anybody. > So if this was a real app, the DBA would > somehow have to work out that they should aggressively tune > autovacuum_vacuum_scale_factor to clean up bloat from updates. I doubt > any DBA could ever figure that out, because it doesn't make any sense. Correction: I meant that the autovacuum_vacuum_insert_scale_factor GUC is accidentally protective with the BenchmarkSQL tables, and that no DBA could be expected to figure this out. That is, it helps to lower autovacuum_vacuum_insert_scale_factor from its default of 0.20, just to get autovacuum to better handle bloat from *updates*. This has nothing to do with inserts, or with freeze or set VM bits -- and so overall it doesn't make any sense. -- Peter Geoghegan
On Sat, Sep 25, 2021 at 10:17 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Sep 23, 2021 at 10:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Sep 16, 2021 at 7:09 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > Enabling index-only scans is a good enough reason to pursue this > > > project, even on its own. > > > > +1 > > I was hoping that you might be able to work on opportunistically > freezing whole pages for Postgres 15. I think that it would make sense > to opportunistically make a page that is about to become all_visible > during VACUUM become all_frozen instead. Our goal is to make most > pages skip all_visible, and go straight to all_frozen directly. Often > the page won't need to be dirtied again, ever. +1. I'm happy to work on this. There was a similar proposal before[1]; if we freeze even one tuple in a page, we freeze all tuples in a page and set the page as all-frozen if all tuples in the page can be frozen. This is also a good approach. > The hard part is getting the cost way down. lazy_scan_prune() uses > xl_heap_freeze_tuple records for each tuple it freezes. These > obviously have a lot of redundancy across tuples from the same page in > practice. And the WAL overhead is much larger just because these are > per-tuple records, not per-page records. xl_heap_freeze_page includes multiple xl_heap_freeze_tuple data but we write XLOG_HEAP2_FREEZE_PAGE WAL record per pages? What the WAL overhead did you refer to? Regards, [1] https://www.postgresql.org/message-id/CANP8%2Bj%2BEfLZMux6KLvb%2BumdeVYc%2BJZs5ReNSFq9WDLn%2BAKnhkg%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Sep 27, 2021 at 8:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Hi, Here is the first WIP patch for the decoupling table and index vacuum. The first mail of the thread has already explained the complete background of why we want to do this so instead of describing that I will directly jump into explaining what these patches do. Currently, the table vacuum and index vacuum are executed as a single operation. Basically, we vacuum the table by performing hot pruning and remembering dead items in the cache and then we perform the index vacuum and perform the second pass of the heap vacuum under which we mark items unused. In this patch, we make these multiple vacuum passes as independent operations. So the idea is that we provide multiple vacuum options under that, the user can perform the independent operations i.e. "VACUUM (heap_hot_prune) tbl_name" for performing just the hot prune (first vacuum) pass, "VACUUM (heap_vacuum) tbl_name" for the second heap pass to set dead item unused for which index vacuum is done. And additionally, we are now allowing users to just perform the index vacuum i.e. "VACUUM idx_name". So under the heap_hot_prune pass, we will generate the dead tids and instead of directly performing the index vacuum we will flush those dead tids into the conveyor belt using Deadtidstore interfaces. Then in the index pass, we will read the data from the conveyor belt and perform the index vacuum and at last, in the heap_vacuum pass, we will read the data from the conveyor belt and mark all dead items unused. However, in the second pass, we can only mark those items unused which are dead, and for which all the indexes for the table are already vacuumed. So for identifying that in the pg_class entry we store the conveyor belt pageno up to which we have already done the index vacuum for the index related entry and we have already done the heap_vacuum pass for the table related entry. Additionally while doing the hot_prune pass we also check if the item is already dead and index vacuum is also done for that then we directly set it unused, for this, we use Deadtidstore interfaces. Deadtidstore provides interfaces over the conveyor belt for storing and retrieving dead tids into the conveyor belt. This module maintains a DeadTidState which keeps track of the current insertion progress i.e the first and the last conveyor belt page for the current vacuum run. And on the completion of the vacuum run, this takes care of setting the complete vacuum run bound by storing the last conveyor belt pageno of the current vacuum run into the special space of the first conveyor belt page for this run. This also provides the infrastructure to avoid adding duplicate tids into the conveyor belt. Basically, if we perform the first vacuum pass multiple times without executing the second vacuum pass then it is possible that we encounter the same dead tids in the conveyor belt so this module maintains a cache over the conveyor belt such that it only loads the data into the cache w.r.t the current block the vacuum is processing so we don't need to maintain a huge cache. Test example: CREATE TABLE t (a int); CREATE INDEX idx on t(a); INSERT INTO t VALUES (generate_series(1,1000000)); DELETE FROM t where a > 300; VACUUM (heap_hot_prune) t; VACUUM idx; "VACUUM (heap_vacuum) t; TODO: - This is just a POC patch to discuss the design idea and needs a lot of improvement and testing. - We are using a slightly different format for storing the dead tids into the conveyor belt which is explained in the patch but the traditional multi-pass vacuum is still using the same format (array of ItemPointeData), so we need to unify that format. - Performance testing. - Cleaner interfaces so that we can easily be integrated with auto vacuum, currently, this is not provided for the manual vacuum. - Add test cases. Patches can be applied on the latest conveyor belt patches[1] [1] https://www.postgresql.org/message-id/CAFiTN-sQUddO9JPiH3tz%2BvbNqRqi_pgndecy8k2yXAnO3ymqZA%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Jan 26, 2022 at 8:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > TODO: > - This is just a POC patch to discuss the design idea and needs a lot > of improvement and testing. > - We are using a slightly different format for storing the dead tids > into the conveyor belt which is explained in the patch but the > traditional multi-pass vacuum is still using the same format (array of > ItemPointeData), so we need to unify that format. > - Performance testing. > - Cleaner interfaces so that we can easily be integrated with auto > vacuum, currently, this is not provided for the manual vacuum. > - Add test cases. I think this is a pretty interesting piece of work. I appreciate the effort you've obviously put into the comments, although I do think some of them are going to need some additional clarification. But I think the bigger questions here at the moment are things like (1) Is this the right idea? and if not (2) How could we change it to make it better? and (3) Is there any way that we can make it simpler? It was the last of these questions that prompted me to post http://postgr.es/m/CA+TgmoY18RzQqDm2jE2WDkiA8ngTEDHp7uLtHb3a-ABs+wbY_g@mail.gmail.com because, if that thought were to work out, then we could have more things in common between the conveyor-belt and non-conveyor-belt cases, and we might be able to start with some preliminary work to jigger more things in to the second phase, and then look to integrate the conveyor belt stuff separately. I think what we ought to do at this point is try to figure out some tests that might show how well this approach actually works in practice. Now one motivation for this work was the desire to someday have global indexes, but those don't exist yet, so it makes sense to consider other scenarios in which the patch might (or might not) be beneficial. And it seems to me that we should be looking for a scenario where we have multiple indexes with different vacuuming needs. How could that happen? Well, the first thing that occurred to me was a table with a partial index. If we have a column t whose values are randomly distributed between 1 and 10, and a partial index on some other column WHERE t = 1, then the partial index should only accumulate dead tuples about 10% as fast as a non-partial index on the same column. On the other hand, the partial index also has a much smaller number of total rows, so after a fixed number of updates, the partial index should have the same *percentage* of dead tuples as the non-partial index even though the absolute number is smaller. So maybe that's not a great idea. My second thought was that perhaps we can create a test scenario where, in one index, the deduplication and bottom-up index deletion and kill_prior_tuple mechanisms are very effective, and in another index, it's not effective at all. For example, maybe index A is an index on the primary key, and index B is a non-unique index on some column that we're updating with ever-increasing values (so that we never put new tuples into a page that could be productively cleaned up). I think what should happen in this case is that A should not grow in size even if it's never vacuumed, while B will require vacuuming to keep the size down. If this example isn't exactly right, maybe we can construct one where that does happen. Then we could try to demonstrate that with this patch we can do less vacuuming work and still keep up than what would be possible without the patch. We'll either be able to show that this is true, or we will see that it's false, or we won't be able to really see much difference. Any of those would be interesting findings. One thing we could try doing in order to make that easier would be: tweak things so that when autovacuum vacuums the table, it only vacuums the indexes if they meet some threshold for bloat. I'm not sure exactly what happens with the heap vacuuming then - do we do phases 1 and 2 always, or a combined heap pass, or what? But if we pick some criteria that vacuums indexes sometimes and not other times, we can probably start doing some meaningful measurement of whether this patch is making bloat better or worse, and whether it's using fewer or more resources to do it. Do you have a git branch for this work? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 4, 2022 at 1:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > My second thought was that perhaps we can create a test scenario > where, in one index, the deduplication and bottom-up index deletion > and kill_prior_tuple mechanisms are very effective, and in another > index, it's not effective at all. For example, maybe index A is an > index on the primary key, and index B is a non-unique index on some > column that we're updating with ever-increasing values (so that we > never put new tuples into a page that could be productively cleaned > up). I think what should happen in this case is that A should not grow > in size even if it's never vacuumed, while B will require vacuuming to > keep the size down. That should work. All you need is a table with several indexes, and a workload consisting of updates that modify a column that is the key column for only one of the indexes. I would expect bottom-up index deletion to be 100% effective for the not-logically-modified indexes, in the sense that there will be no page splits -- provided there are no long held snapshots, and provided that the index isn't very small. If it is small (think of something like the pgbench_branches pkey), then even the occasional ANALYZE will act as a "long held snapshot" relative to the size of the index. And so then you might get one page split per original leaf page, but probably not a second, and very very probably not a third. The constantly modified index will be entirely dependent on index vacuuming here, and so an improved VACUUM design that allows that particular index to be vacuumed more frequently could really improve performance. BTW, it's a good idea to avoid unique indexes in test cases where there is an index that you don't want to set LP_DEAD bits for, since _bt_check_unique() tends to do a good job of setting LP_DEAD bits, independent of the kill_prior_tuple thing. You can avoid using kill_prior_tuple by forcing bitmap scans, of course. -- Peter Geoghegan
On Fri, Feb 4, 2022 at 1:46 PM Peter Geoghegan <pg@bowt.ie> wrote: > That should work. All you need is a table with several indexes, and a > workload consisting of updates that modify a column that is the key > column for only one of the indexes. I would expect bottom-up index > deletion to be 100% effective for the not-logically-modified indexes, > in the sense that there will be no page splits -- provided there are > no long held snapshots, and provided that the index isn't very small. > If it is small (think of something like the pgbench_branches pkey), > then even the occasional ANALYZE will act as a "long held snapshot" > relative to the size of the index. And so then you might get one page > split per original leaf page, but probably not a second, and very very > probably not a third. > > The constantly modified index will be entirely dependent on index > vacuuming here, and so an improved VACUUM design that allows that > particular index to be vacuumed more frequently could really improve > performance. Thanks for checking my work here - I wasn't 100% sure I had the right idea. > BTW, it's a good idea to avoid unique indexes in test cases where > there is an index that you don't want to set LP_DEAD bits for, since > _bt_check_unique() tends to do a good job of setting LP_DEAD bits, > independent of the kill_prior_tuple thing. You can avoid using > kill_prior_tuple by forcing bitmap scans, of course. Thanks for this tip, too. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 4, 2022 at 1:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > > The constantly modified index will be entirely dependent on index > > vacuuming here, and so an improved VACUUM design that allows that > > particular index to be vacuumed more frequently could really improve > > performance. > > Thanks for checking my work here - I wasn't 100% sure I had the right idea. I should perhaps have emphasized individual leaf pages, rather than total index size. Presumably we only need to store so many extra versions per logical row at any one time, and we have a fair amount of free space for extra versions on leaf pages. Typically 10%- 30% of the space from the page (typical when it isn't already inevitable that the page will eventually split due to simple inserts). A traditional guarantee with B-Trees is that we get `ln(2)` space utilization with random insertions, which leaves just over 30% of the page free for later updates -- that's where I got 30% from. There is a complementary effect with deduplication, since that buys us time before the page has to split, making it much more likely that the split will be avoided entirely. It's very nonlinear. As I said, the competition between older snapshots and garbage collection can still lead to version-driven page splits (especially when non-hot updates are concentrated in one part of the key space, or one leaf page). But that's arguably a good thing -- it naturally relieves contention. There are actually designs that artificially split B-Tree pages early [1], detecting concurrency control related contention. Other systems need concurrency control in indexes, which we avoid by having versions live in indexes. [1] http://cidrdb.org/cidr2021/papers/cidr2021_paper21.pdf -- Peter Geoghegan
On Fri, Feb 4, 2022 at 11:45 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jan 26, 2022 at 8:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > TODO: > > - This is just a POC patch to discuss the design idea and needs a lot > > of improvement and testing. > > - We are using a slightly different format for storing the dead tids > > into the conveyor belt which is explained in the patch but the > > traditional multi-pass vacuum is still using the same format (array of > > ItemPointeData), so we need to unify that format. > > - Performance testing. > > - Cleaner interfaces so that we can easily be integrated with auto > > vacuum, currently, this is not provided for the manual vacuum. > > - Add test cases. > > I think this is a pretty interesting piece of work. I appreciate the > effort you've obviously put into the comments, although I do think > some of them are going to need some additional clarification. But I > think the bigger questions here at the moment are things like (1) Is > this the right idea? and if not (2) How could we change it to make it > better? and (3) Is there any way that we can make it simpler? It was > the last of these questions that prompted me to post > http://postgr.es/m/CA+TgmoY18RzQqDm2jE2WDkiA8ngTEDHp7uLtHb3a-ABs+wbY_g@mail.gmail.com > because, if that thought were to work out, then we could have more > things in common between the conveyor-belt and non-conveyor-belt > cases, and we might be able to start with some preliminary work to > jigger more things in to the second phase, and then look to integrate > the conveyor belt stuff separately. I agree that if we can do something like that then integrating the conveyor belt will be much cleaner. > My second thought was that perhaps we can create a test scenario > where, in one index, the deduplication and bottom-up index deletion > and kill_prior_tuple mechanisms are very effective, and in another > index, it's not effective at all. For example, maybe index A is an > index on the primary key, and index B is a non-unique index on some > column that we're updating with ever-increasing values (so that we > never put new tuples into a page that could be productively cleaned > up). I think what should happen in this case is that A should not grow > in size even if it's never vacuumed, while B will require vacuuming to > keep the size down. If this example isn't exactly right, maybe we can > construct one where that does happen. Then we could try to demonstrate > that with this patch we can do less vacuuming work and still keep up > than what would be possible without the patch. We'll either be able to > show that this is true, or we will see that it's false, or we won't be > able to really see much difference. Any of those would be interesting > findings. +1 > One thing we could try doing in order to make that easier would be: > tweak things so that when autovacuum vacuums the table, it only > vacuums the indexes if they meet some threshold for bloat. I'm not > sure exactly what happens with the heap vacuuming then - do we do > phases 1 and 2 always, or a combined heap pass, or what? But if we > pick some criteria that vacuums indexes sometimes and not other times, > we can probably start doing some meaningful measurement of whether > this patch is making bloat better or worse, and whether it's using > fewer or more resources to do it. I think we can always trigger phase 1 and 2 and phase 2 will only vacuum conditionally based on if all the indexes are vacuumed for some conveyor belt pages so we don't have risk of scanning without marking anything unused. And we can try to measure with other approaches as well where we completely avoid phase 2 and it will be done only along with phase 1 whenever applicable. > Do you have a git branch for this work? Yeah, my repository: https://github.com/dilipbalaut11/conveyor_test branch: DecouplingIndexAndHeapVacuumUsingCB -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Feb 6, 2022 at 11:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > One thing we could try doing in order to make that easier would be: > > tweak things so that when autovacuum vacuums the table, it only > > vacuums the indexes if they meet some threshold for bloat. I'm not > > sure exactly what happens with the heap vacuuming then - do we do > > phases 1 and 2 always, or a combined heap pass, or what? But if we > > pick some criteria that vacuums indexes sometimes and not other times, > > we can probably start doing some meaningful measurement of whether > > this patch is making bloat better or worse, and whether it's using > > fewer or more resources to do it. > > I think we can always trigger phase 1 and 2 and phase 2 will only > vacuum conditionally based on if all the indexes are vacuumed for some > conveyor belt pages so we don't have risk of scanning without marking > anything unused. Not sure what you mean about a risk of scanning without marking any LP_DEAD items as LP_UNUSED. If VACUUM always does some amount of this, then it follows that the new mechanism added by the patch just can't safely avoid any work at all, making it all pointless. We have to expect heap vacuuming to take place much less often with the patch. Simply because that's what the invariant described in comments above lazy_scan_heap() requires. Note that this is not the same thing as saying that we do less *absolute* heap vacuuming with the conveyor belt -- my statement about less heap vacuuming taking place is *only* true relative to the amount of other work that happens in any individual "shortened" VACUUM operation. We could do exactly the same total amount of heap vacuuming as before (in a version of Postgres without the conveyor belt but with the same settings), but much *more* index vacuuming (at least for one or two problematic indexes). > And we can try to measure with other approaches as > well where we completely avoid phase 2 and it will be done only along > with phase 1 whenever applicable. I believe that the main benefit of the dead TID conveyor belt (outside of global index use cases) will be to enable us to do more (much more) index vacuuming for one index in particular. So it's not really about doing less index vacuuming or less heap vacuuming -- it's about doing a *greater* amount of *useful* index vacuuming, in less time. There is often some way in which failing to vacuum one index for a long time does lasting damage to the index structure. -- Peter Geoghegan
On Tue, Feb 8, 2022 at 12:12 PM Peter Geoghegan <pg@bowt.ie> wrote: > I believe that the main benefit of the dead TID conveyor belt (outside > of global index use cases) will be to enable us to do more (much more) > index vacuuming for one index in particular. So it's not really about > doing less index vacuuming or less heap vacuuming -- it's about doing > a *greater* amount of *useful* index vacuuming, in less time. There is > often some way in which failing to vacuum one index for a long time > does lasting damage to the index structure. This makes sense to me, and I think it's a good insight. It's not clear to me that we have enough information to make good decisions about which indexes to vacuum and which indexes to skip. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 8, 2022 at 9:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 8, 2022 at 12:12 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I believe that the main benefit of the dead TID conveyor belt (outside > > of global index use cases) will be to enable us to do more (much more) > > index vacuuming for one index in particular. So it's not really about > > doing less index vacuuming or less heap vacuuming -- it's about doing > > a *greater* amount of *useful* index vacuuming, in less time. There is > > often some way in which failing to vacuum one index for a long time > > does lasting damage to the index structure. > > This makes sense to me, and I think it's a good insight. > > It's not clear to me that we have enough information to make good > decisions about which indexes to vacuum and which indexes to skip. What if "extra vacuuming, not skipping vacuuming" was not just an abstract goal, but an actual first-class part of the implementation, and the index AM API? Then the question we're asking the index/index AM is no longer "Do you [an index] *not* require index vacuuming, even though you are entitled to it according to the conventional rules of autovacuum scheduling?". The question is instead more like "Could you use an extra, early VACUUM?". if we invert the question like this then we have something that makes more sense at the index AM level, but requires significant improvements at the level of autovacuum scheduling. On the other hand I think that you already need to do at least some work in that area. -- Peter Geoghegan
On Tue, Feb 8, 2022 at 12:50 PM Peter Geoghegan <pg@bowt.ie> wrote: > > It's not clear to me that we have enough information to make good > > decisions about which indexes to vacuum and which indexes to skip. > > What if "extra vacuuming, not skipping vacuuming" was not just an > abstract goal, but an actual first-class part of the implementation, > and the index AM API? Then the question we're asking the index/index > AM is no longer "Do you [an index] *not* require index vacuuming, even > though you are entitled to it according to the conventional rules of > autovacuum scheduling?". The question is instead more like "Could you > use an extra, early VACUUM?". > > if we invert the question like this then we have something that makes > more sense at the index AM level, but requires significant > improvements at the level of autovacuum scheduling. On the other hand > I think that you already need to do at least some work in that area. Right, that's why I asked the question. If we're going to ask the index AM whether it would like to be vacuumed right now, we're going to have to put some logic into the index AM that knows how to answer that question. But if we don't have any useful statistics that would let us answer the question correctly, then we have problems. While I basically agree with everything that you just wrote, I'm somewhat inclined to think that the question is not best phrased as either extra-vacuum or skip-a-vacuum. Either of those supposes a normative amount of vacuuming from which we could deviate in one direction or the other. I think it would be better to phrase it in a way that doesn't make such a supposition. Maybe something like: "Hi, we are vacuuming the heap right now and we are also going to vacuum any indexes that would like it, and does that include you?" The point is that it's a continuum. If we decide that we're asking the index "do you want extra vacuuming?" then that phrasing suggests that you should only say yes if you really need it. If we decide we're asking the index "can we skip vacuuming you this time?" then the phrasing suggests that you should not feel bad about insisting on a vacuum right now, and only surrender your claim if you're sure you don't need it. But in reality, no bias either way is warranted. It is either better that this index should be vacuumed right now, or better that it should not be vacuumed right now, and whichever is better should be what we choose to do. To expand on that just a bit, if I'm a btree index and someone asks me "can we skip vacuuming you this time?" I might say "return dead_tups < tiny_amount" and if they ask me "do you want extra vacuuming" I might say "return dead_tups > quite_large_amount". But if they ask me "should we vacuum you now?" then I might say "return dead_tups > moderate_amount" which feels like the correct thing here. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 8, 2022 at 10:58 AM Robert Haas <robertmhaas@gmail.com> wrote: > Right, that's why I asked the question. If we're going to ask the > index AM whether it would like to be vacuumed right now, we're going > to have to put some logic into the index AM that knows how to answer > that question. But if we don't have any useful statistics that would > let us answer the question correctly, then we have problems. I have very little faith in the use of statistical sampling for anything involving vacuuming. In fact, I think that the current way in which ANALYZE counts dead tuples is a misapplication of statistics. It isn't even wrong. One of the things that I really like about this project is that it can plausibly solve that problem by splitting up the work of VACUUM, at low cost -- it's less top-down. Not only do you get the obvious benefits with preventing bloat; you also get *continual* feedback about the actual physical reality in the table (and indexes, to a lesser extent). As I said recently, right now the more bloat we have, the more uncertainty about the total amount of bloat exists. We need to control both the bloat, and the uncertainty about the bloat. The basic high level idea behind how the optimizer uses statistics involves the assumption that *all* the rows in the table are *themselves* a sample taken from some larger distribution -- something from the real physical world (meeting this assumption is one reason why database/schema normalization really matters). And so on a good week it probably won't matter too much to the optimizer if ANALYZE doesn't run until the table size doubles (for a table that was already quite large). These are pretty delicate assumptions, that (from the point of view of the optimizer) work out surprisingly well in practice. Bloat just isn't like that. Dead tuples are fundamentally cyclic and dynamic in nature -- conventional statistics just won't work with something like that. Worst of all, the process that counts dead tuples (ANALYZE) is really an active participant in the system -- the whole entire purpose of even looking is to *reduce* the number of dead tuples by making an autovacuum run. That's deeply weird. > The point is that it's a continuum. If we decide that we're asking the > index "do you want extra vacuuming?" then that phrasing suggests that > you should only say yes if you really need it. If we decide we're > asking the index "can we skip vacuuming you this time?" then the > phrasing suggests that you should not feel bad about insisting on a > vacuum right now, and only surrender your claim if you're sure you > don't need it. But in reality, no bias either way is warranted. Actually, I think that this particular bias *is* warranted. We should openly and plainly be biased in the direction of causing the least harm. What's wrong with that? Having accurate information in not an intrinsic good. I even think that having more information can be strictly worse, because you might actually believe it. Variance matters a lot -- the bias/variance tradeoff is pretty fundamental here. I'm also saying some of this stuff because of broader VACUUM design considerations. VACUUM fundamentally has to work at the table level, and I don't see that changing. The approach of making autovacuum do something akin to a plain VACUUM command in the simplest cases, and only later some extra "dynamic mini vacuums" (that pick up where the VACUUM command style VACUUM left off) has a lot to recommend it. This approach allows most of the current autovacuum settings to continue to work in roughly the same way. They just need to have their documentation updated to make it clear that they're about the worst case. > To expand on that just a bit, if I'm a btree index and someone asks me > "can we skip vacuuming you this time?" I might say "return dead_tups < > tiny_amount" and if they ask me "do you want extra vacuuming" I might > say "return dead_tups > quite_large_amount". But if they ask me > "should we vacuum you now?" then I might say "return dead_tups > > moderate_amount" which feels like the correct thing here. The btree side of this shouldn't care at all about dead tuples (in general we focus way too much on dead tuples, and way too little on pages). With bottom-up index deletion the number of dead tuples in the index is just about completely irrelevant. It's entirely possible and often even likely that 20%+ of all index tuples will be dead at any one time, when the optimization perfectly preserves the index structure. The btree side of the index AM API should be focussing on the growth in index size, relative to some expectation (like maybe the growth for whatever index on the same table has grown the least since last time, accounting for obvious special cases like partial indexes). Perhaps we'd give some consideration to bulk deletes, too. Overall, it should be pretty simple, and should sometimes force us to do one of these "dynamic mini vacuums" of the index just because we're not quite sure what to do. There is nothing wrong with admitting the uncertainty. -- Peter Geoghegan
On Tue, Feb 8, 2022 at 10:42 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Sun, Feb 6, 2022 at 11:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > One thing we could try doing in order to make that easier would be: > > > tweak things so that when autovacuum vacuums the table, it only > > > vacuums the indexes if they meet some threshold for bloat. I'm not > > > sure exactly what happens with the heap vacuuming then - do we do > > > phases 1 and 2 always, or a combined heap pass, or what? But if we > > > pick some criteria that vacuums indexes sometimes and not other times, > > > we can probably start doing some meaningful measurement of whether > > > this patch is making bloat better or worse, and whether it's using > > > fewer or more resources to do it. > > > > I think we can always trigger phase 1 and 2 and phase 2 will only > > vacuum conditionally based on if all the indexes are vacuumed for some > > conveyor belt pages so we don't have risk of scanning without marking > > anything unused. > > Not sure what you mean about a risk of scanning without marking any > LP_DEAD items as LP_UNUSED. I mean for testing purposes if we integrate with autovacuum such that, 1) always do the first pass of the vacuum 2) index vacuum will be done only for the indexes which have bloated more than some threshold and then 3) we can always trigger the heap vacuum second pass. So my point was even if from autovacuum we trigger the second vacuum pass every time it will not do anything if all the indexes are not vacuumed. If VACUUM always does some amount of this, > then it follows that the new mechanism added by the patch just can't > safely avoid any work at all, making it all pointless. We have to > expect heap vacuuming to take place much less often with the patch. > Simply because that's what the invariant described in comments above > lazy_scan_heap() requires. In the second pass we are making sure that we don't mark any LP_DEAD to LP_UNUSED for which index vacuum is not done. Basically we are storing dead items in the conveyor belt and whenever we do the index pass we remember upto which conveyor belt page index vacuum is done. And before starting the heap second pass we will find the minimum conveyor belt page upto which all the indexes have been vacuumed. > Note that this is not the same thing as saying that we do less > *absolute* heap vacuuming with the conveyor belt -- my statement about > less heap vacuuming taking place is *only* true relative to the amount > of other work that happens in any individual "shortened" VACUUM > operation. We could do exactly the same total amount of heap vacuuming > as before (in a version of Postgres without the conveyor belt but with > the same settings), but much *more* index vacuuming (at least for one > or two problematic indexes). > > > And we can try to measure with other approaches as > > well where we completely avoid phase 2 and it will be done only along > > with phase 1 whenever applicable. > > I believe that the main benefit of the dead TID conveyor belt (outside > of global index use cases) will be to enable us to do more (much more) > index vacuuming for one index in particular. So it's not really about > doing less index vacuuming or less heap vacuuming -- it's about doing > a *greater* amount of *useful* index vacuuming, in less time. There is > often some way in which failing to vacuum one index for a long time > does lasting damage to the index structure. I agree with the point. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 1:21 AM Peter Geoghegan <pg@bowt.ie> wrote: > > The btree side of this shouldn't care at all about dead tuples (in > general we focus way too much on dead tuples, and way too little on > pages). With bottom-up index deletion the number of dead tuples in the > index is just about completely irrelevant. It's entirely possible and > often even likely that 20%+ of all index tuples will be dead at any > one time, when the optimization perfectly preserves the index > structure. > > The btree side of the index AM API should be focussing on the growth > in index size, relative to some expectation (like maybe the growth for > whatever index on the same table has grown the least since last time, > accounting for obvious special cases like partial indexes). Perhaps > we'd give some consideration to bulk deletes, too. Overall, it should > be pretty simple, and should sometimes force us to do one of these > "dynamic mini vacuums" of the index just because we're not quite sure > what to do. There is nothing wrong with admitting the uncertainty. I agree with the point that we should be focusing more on index size growth compared to dead tuples. But I don't think that we can completely ignore the number of dead tuples. Although we have the bottom-up index deletion but whether the index structure will be preserved or not will depend upon what keys we are inserting next. So for example if there are 80% dead tuples but so far index size is fine then can we avoid vacuum? If we avoid vacuuming then it is very much possible that in some cases we will create a huge bloat e.g. if we are inserting some keys which can not take advantage of bottom up deletion. So IMHO the decision should be a combination of index size bloat and % dead tuples. Maybe we can add more weight to the size bloat and less weight to % dead tuple but we should not completely ignore it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 1:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I agree with the point that we should be focusing more on index size > growth compared to dead tuples. But I don't think that we can > completely ignore the number of dead tuples. Although we have the > bottom-up index deletion but whether the index structure will be > preserved or not will depend upon what keys we are inserting next. So > for example if there are 80% dead tuples but so far index size is fine > then can we avoid vacuum? If we avoid vacuuming then it is very much > possible that in some cases we will create a huge bloat e.g. if we are > inserting some keys which can not take advantage of bottom up > deletion. So IMHO the decision should be a combination of index size > bloat and % dead tuples. Maybe we can add more weight to the size > bloat and less weight to % dead tuple but we should not completely > ignore it. I think that dead index tuples really don't matter if they're going to get removed anyway before a page split happens. In particular, if we're going to do a bottom-up index deletion pass before splitting the page, then who cares if there are going to be dead tuples around until then? You might think that they'd have the unfortunate effect of slowing down scans, and they could slow down ONE scan, but if they do, then I think kill_prior_tuple will hint them dead and they won't matter any more. Now, if we have a page that is going to split, because it's going to receive inserts but neither kill_prior_tuple nor bottom-up index deletion are going to keep us out of trouble, then the dead tuples matter. And if we have a page where all the tuples are dead and no further inserts are ever going to happen, those dead tuples also matter, because getting rid of them would let us recycle the page. Just to be clear, when I say that the dead index tuples don't matter here, I mean from the point of view of the index. From the point of view of the table, the presence of dead index tuples (or even the potential presence of dead tuples) pointing to dead line pointers is an issue that can drive heap bloat. But from the point of view of the index, because we don't ever merge sibling index pages, and because we have kill_prior_tuple, there's not much value in freeing up space in index pages unless it either prevents a split or lets us free the whole page. So I agree with Peter that index growth is what really matters. However, I have a concern that Peter's idea to use the previous index growth to drive future index vacuuming distinction is retrospective rather than prospective. If the index is growing more than it should based on the data volume, then evidently we didn't do enough vacuuming at some point in the past. It's reasonable to step up our efforts in the present to make sure that the problem doesn't continue, but in some sense it's already too late. What we would really like is a measure that answers the question: is the index going to bloat in the relatively near future if we don't vacuum it now? I think that the dead tuple count is trying, however imperfectly, to figure that out. All other things being equal, the more dead tuples there are in the index, the more bloat we're going to have later if we don't clean them out now. The problem is not with that core idea, which IMHO is actually good, but that all other things are NOT equal. Peter has shown pretty convincingly that in some workloads, essentially 100% of dead tuples are going to get removed without causing a page split and the index growth will be 0, whereas in other workloads 0% of dead tuples are going to get removed without causing index growth. If you knew that you had the second case, then counting dead index tuples to decide when to vacuum would, in my opinion, be a very sensible thing to do. It would still not be perfect, because dead tuples in pages that are going to get split are a lot worse than dead tuples in pages that aren't going to be split, but it doesn't seem meaningless. However, if all of the index tuples are going to get removed in a timely fashion anyway, then it's as useful as a stopped clock: it will be right whenever it says the index doesn't need to be vacuumed, and wrong when it says anything else. In a certain sense, bottom-up index deletion may have exacerbated the problems in this area. The more ways we add to remove dead tuples from indexes without vacuum, the less useful dead tuples will become as a predictor of index growth. Maybe #-of-dead-tuples and future-index-growth weren't that tightly coupled even before bottom-up index deletion, but it must be worse now. I'm not hung up on using the # of dead tuples specifically as the metric for index vacuuming, and it may be best to pick some other measure. But I am a little suspicious that if the only measure is past index growth, we will let some situations go too far before we wake up and do something about them. My intuition is that it would be a good idea to come up with something we could measure, even if it's imperfect, that would give us some clue that trouble is brewing before pages actually start splitting. Now maybe my intuition is wrong and there is nothing better, but I think it's worth a thought. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 6:13 AM Robert Haas <robertmhaas@gmail.com> wrote: > Just to be clear, when I say that the dead index tuples don't matter > here, I mean from the point of view of the index. From the point of > view of the table, the presence of dead index tuples (or even the > potential presence of dead tuples) pointing to dead line pointers is > an issue that can drive heap bloat. But from the point of view of the > index, because we don't ever merge sibling index pages, and because we > have kill_prior_tuple, there's not much value in freeing up space in > index pages unless it either prevents a split or lets us free the > whole page. So I agree with Peter that index growth is what really > matters. One small caveat that I'd add is this: heap fragmentation likely makes it harder to avoid page splits in indexes, to a degree. It is arguably one cause of the page splits that do happen in a table like pgbench_tellers, with standard pgbench (and lots of throughput, lots of clients). The tellers (and also branches) primary key tends to double in size in my recent tests (still way better than a 20x increase in size, which is what happened in Postgres 11 and maybe even 13). I think that it might be possible to perfectly preserve the original index size (even with ANALYZE running now and again) by setting heap fill factor very low, maybe 50 or less. This is a minor quibble, though. It still makes sense to think of heap fragmentation as a problem for the heap itself, and not for indexes at all, since the effect I describe is relatively insignificant, and just about impossible to model. The problem really is that the heap pages are failing to hold their original logical rows in place -- the index size issue is more of a symptom than a problem unto itself. > However, I have a concern that Peter's idea to use the previous index > growth to drive future index vacuuming distinction is retrospective > rather than prospective. If the index is growing more than it should > based on the data volume, then evidently we didn't do enough vacuuming > at some point in the past. That's a very valid concern. As you know, the great advantage about retrospectively considering what's not going well (and reducing everything to some highly informative measure like growth in index size) is that it's reliable -- you don't have to understand precisely how things got that way, which is just too complicated to get right. And as you point out, the great disadvantage is that it has already happened -- which might already be too late. More on that later... > It's reasonable to step up our efforts in > the present to make sure that the problem doesn't continue, but in > some sense it's already too late. What we would really like is a > measure that answers the question: is the index going to bloat in the > relatively near future if we don't vacuum it now? I think that the > dead tuple count is trying, however imperfectly, to figure that out. > All other things being equal, the more dead tuples there are in the > index, the more bloat we're going to have later if we don't clean them > out now. One of the key intuitions behind bottom-up index deletion is to treat the state of an index page as a dynamic thing, not a static thing. The information that we take from the page that drives our decisions is very reliable on average, over time, in the aggregate. At the same time, the information is very noisy, and could be wrong in important ways at just about any time. The fundamental idea was to take advantage of the first property, without ever getting killed by the second property. To me this seems conceptually similar to how one manages risk when playing a game of chance while applying the Kelly criterion. The criterion provides probabilistic certainty on the best strategy to use in a situation where we have known favorable odds, an infinite series of bets, and a personal bankroll. I recently came across a great blog post about it, which gets the idea across well: https://explore.paulbutler.org/bet/ If you scroll down to the bottom of the page, there are some general conclusions, some of which are pretty counterintuitive. Especially "Maximizing expected value can lead to a bad long-run strategy". Growth over time is much more important than anything else, since you can play as many individual games as you like -- provided you never go bankrupt. I think that *a lot* of problems can be usefully analyzed that way, or at least benefit from a general awareness of certain paradoxical aspects of probability. Bankruptcy must be recognized as qualitatively different to any non-zero bankroll. A little bit like how splitting a leaf page unnecessarily is truly special. Once we simply avoid ruin, we can get the benefit of playing a game with favorable odds -- growth over time is what matters, not necessarily our current bankroll. It sounds like a fairly small difference, but that's deceptive -- it's a huge difference. > The problem is not with that core idea, which IMHO is actually good, > but that all other things are NOT equal. ...now to get back to talking about VACUUM itself, and to this project. I couldn't agree more -- all other things are NOT equal. We need a way to manage the risk that things could change quickly when an index that we believe has many dead tuples hasn't grown at all just yet. It's probably also true that we should try to predict the near future, and not 100% rely on the fact that what we've been doing seems to have worked so far -- I do accept that. We should probably dispense with the idea that we'll be making these decisions about what to do with an index like this (bloated in a way that bottom-up index deletion just won't help with) in an environment that is similar to how the current "skip index scan when # heap pages with one or more LP_DEAD items < 2% of rel_pages" thing. That mechanism has to be very conservative because we just don't know when the next opportunity to vacuum indexes will be -- we almost have to assume that the decision will be static, and made exactly once, so we better be defensive. But why should that continue to be true with the conveyor belt stuff in place, and with periodic mini-vacuums that coordinate over time? I don't think it has to be like that. We can make it much more dynamic. I can imagine a two-way dialog between the index and between vacuumlazy.c that takes place over time. The index AM might be able to report something along the lines of: "While I think that this index has more dead index tuples then it really should, the fact is that it hasn't grown at all, even by one single leaf page. And so don't you [vacuumlazy.c] should not make me vacuum the index right now. But be careful -- check back in again in another minute or two, because the situation must be assumed to be pretty volatile for now." This is just an example, not a concrete design. My point is that approaching the problem dynamically makes it *vastly* easier to do the right thing. It's far easier to manage the consequences of being wrong than it is to be right all the time. We're going to be wrong anyway, so better to be wrong on our own terms. > I'm not hung up on using the # of dead tuples specifically as the > metric for index vacuuming, and it may be best to pick some other > measure. But I am a little suspicious that if the only measure is past > index growth, we will let some situations go too far before we wake up > and do something about them. My intuition is that it would be a good > idea to come up with something we could measure, even if it's > imperfect, that would give us some clue that trouble is brewing before > pages actually start splitting. Now maybe my intuition is wrong and > there is nothing better, but I think it's worth a thought. We will need something like that. I think that LP_DEAD items (or would-be LP_DEAD items -- tuples with storage that would get pruned into LP_DEAD items if we were to prune) in the table are much more interesting than dead heap-only tuples, and also more interesting that dead index tuples. Especially the distribution of such LP_DEAD items in the table, and their concentration. That does seem much more likely to be robust as a quantitative driver of index vacuuming. In the extreme case when there are a huge amount of LP_DEAD items in the table, then we're going to want to make them LP_UNUSED anyway, which implies that we'll do index vacuuming to make it safe. Since that's already going to be true, maybe we should try to find a way to usefully scale the behavior, so that maybe some indexes are vacuumed sooner when the number of LP_DEAD items is increasing. Not really sure, but that seems more promising than anything else. -- Peter Geoghegan
On Wed, Feb 9, 2022 at 2:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > We should probably dispense with the idea that we'll be making these > decisions about what to do with an index like this (bloated in a way > that bottom-up index deletion just won't help with) in an environment > that is similar to how the current "skip index scan when # heap pages > with one or more LP_DEAD items < 2% of rel_pages" thing. That > mechanism has to be very conservative because we just don't know when > the next opportunity to vacuum indexes will be -- we almost have to > assume that the decision will be static, and made exactly once, so we > better be defensive. But why should that continue to be true with the > conveyor belt stuff in place, and with periodic mini-vacuums that > coordinate over time? I don't think it has to be like that. We can > make it much more dynamic. I'm not sure that we can. I mean, there's still only going to be ~3 autovacuum workers, and there could be arbitrarily many tables. Even if the vacuum load is within the bounds of what the system can sustain, individual tables can't be assured of being visited frequently (or so it seems to me) and it could be that there are actually not enough resources to vacuum and have to try to cope as best we can. Less unnecessary vacuuming of large indexes can help, of course, but I'm not sure it fundamentally changes the calculus. > We will need something like that. I think that LP_DEAD items (or > would-be LP_DEAD items -- tuples with storage that would get pruned > into LP_DEAD items if we were to prune) in the table are much more > interesting than dead heap-only tuples, and also more interesting that > dead index tuples. Especially the distribution of such LP_DEAD items > in the table, and their concentration. That does seem much more likely > to be robust as a quantitative driver of index vacuuming. Hmm... why would the answer have to do with dead items in the heap? I was thinking along the lines of trying to figure out either a more reliable count of dead tuples in the index, subtracting out whatever we save by kill_prior_tuple and bottom-up vacuuming; or else maybe a count of the subset of dead tuples that are likely not to get opportunistically pruned in one way or another, if there's some way to guess that. Or maybe something where when we see an index page filling up we try to figure out (or guess) that it's close to really needing a split - i.e. that it's not full of tuples that we could just junk to make space - and notice how often that's happening. I realize I'm hand-waving, but if the property is a property of the heap rather than the index, how will different indexes get different treatment? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > I'm not sure that we can. I mean, there's still only going to be ~3 > autovacuum workers, and there could be arbitrarily many tables. Even > if the vacuum load is within the bounds of what the system can > sustain, individual tables can't be assured of being visited > frequently (or so it seems to me) and it could be that there are > actually not enough resources to vacuum and have to try to cope as > best we can. Less unnecessary vacuuming of large indexes can help, of > course, but I'm not sure it fundamentally changes the calculus. You seem to be vastly underestimating the value in being able to spread out and reschedule the work, and manage costs more generally. If you can multiplex autovacuum workers across tables, by splitting up work across a table's index over time, then it might not matter at all that you only have 3 workers. If you can spread out the work over time, then you make things much cheaper (fewer FPIs by aligning to checkpoint boundaries). And, because you have a schedule that can be dynamically updated, you get to update your global view of the world (not just one table) before you've fully committed to it -- if you provisionally say that you think that a certain index won't need to be vacuumed for a long time, that isn't the last word anymore. Costs are paid by the whole system, but benefits only go to individual tables and indexes. Being able to manage costs over time with a sense of the benefits, and a sense of high level priorities will be *huge* for us. Managing debt at the level of the entire system (not just one table or index) is also really important. (Though maybe we should just focus on the v1, just because that's what is needed right now.) > > We will need something like that. I think that LP_DEAD items (or > > would-be LP_DEAD items -- tuples with storage that would get pruned > > into LP_DEAD items if we were to prune) in the table are much more > > interesting than dead heap-only tuples, and also more interesting that > > dead index tuples. Especially the distribution of such LP_DEAD items > > in the table, and their concentration. That does seem much more likely > > to be robust as a quantitative driver of index vacuuming. > > Hmm... why would the answer have to do with dead items in the heap? We're eventually going to have to make the LP_DEAD items LP_UNUSED anyway here. So we might as well get started on that, with the index that we *also* think is the one that might need it the most, for its own reasons. We're making a decision on the basis of multiple factors, knowing that in the worst case (when the index really didn't need anything at all) we will have at least had the benefit of doing some actually-useful work sooner rather than later. We should probably consider multiple reasons to do any unit of work. > I was thinking along the lines of trying to figure out either a more > reliable count of dead tuples in the index, subtracting out whatever > we save by kill_prior_tuple and bottom-up vacuuming; or else maybe a > count of the subset of dead tuples that are likely not to get > opportunistically pruned in one way or another, if there's some way to > guess that. I don't know how to build something like that, since that works by understanding what's working, not by noticing that some existing strategy plainly isn't working. The only positive information that I have confidence in is the extreme case where you have zero index growth. Which is certainly possible, but perhaps not that interesting with a real workload. There are emergent behaviors with bottom-up deletion. Purely useful behaviors, as far as I know, but still very hard to precisely nail down. For example, Victor Yegorov came up with an adversarial benchmark [1] that showed that the technique dealt with index bloat from queue-like inserts and deletes that recycled the same distinct key values over time, since they happened to be mixed with non-hot updates. It dealt very well with that, even though *I had no clue* that it would work *at all*, and might have even incorrectly predicted the opposite if Victor had asked about it in advance. > I realize I'm > hand-waving, but if the property is a property of the heap rather than > the index, how will different indexes get different treatment? Maybe by making the primary key growth an indicator of what is reasonable for the other indexes (or other B-Tree indexes) -- it has a natural tendency to be the least bloated possible index. If you have something like a GiST index, or if you have a B-Tree index that constantly gets non-HOT updates that logically modify an indexed column, then it should become reasonably obvious. Maybe there'd be some kind of feedback behavior to lock in "bloat prone index" for a time. If we can bring costs into it too (e.g., spreading out the burden of index vacuuming over time), then it becomes acceptable to incorrectly determine which index needed special attention. We will still remember that that one index has been vacuumed up to a certain point, which is still useful -- that work would have to have been completed either way, so it's really no real loss. Plus we've spread the burden out over time, which is always useful. The cost control stuff could easily more than make up for the fact that we don't have a mythical perfect model that always knows exactly what to do, when, based on the needs of indexes. I think that expanding the scope to cover cost management actually makes this project easier, not harder. Costs really matter, and are much easier to understand. Cost control makes it okay to guess about benefits for the index/queries and be wrong. [1] https://www.postgresql.org/message-id/CAGnEbogATZS1mWMVX8FzZHMXzuDEcb10AnVwwhCtXtiBpg3XLQ@mail.gmail.com -- Peter Geoghegan
On Wed, Feb 9, 2022 at 7:43 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Feb 9, 2022 at 1:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think that dead index tuples really don't matter if they're going to > get removed anyway before a page split happens. In particular, if > we're going to do a bottom-up index deletion pass before splitting the > page, then who cares if there are going to be dead tuples around until > then? You might think that they'd have the unfortunate effect of > slowing down scans, and they could slow down ONE scan, but if they do, > then I think kill_prior_tuple will hint them dead and they won't > matter any more. Actually I was not worried about the scan getting slow. What I was worried about is if we keep ignoring the dead tuples for long time then in the worst case if we have huge number of dead tuples in the index maybe 80% to 90% and then suddenly if we get a lot of insertion for the keys which can not use bottom up deletion (due to the key range). So now we have a lot of pages which have only dead tuples but we will still allocate new pages because we ignored the dead tuple % and did not vacuum for a long time. In short I am worried about creating a sudden bloat in the index due to a lot of existing dead tuples. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 6:18 PM Peter Geoghegan <pg@bowt.ie> wrote: > You seem to be vastly underestimating the value in being able to > spread out and reschedule the work, and manage costs more generally. Hmm. I think you're vastly overestimating the extent to which it's possible to spread out and reschedule the work. I don't know which of us is wrong. From my point of view, if VACUUM is going to do a full phase 1 heap pass and a full phase 2 heap pass on either side of whatever index work it does, there is no way that things are going to get that much more dynamic than they are today. And even if we didn't do that, in order to make any progress setting LP_DEAD pointers to LP_UNUSED, you have to vacuum the entire index, which might be BIG. It would be great to have a lot of granularity here but it doesn't seem achievable. > > I was thinking along the lines of trying to figure out either a more > > reliable count of dead tuples in the index, subtracting out whatever > > we save by kill_prior_tuple and bottom-up vacuuming; or else maybe a > > count of the subset of dead tuples that are likely not to get > > opportunistically pruned in one way or another, if there's some way to > > guess that. > > I don't know how to build something like that, since that works by > understanding what's working, not by noticing that some existing > strategy plainly isn't working. The only positive information that I have > confidence in is the extreme case where you have zero index growth. > Which is certainly possible, but perhaps not that interesting with a > real workload. > > There are emergent behaviors with bottom-up deletion. Purely useful > behaviors, as far as I know, but still very hard to precisely nail > down. For example, Victor Yegorov came up with an adversarial > benchmark [1] that showed that the technique dealt with index bloat > from queue-like inserts and deletes that recycled the same distinct > key values over time, since they happened to be mixed with non-hot > updates. It dealt very well with that, even though *I had no clue* > that it would work *at all*, and might have even incorrectly predicted > the opposite if Victor had asked about it in advance. I don't understand what your point is in these two paragraphs. I'm just arguing that, if a raw dead tuple count is meaningless because a lot of them are going to disappear harmlessly with or without vacuum, it's reasonable to try to get around that problem by counting the subset of dead tuples where that isn't true. I agree that it's unclear how to do that, but that doesn't mean that it can't be done. > > I realize I'm > > hand-waving, but if the property is a property of the heap rather than > > the index, how will different indexes get different treatment? > > Maybe by making the primary key growth an indicator of what is > reasonable for the other indexes (or other B-Tree indexes) -- it has a > natural tendency to be the least bloated possible index. If you have > something like a GiST index, or if you have a B-Tree index that > constantly gets non-HOT updates that logically modify an indexed > column, then it should become reasonably obvious. Maybe there'd be > some kind of feedback behavior to lock in "bloat prone index" for a > time. I have the same concern about this as what I mentioned before: it's purely retrospective. Therefore in my mind it's a very reasonable choice for a backstop, but not a somewhat unsatisfying choice for a primary mechanism. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Feb 10, 2022 at 3:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Actually I was not worried about the scan getting slow. What I was > worried about is if we keep ignoring the dead tuples for long time > then in the worst case if we have huge number of dead tuples in the > index maybe 80% to 90% and then suddenly if we get a lot of insertion > for the keys which can not use bottom up deletion (due to the key > range). So now we have a lot of pages which have only dead tuples but > we will still allocate new pages because we ignored the dead tuple % > and did not vacuum for a long time. It seems like a reasonable concern to me ... and I think it's somewhat related to my comments about trying to distinguish which dead tuples matter vs. which don't. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Feb 10, 2022 at 11:16 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 10, 2022 at 3:10 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Actually I was not worried about the scan getting slow. What I was > > worried about is if we keep ignoring the dead tuples for long time > > then in the worst case if we have huge number of dead tuples in the > > index maybe 80% to 90% and then suddenly if we get a lot of insertion > > for the keys which can not use bottom up deletion (due to the key > > range). So now we have a lot of pages which have only dead tuples but > > we will still allocate new pages because we ignored the dead tuple % > > and did not vacuum for a long time. > > It seems like a reasonable concern to me ... and I think it's somewhat > related to my comments about trying to distinguish which dead tuples > matter vs. which don't. It's definitely a reasonable concern. But once you find yourself in this situation, *every* index will need to be vacuumed anyway, pretty much as soon as possible. There will be many LP_DEAD items in the heap, which will be enough to force index vacuuming of all indexes. -- Peter Geoghegan
On Thu, Feb 10, 2022 at 11:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > Hmm. I think you're vastly overestimating the extent to which it's > possible to spread out and reschedule the work. I don't know which of > us is wrong. From my point of view, if VACUUM is going to do a full > phase 1 heap pass and a full phase 2 heap pass on either side of > whatever index work it does, there is no way that things are going to > get that much more dynamic than they are today. Waiting to vacuum each index allows us to wait until the next VACUUM operation on the table, giving us more TIDs to remove when we do go to vacuum one of these large indexes. Making decisions dynamically seems very promising because it gives us the most flexibility. In principle the workload might not allow for any of that, but in practice I think that it will. > I don't understand what your point is in these two paragraphs. I'm > just arguing that, if a raw dead tuple count is meaningless because a > lot of them are going to disappear harmlessly with or without vacuum, > it's reasonable to try to get around that problem by counting the > subset of dead tuples where that isn't true. I agree that it's unclear > how to do that, but that doesn't mean that it can't be done. VACUUM is a participant in the system -- it sees how physical relations are affected by the workload, but it also sees how physical relations are affected by previous VACUUM operations. If it goes to VACUUM an index on the basis of a relatively small difference (that might just be noise), and does so systematically and consistently, that might have unintended consequences. In particular, we might do the wrong thing, again and again, because we're overinterpreting noise again and again. > I have the same concern about this as what I mentioned before: it's > purely retrospective. Therefore in my mind it's a very reasonable > choice for a backstop, but not a somewhat unsatisfying choice for a > primary mechanism. I'm not saying that it's impossible or even unreasonable to do something based on the current or anticipated state of the index, exactly. Just that you have to be realistic about how accurate that model is going to be in practice. In practice it'll be quite noisy, and that must be accounted for. For example, we could deliberately coarsen the information, so that only relatively large differences in apparent-bloatedness are visible to the model. The other thing is that VACUUM itself cannot be expected to operate with all that much precision, just because of how it works at a high level. Any quantitative measure will only be meaningful as a way of prioritizing work. Which is going to be far easier by making the behavior dynamic, and continually reassessing. Once a relatively large difference among two indexes first emerges, we can be relatively confident about what to do. But smaller differences are likely just noise. -- Peter Geoghegan