decoupling table and index vacuum - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | decoupling table and index vacuum |
Date | |
Msg-id | CA+TgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A@mail.gmail.com Whole thread Raw |
Responses |
Re: decoupling table and index vacuum
Re: decoupling table and index vacuum Re: decoupling table and index vacuum Re: decoupling table and index vacuum |
List | pgsql-hackers |
Hi, We are used to thinking about table vacuum and index vacuum as parts of a single, indivisible operation. You vacuum the table -- among other things by performing HOT pruning and remembering dead TIDs -- and then you vacuum the indexes -- removing the remembered TIDs from the index -- and then you vacuum the table some more, setting those dead TIDs unused -- and then you're done. And along the way you do some other things too like considering truncation that aren't relevant to the point I want to make here. Now, the problem with this is that every index has its own needs, which are separate from the needs of the tables, as I think Peter Geoghegan and Masahiko Sawada were also discussing recently. Opportunistic index cleanup strategies like kill_prior_tuple and bottom-up deletion may work much better for some indexes than others, meaning that you could have some indexes that badly need to be vacuumed because they are full of garbage, and other indexes on the same table where the opportunistic cleanup has worked perfectly and there is no need for vacuuming at all. Separately, the table may or may not need to get some dead pointers set back to unused to avoid table bloat. But, as things stand today, strategy options to deal with such situations are limited. Leaving aside what the code actually does right now, let's talk about what options we have in theory with the technology as it exists now. They basically all boil down to stopping early and then redoing the work later. We must always start with a pass over the heap to collect dead TIDs, because otherwise there's nothing else to do. Now we can choose to stop, but then the next VACUUM will have to collect all those TIDs again. It may get to skip more all-visible pages than the current vacuum did, but the pages that still have dead TIDs will all have to be visited again. If we don't stop, then we can choose to vacuum all of the indexes or just some of them, and then afterwards stop. But if we do this, the next VACUUM will have to scan all indexes again for the same TIDs. Here, we don't even have the visibility map to allow skipping known-clean pages, so it's *really* a lot of work we have to redo. Thus what we normally do is press on to the third step, where we mark dead line pointers unused after scanning every index in its entirety, and now they're gone and we don't have to worry about them again. Barring emergency escape valves, as things stand today, the frequency of table vacuuming is the same as the frequency of index vacuuming, even though the *required* frequency of vacuuming is not the same, and also varies from index to index. Now, the reason for this is that when we discover dead TIDs, we only record them in memory, not on disk. So, as soon as VACUUM ends, we lose all knowledge of whether those TIDs were and must rediscover them. Suppose we didn't do this, and instead had a "dead TID" fork for each table. Suppose further that this worked like a conveyor belt, similar to WAL, where every dead TID we store into the fork is assigned an identifying 64-bit number that is never reused. Then, suppose that for each index, we store the number of the oldest entry that might still need to be vacuumed from the index. Every time you perform what we now call the first heap pass of a VACUUM, you add the new TIDs you find to the dead TID fork. Every time you vacuum an index, the TIDs that need to be removed are those between the oldest-entry pointer for that index and the current end of the TID fork. You remove all of those and then advance your oldest-entry pointer accordingly. If that's too many TIDs to fit in maintenance_work_mem, you can just read as many as will fit and advance your oldest-entry pointer less far. Every time you perform what we now call the second heap pass of a VACUUM, you find all the TIDs that precede every index's oldest-entry pointer and set them unused. You then throw away the associated storage at the OS level. This requires a scheme where relations can be efficiently truncated from the beginning rather than only at the end, which is why I said "a conveyor belt" and "similar to WAL". Details deliberately vague since I am just brainstorming here. This scheme adds a lot of complexity, which is a concern, but it seems to me that it might have several benefits. One is concurrency. You could have one process gathering dead TIDs and adding them to the dead-TID fork while another process is vacuuming previously-gathered TIDs from some index. In fact, every index could be getting vacuumed at the same time, and different indexes could be removing different TID ranges. At the same time, you could have another process setting dead TIDs that all indexes have previously removed to unused. Furthermore, all of these operations can start in any order, and any of them can be repeated any number of times during a single run of any particular other one, or indeed, without that particular one ever being run at all. Both heap phases can also now be done in smaller chunks, if desired. You can gather TIDs from a portion of the table and remember where you left off, and come back and pick up from that point later, if you wish. You can likewise pick a subset of dead-TIDs-retired-from-all-indexes to set unused, and do just that many, and then at a later time come back and do some more. Also, you can now make mostly-independent decisions about how to perform each of these operations, too. It's not completely independent: if you need to set some dead TIDs in the table to unused, you may have to force index vacuuming that isn't needed for bloat control. However, you only need to force it for indexes that haven't been vacuumed recently enough for some other reason, rather than every index. If you have a target of reclaiming 30,000 TIDs, you can just pick the indexes where there are fewer than 30,000 dead TIDs behind their oldest-entry pointers and force vacuuming only of those. By the time that's done, there will be at least 30,000 dead line pointers you can mark unused, and maybe more, minus whatever reclamation someone else did concurrently. But is this worthwhile? I think it depends a lot on what you think the comparative required frequencies are for the various operations. If index A needs to be vacuumed every 40 minutes and index B needs to be vacuumed every 55 minutes and the table that owns both of them needs to be vacuumed every 70 minutes, I am not sure there is a whole lot here. I think you will be pretty much equally well off if you just do what we do today every 40 minutes and call it good. Also, you will not benefit very much if the driving force is reclaiming dead line pointers in the table itself. If that has to happen frequently, then the indexes have to be scanned frequently, and this whole thing is a lot of work for not much. But, maybe that's not the case. Suppose index A needs to be vacuumed every hour to avoid bloat, index B needs to be vacuumed every 4 hours to avoid bloat, and the table needs dead line pointers reclaimed every 5.5 hours. Well, now you can gain a lot. You can vacuum index A frequently while vacuuming index B only as often as it needs, and you can reclaim dead line pointers on their own schedule based on whatever index vacuuming was already done for bloat avoidance. Without this scheme, there's just no way to give everybody what they need without some of the participants being "dragged along for the ride" and forced into work that they don't actually need done simply because "that's how it works." One thing I don't know is whether the kind of scenario that I describe above is common, i.e. is the main reason we need to vacuum to control index bloat, where this kind of approach seems likely to help, or is it to reclaim dead line pointers in the heap, where it's not? I'd be interested in hearing from people who have some experience in this area, or at least better intuition than I do. I'm interested in this idea partly because I think it would be much more likely to help in a hypothetical world where we had global indexes. Imagine a partitioned table where each partition has a local index and a then there is also a global index which indexes tuples from every partition. Waving away the difficulty of making such a thing work, there's a vacuuming problem here, which has been discussed before. In short, if you tried to handle this in the naive way, you'd end up having to vacuum the global index every time you vacuumed any partition. That sucks. Suppose that there are 1000 partitions, each partition is 1GB, and each local index is 50MB. All things being equal, the global index should end up being about as big as all of the local indexes put together, which in this case would be 50GB. Clearly, we do not want to vacuum each partition by scanning the 1GB partition + the 50MB local index + the 50GB global index. That's insane. With the above system, since everything's decoupled, you can vacuum the partition tables individually as often as required, and similarly for their local indexes, but put off vacuuming the global index until you've vacuumed a bunch, maybe all, of the partitions, so that the work of cleaning up the global index cleans up dead TIDs from many or all partitions instead of just one at a time. Now, the fly in the ointment here is that this supposes that we don't get forced into vacuuming the global index too quickly because of dead line pointer accumulation. But, I think if that does happen, with careful scheduling, we might not really be worse off than we would have been without partitioning. If we scan the table for just one partition and, say, exhaust maintenance_work_mem, we have some expensive index vacuuming to do immediately, but that would've also happened in pretty much the same way with an unpartitioned table. If we don't fill maintenance_work_mem but we do notice that the table for this partition is full of dead line pointers that we need to reclaim, we can still choose to scan some other partitions and collect some more dead TIDs before cleaning the global index. That could delay actually getting those line pointers reclaimed, but an unpartitioned table would have suffered from at least as much delay, because it wouldn't even consider the possibility of stopping before scanning every available table page, and we could choose to stop after dealing with only some partitions but not all. It's probably tricky to get the autovacuum algorithm right here, but there seems to be some room for optimism. Even if global indexes never happened, though, I think this could have other benefits. For example, the wraparound failsafe mechanism recently added by Masahiko Sawada and Peter Geoghegan bypasses index vacuuming when wraparound danger is imminent. The only problem is that making that decision means throwing away the accumulated list of dead TIDs, which then need to be rediscovered whenever we get around to vacuuming the indexes. But that's avoidable, if they're stored on disk rather than in RAM. One rather serious objection to this whole line of attack is that we'd ideally like VACUUM to reclaim disk space without using any more, in case the motivation for running VACUUM in the first place. A related objection is that if it's sometimes agreable to do everything all at once as we currently do, the I/O overhead could be avoided. I think we'd probably have to retain a code path that buffers the dead TIDs in memory to account, at least, for the low-on-disk-space case, and maybe that can also be used to avoid I/O in some other cases, too. I haven't thought through all the details here. It seems to me that the actual I/O avoidance is probably not all that much - each dead TID is much smaller than the deleted tuple that gave rise to it, and typically you don't delete all the tuples at once - but it might be material in some cases, and it's definitely material if you don't have enough disk space left for it to complete without error. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: