heapam and bottom-up garbage collection, keeping version chains short (Was: Deleting older versions in unique indexes to avoid page splits) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | heapam and bottom-up garbage collection, keeping version chains short (Was: Deleting older versions in unique indexes to avoid page splits) |
Date | |
Msg-id | CAH2-Wznu00SnGSw-idMj9DLy59XJxSLWvshRCO5sGC--eaguMQ@mail.gmail.com Whole thread Raw |
In response to | Re: Deleting older versions in unique indexes to avoid page splits (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Thu, Oct 22, 2020 at 6:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > But that being said, I'm not trying to derail this patch. It isn't, > and shouldn't be, the job of this patch to solve that problem. It's > just better if it doesn't regress things, or maybe even (as you say) > makes them a little better. I think the idea you've got here is > basically good, and a lot of it comes down to how well it works in > practice. Thanks. I would like to have a more general conversation about how some of the principles embodied by this patch could be generalized to heapam in the future. I think that we could do something similar with pruning, or at least with the truncation of HOT chains. Here are a few of the principles I'm thinking of, plus some details of how they might be applied in heapam: * The most important metric is not the total amount of dead tuples in the whole table. Rather, it's something like the 90th + 99th percentile length of version chains for each logical row, or maybe only those logical rows actually accessed by index scans recently. (Not saying that we should try to target this in some explicit way - just that it should be kept low with real workloads as a natural consequence of the design.) * Have a variety of strategies for dealing with pruning that compete with each other. Allow them to fight it out organically. Start with the cheapest thing and work your way up. So for example, truncation of HOT chains is always preferred, and works in just the same way as it does today. Then it gets more complicated and more expensive, but in a way that is a natural adjunct of the current proven design. I believe that this creates a bit more pain earlier on, but avoids much more pain later. There is no point powering ahead if we'll end up hopelessly indebted with garbage tuples in the end. For heapam we could escalate from regular pruning by using an old version store for versions that were originally part of HOT chains, but were found to not fit on the same page at the point of opportunistic pruning. The old version store is for committed versions only -- it isn't really much like UNDO. We leave forwarding information on the original heap page. We add old committed versions to the version store at the point when a heap page is about to experience an event that is roughly the equivalent of a B-Tree page split -- for heapam a "split" is being unable to keep the same logical row on the same heap page over time in the presence of many updates. This is our last line of defense against this so-called page split situation. It occurs after regular pruning (which is an earlier line of defense) fails to resolve the situation inexpensively. We can make moving tuples into the old version store WAL efficient by making it work like an actual B-Tree page split -- it can describe the changes in a way that's mostly logical, and based on the existing page image. And like an actual page split, we're amortizing costs in a localized way. It is also possible to apply differential compression to whole HOT chains rather than storing them in the old version store, for example, if that happens to look favorable (think of rows with something like a pgbench_accounts.filler column -- not uncommon). We have options, we can add new options in the future as new requirements come to light. We allow the best option to present itself to us in a bottom-up fashion. Sometimes the best strategy at the local level is actually a combination of two different strategies applied alternately over time. For example, we use differential compression first when we fail to prune, then we prune the same page later (a little like merge deduplication in my recent nbtree delete dedup patch). Maybe each of these two strategies (differential compression + traditional heap HOT chain truncation) get applied alternately against the same heap page over time, in a tick-tock fashion. We naturally avoid availing of the old version store structure, which is good, since that is a relatively expensive strategy that we should apply only as a last resort. This tick-tock behavior is an emergent property of the workload rather than something planned or intelligent, and yet it kind of appears to be an intelligent strategy. (Maybe it works that way permanently in some parts of the heap, or maybe the same heap blocks only tick-tock like this on Tuesdays. It may be possible for stuff like that to happen sanely with well chosen simple heuristics that exploit asymmetry.) * Work with (not against) the way that Postgres strongly decouples the physical representation of data from the logical contents of the database compared to other DB systems. But at the same time, work hard to make the physical representation of the data as close as is practically possible to an idealized, imaginary logical version of the data. Do this because it makes queries faster, not because it's strictly necessary for concurrency control/recovery/whatever. Concretely, this mostly means that we try our best to keep each logical row (i.e. the latest physical version or two of each row) located on the same physical heap block over time, using the escalation strategy I described or something like it. Notice that we're imposing a cost on queries that are arguably responsible for creating garbage, but only when we really can't tolerate more garbage collection debt. But if it doesn't work out that's okay -- we tried. I have a feeling that it will in fact mostly work out. Why shouldn't it be possible to have no more than one or two uncommitted row versions in a heap page at any given time, just like with my nbtree patch? (I think that the answer to that question is "weird workloads", but I'm okay with the idea that they're a little slower or whatever.) Notice that this makes the visibility map work better in practice. I also think that the FSM needs to be taught that it isn't a good thing to reuse a little fragment of space on its own, because it works against our goal of trying to avoid relocating rows. The current logic seems focussed on using every little scrap of free space no matter what, which seems pretty misguided. Penny-wise, pound-foolish. Also notice that fighting to keep the same logical row on the same block has a non-linear payoff. We don't need to give up on that goal at the first sign of trouble. If it's hard to do a conventional prune after succeeding a thousand times then it's okay to work harder. Only a sucker gives up at the first sign of trouble. We're playing a long game here. If it becomes so hard that even applying a more aggressive strategy fails, then it's probably also true that it has become inherently impossible to sustain the original layout. We graciously accept defeat and cut our losses, having only actually wasted a little effort to learn that we need to move our incoming successor version to some other heap block (especially because the cost is amortized across versions that live on the same heap block). * Don't try to completely remove VACUUM. Treat its top down approach as complementary to the new bottom-up approaches we add. There is nothing wrong with taking a long time to clean up garbage tuples in heap pages that have very little garbage total. In fact I think that it might be a good thing. Why be in a hurry to dirty the page again? If it becomes a real problem in the short term then the bottom-up stuff can take care of it. Under this old-but-new paradigm, maybe VACUUM has to visit indexes a lot less (maybe it just decides not to sometimes, based on stats about the indexes, like we see today with vacuum_index_cleanup = off). VACUUM is for making infrequent "clean sweeps", though it typically leaves most of the work to new bottom-up approaches, that are well placed to understand the needs of queries that touch nearby data. Autovacuum does not presume to understand the needs of queries at all. It would also be possible for VACUUM to make more regular sweeps over the version store without disturbing the main relation under this model. The version store isn't that different to a separate heap relation, but naturally doesn't require traditional index cleanup or freezing, and naturally stores things in approximately temporal order. So we recycle space in the version store in a relatively eager, circular fashion, because it naturally contains data that favors such an approach. We make up the fixed cost of using the separate old version store structure by reducing deferred costs like this. And by only using it when it is demonstrably helpful. It might also make sense for us to prioritize putting heap tuples that represent versions whose indexed columns were changed by update (basically a non-hot update) in the version store -- we work extra hard on that (and just leave behind an LP_DEAD line pointer). That way VACUUM can do retail index tuple deletion for the index whose columns were modified when it finds them in the version store (presumably this nbtree patch of mine works well enough with the logically unchanged index entries for other indexes that we don't need to go out of our way). I'm sure that there are more than a few holes in this sketch of mine. It's not really worked out, but it has certain properties that are generally under appreciated -- especially the thing about the worst case number of versions per logical row being extremely important, as well as the idea that back pressure can be a good thing when push comes to shove -- provided it is experienced locally, and only by queries that update the same very hot logical rows. Back pressure needs to be proportionate and approximately fair. Another important point that I want to call out again here is that we should try to exploit cost/benefit asymmetry opportunistically. That seems to have worked out extremely well in my recent B-Tree delete dedup patch. I don't really expect anybody to take this seriously -- especially as a total blueprint. Maybe some elements of what I've sketched can be used as part of some future big effort. You've got to start somewhere. It has to be incremental. -- Peter Geoghegan
pgsql-hackers by date: