The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | The Free Space Map: Problems and Opportunities |
Date | |
Msg-id | CAH2-Wzk56qXpK4bSbiySauVyvZ4B4hF9Gm99T1Vtat+aomcaWQ@mail.gmail.com Whole thread Raw |
Responses |
Re: The Free Space Map: Problems and Opportunities
Re: The Free Space Map: Problems and Opportunities Re: The Free Space Map: Problems and Opportunities |
List | pgsql-hackers |
I have suspected that there are serious design issues with the FSM (and related heapam code like hio.c) for several years now [1]. This has a lot to do with the experience of using Jan Wieck's BenchmarkSQL implementation of TPC-C [2][3][4]. It clearly makes Postgres exhibit pathological performance issues, especially heap fragmentation. There is a clear way in which free space in the two largest tables (orders and order lines) ought to be managed, given the fixed pattern of the workload, but that isn't what we see. I had the opportunity to work with Jan Wieck and Greg Smith directly on this, shortly before I left Crunchy Data several weeks ago. Jan was able to fix some issues on the BenchmarkSQL side, which helped a lot. But the real problems remain. It is generally understood that the FSM leads to Postgres doing badly with TPC-C. I personally believe that this is representative of real practical problems, and not just a problem with this one benchmark. This email is an attempt to impose some order on a disorderly problem space. I'd like to compare notes with other people that are interested in the problem. I suspect that some experienced hackers have yet to be convinced of the importance of the FSM when it comes to bloat. Hopefully I'll manage to make that case convincingly on this thread. If any of my more concrete claims about problems in the FSM seem like they need to be justified, please let me know. I am omitting details in this initial overview email for the sake of brevity. It's already too long. Problems -------- The FSM gives out space without considering the passage of time, or the likelihood that a particular transaction or client will consume its requested free space in a more or less predictable and steady way over time. It has no memory. The FSM therefore fails to capture naturally occurring locality, or at least doesn't specifically care about it. This is the central problem, that all other problems with the FSM seem to stem from in one way or another. The FSM should group related rows (e.g. rows from the same transaction or backend) together, so that they reliably go on the same heap page -- careless mixing of unrelated rows should be avoided. When I say "related", I mean related in whatever sense the application or end user thinks of them as related. As a general rule we should expect groups of rows that were inserted by the same transaction to also be read, updated, deleted, or removed by VACUUM together, as a group. While this isn't guaranteed, it's a good working assumption for the FSM. It avoids unnecessary fragmentation. The current FSM design causes significant fragmentation with workloads where users *expect* no fragmentation at all. I see this with TPC-C. The way that the FSM *ought* to behave for the workload in question is intuitively obvious, once you lay it all out. But that's not what we see in practice -- far from it. (Happy to go into more detail on this.) You don't even need temporal locality to see problems. Even random inserts show up certain FSM implementation problems. The FSM lacks even a basic understanding of the *aggregate* result of backends applying various FSM heuristics over time, and with concurrent access. Inserting backends currently behave like an amateur soccer team where every individual soccer player chases the ball, without any coordination among team members. The players in this analogy somehow fail to consider where the ball *is about to be* -- there is no temporal awareness. They also miss the importance of cooperating with each other as a team -- there is also no spatial awareness, and no thought given to second order effects. This leads to increased buffer lock contention and fragmentation. (Other known FSM problems have been omitted for the sake of brevity.) Background ---------- To recap, the FSM tracks how much free space is in every heap page. Most FSM maintenance takes place during VACUUM, though ordinary connection backends occasionally help to keep the FSM current, via calls to RecordAndGetPageWithFreeSpace() made from hio.c. There is also a bulk extension mechanism added by commit 719c84c1be, which is used when we detect multiple concurrent inserters. This bulk extension mechanism does change things a little (it gives some thought to systematic effects), though it still has the same basic issues: it doesn't do nearly enough to group likely-related rows together. I suspect that improving the bulk extension mechanism will be an important part of improving the overall picture. That mechanism needs to be more explicit, and more careful about who gets what free space when. I'll go into the significance of this mechanism to the FSM below, under "Opportunities". But first, more background information. Papers ------ I've taken time to review the database research literature, which doesn't have all that much to say here. But there are a couple of relevant classic papers that I know of [6][7]. A lot of the considerations for free space management in heap-based systems like DB2 and Oracle are naturally intertwined with concurrency control, recovery, and even heavyweight locking. These systems must treat TIDs as stable identifiers of a logical row, and not identifiers of a physical tuple -- another important difference. "Free space" is only truly available to reuse in these systems some time after a deleter commits and releases its heavyweight locks. This is why their FSM equivalents must be WAL-logged. There is even a need for logical UNDO to cover these FSM-equivalent data structures, which have their own tricky rollback path issues that align with those in the heap structure. Everything is tied together to avoid rollback safety hazards. Physical rollback cannot find that the tuple needed to restore form UNDO will no longer physically fit on the original heap page. Plus the line pointer can't just be recycled. It's all quite delicate. At first I thought that this UNDO rollback safety business meant that I had nothing to learn from these old papers. Postgres naturally doesn't have to care about a transaction reusing space before another transaction that deleted the space commits (or after an inserter aborts) -- in Postgres, space is just that: space. We always have the *option* to create a new HOT chain on another page. However, I eventually changed my mind -- these old papers are relevant. In a way Postgres actually does have rollback. It doesn't involve delicate physical handling, in large part because TID stability isn't guaranteed in Postgres. But Postgres style "logical rollback" is more similar than it is different. Logical database ---------------- While it certainly is good that Postgres has the freedom to allow the physical representation of the logical database to diverge in all kinds of ways, that doesn't mean that we should be totally unconcerned about just how far the divergence goes. It's good that we have the option to retain a few extra versions without considering rollback hazards. But an option isn't an obligation. That's what makes an option useful -- it gives us the ability to have our cake, and eat it too. Clearly we should try to avoid the need for a new HOT chain on another heap block, for example. OTOH we shouldn't bend over backwards to make sure of it -- that can't be worth it, since it isn't a matter of fundamental correctness with Postgres style concurrency control. As I said, we should try to be a little bit like ARIES with full UNDO rollback, purely to avoid bloat and improve temporal locality -- concurrent inserters should not clobber the same heap pages. I believe that something a little like FSM transaction rollback is just what we need here. A design that is inspired by other existing designs, including even the UNDO + rollback parts, which are pretty essential. Transaction boundaries and ownership semantics for free space really do seem to matter when it comes to avoiding bloat. Roughly the same high level approach led to my developing bottom-up index deletion for Postgres 14 [5] -- the same "logical vs physical" tension exists in Postgres B-Tree indexes. Using "logical database surrogates" in the physical database is actually a standard family of techniques used in DB systems design. This family of techniques has been underexploited within Postgres, probably because it isn't so natural in a world without UNDO style rollback. This family of techniques is something to keep in your "bag of tricks" as a Postgres hacker (as Hellerstein once put it). I'll return to how we might actually do something a bit like UNDO style rollback in the FSM later on. It is counterintuitive, but stay with me. Open and closed pages --------------------- Note that other designs for FSM style structures (for a heap table access method) all seem to have fewer than 10 possible "page has this much free space" increments [6] -- the granularity is far coarser than our own FSM_CATEGORIES granularity (which has a massive 255 distinct increments of free space for each heap page). One reason for this is that ignoring tiny differences avoids contention and fragmentation from chasing insignificant differences between heap blocks, which I don't need to go into again (my earlier amatuer soccer team analogy is enough for now). But it's also related to the logical database/rollback stuff that I just went into. It sometimes makes sense to think of a physical heap page as more than just a data structure. Even things that we tend to naturally think of as strictly physical may need to be rethought just a little. Take DB2's version of the FSM, FSIP [7]. This design usually doesn't ever end up inserting *any* new logical rows on a heap page after the page first fills -- it is initially "open" when first allocated, and then quickly becomes "closed" to inserts of new logical rows, once it crosses a certain almost-empty space threshold. The heap page usually stays "closed" forever. While the design does not *completely* ignore the possibility that the page won't eventually get so empty that reuse really does look like a good idea, it makes it rare. A heap page needs to have rather a lot of the original logical rows deleted before that becomes possible again. It's rare for it to go backwards because the threshold that makes a closed page become open again is much lower (i.e. involves much more free space) than the threshold that initially made a (newly allocated) heap page closed. This one-way stickiness quality is important. As with simulated annealing algorithms, our algorithm has heap pages that naturally settle. Our algorithm makes it hard to change things once they become settled -- it has to be really worth it before we'll flip a page back to "open" again. There is a general bias against disturbing that which has become the settled state. (I think that the terms "open" and "closed" come from the world of online bin packing problems, where the same tension and uncertainties exist.) This stickiness concept is called "hysteresis" by some DB researchers, often when discussing UNDO stuff [8]. Having *far less* granularity than FSM_CATEGORIES/255 seems essential to make that work as intended. Pages need to be able to settle without being disturbed by noise-level differences. That's all that super fine grained values buy you: more noise, more confusion. Visibility map -------------- If the logical database and natural locality are important to the FSM, then what about the visibility map? And what about the relationship between the FSM and the visibility map, in light of all this? Currently VACUUM doesn't care about how its FSM behavior interacts with how it sets all-frozen/all-visible bits for the same heap page. To me this seems completely unreasonable -- they're *obviously* related! We're probably only gaining a minimal amount of free space on one occasion by ignoring the VM/FSM relationship, for which we pay a high cost. Worst of all we're *perpetuating the cycle* of dirtying and redirtying the same pages over time. Maybe we should go as far as merging the FSM and VM, even -- that seems like a natural consequence of logical-ish/qualitative definitions of "page is full". Opportunities ------------- Now more about what I think a new FSM design for Postgres *should* do. This is almost a description of an actual data structure. A new FSM. But not quite. As we've seen, free space management in a true ARIES-style system (like the DB2 FSIP design) is forced to think of free space in terms of "leases", or bulk-allocated free lists of usually-contiguous pages. Under a true ARIES design, it is strictly necessary to track which deleter XID freed-up which individual lease of free space lest some concurrent inserter reuse the space before it is truly safe to do so -- the deleter xact must commit before that reuse can be safely allowed. While delete + commit does seem mostly irrelevant to the Postgres FSM, insert + abort case handling seems to be something that we can learn from. Maybe the same principles can be applied in Postgres, where we need leases to avoid leaks in the presence of errors (not necessarily rollback per se). We must be able to tolerate incorrect speculations about where free space will be needed next. Transactions must be given the opportunity to make use of the heap free space lease that they asked for -- but not forever. It's a balancing act. Imagine if the FSM tracked recent free space requests that have been satisfied already. Requesting backends will maintain this same metadata when they burn through their bulk allocation of reserved space/heap pages. This new structure would contain metadata like the requesting XID/backend for a given lease of heap pages, the number of leases during the last bulk extension operation, and maybe the number of concurrently waiting backends at that time. Now we have some sense of recent history, and start to recognize trends over time. The FSM is now able to *take back* some of the space that it gives out, based on new information. Now the FSM can afford to make mistakes because the mistakes can always be corrected a little later. The FSM can be generous based on speculation, heuristics, whatever it might be. It should be possible for the FSM to have it both ways, more or less. Consider the bulk extension mechanism for concurrent inserters that was added by commit 719c84c1be once again. The mechanism's leader backend process performs the actual bulk heap relation extension, allocating up to 512 new heap pages all at once. Even when the leader backend ends up extending a heap relation by quite a few pages (hundreds), the leader is still unable to *directly* hand off free space to particular backends that drove the leader to extend aggressively in the first place. It would be quite natural for the leader backend to say to each individual follower backend: thank you for waiting for me, here is a big contiguous range of heap pages (e.g., as many as 50 heap pages or more per xact when things truly ramp up), which should be enough to allow you to avoid going to the FSM again for quite a long time. But the leader backend doesn't do that at all -- it just puts all of the pages (up to 512 total) in the FSM. All of the waiting follower backends are left to senselessly fight it out for free space by simply going back to the FSM when they wake up again. Why can't they cooperate intelligently? What stops that today? FSM "rollback" ------------- (This is where my argument comes together, finally.) FSM rollback in the FSIP paper is really just about reliable ownership semantics. Ownership of free space in heap pages that is scoped at the level of transactions. It seems to me that ownership of free space in heap pages by particular transactions is really all that the bulk extension FSM logic is missing -- and getting that right seems like a big part of fixing the FSM comprehensively. The bulk extension mechanism cannot *trust* the backends to use more than a little space at a time today. While on average each backend's use of free space probably *is* predictable over time, it only takes one or two straggler backends per bulk operation to mess everything up -- that's enough for the system to end up with very poor space utilization. Since we don't have any sense of ownership of space in heap pages by transactions today, the code must hedge against stragglers/space leaks by making the space *generally* available to all. Of course, this hedging comes at a high cost: it prevents the whole system (heapam, FSM, and backends) from capturing naturally occurring locality in a variety of important cases. Including the important TPC-C case I mentioned. Explicit scoped ownership of free space in heap pages removes the need to hedge. It addresses the problem of would-be space leaks directly, while *also* fixing the FSM locality problems implicitly. Metadata about recent requests gives us the *context* needed to do all this well. Sure, we don't need it for transaction rollback using ARIES style UNDO, but we can still use it for this. Perhaps this can be pushed much further. Long running transactions ought to have as much time as they need to use the large lease of free space that they were provided. But when the transaction commits or aborts, we might then make the FSM code much more aggressive about stealing space back from the backend that inherits the lease from the now-committed transaction. It all depends. Maybe we can teach VACUUM to eagerly clean up aborted transactions that are known to have consumed a lot of free space that turns out to not be needed for the inserted rows. It might also be feasible to make our new slimmed-down FSM crash-safe. The way that REDO routines for VACUUM handle FSM maintenance is pretty sloppy right now. Bean counting ------------ In general I find the idea that an algorithm can always make better decisions with more information dubious. Accounting for every tiny little fragment of free space ("bean counting") is definitely not an intrinsic good. Maybe keeping that information is not just wasteful -- maybe it's actually harmful. There is a real risk of overinterpreting noise-level difference [9]. I'd prefer to err in the direction of underfitting. I now wonder if the FSM is fundamentally doing the wrong thing by keeping track of all "free space" in every page over time. Why wouldn't we just have a free space management strategy that is generally only concerned with recent events? If we find a way to make almost all newly allocated heap pages become "closed" quickly (maybe they're marked all-frozen quickly too), and make sure that that condition is sticky, then this can work out well. We may not even need to store explicit freespace information for most heap pages in the database -- a page being "closed" can be made implicit by the FSM and/or VM. Making heap pages age-out like this (and mostly stay that way over time) has obvious benefits in many different areas. This won't help workloads like stock pgbench, which is pretty unrealistic anyway. But it needn't hurt them either. [1] https://postgr.es/m/CAH2-WzkEP6wz2Eg7mgKbF-qTPi21+BWrJgNm0qfU5kf0iJBV2g@mail.gmail.com [2] https://postgr.es/m/0265f9e2-3e32-e67d-f106-8abde596c0e4@commandprompt.com [3] https://github.com/wieck/benchmarksql/commit/6ac326ad23d55de2edc18bfbffb10b21f1b39b48 [4] https://github.com/wieck/benchmarksql/commit/0830c5f526061529eb2b45831c544539caa65435 [5] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf [6] https://dl.acm.org/doi/abs/10.1145/235968.233355 [7] https://www.semanticscholar.org/paper/Algorithms-for-Flexible-Space-Management-in-Systems-Mohan-Haderle/9db1a8104daade31949b6399cac9169751fa1605 [8] https://queue.acm.org/detail.cfm?id=3099561 [9] https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#Approaches -- Peter Geoghegan
pgsql-hackers by date: