Thread: The Free Space Map: Problems and Opportunities
I have suspected that there are serious design issues with the FSM (and related heapam code like hio.c) for several years now [1]. This has a lot to do with the experience of using Jan Wieck's BenchmarkSQL implementation of TPC-C [2][3][4]. It clearly makes Postgres exhibit pathological performance issues, especially heap fragmentation. There is a clear way in which free space in the two largest tables (orders and order lines) ought to be managed, given the fixed pattern of the workload, but that isn't what we see. I had the opportunity to work with Jan Wieck and Greg Smith directly on this, shortly before I left Crunchy Data several weeks ago. Jan was able to fix some issues on the BenchmarkSQL side, which helped a lot. But the real problems remain. It is generally understood that the FSM leads to Postgres doing badly with TPC-C. I personally believe that this is representative of real practical problems, and not just a problem with this one benchmark. This email is an attempt to impose some order on a disorderly problem space. I'd like to compare notes with other people that are interested in the problem. I suspect that some experienced hackers have yet to be convinced of the importance of the FSM when it comes to bloat. Hopefully I'll manage to make that case convincingly on this thread. If any of my more concrete claims about problems in the FSM seem like they need to be justified, please let me know. I am omitting details in this initial overview email for the sake of brevity. It's already too long. Problems -------- The FSM gives out space without considering the passage of time, or the likelihood that a particular transaction or client will consume its requested free space in a more or less predictable and steady way over time. It has no memory. The FSM therefore fails to capture naturally occurring locality, or at least doesn't specifically care about it. This is the central problem, that all other problems with the FSM seem to stem from in one way or another. The FSM should group related rows (e.g. rows from the same transaction or backend) together, so that they reliably go on the same heap page -- careless mixing of unrelated rows should be avoided. When I say "related", I mean related in whatever sense the application or end user thinks of them as related. As a general rule we should expect groups of rows that were inserted by the same transaction to also be read, updated, deleted, or removed by VACUUM together, as a group. While this isn't guaranteed, it's a good working assumption for the FSM. It avoids unnecessary fragmentation. The current FSM design causes significant fragmentation with workloads where users *expect* no fragmentation at all. I see this with TPC-C. The way that the FSM *ought* to behave for the workload in question is intuitively obvious, once you lay it all out. But that's not what we see in practice -- far from it. (Happy to go into more detail on this.) You don't even need temporal locality to see problems. Even random inserts show up certain FSM implementation problems. The FSM lacks even a basic understanding of the *aggregate* result of backends applying various FSM heuristics over time, and with concurrent access. Inserting backends currently behave like an amateur soccer team where every individual soccer player chases the ball, without any coordination among team members. The players in this analogy somehow fail to consider where the ball *is about to be* -- there is no temporal awareness. They also miss the importance of cooperating with each other as a team -- there is also no spatial awareness, and no thought given to second order effects. This leads to increased buffer lock contention and fragmentation. (Other known FSM problems have been omitted for the sake of brevity.) Background ---------- To recap, the FSM tracks how much free space is in every heap page. Most FSM maintenance takes place during VACUUM, though ordinary connection backends occasionally help to keep the FSM current, via calls to RecordAndGetPageWithFreeSpace() made from hio.c. There is also a bulk extension mechanism added by commit 719c84c1be, which is used when we detect multiple concurrent inserters. This bulk extension mechanism does change things a little (it gives some thought to systematic effects), though it still has the same basic issues: it doesn't do nearly enough to group likely-related rows together. I suspect that improving the bulk extension mechanism will be an important part of improving the overall picture. That mechanism needs to be more explicit, and more careful about who gets what free space when. I'll go into the significance of this mechanism to the FSM below, under "Opportunities". But first, more background information. Papers ------ I've taken time to review the database research literature, which doesn't have all that much to say here. But there are a couple of relevant classic papers that I know of [6][7]. A lot of the considerations for free space management in heap-based systems like DB2 and Oracle are naturally intertwined with concurrency control, recovery, and even heavyweight locking. These systems must treat TIDs as stable identifiers of a logical row, and not identifiers of a physical tuple -- another important difference. "Free space" is only truly available to reuse in these systems some time after a deleter commits and releases its heavyweight locks. This is why their FSM equivalents must be WAL-logged. There is even a need for logical UNDO to cover these FSM-equivalent data structures, which have their own tricky rollback path issues that align with those in the heap structure. Everything is tied together to avoid rollback safety hazards. Physical rollback cannot find that the tuple needed to restore form UNDO will no longer physically fit on the original heap page. Plus the line pointer can't just be recycled. It's all quite delicate. At first I thought that this UNDO rollback safety business meant that I had nothing to learn from these old papers. Postgres naturally doesn't have to care about a transaction reusing space before another transaction that deleted the space commits (or after an inserter aborts) -- in Postgres, space is just that: space. We always have the *option* to create a new HOT chain on another page. However, I eventually changed my mind -- these old papers are relevant. In a way Postgres actually does have rollback. It doesn't involve delicate physical handling, in large part because TID stability isn't guaranteed in Postgres. But Postgres style "logical rollback" is more similar than it is different. Logical database ---------------- While it certainly is good that Postgres has the freedom to allow the physical representation of the logical database to diverge in all kinds of ways, that doesn't mean that we should be totally unconcerned about just how far the divergence goes. It's good that we have the option to retain a few extra versions without considering rollback hazards. But an option isn't an obligation. That's what makes an option useful -- it gives us the ability to have our cake, and eat it too. Clearly we should try to avoid the need for a new HOT chain on another heap block, for example. OTOH we shouldn't bend over backwards to make sure of it -- that can't be worth it, since it isn't a matter of fundamental correctness with Postgres style concurrency control. As I said, we should try to be a little bit like ARIES with full UNDO rollback, purely to avoid bloat and improve temporal locality -- concurrent inserters should not clobber the same heap pages. I believe that something a little like FSM transaction rollback is just what we need here. A design that is inspired by other existing designs, including even the UNDO + rollback parts, which are pretty essential. Transaction boundaries and ownership semantics for free space really do seem to matter when it comes to avoiding bloat. Roughly the same high level approach led to my developing bottom-up index deletion for Postgres 14 [5] -- the same "logical vs physical" tension exists in Postgres B-Tree indexes. Using "logical database surrogates" in the physical database is actually a standard family of techniques used in DB systems design. This family of techniques has been underexploited within Postgres, probably because it isn't so natural in a world without UNDO style rollback. This family of techniques is something to keep in your "bag of tricks" as a Postgres hacker (as Hellerstein once put it). I'll return to how we might actually do something a bit like UNDO style rollback in the FSM later on. It is counterintuitive, but stay with me. Open and closed pages --------------------- Note that other designs for FSM style structures (for a heap table access method) all seem to have fewer than 10 possible "page has this much free space" increments [6] -- the granularity is far coarser than our own FSM_CATEGORIES granularity (which has a massive 255 distinct increments of free space for each heap page). One reason for this is that ignoring tiny differences avoids contention and fragmentation from chasing insignificant differences between heap blocks, which I don't need to go into again (my earlier amatuer soccer team analogy is enough for now). But it's also related to the logical database/rollback stuff that I just went into. It sometimes makes sense to think of a physical heap page as more than just a data structure. Even things that we tend to naturally think of as strictly physical may need to be rethought just a little. Take DB2's version of the FSM, FSIP [7]. This design usually doesn't ever end up inserting *any* new logical rows on a heap page after the page first fills -- it is initially "open" when first allocated, and then quickly becomes "closed" to inserts of new logical rows, once it crosses a certain almost-empty space threshold. The heap page usually stays "closed" forever. While the design does not *completely* ignore the possibility that the page won't eventually get so empty that reuse really does look like a good idea, it makes it rare. A heap page needs to have rather a lot of the original logical rows deleted before that becomes possible again. It's rare for it to go backwards because the threshold that makes a closed page become open again is much lower (i.e. involves much more free space) than the threshold that initially made a (newly allocated) heap page closed. This one-way stickiness quality is important. As with simulated annealing algorithms, our algorithm has heap pages that naturally settle. Our algorithm makes it hard to change things once they become settled -- it has to be really worth it before we'll flip a page back to "open" again. There is a general bias against disturbing that which has become the settled state. (I think that the terms "open" and "closed" come from the world of online bin packing problems, where the same tension and uncertainties exist.) This stickiness concept is called "hysteresis" by some DB researchers, often when discussing UNDO stuff [8]. Having *far less* granularity than FSM_CATEGORIES/255 seems essential to make that work as intended. Pages need to be able to settle without being disturbed by noise-level differences. That's all that super fine grained values buy you: more noise, more confusion. Visibility map -------------- If the logical database and natural locality are important to the FSM, then what about the visibility map? And what about the relationship between the FSM and the visibility map, in light of all this? Currently VACUUM doesn't care about how its FSM behavior interacts with how it sets all-frozen/all-visible bits for the same heap page. To me this seems completely unreasonable -- they're *obviously* related! We're probably only gaining a minimal amount of free space on one occasion by ignoring the VM/FSM relationship, for which we pay a high cost. Worst of all we're *perpetuating the cycle* of dirtying and redirtying the same pages over time. Maybe we should go as far as merging the FSM and VM, even -- that seems like a natural consequence of logical-ish/qualitative definitions of "page is full". Opportunities ------------- Now more about what I think a new FSM design for Postgres *should* do. This is almost a description of an actual data structure. A new FSM. But not quite. As we've seen, free space management in a true ARIES-style system (like the DB2 FSIP design) is forced to think of free space in terms of "leases", or bulk-allocated free lists of usually-contiguous pages. Under a true ARIES design, it is strictly necessary to track which deleter XID freed-up which individual lease of free space lest some concurrent inserter reuse the space before it is truly safe to do so -- the deleter xact must commit before that reuse can be safely allowed. While delete + commit does seem mostly irrelevant to the Postgres FSM, insert + abort case handling seems to be something that we can learn from. Maybe the same principles can be applied in Postgres, where we need leases to avoid leaks in the presence of errors (not necessarily rollback per se). We must be able to tolerate incorrect speculations about where free space will be needed next. Transactions must be given the opportunity to make use of the heap free space lease that they asked for -- but not forever. It's a balancing act. Imagine if the FSM tracked recent free space requests that have been satisfied already. Requesting backends will maintain this same metadata when they burn through their bulk allocation of reserved space/heap pages. This new structure would contain metadata like the requesting XID/backend for a given lease of heap pages, the number of leases during the last bulk extension operation, and maybe the number of concurrently waiting backends at that time. Now we have some sense of recent history, and start to recognize trends over time. The FSM is now able to *take back* some of the space that it gives out, based on new information. Now the FSM can afford to make mistakes because the mistakes can always be corrected a little later. The FSM can be generous based on speculation, heuristics, whatever it might be. It should be possible for the FSM to have it both ways, more or less. Consider the bulk extension mechanism for concurrent inserters that was added by commit 719c84c1be once again. The mechanism's leader backend process performs the actual bulk heap relation extension, allocating up to 512 new heap pages all at once. Even when the leader backend ends up extending a heap relation by quite a few pages (hundreds), the leader is still unable to *directly* hand off free space to particular backends that drove the leader to extend aggressively in the first place. It would be quite natural for the leader backend to say to each individual follower backend: thank you for waiting for me, here is a big contiguous range of heap pages (e.g., as many as 50 heap pages or more per xact when things truly ramp up), which should be enough to allow you to avoid going to the FSM again for quite a long time. But the leader backend doesn't do that at all -- it just puts all of the pages (up to 512 total) in the FSM. All of the waiting follower backends are left to senselessly fight it out for free space by simply going back to the FSM when they wake up again. Why can't they cooperate intelligently? What stops that today? FSM "rollback" ------------- (This is where my argument comes together, finally.) FSM rollback in the FSIP paper is really just about reliable ownership semantics. Ownership of free space in heap pages that is scoped at the level of transactions. It seems to me that ownership of free space in heap pages by particular transactions is really all that the bulk extension FSM logic is missing -- and getting that right seems like a big part of fixing the FSM comprehensively. The bulk extension mechanism cannot *trust* the backends to use more than a little space at a time today. While on average each backend's use of free space probably *is* predictable over time, it only takes one or two straggler backends per bulk operation to mess everything up -- that's enough for the system to end up with very poor space utilization. Since we don't have any sense of ownership of space in heap pages by transactions today, the code must hedge against stragglers/space leaks by making the space *generally* available to all. Of course, this hedging comes at a high cost: it prevents the whole system (heapam, FSM, and backends) from capturing naturally occurring locality in a variety of important cases. Including the important TPC-C case I mentioned. Explicit scoped ownership of free space in heap pages removes the need to hedge. It addresses the problem of would-be space leaks directly, while *also* fixing the FSM locality problems implicitly. Metadata about recent requests gives us the *context* needed to do all this well. Sure, we don't need it for transaction rollback using ARIES style UNDO, but we can still use it for this. Perhaps this can be pushed much further. Long running transactions ought to have as much time as they need to use the large lease of free space that they were provided. But when the transaction commits or aborts, we might then make the FSM code much more aggressive about stealing space back from the backend that inherits the lease from the now-committed transaction. It all depends. Maybe we can teach VACUUM to eagerly clean up aborted transactions that are known to have consumed a lot of free space that turns out to not be needed for the inserted rows. It might also be feasible to make our new slimmed-down FSM crash-safe. The way that REDO routines for VACUUM handle FSM maintenance is pretty sloppy right now. Bean counting ------------ In general I find the idea that an algorithm can always make better decisions with more information dubious. Accounting for every tiny little fragment of free space ("bean counting") is definitely not an intrinsic good. Maybe keeping that information is not just wasteful -- maybe it's actually harmful. There is a real risk of overinterpreting noise-level difference [9]. I'd prefer to err in the direction of underfitting. I now wonder if the FSM is fundamentally doing the wrong thing by keeping track of all "free space" in every page over time. Why wouldn't we just have a free space management strategy that is generally only concerned with recent events? If we find a way to make almost all newly allocated heap pages become "closed" quickly (maybe they're marked all-frozen quickly too), and make sure that that condition is sticky, then this can work out well. We may not even need to store explicit freespace information for most heap pages in the database -- a page being "closed" can be made implicit by the FSM and/or VM. Making heap pages age-out like this (and mostly stay that way over time) has obvious benefits in many different areas. This won't help workloads like stock pgbench, which is pretty unrealistic anyway. But it needn't hurt them either. [1] https://postgr.es/m/CAH2-WzkEP6wz2Eg7mgKbF-qTPi21+BWrJgNm0qfU5kf0iJBV2g@mail.gmail.com [2] https://postgr.es/m/0265f9e2-3e32-e67d-f106-8abde596c0e4@commandprompt.com [3] https://github.com/wieck/benchmarksql/commit/6ac326ad23d55de2edc18bfbffb10b21f1b39b48 [4] https://github.com/wieck/benchmarksql/commit/0830c5f526061529eb2b45831c544539caa65435 [5] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf [6] https://dl.acm.org/doi/abs/10.1145/235968.233355 [7] https://www.semanticscholar.org/paper/Algorithms-for-Flexible-Space-Management-in-Systems-Mohan-Haderle/9db1a8104daade31949b6399cac9169751fa1605 [8] https://queue.acm.org/detail.cfm?id=3099561 [9] https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#Approaches -- Peter Geoghegan
On Mon, Aug 16, 2021 at 10:35:45AM -0700, Peter Geoghegan wrote: > I have suspected that there are serious design issues with the FSM > (and related heapam code like hio.c) for several years now [1]. This > has a lot to do with the experience of using Jan Wieck's BenchmarkSQL > implementation of TPC-C [2][3][4]. It clearly makes Postgres exhibit > pathological performance issues, especially heap fragmentation. OK, here is my feedback. First, I understand the space reservation/soccer idea, but I am also concerned that adding space reservation could improve some things and make others worse. Second, let's look at a concrete example, and see how it can be improved more simply. As I understand it, there are three operations that need free space: 1. INSERT/COPY 2. UPDATE where there is insufficient space on the page of the old row 3. UPDATE where there is sufficient page space The third option already can use the existing page for a HOT update, so the FSM doesn't matter. For 1 & 2, suppose you have a table with 10 8k pages, all 80% full. As I understand it, the first page will go from 80% to 81%, then the next row addition will take the second page from 80% to 81%, until all pages are 81%, and then it starts over again. Is that accurate? Is that the non-temporal memory issue you are concerned about? Doesn't spreading the new rows out increase the ability to do HOT updates? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, Aug 16, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote: > OK, here is my feedback. First, I understand the space > reservation/soccer idea, but I am also concerned that adding space > reservation could improve some things and make others worse. That is definitely a legitimate concern. There is a trade-off here: if we do too much preallocation, there may be cases where the preallocated space that we expected to be used quickly is never used by anybody. But that doesn't seem like a problem, provided we don't lose track of the fact that it happened. Then the worst that can happen is that we have empty pages that nobody will ever use, because nobody inserts into the table ever again (not the backend with the leaked space lease, not anybody else). That seems manageable. We can add some kind of ramp-up behavior that gets more and more aggressive as demand for new space from backends is steady or increasing. > Second, let's look at a concrete example, and see how it can be improved > more simply. As I understand it, there are three operations that need > free space: > > 1. INSERT/COPY > 2. UPDATE where there is insufficient space on the page of the > old row > 3. UPDATE where there is sufficient page space Right. > The third option already can use the existing page for a HOT update, so > the FSM doesn't matter. I agree. All good questions. Thank you for diving in on this. > For 1 & 2, suppose you have a table with 10 8k > pages, all 80% full. As I understand it, the first page will go from > 80% to 81%, then the next row addition will take the second page from > 80% to 81%, until all pages are 81%, and then it starts over again. Is > that accurate? No. Generally backends have their own target block (a simple BlockNumber) that they store in the relcache, one per table recently modified. Backends/heapam uses RelationGetTargetBlock() to access this local cache. So there is "backend affinity" for particular individual heap pages, even today. However, that still has many problems. It doesn't make sense to have a local cache for a shared resource -- that's the problem. You actually need some kind of basic locking or lease system, so that 10 backends don't all decide at the same time that one particular heap block is fully empty, and therefore a good target block for that backend alone. It's as if the backends believe that they're special little snowflakes, and that no other backend could possibly be thinking the same thing at the same time about the same heap page. And so when TPC-C does its initial bulk insert, distinct orders are already shuffled together in a hodge-podge, just because concurrent bulk inserters all insert on the same heap pages. If we could safely give out 10 or 50 pages directly to each backend, then clearly the related data would be stored together in this scenario -- each backend would be able to work through its lease of contiguous heap pages for quite a while before revisiting the FSM/relation extension. The trick is to teach the implementation to do that without allowing the backend to permanently leak whole entire leases with maybe dozens of free pages. Systems like DB2 and Oracle probably can't have this problem. Even the bulk lease case is just an extension of something they had to do anyway. You see, some kind of FSM lease system that knows about transaction lifetime for any given piece of free space is strictly necessary with UNDO based rollback. Without that, transaction rollback breaks in certain edge cases involving concurrent inserts into the same page, right before a rollback that needs to put an updated version back in place. If the updated version from undo happens to be physically larger than the now-aborted successor version, and if there is little or no space left to cover that, what can rollback do about it? Nothing. It's a nasty bug. They use heavyweight locks to avoid this. > Is that the non-temporal memory issue you are concerned > about? When I talk about memory or time, what I'm usually referring to is the ability to manage larger allocations of multiple blocks for a while after they're originally requested. So that a requesting transaction/backend is given the opportunity to use the space that they asked for. They shouldn't be completely trusted. We should trust but verify. > Doesn't spreading the new rows out increase the ability to do > HOT updates? It's true that the problem scenario with TPC-C does involve updates, and it's true that that's when things really go badly. But I deliberately haven't gone into the role that the updates play there too much. Not just yet. It's important to note that the problems really do start when we insert, even if that's not obvious -- that's when the locality is lost, right when the original order transaction comes in. We lose locality just because we have concurrent inserters into the same table, where each transaction inserts multiple related rows. We fail to group related rows together right from the start, setting us up for failure later on. This is actually quite simple -- the chaos that follows is where it gets truly complicated (and not necessarily worth getting into just yet). The update of a given order and its order lines entries takes place hours later, within a delivery transaction. Another concern is knock on effects *after* any initial problematic updates -- that's certainly not where it ends. Every action has consequences, and these consequences themselves have consequences. By migrating a row from an earlier block into a later block (both physically and temporally early/late), we're not just changing things for that original order row or rows (the order non-HOT-updated by the delivery transaction). Since the old/original version left behind by a delivery transaction will itself get vacuumed eventually, that space goes in the FSM eventually. And so this same block will get some entirely unrelated version (could be a new insert or update). It's systematic. We dare not waste a tiny amount of free space. Surely we should cut our losses at some point and accept that we're "wasting space", but there is never any natural backpressure. Nothing to break the cycle of fragmentation. (This is another thing involving time, not quite related to my central point about free space leases.) -- Peter Geoghegan
On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > Another concern is knock on effects *after* any initial problematic > updates -- that's certainly not where it ends. Every action has > consequences, and these consequences themselves have consequences. By > migrating a row from an earlier block into a later block (both > physically and temporally early/late), we're not just changing things > for that original order row or rows (the order non-HOT-updated by the > delivery transaction). To be clear, TPC-C looks like this: each order row and each order line row will be inserted just once, and then later updated just once. But that's it, forever -- no more modifications. Both tables grow and grow for as long as the benchmark runs. Users with workloads like this will naturally expect that performance will steady over time. Even over days or even weeks. But that's not what we see. What we actually see is that the FSM can never quite resist the temptation to dirty older pages that should be aging out. And so performance degrades over days and weeks -- that's how long Jan has said that it can take. The FSM does have a bias in favor of using earlier pages in favor of later pages, in order to make relation truncation by VACUUM more likely in general. That bias certainly isn't helping us here, and might be another thing that hurts us. I think that the fundamental problem is that the FSM just doesn't have any way of recognizing that it's behavior is penny wise, pound foolish. I don't believe that there is any fundamental reason why Postgres can't have *stable* long term performance for this workload in the way that it's really expected to. That seems possible within the confines of the existing design for HOT and VACUUM, which already work very well for the first few hours. -- Peter Geoghegan
Hi, On 2021-08-16 10:35:45 -0700, Peter Geoghegan wrote: > Problems > -------- > > The FSM gives out space without considering the passage of time, or > the likelihood that a particular transaction or client will consume > its requested free space in a more or less predictable and steady way > over time. It has no memory. The FSM therefore fails to capture > naturally occurring locality, or at least doesn't specifically care > about it. This is the central problem, that all other problems with > the FSM seem to stem from in one way or another. I suspect that one other fairly fundamental issue is that we are very reticent updating the FSM with information about free space. As long as that's the case a smarter placement logic would often not have a better place to choose from. > Background > ---------- > > To recap, the FSM tracks how much free space is in every heap page. > Most FSM maintenance takes place during VACUUM, though ordinary > connection backends occasionally help to keep the FSM current, via > calls to RecordAndGetPageWithFreeSpace() made from hio.c. > > There is also a bulk extension mechanism added by commit 719c84c1be, > which is used when we detect multiple concurrent inserters. This bulk > extension mechanism does change things a little (it gives some thought > to systematic effects), though it still has the same basic issues: it > doesn't do nearly enough to group likely-related rows together. I > suspect that improving the bulk extension mechanism will be an > important part of improving the overall picture. I think the design of the current bulk extension mechanism is quite backward (it only takes contention into account not other needs for bulk work, it does expensive stuff like finding victim buffers under lock, it doesn't release locks until all blocks are extended, ...). But: I don't think a bulk extension mechanism should be tasked with doing grouping or anything like that. > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't > ever end up inserting *any* new logical rows on a heap page after the > page first fills -- it is initially "open" when first allocated, and > then quickly becomes "closed" to inserts of new logical rows, once it > crosses a certain almost-empty space threshold. The heap page usually > stays "closed" forever. While the design does not *completely* ignore > the possibility that the page won't eventually get so empty that reuse > really does look like a good idea, it makes it rare. A heap page needs > to have rather a lot of the original logical rows deleted before that > becomes possible again. It's rare for it to go backwards because the > threshold that makes a closed page become open again is much lower > (i.e. involves much more free space) than the threshold that initially > made a (newly allocated) heap page closed. This one-way stickiness > quality is important. I suspect that we'd get a *LOT* of complaints if we introduced that degree of stickiness. > This stickiness concept is called "hysteresis" by some DB researchers, > often when discussing UNDO stuff [8]. Having *far less* granularity > than FSM_CATEGORIES/255 seems essential to make that work as intended. > Pages need to be able to settle without being disturbed by noise-level > differences. That's all that super fine grained values buy you: more > noise, more confusion. Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit from a fairly granular FSM_CATEGORIES, while still achieving better locality. You could e.g. search for free space for an out-of-page update in nearby pages with a lower free-percentage threshold, while having a different threshold and selection criteria for placement of inserts. > Visibility map > -------------- > > If the logical database and natural locality are important to the FSM, > then what about the visibility map? And what about the relationship > between the FSM and the visibility map, in light of all this? > > Currently VACUUM doesn't care about how its FSM behavior interacts > with how it sets all-frozen/all-visible bits for the same heap page. > To me this seems completely unreasonable -- they're *obviously* > related! We're probably only gaining a minimal amount of free space on > one occasion by ignoring the VM/FSM relationship, for which we pay a > high cost. Worst of all we're *perpetuating the cycle* of dirtying and > redirtying the same pages over time. Maybe we should go as far as > merging the FSM and VM, even -- that seems like a natural consequence > of logical-ish/qualitative definitions of "page is full". The density of the VM is fairly crucial for efficient IOS, so I'm doubtful merging FSM and VM is a good direction to go to. Additionally we need to make sure that the VM is accurately durable, which isn't the case for the FSM (which I think we should use to maintain it more accurately). But perhaps I'm just missing something here. What is the strong interlink between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM that the main linkage is on the side of inserts/update that choose a target page from the FSM. Isn't it better to make that side smarter? > Imagine if the FSM tracked recent free space requests that have been > satisfied already. Requesting backends will maintain this same > metadata when they burn through their bulk allocation of reserved > space/heap pages. This new structure would contain metadata like the > requesting XID/backend for a given lease of heap pages, the number of > leases during the last bulk extension operation, and maybe the number > of concurrently waiting backends at that time. Now we have some sense > of recent history, and start to recognize trends over time. The FSM is > now able to *take back* some of the space that it gives out, based on > new information. Now the FSM can afford to make mistakes because the > mistakes can always be corrected a little later. The FSM can be > generous based on speculation, heuristics, whatever it might be. It > should be possible for the FSM to have it both ways, more or less. Hm. To me this seems like a somewhat separate layer from the FSM. Having that kind of data go through shared buffers, with a need to worry about on-disk consistency/cleanup etc, the inability to add granular locks on a sub-page level, the need to go through buffer mapping / pinning all seem like things you wouldn't actually want. To me this seems like it'd be better addressed by a shared, per-relfilenode, in-memory datastructure. Thomas Munro has been working on keeping accurate per-relfilenode relation size information. ISTM that that that'd be a better place to hook in for this. > would be quite natural for the leader backend to say to each individual > follower backend: thank you for waiting for me, here is a big contiguous > range of heap pages (e.g., as many as 50 heap pages or more per xact when > things truly ramp up), which should be enough to allow you to avoid going to > the FSM again for quite a long time. That'd end up badly if we did for a relation that's hotly inserted into by a lot of connections, but where each connection only inserts a few rows. I suspect we'd either need information from the inserting backends about the number of pages they're likely going to need (e.g. a smarter "extension queue" where each entry lists the number of pages requested) and/or stats about the number of consecutive insertions a relation typically gets. > But the leader backend doesn't do that at all -- it just puts all of > the pages (up to 512 total) in the FSM. All of the waiting follower > backends are left to senselessly fight it out for free space by simply > going back to the FSM when they wake up again. Why can't they > cooperate intelligently? What stops that today? The lack of suitable in-memory coordination facilities. It turns out that I looked a bunch at this area because of AIO + DIO: With DIO the typical linux filesystems (ext4, xfs) do not perform any delayed allocation, which means that the non-bulk extension leads to *horrible* filesystem fragmentation. And even the bulk paths don't allocate enough blocks at once to avoid fragmentation with common amounts of concurrency. But just increasing the degree of bulk extension isn't helpful - it just exacerbates already bad thundering herd issues: All backends will decide to try to extend at the same time, all-1 backends will wait, one backend will extend wake up all-1, they then contend for the same locks etc. I addressed (or rather sidestepped) this to a small degree in the AIO+DIO patchset, by making extension faster and doing a few improvements around locking. But it's not good enough. If the relation extension infrastructure instead had information about the number of blocks each waiter wants and we had Thomas' "accurate shared relation size" infrastructure, we could do a lot better: 1) Whenever a backend needs new space and most of the previously bulk-extended space is used opportunistically try to bulk-extend. In the happy path this avoids the thundering herd issue because other backends can continue inserting while extension is in progress. 2) Whenever bulk extending, release pages to individual backends by looking in the queue of backends needing extension and wake up backends individually whenever we've bulk extended sufficiently for that backend. This provides two main benefits: First, it avoids all backends trying to acquire exactly the same resources just after waking up, without being able to benefit (because the first backend will often just fill the page). Second, it improves average latency during extension substantially, because individual backends can start continuing as soon as bulk extension has progressed enough for some backend, rather than needing to wait for the entire extension. > FSM "rollback" > ------------- > > (This is where my argument comes together, finally.) > > FSM rollback in the FSIP paper is really just about reliable ownership > semantics. Ownership of free space in heap pages that is scoped at the > level of transactions. > > It seems to me that ownership of free space in heap pages by > particular transactions is really all that the bulk extension FSM > logic is missing -- and getting that right seems like a big part of > fixing the FSM comprehensively. The bulk extension mechanism cannot > *trust* the backends to use more than a little space at a time today. > While on average each backend's use of free space probably *is* > predictable over time, it only takes one or two straggler backends per > bulk operation to mess everything up -- that's enough for the system > to end up with very poor space utilization. Since we don't have any > sense of ownership of space in heap pages by transactions today, the > code must hedge against stragglers/space leaks by making the space > *generally* available to all. Of course, this hedging comes at a high > cost: it prevents the whole system (heapam, FSM, and backends) from > capturing naturally occurring locality in a variety of important > cases. Including the important TPC-C case I mentioned. I think the fundamental issue of bulk extension issue is more that bulk extension shouldn't actually get pages from the FSM at all. That's just an ugly implementation detail of the hack we have currently, rather than something that should live on. We may want to decide to continue inserting bulk extended pages into the FSM to handle edge cases of backends erroring out before space is used, but the happy path should not touch the FSM at all. > Explicit scoped ownership of free space in heap pages removes the need > to hedge. It addresses the problem of would-be space leaks directly, > while *also* fixing the FSM locality problems implicitly. Metadata > about recent requests gives us the *context* needed to do all this > well. It sounds like you're thinking of using some global history, or at least a backend local history? I think it might be better to have backend local, per-command state. E.g. for a COPY aggressively increase the amount of space reserved, but not for a single-row insert. > I now wonder if the FSM is fundamentally doing the wrong thing by > keeping track of all "free space" in every page over time. Why > wouldn't we just have a free space management strategy that is > generally only concerned with recent events? I guess it depends on what you define as recent, but I don't see a way to bound the timeframe usefully. We need to deal with cases where table sizes go up and down over time, e.g. because there's periodic batch activity and ongoing activity. What's the time bound for unused space somewhere in a relation? I don't see any, unless we add compaction logic to vacuum. Greetings, Andres Freund
On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote: > > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't > > ever end up inserting *any* new logical rows on a heap page after the > > page first fills -- it is initially "open" when first allocated, and > > then quickly becomes "closed" to inserts of new logical rows, once it > > crosses a certain almost-empty space threshold. The heap page usually > > stays "closed" forever. While the design does not *completely* ignore > > the possibility that the page won't eventually get so empty that reuse > > really does look like a good idea, it makes it rare. A heap page needs > > to have rather a lot of the original logical rows deleted before that > > becomes possible again. It's rare for it to go backwards because the > > threshold that makes a closed page become open again is much lower > > (i.e. involves much more free space) than the threshold that initially > > made a (newly allocated) heap page closed. This one-way stickiness > > quality is important. > > I suspect that we'd get a *LOT* of complaints if we introduced that degree of > stickiness. I don't know what the right degree of stickiness is, but I can easily believe that we want to have more than none. Part of Peter's point, at least as I understand it, is that if a page has 100 tuples and you delete 1 or 2 or 3 of them, it is not smart to fill up the space thus created with 1 or 2 or 3 new tuples. You should instead hope that space will optimize future HOT updates, or else wait for the page to lose some larger number of tuples and then fill up all the space at the same time with tuples that are, hopefully, related to each other in some useful way. Now what's the threshold? 20 out of 100? 50? 80? That's not really clear to me. But it's probably not 1 out of 100. > > This stickiness concept is called "hysteresis" by some DB researchers, > > often when discussing UNDO stuff [8]. Having *far less* granularity > > than FSM_CATEGORIES/255 seems essential to make that work as intended. > > Pages need to be able to settle without being disturbed by noise-level > > differences. That's all that super fine grained values buy you: more > > noise, more confusion. > > Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit > from a fairly granular FSM_CATEGORIES, while still achieving better > locality. You could e.g. search for free space for an out-of-page update in > nearby pages with a lower free-percentage threshold, while having a different > threshold and selection criteria for placement of inserts. Yeah, I don't think that reducing FSM_CATEGORIES is likely to have much value for its own sake. But it might be useful as a way of accomplishing some other goal. For example if we decided that we need a bit per page to track "open" vs. "closed" status or something of that sort, I don't think we'd lose much by having fewer categories. > The density of the VM is fairly crucial for efficient IOS, so I'm doubtful > merging FSM and VM is a good direction to go to. Additionally we need to make > sure that the VM is accurately durable, which isn't the case for the FSM > (which I think we should use to maintain it more accurately). > > But perhaps I'm just missing something here. What is the strong interlink > between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM > that the main linkage is on the side of inserts/update that choose a target > page from the FSM. Isn't it better to make that side smarter? I don't have a well-formed opinion on this point yet. I am also skeptical of the idea of merging the FSM and VM, because it seems to me that which pages are all-visible and which pages are full are two different things. However, there's something to Peter's point too. An empty page is all-visible but still valid as an insertion target, but this is not as much of a contradiction to the idea of merging them as it might seem, because marking the empty page as all-visible is not nearly as useful as marking a full page all-visible. An index-only scan can't care about the all-visible status of a page that doesn't contain any tuples, and we're also likely to make the page non-empty in the near future, in which case the work we did to set the all-visible bit and log the change was wasted. The only thing we're accomplishing by making the page all-visible is letting VACUUM skip it, which will work out to a win if it stays empty for multiple VACUUM cycles, but not otherwise. Conceptually, you might think of the merged data structure as a "quiescence map." Pages that aren't being actively updated are probably all-visible and are probably not great insertion targets. Those that are being actively updated are probably not all-visible and have better chances of being good insertion targets. However, there are a lot of gremlins buried in the word "probably." A page could get full but still not be all-visible for a long time, due to long-running transactions, and it doesn't seem likely to me that we can just ignore that possibility. Nor on the other hand does the difference between an old-but-mostly-full page and an old-but-mostly-empty page seem like something we just want to ignore. So I don't know how a merged data structure would actually end up being an improvement, exactly. > To me this seems like it'd be better addressed by a shared, per-relfilenode, > in-memory datastructure. Thomas Munro has been working on keeping accurate > per-relfilenode relation size information. ISTM that that that'd be a better > place to hook in for this. +1. I had this same thought reading Peter's email. I'm not sure if it makes sense to hook that into Thomas's work, but I think it makes a ton of sense to use shared memory to coordinate transient state like "hey, right now I'm inserting into this block, you guys leave it alone" while using the disk for durable state like "here's how much space this block has left." -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Aug 16, 2021 at 05:15:36PM -0700, Peter Geoghegan wrote: > It doesn't make sense to have a local cache for a shared resource -- > that's the problem. You actually need some kind of basic locking or > lease system, so that 10 backends don't all decide at the same time > that one particular heap block is fully empty, and therefore a good > target block for that backend alone. It's as if the backends believe > that they're special little snowflakes, and that no other backend > could possibly be thinking the same thing at the same time about the > same heap page. And so when TPC-C does its initial bulk insert, > distinct orders are already shuffled together in a hodge-podge, just > because concurrent bulk inserters all insert on the same heap pages. OK, I am trying to think of something simple we could test to see the benefit, with few downsides. I assume the case you are considering is that you have a 10 8k-page table, and one page is 80% full and the others are 81% full, and if several backends start adding rows at the same time, they will all choose the 80%-full page. What if we change how we select pages with this: 1. find the page with the most free space 2. find all pages with up to 10% less free space than page #1 3. count the number of pages in #2 4. compute the proc_id modulus step #3 and use that page's offset from step #2 For example: 1. page with most freespace is 95% free 2. pages 2,4,6,8,10 have between 86%-95% free 3. five pages 4. proc id 14293 % 5 = 3 so use the third page from #2, page 6 This should spread out page usage to be more even, but still favor pages with more freespace. Yes, this is simplistic, but it would seem to have few downsides and I would be interested to see how much it helps. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, Aug 16, 2021 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> Open and closed pages
> ---------------------
> This stickiness concept is called "hysteresis" by some DB researchers,
> often when discussing UNDO stuff [8]. Having *far less* granularity
> than FSM_CATEGORIES/255 seems essential to make that work as intended.
> Pages need to be able to settle without being disturbed by noise-level
> differences. That's all that super fine grained values buy you: more
> noise, more confusion.
I'm not sure it's essential to have "far" fewer categories, if the closed-to-open transition is made less granular through some other mechanism. We can certainly get by with fewer categories, freeing up bits -- it seems we'd need at least one bit to track a block's open-close state.
> Visibility map
> --------------
>
> If the logical database and natural locality are important to the FSM,
> then what about the visibility map? And what about the relationship
> between the FSM and the visibility map, in light of all this?
>
> Currently VACUUM doesn't care about how its FSM behavior interacts
> with how it sets all-frozen/all-visible bits for the same heap page.
> To me this seems completely unreasonable -- they're *obviously*
> related! We're probably only gaining a minimal amount of free space on
> one occasion by ignoring the VM/FSM relationship, for which we pay a
> high cost. Worst of all we're *perpetuating the cycle* of dirtying and
> redirtying the same pages over time. Maybe we should go as far as
> merging the FSM and VM, even -- that seems like a natural consequence
> of logical-ish/qualitative definitions of "page is full".
> [...]
> I now wonder if the FSM is fundamentally doing the wrong thing by
> keeping track of all "free space" in every page over time. Why
> wouldn't we just have a free space management strategy that is
> generally only concerned with recent events? If we find a way to make
> almost all newly allocated heap pages become "closed" quickly (maybe
> they're marked all-frozen quickly too), and make sure that that
> condition is sticky, then this can work out well. We may not even need
> to store explicit freespace information for most heap pages in the
> database -- a page being "closed" can be made implicit by the FSM
> and/or VM. Making heap pages age-out like this (and mostly stay that
> way over time) has obvious benefits in many different areas.
The second paragraph here is an interesting idea and makes a great deal of sense. It would lead to smaller FSMs that are navigated more quickly and locked for shorter durations.
Implicit "closure" seems riskier in my view if you want to bring VM qualities into it, however. Currently, setting an all-visible or all-frozen flag must be correct and crash-safe, but clearing those is just a lost optimization. If either of those qualities are implicit by lack of reference, it seems more vulnerable to bugs.
On Tue, Aug 17, 2021 at 12:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:
>
> > To me this seems like it'd be better addressed by a shared, per-relfilenode,
> > in-memory datastructure. Thomas Munro has been working on keeping accurate
> > per-relfilenode relation size information. ISTM that that that'd be a better
> > place to hook in for this.
>
> +1. I had this same thought reading Peter's email. I'm not sure if it
> makes sense to hook that into Thomas's work, but I think it makes a
> ton of sense to use shared memory to coordinate transient state like
> "hey, right now I'm inserting into this block, you guys leave it
> alone" while using the disk for durable state like "here's how much
> space this block has left."
This makes sense as well. Shared memory for more recent / highly contended state, and disk for less recent / less contended / stickier state. This also might have the advantage of smaller, more focused projects from a coding standpoint.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> Open and closed pages
> ---------------------
> This stickiness concept is called "hysteresis" by some DB researchers,
> often when discussing UNDO stuff [8]. Having *far less* granularity
> than FSM_CATEGORIES/255 seems essential to make that work as intended.
> Pages need to be able to settle without being disturbed by noise-level
> differences. That's all that super fine grained values buy you: more
> noise, more confusion.
I'm not sure it's essential to have "far" fewer categories, if the closed-to-open transition is made less granular through some other mechanism. We can certainly get by with fewer categories, freeing up bits -- it seems we'd need at least one bit to track a block's open-close state.
> Visibility map
> --------------
>
> If the logical database and natural locality are important to the FSM,
> then what about the visibility map? And what about the relationship
> between the FSM and the visibility map, in light of all this?
>
> Currently VACUUM doesn't care about how its FSM behavior interacts
> with how it sets all-frozen/all-visible bits for the same heap page.
> To me this seems completely unreasonable -- they're *obviously*
> related! We're probably only gaining a minimal amount of free space on
> one occasion by ignoring the VM/FSM relationship, for which we pay a
> high cost. Worst of all we're *perpetuating the cycle* of dirtying and
> redirtying the same pages over time. Maybe we should go as far as
> merging the FSM and VM, even -- that seems like a natural consequence
> of logical-ish/qualitative definitions of "page is full".
> [...]
> I now wonder if the FSM is fundamentally doing the wrong thing by
> keeping track of all "free space" in every page over time. Why
> wouldn't we just have a free space management strategy that is
> generally only concerned with recent events? If we find a way to make
> almost all newly allocated heap pages become "closed" quickly (maybe
> they're marked all-frozen quickly too), and make sure that that
> condition is sticky, then this can work out well. We may not even need
> to store explicit freespace information for most heap pages in the
> database -- a page being "closed" can be made implicit by the FSM
> and/or VM. Making heap pages age-out like this (and mostly stay that
> way over time) has obvious benefits in many different areas.
The second paragraph here is an interesting idea and makes a great deal of sense. It would lead to smaller FSMs that are navigated more quickly and locked for shorter durations.
Implicit "closure" seems riskier in my view if you want to bring VM qualities into it, however. Currently, setting an all-visible or all-frozen flag must be correct and crash-safe, but clearing those is just a lost optimization. If either of those qualities are implicit by lack of reference, it seems more vulnerable to bugs.
On Tue, Aug 17, 2021 at 12:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 17, 2021 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:
>
> > To me this seems like it'd be better addressed by a shared, per-relfilenode,
> > in-memory datastructure. Thomas Munro has been working on keeping accurate
> > per-relfilenode relation size information. ISTM that that that'd be a better
> > place to hook in for this.
>
> +1. I had this same thought reading Peter's email. I'm not sure if it
> makes sense to hook that into Thomas's work, but I think it makes a
> ton of sense to use shared memory to coordinate transient state like
> "hey, right now I'm inserting into this block, you guys leave it
> alone" while using the disk for durable state like "here's how much
> space this block has left."
This makes sense as well. Shared memory for more recent / highly contended state, and disk for less recent / less contended / stickier state. This also might have the advantage of smaller, more focused projects from a coding standpoint.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Aug 17, 2021 at 6:18 AM Andres Freund <andres@anarazel.de> wrote: > I suspect that one other fairly fundamental issue is that we are very > reticent updating the FSM with information about free space. As long as > that's the case a smarter placement logic would often not have a better > place to choose from. I think so too. But that is made a lot harder by the supposed need to represent the amount of freespace in FSM_CATEGORIES (255) distinct increments. As opposed to maybe 4 distinct increments per page, like in other systems. You only have to update the externally stored metadata when a threshold is crossed. (This isn't the main reason why I don't like that; more on that later.) > I think the design of the current bulk extension mechanism is quite backward > (it only takes contention into account not other needs for bulk work, it does > expensive stuff like finding victim buffers under lock, it doesn't release > locks until all blocks are extended, ...). But: I don't think a bulk > extension mechanism should be tasked with doing grouping or anything like > that. Maybe I'm just drawing the boundaries differently in my head, which doesn't seem like a real difference of opinion. > > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't > > ever end up inserting *any* new logical rows on a heap page after the > > page first fills -- it is initially "open" when first allocated, and > > then quickly becomes "closed" to inserts of new logical rows, once it > > crosses a certain almost-empty space threshold. > I suspect that we'd get a *LOT* of complaints if we introduced that degree of > stickiness. Maybe. It would be very different to our current behavior, which would undoubtedly have many consequences. Maybe this will be a big problem, maybe it barely matters. I make no specific claims about it, either way. Not just yet. > Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit > from a fairly granular FSM_CATEGORIES, while still achieving better > locality. You could e.g. search for free space for an out-of-page update in > nearby pages with a lower free-percentage threshold, while having a different > threshold and selection criteria for placement of inserts. Again, it's all about *systemic* effects. A FSM request is basically a choice *among* blocks that could in principle satisfy the request -- often the number of basically-satisfactory blocks is huge (way too large, even). You have to bear in mind that concurrent requests are in competition with each other in much of the time. They are usually all looking for exactly the same thing (e.g., same free space requirement), or close to the same thing -- tuples within a given table tend to all be about as wide as each other. I can clearly sometimes see very high contention cases, where backends do way too much spinning inside the RelationGetBufferForTuple() loop. They're all more or less chasing the same scraps of free space, in a highly inefficient way. Even though there is probably an abundance of free space. Right now the heap pages that have the most free space are also the ones that are least likely to be used. Though I think that these backends should cooperate more, some amount of competition is probably necessary. If FSM_CATEGORIES used 3 bits instead of 8, then the backends could fall back on caring about some other thing that distinguished heap pages from each other that actually matters. Leading to less contention, and maybe better space utilization. I have presented this FSM_CATEGORIES contention issue as a distinct problem to the first one (the locality problem), though honestly that was a bit arbitrary -- it's just blurry. Thank you for putting up with that. > The density of the VM is fairly crucial for efficient IOS, so I'm doubtful > merging FSM and VM is a good direction to go to. Additionally we need to make > sure that the VM is accurately durable, which isn't the case for the FSM > (which I think we should use to maintain it more accurately). I explained this part badly. I don't think that we should literally unite the VM and FSM data structures. I was really just pointing out that a logical-ish notion of a page being full (to use the term of art I've seen, a "closed" page) is almost the same thing as a page that is marked all-visible/all-frozen. Presumably VACUUM does that with the optimistic assumption that it will probably never again need to touch the same page. And while that can never be guaranteed, it certainly seems like we might want to weigh things in favor of that happening. I'm pretty sure that carelessly remembering 200 bytes of free space in the FSM for an all-frozen page is practically guaranteed to be a bad idea. While I don't have a great idea of where the break-even point is just yet, I am prepared to say that I'm sure that it's not 0 bytes, which is how it works today. A squishy logical definition of "page is full" seems pretty natural to me. > But perhaps I'm just missing something here. What is the strong interlink > between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM > that the main linkage is on the side of inserts/update that choose a target > page from the FSM. Isn't it better to make that side smarter? I'll respond to Robert about this separately. > To me this seems like it'd be better addressed by a shared, per-relfilenode, > in-memory datastructure. Thomas Munro has been working on keeping accurate > per-relfilenode relation size information. ISTM that that that'd be a better > place to hook in for this. Actually, that's the approach that I'm taking in my current prototyping. I would like to have a WAL-logged component to this as well, but I haven't got that far. I think that it's possible that I'll eventually conclude that the whole idea of a FSM is just the wrong idea. Remembering anything at all about most pages may just be unnecessary in a world where we fix everything. Maybe the end result is that the common case is that most individual heap pages are "closed". Perhaps that can just be implicit -- there is no information to remember. > > would be quite natural for the leader backend to say to each individual > > follower backend: thank you for waiting for me, here is a big contiguous > > range of heap pages (e.g., as many as 50 heap pages or more per xact when > > things truly ramp up), which should be enough to allow you to avoid going to > > the FSM again for quite a long time. > > That'd end up badly if we did for a relation that's hotly inserted into by a > lot of connections, but where each connection only inserts a few rows. I > suspect we'd either need information from the inserting backends about the > number of pages they're likely going to need (e.g. a smarter "extension queue" > where each entry lists the number of pages requested) and/or stats about the > number of consecutive insertions a relation typically gets. That's a part of it, certainly. The heuristics would mostly be driven by recent information. In particular, we would probably only ramp things up based on a clear observed failure to keep up with demand. We could be very aggressive indeed this way, without too much extra risk. We'd only ramp up because we *knew* that the most recent batch/set of free lists for the last bulk request was inadequate. For that you need the context, the recent history. > > But the leader backend doesn't do that at all -- it just puts all of > > the pages (up to 512 total) in the FSM. All of the waiting follower > > backends are left to senselessly fight it out for free space by simply > > going back to the FSM when they wake up again. Why can't they > > cooperate intelligently? What stops that today? > > The lack of suitable in-memory coordination facilities. > > It turns out that I looked a bunch at this area because of AIO + DIO: With DIO > the typical linux filesystems (ext4, xfs) do not perform any delayed > allocation, which means that the non-bulk extension leads to *horrible* > filesystem fragmentation. It's the coordination facilities -- agreed. The only way that I seem to be going further than what you've said here is the logical database stuff. That is, I intend to specifically involve metadata like XIDs, as well as a general understanding of things like commit/abort boundaries for said XIDs. Ownership of a free space lease shouldn't be strictly tied to an XID, but that seems like a good starting point to me. This is the kind of thing that an OS engineer would never think of, that other database systems seem to do a lot of. This design involving waiting on XIDs could also be reused for nbtree page deletion, where page recycling is deferred (same with GiST page deletion). We could model this problem as: the physically-deleted-by-VACUUM page's safexid value should be treated as "the maybe-still-ongoing transaction that owns this page". This is kind of a lie, but close enough to the truth. That way we can put every deleted page in the FSM/whatever immediately, while reusing the same XID-wait stuff to solve the recycle safety problem. We push the recycle safety stuff on to the consumer side this way. This is currently a problem that mostly belongs to VACUUM itself (the producer side) -- which makes zero sense. I think that the only reason we do it that way is because the current FSM doesn't care at all about XID boundaries. It's almost the same problem, maybe even exactly the same. > I addressed (or rather sidestepped) this to a small degree in the AIO+DIO > patchset, by making extension faster and doing a few improvements around > locking. But it's not good enough. > > If the relation extension infrastructure instead had information about the > number of blocks each waiter wants and we had Thomas' "accurate shared > relation size" infrastructure, we could do a lot better: Sounds great to me! > 1) Whenever a backend needs new space and most of the previously bulk-extended > space is used opportunistically try to bulk-extend. In the happy path this > avoids the thundering herd issue because other backends can continue > inserting while extension is in progress. Right -- my thoughts exactly. > 2) Whenever bulk extending, release pages to individual backends by looking in > the queue of backends needing extension and wake up backends individually > whenever we've bulk extended sufficiently for that backend. This is also exactly what I was thinking of, albeit for slightly different reasons. I want to more or less say directly to each backend at the point it wakes up: here you go, here are those pages you asked for. Somebody might check in about this free space lease later on, but in general you can think of these pages as your exclusive property. Making it backend/XID specific is probably helpful in many ways. Like maybe there is a natural variation in demand among backends, even though they're all bulk inserting into the same table. Who knows? > I think the fundamental issue of bulk extension issue is more that bulk > extension shouldn't actually get pages from the FSM at all. That's just an > ugly implementation detail of the hack we have currently, rather than > something that should live on. Actually, I agree. I was just trying to use our existing terms to introduce my ideas. > We may want to decide to continue inserting > bulk extended pages into the FSM to handle edge cases of backends erroring out > before space is used, but the happy path should not touch the FSM at all. You don't have to touch the FSM if you don't even have one in the first place. :-) > It sounds like you're thinking of using some global history, or at least a > backend local history? I think it might be better to have backend local, > per-command state. E.g. for a COPY aggressively increase the amount of space > reserved, but not for a single-row insert. That seems quite possible. Not sure just yet. This feels like the right general idea to me, even though many of the details are not worked out. > I guess it depends on what you define as recent, but I don't see a way to > bound the timeframe usefully. We need to deal with cases where table sizes go > up and down over time, e.g. because there's periodic batch activity and > ongoing activity. What's the time bound for unused space somewhere in a > relation? I don't see any, unless we add compaction logic to vacuum. I don't want to bound the timeframe, exactly. I want to look at what happened during the last bulk operation, and compare that to what I see for the current/ongoing operation -- which could involve a fair amount of details. The idea is to make decisions based on inferences about what is *changing*. What's working, what's not working -- the *direction* of things is at least as important as the state of things at this exact point in time. This does mean that you have to accept that some number of heap pages can be "leaked" if inserts against a table that was being bulk-extended suddenly go away completely. Of course it should be possible to get this so-called leaked space back, just as soon as the table receives more inserts (any inserts should do that). My current best guess is that this will work fine in practice. -- Peter Geoghegan
On Tue, Aug 17, 2021 at 9:48 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't know what the right degree of stickiness is, but I can easily > believe that we want to have more than none. Part of Peter's point, at > least as I understand it, is that if a page has 100 tuples and you > delete 1 or 2 or 3 of them, it is not smart to fill up the space thus > created with 1 or 2 or 3 new tuples. You should instead hope that > space will optimize future HOT updates, or else wait for the page to > lose some larger number of tuples and then fill up all the space at > the same time with tuples that are, hopefully, related to each other > in some useful way. That is a big part of it. But it gets worse than that. It is truly insidious, once you follow the chain of problems over time -- maybe over hours and hours. BenchmarkSQL makes this a lot easier. Right now when we set heap fill factor to 85 (which is what BenchmarkSQL does), we effectively have this chunk of special reserved free space left behind on each page. The remaining 15% of tuple space is of course reserved for successor versions of the same logical rows -- those rows from the time that the page first fills (presumably they all come from straight inserts). That's our starting point -- simple enough. The approach we take effectively gives up on the *whole entire page* at literally the first sign of trouble -- the first time we fail to hold a successor version on that same page. Even though it just well have been a once-off perturbation. What we might call bad luck, as opposed to an inevitability. For example, maybe ANALYZE ran that second, which made the crucial difference for that one heap page that one time. That's kind of enough to make the implementation assume all bets are off, for the entire page -- it's as if we pessimistically assume that there is no chance of keeping anything like the original rows on the heap page, so we might as well let it all get mixed up even more. This is way too pessimistic -- it's black and white thinking. While in general that level of pessimism might turn out to be justified (anything is possible), we should hold the line and find out for sure how bad it is. I happen to know that our problematic BenchmarkSQL/TPC-C tables just don't have any skew, so this pessimism is particularly wrong-headed there. Bear in mind that the new unrelated tuples that ultimately replace our original couldn't-fit tuples (probably after the next VACUUM) are probably not "average tuples". Maybe they're newly inserted tuples, which is bad in the BenchmarkSQL case, because they're all *guaranteed* to be updated in the future (owing to the details of the workload). With some other typical workload they might often be successor versions from updates instead -- most likely for rows that are disproportionately likely to be updated many times. It's truly pernicious. If we only accepted the original row migration, and didn't throw good money after bad, then the cycle would be broken. To me this is a bit like certain heuristic algorithms, where you technically accept a suboptimal outcome to avoid getting stuck in a local maxima. > Now what's the threshold? 20 out of 100? 50? 80? I'm not going to pretend to know the answer. But I will point out that one DB system whose heap fill factor defaults to 90 seems to have a symmetric setting for the "open up page again" point -- the default for that is 40. Not sure that that really matters to us, but that does seem pretty low to me. It's very sticky indeed. > Yeah, I don't think that reducing FSM_CATEGORIES is likely to have > much value for its own sake. But it might be useful as a way of > accomplishing some other goal. For example if we decided that we need > a bit per page to track "open" vs. "closed" status or something of > that sort, I don't think we'd lose much by having fewer categories. This is probably one of my less important points. I addressed this point in my remarks on this to Andres from earlier. > I don't have a well-formed opinion on this point yet. I am also > skeptical of the idea of merging the FSM and VM, because it seems to > me that which pages are all-visible and which pages are full are two > different things. As I clarified in my email to Andres, this was just a bad choice of words. I meant that they're at least conceptually much closer, provided you're willing to accept my general view of things. > However, there's something to Peter's point too. An > empty page is all-visible but still valid as an insertion target, but > this is not as much of a contradiction to the idea of merging them as > it might seem, because marking the empty page as all-visible is not > nearly as useful as marking a full page all-visible. An index-only > scan can't care about the all-visible status of a page that doesn't > contain any tuples, and we're also likely to make the page non-empty > in the near future, in which case the work we did to set the > all-visible bit and log the change was wasted. The only thing we're > accomplishing by making the page all-visible is letting VACUUM skip > it, which will work out to a win if it stays empty for multiple VACUUM > cycles, but not otherwise. That's a big part of it. The other thing is systemic effects, which is related to the part of this email above about chain of events. I'll talk about that some more now. Lets say that VACUUM puts a non-zero usable amount of free space in the FSM for a page that it marks all-visible/all-frozen at the same time -- this is possible today, of course. To me this seems contradictory, at least when there isn't much space -- which I believe is common. It's at least a minor waste to set the VM bit. Now let's say we change VACUUM to not waste effort on setting the heap page -- we're now avoiding minor waste, which is an okay goal. But we're also doing something else, that I find very objectionable: we're effectively betting that the situation will improve for the same page at some point in the future, during a future VACUUM. It is possible that we could get just one or two inserts on the same page, and then see it again, and then we're done. If that happens, we win our bet. Otherwise, we lose our bet. Here's the problem with making that bet: it's typically an awful bet. The potential upside is clearly pretty low. And more importantly, there is a decent chance that we'll actually get some row that is going to mess everything up -- not an "average row" (whatever that may mean). Again, very frequently updated rows are often what'll end up on the page, which are enough to make the page never have its all-visible bit set. Here is why I find this *extremely* objectionable: we *actually* risk experiencing a downside for the heap page by taking this awful bet *many many times* in the future -- it's not just a once-off downside, I believe. This is not in fact a choice about one heap page on one occasion -- it is actually a *policy* affecting the page again and again (and every other heap page too). It is only by making a not-quite-physically-full page "closed" in the way that I advocate we avoid taking this awful bet -- that is how it's possible to artificially *make* this a question about one page on one occasion. We're bounding the downside, lessening our exposure to the worst case. Intuitively, I think that the possible downside is essentially unlimited, if we assume systemic destabilization risks (which seems wise). Whereas the possible upside of taking that awful bet is clearly, measurably small. A once-off upside. Peanuts, really. (Of course my argument won't work if you assume that even with the "closed" page choice we naturally get more updates to one of the aged frozen rows anyway. But we're counting on that not happening in either scenario/with *either* strategy, so I don't think that it invalidates anything I've said.) > Conceptually, you might think of the merged data structure as a > "quiescence map." Pages that aren't being actively updated are > probably all-visible and are probably not great insertion targets. > Those that are being actively updated are probably not all-visible and > have better chances of being good insertion targets. However, there > are a lot of gremlins buried in the word "probably." A page could get > full but still not be all-visible for a long time, due to long-running > transactions, and it doesn't seem likely to me that we can just ignore > that possibility. I think that we should opportunistically be setting the all-frozen bit in heap pages (and freezing tuples earlier than required) anyway. For example, we could notice that a whole heap page is eligible to be marked all-frozen, and freeze the tuples early -- we could restart processing of the page once we realized that freezing early is a good idea. Why even have a distinct all-visible VM bit? The per-tuple overhead of the WAL records for freezing are kind of bulky, but that could be fixed. It's not a good reason. (Of course I'm not arguing that we should break pg_upgrade, just that there is no reason not to freeze early when it's cheap to do so in passing.) > +1. I had this same thought reading Peter's email. I'm not sure if it > makes sense to hook that into Thomas's work, but I think it makes a > ton of sense to use shared memory to coordinate transient state like > "hey, right now I'm inserting into this block, you guys leave it > alone" while using the disk for durable state like "here's how much > space this block has left." I agree that that's the best starting point for this. -- Peter Geoghegan
On Tue, Aug 17, 2021 at 11:24 AM Bruce Momjian <bruce@momjian.us> wrote: > OK, I am trying to think of something simple we could test to see the > benefit, with few downsides. I assume the case you are considering is > that you have a 10 8k-page table, and one page is 80% full and the > others are 81% full, and if several backends start adding rows at the > same time, they will all choose the 80%-full page. That's one problem (there is so many). They start with that same page, and then fight each other over other pages again and again. This emergent behavior is a consequence of the heuristic that has the FSM look for the heap page with the least space that still satisfies a given request. I mentioned this problem earlier. > For example: > > 1. page with most freespace is 95% free > 2. pages 2,4,6,8,10 have between 86%-95% free > 3. five pages > 4. proc id 14293 % 5 = 3 so use the third page from #2, page 6 Right now my main concern is giving backends/XIDs their own space, in bulk. Mostly for concurrent bulk inserts, just so we can say that we're getting the easy cases right -- that is a good starting point for my prototype to target, I think. But I am also thinking about stuff like this, as one way to address contention. > This should spread out page usage to be more even, but still favor pages > with more freespace. Yes, this is simplistic, but it would seem to have > few downsides and I would be interested to see how much it helps. I thought about having whole free lists that were owned by XIDs, as well as shared free lists that are determined by hashing MyProcPid -- or something like that. So I agree that it might very well make sense. -- Peter Geoghegan
On Tue, Aug 17, 2021 at 12:11 PM John Naylor <john.naylor@enterprisedb.com> wrote: > I'm not sure it's essential to have "far" fewer categories, if the closed-to-open transition is made less granular throughsome other mechanism. See my remarks to Andres about FSM_CATEGORIES from earlier. (It may not matter if you don't believe that it helps us to have such high granularity -- that means it won't hurt to get rid of it, which might be a good enough reason on its own.) > > I now wonder if the FSM is fundamentally doing the wrong thing by > > keeping track of all "free space" in every page over time. Why > > wouldn't we just have a free space management strategy that is > > generally only concerned with recent events? > The second paragraph here is an interesting idea and makes a great deal of sense. It would lead to smaller FSMs that arenavigated more quickly and locked for shorter durations. Most importantly, the FSM/whatever doesn't have to be locked at all if you don't go to it. And so going to the FSM much less often with space leases or whatever seems like it might be a big win for that reason (not just the locality stuff that I care about). > Implicit "closure" seems riskier in my view if you want to bring VM qualities into it, however. I didn't mean to imply that we'd be changing that data structure. As I clarified earlier, "merging the FSM and the VM" was meant in a conceptual way. Not the best choice of words on my part. > This makes sense as well. Shared memory for more recent / highly contended state, and disk for less recent / less contended/ stickier state. I would like to also make it WAL-logged, which is easier this way. That isn't my main goal, but why not do it? The way that the FSM works right now within VACUUM REDO routines is a problem. -- Peter Geoghegan
On Tue, Aug 17, 2021 at 6:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > Again, it's all about *systemic* effects. A FSM request is basically a > choice *among* blocks that could in principle satisfy the request -- > often the number of basically-satisfactory blocks is huge (way too > large, even). You have to bear in mind that concurrent requests are in > competition with each other in much of the time. They are usually all > looking for exactly the same thing (e.g., same free space > requirement), or close to the same thing -- tuples within a given > table tend to all be about as wide as each other. > > I can clearly sometimes see very high contention cases, where backends > do way too much spinning inside the RelationGetBufferForTuple() loop. > They're all more or less chasing the same scraps of free space, in a > highly inefficient way. Even though there is probably an abundance of > free space. Right now the heap pages that have the most free space are > also the ones that are least likely to be used. > > Though I think that these backends should cooperate more, some amount > of competition is probably necessary. If FSM_CATEGORIES used 3 bits > instead of 8, then the backends could fall back on caring about some > other thing that distinguished heap pages from each other that > actually matters. Leading to less contention, and maybe better space > utilization. To me, it feels like the real issue here is that the free space map is completely deterministic. As it's currently designed, you can reduce the number of bits as much as you want, and it doesn't change anything. Concurrent requests get the same answer, which is not what we want. Just as a thought experiment, imagine that backends preferentially prefer pages whose block numbers are congruent to their PID module some integer. Now you have a reason - perhaps not the best possible reason, but *a* reason - for every single backend to pounce on the same target page. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 17, 2021 at 8:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > Lets say that VACUUM puts a non-zero usable amount of free space in > the FSM for a page that it marks all-visible/all-frozen at the same > time -- this is possible today, of course. To me this seems > contradictory, at least when there isn't much space -- which I believe > is common. It's at least a minor waste to set the VM bit. It seems to me that you are talking about the VM bit but making an argument that is more applicable to the FSM. If we think that the small amount of freespace in the page is going to result in the page being dirtied repeatedly as it cycles through a bunch of different tuples none of which stick around long term, then we might not want to advertise the freespace in the FSM, or we might want the FSM to disregard the fact that a small amount of freespace is present there. However, if we do that, then we certainly want to also set the VM bit. Otherwise, we've got a page that we've effectively quiesced, by refusing to reuse the freespace, but the VM doesn't know that, so future index-only scans and VACUUM operations are penalized. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 18, 2021 at 7:36 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Though I think that these backends should cooperate more, some amount > > of competition is probably necessary. If FSM_CATEGORIES used 3 bits > > instead of 8, then the backends could fall back on caring about some > > other thing that distinguished heap pages from each other that > > actually matters. Leading to less contention, and maybe better space > > utilization. > > To me, it feels like the real issue here is that the free space map is > completely deterministic. I agree. Though note the separate problem with FSM_CATEGORIES that I mentioned to Andres: if you want to maintain the FSM value for pages eagerly within backends, then practically speaking a far lower FSM_CATEGORIES-like constant seems necessary. I'm really just saying that the fine granularity seems to me to be basically the wrong approach. In general the single most important thing is the amount of free space available -- we agree on that. However, the only reason to remember very small *differences* in free space between heap pages is because you intend to do *something* with that extra information. But caring about noise-level differences seems inherently bad -- there is bound to be a better fall-back behavior that reduces contention, and actually *increases* space utilization in practice (even though theoretically coarser-grained information might hurt utilization a little). > As it's currently designed, you can reduce > the number of bits as much as you want, and it doesn't change > anything. Concurrent requests get the same answer, which is not what > we want. I agree that reducing FSM_CATEGORIES alone will not buy us anything. -- Peter Geoghegan
On Wed, Aug 18, 2021 at 7:45 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 17, 2021 at 8:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Lets say that VACUUM puts a non-zero usable amount of free space in > > the FSM for a page that it marks all-visible/all-frozen at the same > > time -- this is possible today, of course. To me this seems > > contradictory, at least when there isn't much space -- which I believe > > is common. It's at least a minor waste to set the VM bit. > > It seems to me that you are talking about the VM bit but making an > argument that is more applicable to the FSM. If we think that the > small amount of freespace in the page is going to result in the page > being dirtied repeatedly as it cycles through a bunch of different > tuples none of which stick around long term, then we might not want to > advertise the freespace in the FSM, or we might want the FSM to > disregard the fact that a small amount of freespace is present there. I'm not quite sure if my argument applies to the VM or the FSM. It's a bit tricky to talk about because I'm imagining a world in which the concepts have shifted. And because the exact details of the data structures are less clear (even if they were clear to me, the lack of a shared vocabulary might confuse our discussion). Maybe there isn't even a conventional FSM in this new world. Maybe free space management works by focusing on recent events, and on outcomes. This means that we store much less information about the entire database, and much more information about very recent events. In particular, I hope that free space management won't have to care about how much free space most individual heap pages have. Perhaps most heap pages are simply "closed", unavailable for reuse. > However, if we do that, then we certainly want to also set the VM bit. > Otherwise, we've got a page that we've effectively quiesced, by > refusing to reuse the freespace, but the VM doesn't know that, so > future index-only scans and VACUUM operations are penalized. I agree that if we do that we'd want to also set the VM bit. I'm imagining a world in which a "closed page" is synonymous with an all-visible/all-frozen page. That doesn't mean that they're exactly the same thing. For example, maybe a closed page is a page that is at least *expected* to have its all-visible/all-frozen set by VACUUM the next time it runs -- though more often it'll have been set already. If you're looking at the system in real time with this improved design, you'll see that pages that become closed will then also have their VM bit set on a predictable timeline. Maybe we even set the VM bit when some of the closed page's closed rows happen to be updated (Freeze cutoff permitting) -- there isn't that much downside to VACUUM betting that these "closed row updates" will be the last such updates forever, or at least for a long time. This should at least be the case once we add an optimized approach to freezing whole pages (pages whose tuples start out unfrozen). -- Peter Geoghegan
On Tue, Aug 17, 2021 at 5:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Now what's the threshold? 20 out of 100? 50? 80? > > I'm not going to pretend to know the answer. But I will point out that > one DB system whose heap fill factor defaults to 90 seems to have a > symmetric setting for the "open up page again" point -- the default > for that is 40. Not sure that that really matters to us, but that does > seem pretty low to me. It's very sticky indeed. Correction: it's actually 60, not 40. It's true that the actual default is 40, but it works the other way around relative to Postgres (as does the related fill factor like setting, which defaults to 90 instead of 100). And so we would think of this other "open up closed page once again" setting as having a default of 60. (Or perhaps we'd think of it as having a default that is 2/3 of the closely related fill factor setting's default.) -- Peter Geoghegan
On Wed, Aug 18, 2021 at 3:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Aug 17, 2021 at 5:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > Now what's the threshold? 20 out of 100? 50? 80? > > > > I'm not going to pretend to know the answer. But I will point out that > > one DB system whose heap fill factor defaults to 90 seems to have a > > symmetric setting for the "open up page again" point -- the default > > for that is 40. Not sure that that really matters to us, but that does > > seem pretty low to me. It's very sticky indeed. > > Correction: it's actually 60, not 40. > > It's true that the actual default is 40, but it works the other way > around relative to Postgres (as does the related fill factor like > setting, which defaults to 90 instead of 100). And so we would think > of this other "open up closed page once again" setting as having a > default of 60. (Or perhaps we'd think of it as having a default that > is 2/3 of the closely related fill factor setting's default.) I don't know whether 60 is optimal or not, but it doesn't seem crazy. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 18, 2021 at 1:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > Maybe there isn't even a conventional FSM in this new world. Maybe > free space management works by focusing on recent events, and on > outcomes. This means that we store much less information about the > entire database, and much more information about very recent events. > In particular, I hope that free space management won't have to care > about how much free space most individual heap pages have. Perhaps > most heap pages are simply "closed", unavailable for reuse. I very much doubt that you can get away without some sort of free space map. Even if in most cases most pages are closed to insertions, there will be important corner cases where lots of pages are open for insertions, like when you just deleted a ton of rows and then ran VACUUM. And we cannot lose track of even one of those open pages or, if the pre-8.4 state of the world is any indication, we will be super sad. I suppose technically it doesn't need to be a map; it could be structured some other way. But it is going to have to be able to keep track of an arbitrary number of open pages without losing any, and a map is one good way of doing that. Regarding the FSM_CATEGORIES question, I think the issue is big tuples vs. small tuples. You don't want to keep on finding pages with some free space and then find out that they don't actually have enough for the tuple you want to insert. Now that does become less of an issue with this open/closed system because, at least initially, when a page is re-opened to inserts, more than a quarter of the page will be free (I presume) and so the details don't matter. But, as it gets closer to filling up, it might matter again. Perhaps we don't want to waste time looking through a bunch of pages with 1kB free when we're trying to insert a 2kB tuple. We could close all those pages as we try them, to prevent repeating the same process over and over again, but maybe the 2kB tuple we're now trying to insert is unusual, and most tuples are 40 bytes. Then we look silly. One idea might be to have a CLOSED state in the FSM and then a variety of OPEN states that are distinguished by amount of available free space. For example, imagine something like CLOSED, OPEN (>2kb), OPEN (1.5-2kb), OPEN (1-1.5kB), OPEN (768-1023 byte), OPEN (512-767 bytes), OPEN (256-511 bytes), OPEN (128-255 bytes), OPEN (64-127 bytes), OPEN (<64 bytes). I think that would give us enough information to pack open pages tightly full before having them become closed. I don't know if those are exactly the right boundaries, and 10 categories might be worse than 8 or 16, but I think it's likely correct to suppose that (a) we don't really care at all how much space is present in closed pages, and (b) for open pages, exactitude is most important when the amount of available space is small. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 20, 2021 at 7:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't know whether 60 is optimal or not, but it doesn't seem crazy. Uh, I had it right the first time. Only the fill factor setting is "inverted relative to Postgres". This other setting really does default to 40. So it's very low. -- Peter Geoghegan
On Fri, Aug 20, 2021 at 11:06 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Aug 20, 2021 at 7:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I don't know whether 60 is optimal or not, but it doesn't seem crazy. > > Uh, I had it right the first time. Only the fill factor setting is > "inverted relative to Postgres". This other setting really does > default to 40. So it's very low. I expect they ran more than zero tests before selecting that value, so it's probably a decent choice in their system. However, that does seem rather low. I would have guessed that a good value would be in the 50-80 percent range. It's hard to know, though, partly because everything is workload dependent, and partly because you're balancing two good things that are qualitatively different. A lower value figures to reduce the degree of "mixing" of older and newer data within the same pages, but it also risks permanently wasting space that could have been put to efficient use. Imagine a table that is written some number of times and then, all at once, you stop all writes forever. You probably don't want to discover at that point in time that you have a large number of pages that are not even close to full, but you can't know when, if ever, such an event will happen for any given table. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 20, 2021 at 7:45 AM Robert Haas <robertmhaas@gmail.com> wrote: > I very much doubt that you can get away without some sort of free > space map. Even if in most cases most pages are closed to insertions, > there will be important corner cases where lots of pages are open for > insertions, like when you just deleted a ton of rows and then ran > VACUUM. And we cannot lose track of even one of those open pages or, > if the pre-8.4 state of the world is any indication, we will be super > sad. I agree with all that. The point I was making is that this new FSM design will have an on-disk size that is "a function of the workload". The new FSM can be required to store information about every single page, but that is the worst case. And a rather atypical case. I imagine that we'll find that the new FSM on-disk structure stores far less information than the current FSM in most cases, even though we're operating within the confines of what you've said. I think of this whole area as making heap pages a bit like B-Tree leaf pages. TIDs are stable logical identifiers of rows (that happen to have a physical component, a block number) in the other DB systems that I have referenced -- the heap pages from these systems are therefore intrinsically more like B-Tree leaf pages than those from Postgres heapam. ISTM that that's relevant to total space utilization. Users with sparse deletion patterns in their heap structure will get low space utilization -- an issue that we're familiar with as a problem for B-Tree indexing. I don't think that having a smaller on-disk FSM should be a goal of this project (though I suppose you could say that that aspect enables FSM WAL-logging, which would be nice). Smaller on-disk footprints seem like a natural consequence of this whole direction -- that's all. I also don't think that you're going to see huge space utilization benefits. This project is mostly aimed at reducing fragmentation, and all of the many problems that stem from it. > I don't know > if those are exactly the right boundaries, and 10 categories might be > worse than 8 or 16, but I think it's likely correct to suppose that > (a) we don't really care at all how much space is present in closed > pages, and (b) for open pages, exactitude is most important when the > amount of available space is small. I really don't have a detailed opinion on the appropriate number of categories just yet, except that it should be maybe 16 or 20 at the very most -- only real testing is likely to help me to refine my thinking on that. Note that the paper "Towards Effective and Efficient Free Space Management" recommends logarithmic intervals (called "buckets"), with 15 total. Details are under "4 Implementing Object Placement". I think that it's quite possible that the final scheme will not be a linear scale. Plus we may have to account for fill factor settings. -- Peter Geoghegan
On Fri, Aug 20, 2021 at 8:34 AM Robert Haas <robertmhaas@gmail.com> wrote: > I expect they ran more than zero tests before selecting that value, so > it's probably a decent choice in their system. However, that does seem > rather low. I would have guessed that a good value would be in the > 50-80 percent range. They don't have to deal with non-HOT updates, which effectively move the row to another block. Actually, that's not quite true -- they do have a concept called row migration. But it's the strategy of last resort, and is presumably very rare -- much rarer than a non-HOT update that moves a row. My point is that it's easier to believe that you run into sparse deletion patterns in a version of Postgres with open and closed heap pages -- because it's not only deletes that you have to worry about. > It's hard to know, though, partly because > everything is workload dependent, and partly because you're balancing > two good things that are qualitatively different. A lower value > figures to reduce the degree of "mixing" of older and newer data > within the same pages, but it also risks permanently wasting space > that could have been put to efficient use. All true -- to me this is all about adapting to workloads at a fine granularity. If we have closed pages, then non-HOT updates that cannot keep the row on the same heap page now actually "relieve the pressure", which is not currently true. If we started out with a heap page that had 100 tuples (say with fill factor 90), and then we find that we cannot keep the entire page together...then maybe we'll have more luck with 99, or just 90, or even less. By firmly sticking with our original goal, then we have some chance to learn what really will be stable for a given heap page. There is nothing wrong with learning that lesson through trial and error, based on negative information from the workload. Right now we go ahead and reuse the space at the next opportunity (we particularly favor pages like this, even). So we never learn from our mistakes. -- Peter Geoghegan
On Fri, Aug 20, 2021 at 11:48 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Aug 20, 2021 at 7:45 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I very much doubt that you can get away without some sort of free > > space map. Even if in most cases most pages are closed to insertions, > > there will be important corner cases where lots of pages are open for > > insertions, like when you just deleted a ton of rows and then ran > > VACUUM. And we cannot lose track of even one of those open pages or, > > if the pre-8.4 state of the world is any indication, we will be super > > sad. > > I agree with all that. The point I was making is that this new FSM > design will have an on-disk size that is "a function of the workload". That could be the right decision, but nothing said up to this point really seems to suggest it. The open/closed distinction and the changes in how to bin available space could all be done with the present model. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 20, 2021 at 9:03 AM Robert Haas <robertmhaas@gmail.com> wrote: > That could be the right decision, but nothing said up to this point > really seems to suggest it. The open/closed distinction and the > changes in how to bin available space could all be done with the > present model. I guess that's true. Isn't this just semantics, though? The new FSM implementation certainly cannot forget about pages that it deems open, or fail to have information about pages that should be considered open -- that would be absolutely unacceptable. And the implementation cannot assume that it won't have to maintain information about every single heap page. All I'm saying is that in practice this will be very rare. Making the FSM structure smaller isn't really the goal, though. -- Peter Geoghegan
On Fri, Aug 20, 2021 at 12:17 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Aug 20, 2021 at 9:03 AM Robert Haas <robertmhaas@gmail.com> wrote: > > That could be the right decision, but nothing said up to this point > > really seems to suggest it. The open/closed distinction and the > > changes in how to bin available space could all be done with the > > present model. > > I guess that's true. Isn't this just semantics, though? Not entirely. On one level, as long as we end up with an FSM implementation that does good things, who cares how it works internally? Well, the answer is, hackers care, and we're hackers, so it doesn't seem out of bounds to talk about thoughts and ideas around that. My own opinion is that the multi-level implementation of the current FSM is confusing and doesn't seem to do what we want, but the basic idea of a direct-mapped data structure seems good to me. Life is a lot simpler if you can update the status of your own page without having to do any complex calculation to figure out where that status is stored, let alone having to shuffle things around to make room to insert your status. Now if you or someone else comes up with an absolutely amazing new design that isn't direct-mapped and it's awesome, I'm not going to sit here and hold my breath; I like awesome as well as the next person. But if you or someone else said to me "hey, Robert, I'm thinking of rewriting the FSM, what tips do you have for maximizing the chances of success?" I'd say "well, you should try to make the replacement data structure as simple as it can be while still having the other properties that you care about ... and in particular I would suggest something that uses a fixed amount of state per heap page, because that seems much simpler to me." But nobody's got to agree with that, and it doesn't have to be right. It's just what I think. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 20, 2021 at 10:15 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I guess that's true. Isn't this just semantics, though? > > Not entirely. On one level, as long as we end up with an FSM > implementation that does good things, who cares how it works > internally? > I'd say "well, you should try > to make the replacement data structure as simple as it can be while > still having the other properties that you care about ... and in > particular I would suggest something that uses a fixed amount of state > per heap page, because that seems much simpler to me." But nobody's > got to agree with that, and it doesn't have to be right. It's just > what I think. But not storing information about a heap page implicitly means that the page is closed. And so we effectively remember the state of every single heap page with the FSM design I am working on. Not storing any information in the data structure is just an optimization. In any case we're obligated to have accurate information about all individual heap pages. Doing it that way seems pretty natural to me, even though the overhead of maintaining the FSM data structure isn't a priority for me -- not at all. It might actually be easier to have that optimization than to not have it. The starting point for the new FSM data structure is lists of blocks that are usable, that have free space. Whereas the starting point for the existing FSM structure is every single extant heap page in the table -- it's strictly obligated to have explicit information about every single page (otherwise we get nasty smgr errors). It's either a small difference or a big difference, depending on how you think about it. -- Peter Geoghegan
On Fri, Aug 20, 2021 at 10:32 AM Peter Geoghegan <pg@bowt.ie> wrote: > But not storing information about a heap page implicitly means that > the page is closed. And so we effectively remember the state of every > single heap page with the FSM design I am working on. Not storing any > information in the data structure is just an optimization. > Doing it that way seems pretty natural to me, even though the overhead > of maintaining the FSM data structure isn't a priority for me -- not > at all. It might actually be easier to have that optimization than to > not have it. The starting point for the new FSM data structure is > lists of blocks that are usable, that have free space. There is one problem with what I've said here, which might be behind your concern. That problem is: if a page being closed is treated as an implicit thing, just from its absence in this new FSM structure, then we need to tighten things up elsewhere. In particular, this code at the end of RelationGetBufferForTuple() is now a problem: """ /* * Remember the new page as our target for future insertions. * * XXX should we enter the new page into the free space map immediately, * or just keep it for this backend's exclusive use in the short run * (until VACUUM sees it)? Seems to depend on whether you expect the * current backend to make more insertions or not, which is probably a * good bet most of the time. So for now, don't add it to FSM yet. */ RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); """ Note that the relation bulk extension mechanism that you committed (in commit 719c84c1be) will always allocate one more block, just for the leader backend (the backend that bulk extends for other backends) -- we don't tell the FSM about this same block, presumably because to do so would risk live lock inside the loop in the same function. We just assume that it's okay that the leader backend remembers this in its local relcache "target block". Right now it isn't so bad that we're kind of sloppy here, at least relative to everything else. Although the leader backend may well "leak" the page before it has the opportunity to write very much to it (e.g., client disconnects), in the long run we should get that space back. At least we know that VACUUM will ultimately be able to find the space again, and put it in the FSM for reuse. Since there is currently no question about "closed vs open" pages, there is no question about VACUUM/the FSM becoming confused about which state applies to which page in this scenario. In short: yeah, this "closed vs open" page business more or less necessitates tightening things up here. Though this new requirement seems to have always been a good idea. Just because we can lean on VACUUM like this (an option that other DB systems don't have) doesn't mean that we should do so. -- Peter Geoghegan
On Fri, Aug 20, 2021 at 3:13 PM Peter Geoghegan <pg@bowt.ie> wrote: > In short: yeah, this "closed vs open" page business more or less > necessitates tightening things up here. Though this new requirement > seems to have always been a good idea. Just because we can lean on > VACUUM like this (an option that other DB systems don't have) doesn't > mean that we should do so. That's all true, but it's not what I was talking about. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 20, 2021 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 20, 2021 at 3:13 PM Peter Geoghegan <pg@bowt.ie> wrote: > > In short: yeah, this "closed vs open" page business more or less > > necessitates tightening things up here. Though this new requirement > > seems to have always been a good idea. Just because we can lean on > > VACUUM like this (an option that other DB systems don't have) doesn't > > mean that we should do so. > > That's all true, but it's not what I was talking about. Then I am not sure that I have understood you fully. I would like to make sure that I have the nuance of it. As long as we have the idea of open vs closed pages then we cannot assume that we know that status from the page alone. Unless we're willing to go update the pages again, and possibly dirty them again, which seems unappealing. My concern here is really the data structure and its overall complexity. If there has to be an explicit state for every page on the FSM, then this new FSM needs to merge freelists that have been emptied into that structure, and needs to handle the question of which structure (map or free lists) is currently authoritative for a given heap page. That's a lot of complexity compared to just forgetting the free lists that are known to be fully closed, which will probably be very common. That can work a little like discarding old UNDO -- it can be simple and cheap *because* it's largely logical and implicit at the level of the physical data structures. -- Peter Geoghegan
On Fri, Aug 20, 2021 at 3:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > My concern here is really the data structure and its overall > complexity. If there has to be an explicit state for every page on the > FSM, then this new FSM needs to merge freelists that have been emptied > into that structure, and needs to handle the question of which > structure (map or free lists) is currently authoritative for a given > heap page. That's a lot of complexity compared to just forgetting the > free lists that are known to be fully closed, which will probably be > very common. That can work a little like discarding old UNDO -- it can > be simple and cheap *because* it's largely logical and implicit at the > level of the physical data structures. I don't think I understand the data structure that you have in mind well enough to comment intelligently. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 20, 2021 at 1:30 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 20, 2021 at 3:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > > My concern here is really the data structure and its overall > > complexity. > I don't think I understand the data structure that you have in mind > well enough to comment intelligently. Right now my prototype has a centralized table in shared memory, with a hash table. One entry per relation, generally multiple freelists per relation. And with per-freelist metadata such as owner and original leader backend XID values. Plus of course the lists of free blocks themselves. The prototype already clearly fixes the worst problems with BenchmarkSQL, but that's only one of my goals. That's just the starting point. I appreciate your input on this. And not just on the implementation details -- the actual requirements themselves are still in flux. This isn't so much a project to replace the FSM as it is a project that adds a new rich abstraction layer that goes between access methods and smgr.c -- free space management is only one of the responsibilities. Access methods like heapam can talk to the new abstraction layer, sometimes exploiting semantics from the logical database -- owning transaction IDs for a given lease, things like that. The new abstraction may thereby help us when working on other problems, including problems in areas that might seem totally unrelated. That's the big idea. Here are some (admittedly quite speculative) examples of places that the infrastructure could help with: * Parallel INSERT/COPY. I imagine parallel DML is hard because of coordination problems. With the infrastructure I have in mind, a parallel DML implementation could request a large free space lease, and distribute blocks from that lease among parallel workers during execution, on demand. This naturally avoids unnecessary buffer lock contention among concurrent writes from the same transaction (but different workers). But that's not all. I'm also thinking of things like LWLock/buffer lock deadlock avoidance -- issues of fundamental correctness. That stuff becomes easier, too. Individual parallel workers from a parallel INSERT could safely assume that the blocks that the parallel leader tells them about really is only going to be accessed by them, for as long as the owning transaction is around (barring read-only access by VACUUM, things like that). There is simply no question of any other backend/parallel worker thinking that it has the right to use the same block before the transaction commits. The transaction-scoped ownership semantics must be truly robust for this to work at all, but that's a good idea anyway. * Truncate tables without a disruptive relation-level lock. AFAICT it isn't actually logically necessary to get a heavyweight relation extension lock to truncate a relation during VACUUM. The only reason we need that right now is because it's really hard to nail down which backends might have a usable BlockNumber that they still expect to not point past the end of the physical relfilenode (or much worse, point to new and logically unrelated data). Perhaps we can solve the underlying problem using something like Lanin & Shasha's "drain technique", which is used inside nbtree for page deletion today. The general idea is to defer the physical unlink/recycling until we know for sure that any backends that might have had unsafe references have gone away. This is a very broad technique. * Truncate-away empty heap table segments that happen to come before other segments that still have useful data. * Eager cleanup of aborted transactions by VACUUM (or some variation of VACUUM that just handles aborts). Again, we're using information about the logical database to solve problem with the physical database -- that information is what we seem to currently be missing in all these areas. That's really the central idea behind the new infrastructure. -- Peter Geoghegan
On Mon, Aug 23, 2021 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote: > Right now my prototype has a centralized table in shared memory, with > a hash table. One entry per relation, generally multiple freelists per > relation. And with per-freelist metadata such as owner and original > leader backend XID values. Plus of course the lists of free blocks > themselves. The prototype already clearly fixes the worst problems > with BenchmarkSQL, but that's only one of my goals. That's just the > starting point. > > I appreciate your input on this. And not just on the implementation > details -- the actual requirements themselves are still in flux. This > isn't so much a project to replace the FSM as it is a project that > adds a new rich abstraction layer that goes between access methods and > smgr.c -- free space management is only one of the responsibilities. Makes sense. I think one of the big implementation challenges here is coping with the scenario where there's not enough shared memory available ... or else somehow making that impossible without reserving an unreasonable amount of shared memory. If you allowed space for every buffer to belong to a different relation and have the maximum number of leases and whatever, you'd probably have no possibility of OOM, but you'd probably be pre-reserving too much memory. I also think there are some implementation challenges around locking. You probably need some, because the data structure is shared, but because it's complex, it's not easy to create locking that allows for good concurrency. Or so I think. Andres has been working -- I think for years now -- on replacing the buffer mapping table with a radix tree of some kind. That strikes me as very similar to what you're doing here. The per-relation data can then include not only the kind of stuff you're talking about but very fundamental things like how long it is and where its buffers are in the buffer pool. Hopefully we don't end up with dueling patches. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 25, 2021 at 10:58 AM Robert Haas <robertmhaas@gmail.com> wrote: > Makes sense. I'm glad that the big picture stuff makes sense to you. > I think one of the big implementation challenges here is > coping with the scenario where there's not enough shared memory > available ... or else somehow making that impossible without reserving > an unreasonable amount of shared memory. Yes, it'll definitely be necessary to nail that down. > If you allowed space for > every buffer to belong to a different relation and have the maximum > number of leases and whatever, you'd probably have no possibility of > OOM, but you'd probably be pre-reserving too much memory. I hope that we can control the shared memory space overhead by making it a function of max_connections, plus some configurable number of relations that get modified within a single transaction. This approach must behave in the same way when when the number of tables that each transaction actually modifies is high -- perhaps a transaction that does this then pays a penalty in WAL logging within the FSM. I think that that can be made manageable, especially if we can pretty much impose the cost directly on those transactions that need to modify lots of relations all at once. (If we can reuse the shared memory over time it'll help too.) > I also think > there are some implementation challenges around locking. That seems likely. > You probably > need some, because the data structure is shared, but because it's > complex, it's not easy to create locking that allows for good > concurrency. Or so I think. My hope is that this design more than makes up for it by relieving contention in other areas. Like buffer lock contention, or relation extension lock contention. > Andres has been working -- I think for years now -- on replacing the > buffer mapping table with a radix tree of some kind. That strikes me > as very similar to what you're doing here. The per-relation data can > then include not only the kind of stuff you're talking about but very > fundamental things like how long it is and where its buffers are in > the buffer pool. Hopefully we don't end up with dueling patches. I agree that there is definitely some overlap. I see no risk of a real conflict, though. I have mostly been approaching this project as an effort to fix the locality problems, mostly by looking for fixes to the BenchmarkSQL workload's problems. I have to admit that the big picture stuff about exploiting transactional semantics with free space management is still pretty aspirational. The resource management parts of my prototype patch are by far the kludgiest parts. I hope that I can benefit from whatever work Andres has already done on this, particularly when it comes to managing per-relation metadata in shared memory. -- Peter Geoghegan
On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > This is actually quite simple -- the chaos that follows is where it > gets truly complicated (and not necessarily worth getting into just > yet). The update of a given order and its order lines entries takes > place hours later, within a delivery transaction. Update on this: aborted transactions play an important role in the chaos that we see with BenchmarkSQL/TPC-C. This is surprising, in one way. Only 1% of all transactions are aborted (per the TPC-C spec). But in another way it's not surprising. It's totally consistent with what I've been saying about open and closed pages. When a page that's initially allocated and opened by a backend becomes closed following some inserts, the page should be left in a more or less pristine state -- owning backends that open and close pages need to be "good citizens. The definition of "pristine" must also include "no already-aborted tuples" -- aborted xact tuples are a problem that we can and should nip in the bud. This is really an extension of what I've been saying about not disrupting naturally occuring temporal and spatial locality. We're far better off if the space reused by freeing aborted heap tuples is reused by the very next transaction running in the very same backend. Plus it's not that hard; we can know for sure that this space is immediately available, which is not generally true of pruning. And so this free space from aborted xact tuples is quite special. Even with the master branch the problems with the FSM are much less noticeable once you configure BenchmarkSQL to not abort any transactions [1]. You need to consider the whole picture to understand how the issue of aborted transactions has such an outsized negative impact on performance and space efficiency. It's a whole *chain* of events: * If just 1% of all transactions abort, that's enough to leave an average of about 1 aborted tuple on every "order" table page (which naturally has small heap tuples [2]), and about 10 on a minority of the really big "new order" table's pages. Maybe 1 in 10 or 12 "order line" table heap pages are affected in this way. * The overall impact is that small (but not tiny) pockets of free space are left randomly strewn all over the place. Not just LP_DEAD line pointer stubs -- whole entire aborted heap tuples. * In theory we could address this problem using opportunistic pruning. But in practice that simply cannot happen today, because we don't touch the inserted rows until literally hours later. The same rows will eventually have updates and selects, but that won't help -- too little, too late. That's just how opportunistic pruning works: currently you need some kind of scan node for opportunistic pruning to kick in, so continually inserting transactions (from new orders) are currently fundamentally unable to do any opportunistic pruning. Much less opportunistic pruning that kicks in at exactly the right time. * When an autovacuum worker runs against these tables, it will of course notice this diffuse free space from aborted tuples, and put it in the FSM. That free space will be reused, but by totally unrelated logical rows. What we really seem to need here is some rudimentary form of "eager physical rollback"; the really important thing seems to be to not allow the problem to happen in the first place. I have a basic prototype implementation, built on top of the larger prototype patch. It opportunistically prunes-away aborted heap tuples on target/open pages, bypassing the usual pd_prune_xid check we have in heap_page_prune_opt() -- that doesn't make sense with the new form of targeted, abort-orientated pruning. The new pruning mechanism makes a really big difference on its own, without even involving the FSM. The best cure for all kinds of bloat is prevention, which this comes pretty close to. I think that this general idea can be pushed further. I've talked about aborts, but what about the commit path? I don't see why we can't make inserting backends responsible for setting hint bits on recently inserted rows -- the cost of putting off that work can only go up, no matter what (unless maybe the user drops the table). Again, the missing piece seems to be a general sense of ownership of heap pages by transactions and/or backends. In order for a backend to be able to clean up its own mess after itself while it's still cheap, it has to understand what that actually means. Don't forget that the TPC-C workload doesn't have that many updates (at least not for the tables I'm focussed on here). Nothing that we tend to think of as an adversarial case for Postgres, except perhaps the updates against the "order" table, which are always non-HOT updates due to an issue with the indexes. That's the extent of it, though -- every other table easily has 99%+ of all updates use HOT (albeit with some heap fill factor tuning). We should *expect* long term stability here, based on the particulars of the workload; we only need to update every "order" row and every "order lines" row exactly once. After that they can just age out of shared_buffers forever (actually they might be read again, but never modified again, but reads were never the problem). The big "order line" table is by far the biggest problem, even though it manages to use HOT for practically all updates. [1] https://github.com/wieck/benchmarksql/commit/66b8db073545026dc76ef513d2b0e318d2f3d5a2 [2] https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf -- "Table 1: Summary of Logical Database" -- Peter Geoghegan
When I have been thinking of this type of problem it seems that the latest -- and correct :) -- place which should do all kinds of cleanup like removing aborted tuples, freezing committed tuples and setting any needed hint bits would be background writer or CHECKPOINT. This would be more PostgreSQL-like, as it moves any work not immediately needed from the critical path, as an extension of how MVCC for PostgreSQL works in general. Another option could be one more background cleaner with "needs cleanup" bitmap which is updated by "real work" backends to let the cleaner know what needs to be done. But doing it as part of checkpoint probably ends up with less WAL writes in the end. That said, CHECKPOINT can efficiently clean up only Heap pages, unless we do some extra work to ensure buffercache eviction ordering so that Heap is (almost) always cleaaned worst and has thus an option to also clean index pointers which point to it. There could be a possibility to do a small amount of cleanup -- enough for TPC-C-like workloads, but not larger ones -- while waiting for the next command to arrive from the client over the network. This of course assumes that we will not improve our feeder mechanism to have back-to-back incoming commands, which can already be done today, but which I have seen seldom used. Cheers, ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested. On Mon, Sep 6, 2021 at 11:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > > This is actually quite simple -- the chaos that follows is where it > > gets truly complicated (and not necessarily worth getting into just > > yet). The update of a given order and its order lines entries takes > > place hours later, within a delivery transaction. > > Update on this: aborted transactions play an important role in the > chaos that we see with BenchmarkSQL/TPC-C. > > This is surprising, in one way. Only 1% of all transactions are > aborted (per the TPC-C spec). But in another way it's not surprising. > It's totally consistent with what I've been saying about open and > closed pages. When a page that's initially allocated and opened by a > backend becomes closed following some inserts, the page should be left > in a more or less pristine state -- owning backends that open and > close pages need to be "good citizens. > > The definition of "pristine" must also include "no already-aborted > tuples" -- aborted xact tuples are a problem that we can and should > nip in the bud. This is really an extension of what I've been saying > about not disrupting naturally occuring temporal and spatial locality. > We're far better off if the space reused by freeing aborted heap > tuples is reused by the very next transaction running in the very same > backend. Plus it's not that hard; we can know for sure that this space > is immediately available, which is not generally true of pruning. And > so this free space from aborted xact tuples is quite special. > > Even with the master branch the problems with the FSM are much less > noticeable once you configure BenchmarkSQL to not abort any > transactions [1]. You need to consider the whole picture to understand > how the issue of aborted transactions has such an outsized negative > impact on performance and space efficiency. > > It's a whole *chain* of events: > > * If just 1% of all transactions abort, that's enough to leave an > average of about 1 aborted tuple on every "order" table page (which > naturally has small heap tuples [2]), and about 10 on a minority of > the really big "new order" table's pages. Maybe 1 in 10 or 12 "order > line" table heap pages are affected in this way. > > * The overall impact is that small (but not tiny) pockets of free > space are left randomly strewn all over the place. Not just LP_DEAD > line pointer stubs -- whole entire aborted heap tuples. > > * In theory we could address this problem using opportunistic pruning. > But in practice that simply cannot happen today, because we don't > touch the inserted rows until literally hours later. > > The same rows will eventually have updates and selects, but that won't > help -- too little, too late. That's just how opportunistic pruning > works: currently you need some kind of scan node for opportunistic > pruning to kick in, so continually inserting transactions (from new > orders) are currently fundamentally unable to do any opportunistic > pruning. Much less opportunistic pruning that kicks in at exactly the > right time. > > * When an autovacuum worker runs against these tables, it will of > course notice this diffuse free space from aborted tuples, and put it > in the FSM. > > That free space will be reused, but by totally unrelated logical rows. > > What we really seem to need here is some rudimentary form of "eager > physical rollback"; the really important thing seems to be to not > allow the problem to happen in the first place. I have a basic > prototype implementation, built on top of the larger prototype patch. > It opportunistically prunes-away aborted heap tuples on target/open > pages, bypassing the usual pd_prune_xid check we have in > heap_page_prune_opt() -- that doesn't make sense with the new form of > targeted, abort-orientated pruning. The new pruning mechanism makes a > really big difference on its own, without even involving the FSM. The > best cure for all kinds of bloat is prevention, which this comes > pretty close to. > > I think that this general idea can be pushed further. I've talked > about aborts, but what about the commit path? I don't see why we can't > make inserting backends responsible for setting hint bits on recently > inserted rows -- the cost of putting off that work can only go up, no > matter what (unless maybe the user drops the table). Again, the > missing piece seems to be a general sense of ownership of heap pages > by transactions and/or backends. In order for a backend to be able to > clean up its own mess after itself while it's still cheap, it has to > understand what that actually means. > > Don't forget that the TPC-C workload doesn't have that many updates > (at least not for the tables I'm focussed on here). Nothing that we > tend to think of as an adversarial case for Postgres, except perhaps > the updates against the "order" table, which are always non-HOT > updates due to an issue with the indexes. That's the extent of it, > though -- every other table easily has 99%+ of all updates use HOT > (albeit with some heap fill factor tuning). We should *expect* long > term stability here, based on the particulars of the workload; we only > need to update every "order" row and every "order lines" row exactly > once. After that they can just age out of shared_buffers forever > (actually they might be read again, but never modified again, but > reads were never the problem). The big "order line" table is by far > the biggest problem, even though it manages to use HOT for practically > all updates. > > [1] https://github.com/wieck/benchmarksql/commit/66b8db073545026dc76ef513d2b0e318d2f3d5a2 > [2] https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf -- "Table 1: > Summary of Logical Database" > -- > Peter Geoghegan > >
On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing <hannuk@google.com> wrote: > When I have been thinking of this type of problem it seems that the > latest -- and correct :) -- place which should do all kinds of > cleanup like removing aborted tuples, freezing committed tuples and > setting any needed hint bits would be background writer or CHECKPOINT. > > This would be more PostgreSQL-like, as it moves any work not > immediately needed from the critical path, as an extension of how MVCC > for PostgreSQL works in general. I think it depends. There is no need to do work in the background here, with TPC-C. With my patch series each backend can know that it just had an aborted transaction that affected a page that it more or less still owns. And has very close at hand, for further inserts. It's very easy to piggy-back the work once you have that sense of ownership of newly allocated heap pages by individual backends/transactions. > This would be more PostgreSQL-like, as it moves any work not > immediately needed from the critical path, as an extension of how MVCC > for PostgreSQL works in general. I think that it also makes sense to have what I've called "eager physical rollback" that runs in the background, as you suggest. I'm thinking of a specialized form of VACUUM that targets a specific aborted transaction's known-dirtied pages. That's my long term goal, actually. Originally I wanted to do this as a way of getting rid of SLRUs and tuple freezing, by representing that all heap pages must only have committed tuples implicitly. That seemed like a good enough reason to split VACUUM into specialized "eager physical rollback following abort" and "garbage collection" variants. The insight that making abort-related cleanup special will help free space management is totally new to me -- it emerged from working directly on this benchmark. But it nicely complements some of my existing ideas about improving VACUUM. > But doing it as part of checkpoint probably ends up with less WAL > writes in the end. I don't think that checkpoints are special in any way. They're very important in determining the total number of FPIs we'll generate, and so have huge importance today. But that seems accidental to me. > There could be a possibility to do a small amount of cleanup -- enough > for TPC-C-like workloads, but not larger ones -- while waiting for the > next command to arrive from the client over the network. This of > course assumes that we will not improve our feeder mechanism to have > back-to-back incoming commands, which can already be done today, but > which I have seen seldom used. That's what I meant, really. Doing the work of cleaning up a heap page that a transaction inserts into (say pruning away aborted tuples or setting hint bits) should ideally happen right after commit or abort -- at least for OLTP like workloads, which are the common case for Postgres. This cleanup doesn't have to be done by exactly the same transactions (and can't be in most interesting cases). It should be quite possible for the work to be done by approximately the same transaction, though -- the current transaction cleans up inserts made by the previous (now committed/aborted) transaction in the same backend (for the same table). The work of setting hint bits and pruning-away aborted heap tuples has to be treated as a logical part of the cost of inserting heap tuples -- backends pay this cost directly. At least with workloads where transactions naturally only insert a handful of rows each in almost all cases -- very much the common case. -- Peter Geoghegan
On Thu, Aug 26, 2021 at 5:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > Andres has been working -- I think for years now -- on replacing the > buffer mapping table with a radix tree of some kind. That strikes me > as very similar to what you're doing here. The per-relation data can > then include not only the kind of stuff you're talking about but very > fundamental things like how long it is and where its buffers are in > the buffer pool. Hopefully we don't end up with dueling patches. FTR I have a patch in development that adds a demand-paged (though Anastasia recently suggested reconsidering that) per-SMGR relation shmem object pool that initially tracks precisely "how long it is", but I hope it would provide a hook to hang many other things on in future that need coordination at the relation level, such as synchronized scan position, and I hope, radix-based buffer mappings. Could be relevant? https://commitfest.postgresql.org/34/2933/ (I'll rebase it soon).
On Mon, Sep 6, 2021 at 7:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > FTR I have a patch in development that adds a demand-paged (though > Anastasia recently suggested reconsidering that) per-SMGR relation > shmem object pool that initially tracks precisely "how long it is", > but I hope it would provide a hook to hang many other things on in > future that need coordination at the relation level, such as > synchronized scan position, and I hope, radix-based buffer mappings. Cool! > Could be relevant? https://commitfest.postgresql.org/34/2933/ (I'll > rebase it soon). I think that it is relevant. You're coming at the same thing from roughly the opposite direction, at least in my mind. But that seems like it might be a very good thing, given the specifics. I think that we can meet each other in the middle, without there being any particular risk of overlapping work or conflicting requirements. The resource management aspects of my own prototype are the least worked out and shakiest parts overall, so this seems likely. You've probably already solved numerous problems for me. -- Peter Geoghegan
On Tue, Sep 7, 2021 at 2:29 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing <hannuk@google.com> wrote: > > When I have been thinking of this type of problem it seems that the > > latest -- and correct :) -- place which should do all kinds of > > cleanup like removing aborted tuples, freezing committed tuples and > > setting any needed hint bits would be background writer or CHECKPOINT. > > > > This would be more PostgreSQL-like, as it moves any work not > > immediately needed from the critical path, as an extension of how MVCC > > for PostgreSQL works in general. > > I think it depends. There is no need to do work in the background > here, with TPC-C. With my patch series each backend can know that it > just had an aborted transaction that affected a page that it more or > less still owns. And has very close at hand, for further inserts. It's > very easy to piggy-back the work once you have that sense of ownership > of newly allocated heap pages by individual backends/transactions. Are you speaking of just heap pages here or also index pages ? It seems indeed easy for heap, but for index pages can be mixed up by other parallel work, especially things like Serial Primary Keys . Or are you expecting these to be kept in good-enoug shape by your earlier index manager work ? > > This would be more PostgreSQL-like, as it moves any work not > > immediately needed from the critical path, as an extension of how MVCC > > for PostgreSQL works in general. > > I think that it also makes sense to have what I've called "eager > physical rollback" that runs in the background, as you suggest. > > I'm thinking of a specialized form of VACUUM that targets a specific > aborted transaction's known-dirtied pages. That's my long term goal, > actually. Originally I wanted to do this as a way of getting rid of > SLRUs and tuple freezing, by representing that all heap pages must > only have committed tuples implicitly. That seemed like a good enough > reason to split VACUUM into specialized "eager physical rollback > following abort" and "garbage collection" variants. > > The insight that making abort-related cleanup special will help free > space management is totally new to me -- it emerged from working > directly on this benchmark. But it nicely complements some of my > existing ideas about improving VACUUM. A minimal useful patch emerging from that understanding could be something which just adds hysteresis to FSM management. (TBH, I actually kind of expected some hysteresis to be there already, as it is in my mental model of "how things should be done" for managing almost any resource :) ) Adding hysteresis to FSM management can hopefully be done independent of all the other stuff and also seems to be something that is unobtrusive and non-controversial enough to fit in current release and possibly be even back-ported . > > But doing it as part of checkpoint probably ends up with less WAL > > writes in the end. > > I don't think that checkpoints are special in any way. They're very > important in determining the total number of FPIs we'll generate, and > so have huge importance today. But that seems accidental to me. I did not mean CHECKPOINT as a command, but more the concept of writing back / un-dirtying the page. In this sense it *is* special because it is the last point in time where you are guaranteed to have the page available in buffercache and thus cheap to access for modifications plus you will avoid a second full-page writeback because of cleanup. Also you do not want to postpone the cleanup to actual page eviction, as that is usually in the critical path for some user query or command. Of course this becomes most important for workloads where the active working set is larger than fits in memory and this is not a typical case for OLTP any more. But at least freezing the page before write-out could have a very big impact on the need to freeze "old" pages available only on disk and thus would be a cheap way to improve the problems around running out of transaction ids. > > There could be a possibility to do a small amount of cleanup -- enough > > for TPC-C-like workloads, but not larger ones -- while waiting for the > > next command to arrive from the client over the network. This of > > course assumes that we will not improve our feeder mechanism to have > > back-to-back incoming commands, which can already be done today, but > > which I have seen seldom used. > > That's what I meant, really. Doing the work of cleaning up a heap page > that a transaction inserts into (say pruning away aborted tuples or > setting hint bits) should ideally happen right after commit or abort > -- at least for OLTP like workloads, which are the common case for > Postgres. This cleanup doesn't have to be done by exactly the same > transactions (and can't be in most interesting cases). It should be > quite possible for the work to be done by approximately the same > transaction, though -- the current transaction cleans up inserts made > by the previous (now committed/aborted) transaction in the same > backend (for the same table). Again, do I assume correctly that you are here mainly targeting heap and not indexes ? > The work of setting hint bits and pruning-away aborted heap tuples has > to be treated as a logical part of the cost of inserting heap tuples > -- backends pay this cost directly. At least with workloads where > transactions naturally only insert a handful of rows each in almost > all cases -- very much the common case. I know that in general we are very reluctant to add threads to postgresql, but this looks a very valid case for having a "microthread" running on the same core as the DML as it will share all the CPU caches and can thus be really cheap without being actually in the critical path. Cheers, ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
On Mon, Sep 6, 2021 at 8:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Sep 6, 2021 at 4:33 PM Hannu Krosing <hannuk@google.com> wrote: > > When I have been thinking of this type of problem it seems that the > > latest -- and correct :) -- place which should do all kinds of > > cleanup like removing aborted tuples, freezing committed tuples and > > setting any needed hint bits would be background writer or CHECKPOINT. > > > > This would be more PostgreSQL-like, as it moves any work not > > immediately needed from the critical path, as an extension of how MVCC > > for PostgreSQL works in general. > > I think it depends. There is no need to do work in the background > here, with TPC-C. With my patch series each backend can know that it > just had an aborted transaction that affected a page that it more or > less still owns. And has very close at hand, for further inserts. It's > very easy to piggy-back the work once you have that sense of ownership > of newly allocated heap pages by individual backends/transactions. Doing work in the background has some advantages, though. In particular, it has the possibly-large advantage of not slowing down foreground work. For me the key insight here is that HOT-pruning a heap page is a lot cheaper if you do it before you write the page. Once you've written the page, the eventual HOT-prune is going to need to dirty it, which will cause it to be written again. If you prune before writing it the first time, that cost is avoided. I'm not sure that it really matters whether the space consumed by aborted tuples gets reused by "the very next transaction" or, say, 10 transactions after that, or even 1000 transactions after that. If you wait for a million transactions, you've quite possibly lost enough temporality to matter, but at 10 or 1000 that's much less likely. The exact threshold is fuzzy: every moment you wait makes it less likely that you have sufficient locality, but you can always find a workload where even a very long wait is acceptable, and another one where even a tiny delay is catastrophic, and it's hard to say what the "real world" looks like. On the other hand, there's nothing fuzzy about the expense incurred by writing the page before it's HOT-pruned. That is essentially certain to incur an extra page write, except in the corner case where the relation gets dropped or truncated before then. So I think you might find that if you found a way to ensure that HOT-pruning -- and entry into the FSM -- always happens for every heap page just before it's written, if it hasn't already happened sooner and might be needed, you might end up in a pretty good spot. It wouldn't even be ignoring temporal locality, since at minimum dirty pages are written once per checkpoint cycle. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 7, 2021 at 12:31 PM Robert Haas <robertmhaas@gmail.com> wrote: > Doing work in the background has some advantages, though. In > particular, it has the possibly-large advantage of not slowing down > foreground work. What I really like about the idea of doing the work in the foreground (when it's clearly not going to be too much of a latency hit) is that it would act as a natural form of backpressure. Sometimes slowing things down is a good thing -- fundamentally inefficient and unsustainable use of the system's resources ought to be slowed down by imposing the cost on the transaction that performs the relevant insert queries. To me this seems like a problem of negative externalities. With small transactions we can pretty much know for sure that the benefit of not leaving pages in good shape at the moment they become "closed" is fixed and low, while the cost to the system as a whole over time is potentially far higher. It seems almost unbounded, in a way, so we're bound to lose in the long run, regardless of workload. (The work can be piggybacked across transactions to make it more efficient, but that's mostly an implementation detail.) Updaters might actually be much less of a problem than inserters. They already keep costs down by pruning opportunistically -- they at least try. Most individual logical rows that are inserted have exactly zero chance of ever being updated, but a ~100% chance of needing to have hint bits set sooner or later. We know very little about workload characteristics, but these are some of the few things that we really do know for sure. > For me the key insight here is that HOT-pruning a heap page is a lot > cheaper if you do it before you write the page. Once you've written > the page, the eventual HOT-prune is going to need to dirty it, which > will cause it to be written again. If you prune before writing it the > first time, that cost is avoided. Right. The difficulty with applying that insight before now has been making the prune operation take place at the level of whole pages systematically. I believe that it's quite valuable to set all the hint bits on a page as we "close it out" for the first time. But setting hint bits for all of the heap tuples on the page except one or two is pretty much useless. You might as well not bother at all. It follows that concurrent inserters clobbering the same page senselessly are more or less not okay if we're going to pursue this technique. You might still have a concurrent update of one of the inserted rows shortly after the original inserting transaction commits, but before the page is closed -- that might make the idea not work out for that one page (obviously it depends on the specifics). That seems like something we can live with, though. > I'm not sure that it really matters > whether the space consumed by aborted tuples gets reused by "the very > next transaction" or, say, 10 transactions after that, or even 1000 > transactions after that. If you wait for a million transactions, > you've quite possibly lost enough temporality to matter, but at 10 or > 1000 that's much less likely. I didn't mean to suggest that it had to happen in perfect lockstep. But I do think that it should be pretty close to perfect. What I actually do right now is prune an open page when it *appears* to be full inside the loop in RelationGetBufferForTuple(). As crude as that is, it sets hint bits and prunes away aborted transactions in mostly the right way, at least as far as TPC-C is concerned. (This only works because it fits into the wider framework of my much larger patch, which introduces ownership semantics, open and closed pages, empty page freelists, etc.) > The exact threshold is fuzzy: every > moment you wait makes it less likely that you have sufficient > locality, but you can always find a workload where even a very long > wait is acceptable, and another one where even a tiny delay is > catastrophic, and it's hard to say what the "real world" looks like. I agree that offloading work in the background can be very useful, and that technique needs to be possible as future work (I actually really want this myself). But most individual apps probably won't really notice the latency hit from pruning if it's well managed -- as with pruning by updates today. We can make a decision about which strategy to use (lightweight foreground processing with a potential latency hit, or background batch cleanup?) at a very late point in the cycle, based on the specifics in each case. Offhand, I don't think that we need to characterize this fuzzy threshold in too much detail. I suspect that the main determinative question can be simple enough. Something like: Does this transaction (or this group of transactions) involve only trivial changes to maybe 3 - 5 heap pages at the most, or not? There is a separate question about what happens if our background cleanup process cannot keep up over time. I think that we can handle that by falling back on foreground processing (within reason). Let's say we have transactions that each insert on to dozens or perhaps even hundreds of heap pages each -- these are "medium sized'' write transactions. If we're already doing way too much background processing for similar transactions from earlier on, and if it's clear we just can't keep up, maybe we can fall back and force foreground processing, to keep the debt under control. (In general I think that this back pressure, negative externalities business will be important.) It seems to me that this leaves one harder question unanswered: at what point does a "medium sized" transaction become so large that it just doesn't make sense to do either? What's the crossover point at which background processing and foreground processing like this should be assumed to be not worth it? That I won't speculate about just yet. I suspect that at some point it really does make sense to leave it all up to a true table-level batch operation, like a conventional VACUUM. -- Peter Geoghegan
On Tue, Sep 7, 2021 at 5:25 AM Hannu Krosing <hannuk@google.com> wrote: > Are you speaking of just heap pages here or also index pages ? Mostly heap pages, but FWIW I think it could work for index tuples too, with retail index tuple deletion. Because that allows you to even remove would-be LP_DEAD item pointers. > Or are you expecting these to be kept in good-enoug shape by your > earlier index manager work ? It's very workload dependent. Some things were very much improved by bottom-up index deletion in Postgres 14, for example (non-hot updates with lots of logically unchanged indexes). Other things weren't helped at all, or were barely helped. I think it's important to cover or cases. > A minimal useful patch emerging from that understanding could be > something which just adds hysteresis to FSM management. (TBH, I > actually kind of expected some hysteresis to be there already, as it > is in my mental model of "how things should be done" for managing > almost any resource :) ) I think that you need to do the FSM before the aborted-heap-tuple cleanup. Otherwise you don't really know when or where to apply the special kind of pruning that the patch invents, which targets aborts specifically. > Adding hysteresis to FSM management can hopefully be done independent > of all the other stuff and also seems to be something that is > unobtrusive and non-controversial enough to fit in current release and > possibly be even back-ported . I don't know about that! Seems kind of an invasive patch to me. > I did not mean CHECKPOINT as a command, but more the concept of > writing back / un-dirtying the page. In this sense it *is* special > because it is the last point in time where you are guaranteed to have > the page available in buffercache and thus cheap to access for > modifications plus you will avoid a second full-page writeback because > of cleanup. Also you do not want to postpone the cleanup to actual > page eviction, as that is usually in the critical path for some user > query or command. I intend to do some kind of batching, but only at the level of small groups of related transactions. Writing a page that was quickly opened for new inserts, filled with newly inserted heap tuples, then closed, and finally cleaned up doesn't seem like it needs to take any direct responsibility for writeback. -- Peter Geoghegan
On Wed, Sep 8, 2021 at 6:52 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Sep 7, 2021 at 5:25 AM Hannu Krosing <hannuk@google.com> wrote: > > Are you speaking of just heap pages here or also index pages ? > > Mostly heap pages, but FWIW I think it could work for index tuples > too, with retail index tuple deletion. Because that allows you to even > remove would-be LP_DEAD item pointers. But it does need info from the heap page(s) the index entry points to, which can become quite expensive if that heap page is locked by other processes or even not cached at all. Also, parallel processes may have moved the index entry to other pages etc. > > Or are you expecting these to be kept in good-enoug shape by your > > earlier index manager work ? > > It's very workload dependent. Some things were very much improved by > bottom-up index deletion in Postgres 14, for example (non-hot updates > with lots of logically unchanged indexes). Other things weren't helped > at all, or were barely helped. I think it's important to cover or > cases. > > > A minimal useful patch emerging from that understanding could be > > something which just adds hysteresis to FSM management. (TBH, I > > actually kind of expected some hysteresis to be there already, as it > > is in my mental model of "how things should be done" for managing > > almost any resource :) ) > > I think that you need to do the FSM before the aborted-heap-tuple > cleanup. Otherwise you don't really know when or where to apply the > special kind of pruning that the patch invents, which targets aborts > specifically. Alternative is to not actually prune the page at all at this time (to avoid re-dirtying it and/or WAL write) but just inspect and add the *potential* free space in FSM. And then do the pruning at the time of next INSERT or UPDATE when the page is ditied and WAL written anyway. > > Adding hysteresis to FSM management can hopefully be done independent > > of all the other stuff and also seems to be something that is > > unobtrusive and non-controversial enough to fit in current release and > > possibly be even back-ported . > > I don't know about that! Seems kind of an invasive patch to me. Why do you think that changing the calculation when a page is added back to FSM is "kind of invasive" ? Not having looked at the code I would assume from the discussion here that we remove the page from FSM when free space goes above FILLFACTOR and we add it back when it goes back below FILLFACTOR. The code change would just change the "add id back" code to use FILLFACTOR - HYSTERESIS_FACTOR instead. What other parts does this need to touch ? Or is it just that the calculation source code is not localised well in code and written multiple times all over the place ? > > I did not mean CHECKPOINT as a command, but more the concept of > > writing back / un-dirtying the page. In this sense it *is* special > > because it is the last point in time where you are guaranteed to have > > the page available in buffercache and thus cheap to access for > > modifications plus you will avoid a second full-page writeback because > > of cleanup. Also you do not want to postpone the cleanup to actual > > page eviction, as that is usually in the critical path for some user > > query or command. > > I intend to do some kind of batching, but only at the level of small > groups of related transactions. Writing a page that was quickly opened > for new inserts, filled with newly inserted heap tuples, then closed, > and finally cleaned up doesn't seem like it needs to take any direct > responsibility for writeback. Agreed that this code does not need to care about writeback. I approached it from the other end and proposed the "just before writeback" as a strategy which is easy to understand conceptually and also is mostly "already there" codewise, that is there is a location in write-back code to plug the call to cleanup in . Cheers ----- Hannu Krosing Google Cloud - We have a long list of planned contributions and we are hiring. Contact me if interested.
On Tue, Sep 7, 2021 at 9:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > What I really like about the idea of doing the work in the foreground > (when it's clearly not going to be too much of a latency hit) is that > it would act as a natural form of backpressure. Right, that's fair. But, prune-before-evict can be done in the first instance by the bgwriter and then by the foreground process doing the eviction if the bgwriter doesn't keep up. At least, assuming we had a bgwriter that worked, which I think right now we don't, but fixing that might be a good idea anyway. > I didn't mean to suggest that it had to happen in perfect lockstep. > But I do think that it should be pretty close to perfect. What I > actually do right now is prune an open page when it *appears* to be > full inside the loop in RelationGetBufferForTuple(). That seems like a good idea. > As crude as that > is, it sets hint bits and prunes away aborted transactions in mostly > the right way, at least as far as TPC-C is concerned. (This only works > because it fits into the wider framework of my much larger patch, > which introduces ownership semantics, open and closed pages, empty > page freelists, etc.) I don't know, I'm not really convinced that "much larger patches" that change a lot of loosely related things all at once are good for the project. It seems to me that there's a reasonably good chance of replacing an annoying set of problems that existing PostgreSQL users have worked around to some degree, knowing or unknowingly, with a different annoying set of problems that may cause fewer or more problems in practice. Sometimes there's no way to improve something short of a giant project that changes a lot of things at the same time, but a series of incremental changes is a lot less risky. > It seems to me that this leaves one harder question unanswered: at > what point does a "medium sized" transaction become so large that it > just doesn't make sense to do either? What's the crossover point at > which background processing and foreground processing like this should > be assumed to be not worth it? That I won't speculate about just yet. > I suspect that at some point it really does make sense to leave it all > up to a true table-level batch operation, like a conventional VACUUM. I doubt it makes sense to define a limit here explicitly. At some point strategies will naturally start to fail, e.g. prune-before-evict won't work once the operation becomes large enough that pages have to be evicted while the transaction is still running. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 8, 2021 at 8:20 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I didn't mean to suggest that it had to happen in perfect lockstep. > > But I do think that it should be pretty close to perfect. What I > > actually do right now is prune an open page when it *appears* to be > > full inside the loop in RelationGetBufferForTuple(). > > That seems like a good idea. > I don't know, I'm not really convinced that "much larger patches" that > change a lot of loosely related things all at once are good for the > project. It seems to me that there's a reasonably good chance of > replacing an annoying set of problems that existing PostgreSQL users > have worked around to some degree, knowing or unknowingly, with a > different annoying set of problems that may cause fewer or more > problems in practice. Sometimes there's no way to improve something > short of a giant project that changes a lot of things at the same > time, but a series of incremental changes is a lot less risky. But these things are *highly* related. The RelationGetBufferForTuple() prune mechanism I described (that targets aborted xact tuples and sets hint bits) is fundamentally built on top of the idea of ownership of heap pages by backends/transactions -- that was what I meant before. We *need* to have context. This isn't an ordinary heap prune -- it doesn't have any of the prechecks to avoid useless pruning that you see at the top of heap_page_prune_opt(). It's possible that we won't be able to get a super-exclusive lock in the specialized prune code path, but that's considered a rare corner case. There is no question of concurrent inserters senselessly blocking the prune, which is not at all true with the current approach to free space management. So there is no way I could extract a minimal "prune inside RelationGetBufferForTuple()" patch that would actually work. Systems that follow ARIES closely and have UNDO *must* treat free space as a qualitative thing, something that is meaningful only with associated information about a deleting or inserting transaction, and its status. There is logical UNDO for the free space management structure, and even getting free space from a page can involve heavyweight locking. Postgres works differently, but there is no reason why Postgres should not do a lightweight approximate version of the same thing - the laws of physics favor carefully grouping logically related data, and working to keep the physical database representation as clean a representation of the logical database as possible, right from the start. > > It seems to me that this leaves one harder question unanswered: at > > what point does a "medium sized" transaction become so large that it > > just doesn't make sense to do either? What's the crossover point at > > which background processing and foreground processing like this should > > be assumed to be not worth it? That I won't speculate about just yet. > > I suspect that at some point it really does make sense to leave it all > > up to a true table-level batch operation, like a conventional VACUUM. > > I doubt it makes sense to define a limit here explicitly. At some > point strategies will naturally start to fail, e.g. prune-before-evict > won't work once the operation becomes large enough that pages have to > be evicted while the transaction is still running. Perhaps. As you know I'm generally in favor of letting things fail naturally, and then falling back on an alternative strategy. -- Peter Geoghegan
On Wed, Sep 8, 2021 at 12:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > But these things are *highly* related. > > The RelationGetBufferForTuple() prune mechanism I described (that > targets aborted xact tuples and sets hint bits) is fundamentally built > on top of the idea of ownership of heap pages by backends/transactions > -- that was what I meant before. We *need* to have context. This isn't an > ordinary heap prune -- it doesn't have any of the prechecks to avoid > useless pruning that you see at the top of heap_page_prune_opt(). It's > possible that we won't be able to get a super-exclusive lock in the > specialized prune code path, but that's considered a rare corner case. > There is no question of concurrent inserters senselessly blocking the > prune, which is not at all true with the current approach to free > space management. So there is no way I could extract a minimal "prune > inside RelationGetBufferForTuple()" patch that would actually work. I'm not trying to argue for slimming down your patches to a size that is so small that they no longer work. However, I *am* arguing that, like bottom-up index deletion and the emergency vacuum mechanism, this change sounds like something that could *easily* have unforeseen consequences. And therefore a lot of caution is needed. And part of that caution is not changing more things at the same time than is really necessary. And that it's worth thinking *hard* about how much change is *really* necessary. It's very easy to convince oneself that everything is connected to everything else and therefore we have to change a lot of things all at once, but that's often not true. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 8, 2021 at 9:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > I'm not trying to argue for slimming down your patches to a size that > is so small that they no longer work. I was just explaining how the "eager physical rollback by pruning" thing works. > However, I *am* arguing that, like bottom-up index deletion and the > emergency vacuum mechanism, this change sounds like something that > could *easily* have unforeseen consequences. And therefore a lot of > caution is needed. I agree that a lot of caution is needed. One can justify aggressive intervention when the likely alternative is a severe and irreversible adverse event, like a version-driven page split. On the other hand we really should be very conservative when that isn't what we see. Context is very important. > And part of that caution is not changing more > things at the same time than is really necessary. And that it's worth > thinking *hard* about how much change is *really* necessary. For the record bottom-up index deletion had very little code in the grand scheme of things. Without documentation changes and comments it is probably less than 1000 lines of C. And the emergency mechanism (considered separately from the refactoring work) had very little code. So I don't know what you mean about changing lots of things all together at once. Huge big-bang style patches really aren't my style at all. Don't expect that to change. > It's very easy to convince oneself that everything is connected to > everything else and therefore we have to change a lot of things all at > once, but that's often not true. There is compelling empirical evidence for my "chain of events" explanation. You can make things much better for BenchmarkSQL on the master branch today, either by artificially disabling the FSM altogether (which hurts a little but helps a lot), *or* by configuring the benchmark to not abort transactions in the first place (which effectively simulates eager physical rollback of aborted xacts). I don't want to change very much. Despite everything I've said, I think it's true that the system does the right thing in most situations -- even with BenchmarkSQL/TPC-C! I just find it useful to focus on the exceptions, the cases where the system clearly does the wrong thing. -- Peter Geoghegan