Re: The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: The Free Space Map: Problems and Opportunities |
Date | |
Msg-id | CAH2-Wzk2dU7xnKrdfs0O_eg=uXdBBC9XbwBM4yk5jXLRht4xyQ@mail.gmail.com Whole thread Raw |
In response to | Re: The Free Space Map: Problems and Opportunities (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: The Free Space Map: Problems and Opportunities
Re: The Free Space Map: Problems and Opportunities Re: The Free Space Map: Problems and Opportunities |
List | pgsql-hackers |
On Mon, Aug 16, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote: > OK, here is my feedback. First, I understand the space > reservation/soccer idea, but I am also concerned that adding space > reservation could improve some things and make others worse. That is definitely a legitimate concern. There is a trade-off here: if we do too much preallocation, there may be cases where the preallocated space that we expected to be used quickly is never used by anybody. But that doesn't seem like a problem, provided we don't lose track of the fact that it happened. Then the worst that can happen is that we have empty pages that nobody will ever use, because nobody inserts into the table ever again (not the backend with the leaked space lease, not anybody else). That seems manageable. We can add some kind of ramp-up behavior that gets more and more aggressive as demand for new space from backends is steady or increasing. > Second, let's look at a concrete example, and see how it can be improved > more simply. As I understand it, there are three operations that need > free space: > > 1. INSERT/COPY > 2. UPDATE where there is insufficient space on the page of the > old row > 3. UPDATE where there is sufficient page space Right. > The third option already can use the existing page for a HOT update, so > the FSM doesn't matter. I agree. All good questions. Thank you for diving in on this. > For 1 & 2, suppose you have a table with 10 8k > pages, all 80% full. As I understand it, the first page will go from > 80% to 81%, then the next row addition will take the second page from > 80% to 81%, until all pages are 81%, and then it starts over again. Is > that accurate? No. Generally backends have their own target block (a simple BlockNumber) that they store in the relcache, one per table recently modified. Backends/heapam uses RelationGetTargetBlock() to access this local cache. So there is "backend affinity" for particular individual heap pages, even today. However, that still has many problems. It doesn't make sense to have a local cache for a shared resource -- that's the problem. You actually need some kind of basic locking or lease system, so that 10 backends don't all decide at the same time that one particular heap block is fully empty, and therefore a good target block for that backend alone. It's as if the backends believe that they're special little snowflakes, and that no other backend could possibly be thinking the same thing at the same time about the same heap page. And so when TPC-C does its initial bulk insert, distinct orders are already shuffled together in a hodge-podge, just because concurrent bulk inserters all insert on the same heap pages. If we could safely give out 10 or 50 pages directly to each backend, then clearly the related data would be stored together in this scenario -- each backend would be able to work through its lease of contiguous heap pages for quite a while before revisiting the FSM/relation extension. The trick is to teach the implementation to do that without allowing the backend to permanently leak whole entire leases with maybe dozens of free pages. Systems like DB2 and Oracle probably can't have this problem. Even the bulk lease case is just an extension of something they had to do anyway. You see, some kind of FSM lease system that knows about transaction lifetime for any given piece of free space is strictly necessary with UNDO based rollback. Without that, transaction rollback breaks in certain edge cases involving concurrent inserts into the same page, right before a rollback that needs to put an updated version back in place. If the updated version from undo happens to be physically larger than the now-aborted successor version, and if there is little or no space left to cover that, what can rollback do about it? Nothing. It's a nasty bug. They use heavyweight locks to avoid this. > Is that the non-temporal memory issue you are concerned > about? When I talk about memory or time, what I'm usually referring to is the ability to manage larger allocations of multiple blocks for a while after they're originally requested. So that a requesting transaction/backend is given the opportunity to use the space that they asked for. They shouldn't be completely trusted. We should trust but verify. > Doesn't spreading the new rows out increase the ability to do > HOT updates? It's true that the problem scenario with TPC-C does involve updates, and it's true that that's when things really go badly. But I deliberately haven't gone into the role that the updates play there too much. Not just yet. It's important to note that the problems really do start when we insert, even if that's not obvious -- that's when the locality is lost, right when the original order transaction comes in. We lose locality just because we have concurrent inserters into the same table, where each transaction inserts multiple related rows. We fail to group related rows together right from the start, setting us up for failure later on. This is actually quite simple -- the chaos that follows is where it gets truly complicated (and not necessarily worth getting into just yet). The update of a given order and its order lines entries takes place hours later, within a delivery transaction. Another concern is knock on effects *after* any initial problematic updates -- that's certainly not where it ends. Every action has consequences, and these consequences themselves have consequences. By migrating a row from an earlier block into a later block (both physically and temporally early/late), we're not just changing things for that original order row or rows (the order non-HOT-updated by the delivery transaction). Since the old/original version left behind by a delivery transaction will itself get vacuumed eventually, that space goes in the FSM eventually. And so this same block will get some entirely unrelated version (could be a new insert or update). It's systematic. We dare not waste a tiny amount of free space. Surely we should cut our losses at some point and accept that we're "wasting space", but there is never any natural backpressure. Nothing to break the cycle of fragmentation. (This is another thing involving time, not quite related to my central point about free space leases.) -- Peter Geoghegan
pgsql-hackers by date: