Re: The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: The Free Space Map: Problems and Opportunities |
Date | |
Msg-id | CAH2-Wzm+=o6A-dkDpXzb24ayau1XXQVeDO5V_aVbYsafu4Wh8w@mail.gmail.com Whole thread Raw |
In response to | Re: The Free Space Map: Problems and Opportunities (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: The Free Space Map: Problems and Opportunities
Re: The Free Space Map: Problems and Opportunities |
List | pgsql-hackers |
On Tue, Aug 17, 2021 at 9:48 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't know what the right degree of stickiness is, but I can easily > believe that we want to have more than none. Part of Peter's point, at > least as I understand it, is that if a page has 100 tuples and you > delete 1 or 2 or 3 of them, it is not smart to fill up the space thus > created with 1 or 2 or 3 new tuples. You should instead hope that > space will optimize future HOT updates, or else wait for the page to > lose some larger number of tuples and then fill up all the space at > the same time with tuples that are, hopefully, related to each other > in some useful way. That is a big part of it. But it gets worse than that. It is truly insidious, once you follow the chain of problems over time -- maybe over hours and hours. BenchmarkSQL makes this a lot easier. Right now when we set heap fill factor to 85 (which is what BenchmarkSQL does), we effectively have this chunk of special reserved free space left behind on each page. The remaining 15% of tuple space is of course reserved for successor versions of the same logical rows -- those rows from the time that the page first fills (presumably they all come from straight inserts). That's our starting point -- simple enough. The approach we take effectively gives up on the *whole entire page* at literally the first sign of trouble -- the first time we fail to hold a successor version on that same page. Even though it just well have been a once-off perturbation. What we might call bad luck, as opposed to an inevitability. For example, maybe ANALYZE ran that second, which made the crucial difference for that one heap page that one time. That's kind of enough to make the implementation assume all bets are off, for the entire page -- it's as if we pessimistically assume that there is no chance of keeping anything like the original rows on the heap page, so we might as well let it all get mixed up even more. This is way too pessimistic -- it's black and white thinking. While in general that level of pessimism might turn out to be justified (anything is possible), we should hold the line and find out for sure how bad it is. I happen to know that our problematic BenchmarkSQL/TPC-C tables just don't have any skew, so this pessimism is particularly wrong-headed there. Bear in mind that the new unrelated tuples that ultimately replace our original couldn't-fit tuples (probably after the next VACUUM) are probably not "average tuples". Maybe they're newly inserted tuples, which is bad in the BenchmarkSQL case, because they're all *guaranteed* to be updated in the future (owing to the details of the workload). With some other typical workload they might often be successor versions from updates instead -- most likely for rows that are disproportionately likely to be updated many times. It's truly pernicious. If we only accepted the original row migration, and didn't throw good money after bad, then the cycle would be broken. To me this is a bit like certain heuristic algorithms, where you technically accept a suboptimal outcome to avoid getting stuck in a local maxima. > Now what's the threshold? 20 out of 100? 50? 80? I'm not going to pretend to know the answer. But I will point out that one DB system whose heap fill factor defaults to 90 seems to have a symmetric setting for the "open up page again" point -- the default for that is 40. Not sure that that really matters to us, but that does seem pretty low to me. It's very sticky indeed. > Yeah, I don't think that reducing FSM_CATEGORIES is likely to have > much value for its own sake. But it might be useful as a way of > accomplishing some other goal. For example if we decided that we need > a bit per page to track "open" vs. "closed" status or something of > that sort, I don't think we'd lose much by having fewer categories. This is probably one of my less important points. I addressed this point in my remarks on this to Andres from earlier. > I don't have a well-formed opinion on this point yet. I am also > skeptical of the idea of merging the FSM and VM, because it seems to > me that which pages are all-visible and which pages are full are two > different things. As I clarified in my email to Andres, this was just a bad choice of words. I meant that they're at least conceptually much closer, provided you're willing to accept my general view of things. > However, there's something to Peter's point too. An > empty page is all-visible but still valid as an insertion target, but > this is not as much of a contradiction to the idea of merging them as > it might seem, because marking the empty page as all-visible is not > nearly as useful as marking a full page all-visible. An index-only > scan can't care about the all-visible status of a page that doesn't > contain any tuples, and we're also likely to make the page non-empty > in the near future, in which case the work we did to set the > all-visible bit and log the change was wasted. The only thing we're > accomplishing by making the page all-visible is letting VACUUM skip > it, which will work out to a win if it stays empty for multiple VACUUM > cycles, but not otherwise. That's a big part of it. The other thing is systemic effects, which is related to the part of this email above about chain of events. I'll talk about that some more now. Lets say that VACUUM puts a non-zero usable amount of free space in the FSM for a page that it marks all-visible/all-frozen at the same time -- this is possible today, of course. To me this seems contradictory, at least when there isn't much space -- which I believe is common. It's at least a minor waste to set the VM bit. Now let's say we change VACUUM to not waste effort on setting the heap page -- we're now avoiding minor waste, which is an okay goal. But we're also doing something else, that I find very objectionable: we're effectively betting that the situation will improve for the same page at some point in the future, during a future VACUUM. It is possible that we could get just one or two inserts on the same page, and then see it again, and then we're done. If that happens, we win our bet. Otherwise, we lose our bet. Here's the problem with making that bet: it's typically an awful bet. The potential upside is clearly pretty low. And more importantly, there is a decent chance that we'll actually get some row that is going to mess everything up -- not an "average row" (whatever that may mean). Again, very frequently updated rows are often what'll end up on the page, which are enough to make the page never have its all-visible bit set. Here is why I find this *extremely* objectionable: we *actually* risk experiencing a downside for the heap page by taking this awful bet *many many times* in the future -- it's not just a once-off downside, I believe. This is not in fact a choice about one heap page on one occasion -- it is actually a *policy* affecting the page again and again (and every other heap page too). It is only by making a not-quite-physically-full page "closed" in the way that I advocate we avoid taking this awful bet -- that is how it's possible to artificially *make* this a question about one page on one occasion. We're bounding the downside, lessening our exposure to the worst case. Intuitively, I think that the possible downside is essentially unlimited, if we assume systemic destabilization risks (which seems wise). Whereas the possible upside of taking that awful bet is clearly, measurably small. A once-off upside. Peanuts, really. (Of course my argument won't work if you assume that even with the "closed" page choice we naturally get more updates to one of the aged frozen rows anyway. But we're counting on that not happening in either scenario/with *either* strategy, so I don't think that it invalidates anything I've said.) > Conceptually, you might think of the merged data structure as a > "quiescence map." Pages that aren't being actively updated are > probably all-visible and are probably not great insertion targets. > Those that are being actively updated are probably not all-visible and > have better chances of being good insertion targets. However, there > are a lot of gremlins buried in the word "probably." A page could get > full but still not be all-visible for a long time, due to long-running > transactions, and it doesn't seem likely to me that we can just ignore > that possibility. I think that we should opportunistically be setting the all-frozen bit in heap pages (and freezing tuples earlier than required) anyway. For example, we could notice that a whole heap page is eligible to be marked all-frozen, and freeze the tuples early -- we could restart processing of the page once we realized that freezing early is a good idea. Why even have a distinct all-visible VM bit? The per-tuple overhead of the WAL records for freezing are kind of bulky, but that could be fixed. It's not a good reason. (Of course I'm not arguing that we should break pg_upgrade, just that there is no reason not to freeze early when it's cheap to do so in passing.) > +1. I had this same thought reading Peter's email. I'm not sure if it > makes sense to hook that into Thomas's work, but I think it makes a > ton of sense to use shared memory to coordinate transient state like > "hey, right now I'm inserting into this block, you guys leave it > alone" while using the disk for durable state like "here's how much > space this block has left." I agree that that's the best starting point for this. -- Peter Geoghegan
pgsql-hackers by date: