Thread: Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?
On Sun, Dec 15, 2024 at 4:47 PM Tomas Vondra <tomas@vondra.me> wrote: > > What's happening is that if the fraction of updated rows is sufficiently > high, the modified pages are within the SKIP_PAGES_THRESHOLD distance, > and vacuum switches to seqscan. But with flash storage that happens to > be harmful - I built master with different SKIP_PAGES_THRESHOLD (1, 2, > 4, 8 and 16), and collected results for comparison. > <--snip--> > These are timings in second. Timings relative to master branch (with > SKIP_PAGES_THRESHOLD=32) are > > frac master-1 master-2 master-4 master-8 master-16 > ----------------------------------------------------------------- > 0.000001 100% 99% 96% 99% 89% > 0.00001 85% 84% 87% 89% 94% > 0.000025 63% 66% 65% 71% 80% > 0.00005 45% 46% 46% 57% 65% > 0.000075 37% 38% 39% 50% 69% > 0.0001 34% 35% 36% 51% 70% > 0.0002 36% 39% 44% 65% 86% > 0.0005 88% 108% 132% 157% 144% > 0.001 219% 274% 269% 203% 129% > 0.01 114% 103% 99% 101% 99% > > So most of this shows significant speedups if we simply disable the > skipping logic (by setting the threshold to 1). There are some > regressions when scanning substantial fraction of the table (0.001 means > ~10% pages are modified). Thanks for running these benchmarks! I know you used explicit vacuum, but I was just thinking that with the default autovacuum_vacuum_scale_factor you'd have to modify 20% of the tuples to trigger an autovacuum, so depending on the fill factor and tuple size, it may be normal for 10% of the pages to need scanning by vacuum. However, we do tell users that 20% is too high anyway. On a related note, the other day I noticed another negative effect caused in part by SKIP_PAGES_THRESHOLD. SKIP_PAGES_THRESHOLD interacts with the opportunistic freeze heuristic [1] causing lots of all-frozen pages to be scanned when checksums are enabled. You can easily end up with a table that has very fragmented ranges of frozen, all-visible, and modified pages. In this case, the opportunistic freeze heuristic bears most of the blame. However, we are not close to coming up with a replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help. This wouldn't have affected your results, but it is worth considering more generally. > The results from the bigger machine with NVMe seem very similar, except > that there are no regressions at all. It's always faster to not switch > to sequential scans. And we are suggesting to users with bigger, faster machines and SSDs to tune autovacuum to run more often. > Obviously, it may not be a good idea to make conclusions based on a > single (and fairly simplistic) test. But I think it's reasonable to > conclude the current SKIP_PAGES_THRESHOLD value (i.e. 32) may be a bit > too high for modern storage. Based on "my" results it seems a value in > the 4-8 range might be more appropriate. A range of 4-8 would address some of the negative effects I saw with the interaction of SKIP_PAGES_THRESHOLD and the opportunistic freeze heuristic causing all-frozen pages to be read. > Or perhaps it should be user-tunable? I'm very reluctant to add more and > more knobs like this, but if we have effective_io_concurrency, isn't > this a very similar thing? Or maybe it'd be possible to adjust this > automatically based on existing parameters (esp. rnadom_page_cost and > effective_io_concurrency)? I don't think we want to expose it to the user -- given that the plan in the long run is to get rid of it. > I do believe most of this applies even to vacuum with stream API patches > [2], which leave SKIP_PAGES_THRESHOLD in place. In fact, isn't a similar > logic built into the read stream itself? Isn't vacuumlazy.c the wrong > level to do this anyway, as it's about I/O pattern - why should vacuum > worry about that? I agree about vacuumlazy.c being an unfortunate place to have this logic. It makes the code there MUCH more complicated. The read stream API isn't a drop-in replacement for SKIP_PAGES_THRESHOLD because it only does read combining and prefetching -- not ensuring sequential ranges are requested. It still relies on kernel readahead logic. However, with AIO, Andres has pointed out SKIP_PAGES_THRESHOLD no longer makes sense. With AIO, multiple reads will be initiated at the same time, so the kernel won't be getting sequential range requests anyway (even when using buffered IO). And with direct IO, there is no kernel readahead logic to confuse. However, I think your benchmarks show that SKIP_PAGES_THRESHOLD may not be operating in the way it was intended anymore anyway. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_ZvkxDRtMAhYRTK9N60jfMMVkiQgnEP5m_asYnuvgmQOg%40mail.gmail.com
On Mon, Dec 16, 2024 at 10:37 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On a related note, the other day I noticed another negative effect > caused in part by SKIP_PAGES_THRESHOLD. SKIP_PAGES_THRESHOLD interacts > with the opportunistic freeze heuristic [1] causing lots of all-frozen > pages to be scanned when checksums are enabled. You can easily end up > with a table that has very fragmented ranges of frozen, all-visible, > and modified pages. In this case, the opportunistic freeze heuristic > bears most of the blame. Bears most of the blame for what? Significantly reducing the total amount of WAL written? > However, we are not close to coming up with a > replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help. > This wouldn't have affected your results, but it is worth considering > more generally. One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes it more likely that non-aggressive VACUUMs will advance relfrozenxid. Granted, it's probably not doing a particularly good job at that right now. But any effort to replace it should account for that. This is possible by making VACUUM consider the cost of scanning extra heap pages up-front. If the number of "extra heap pages to be scanned" to advance relfrozenxid happens to not be very high (or not so high *relative to the current age(relfrozenxid)*), then pay that cost now, in the current VACUUM operation. Even if age(relfrozenxid) is pretty far from the threshold for aggressive mode, if the added cost of advancing relfrozenxid is still not too high, why wouldn't we just do it? I think that aggressive mode is a bad idea more generally. The behavior around waiting for a cleanup lock (the second aggressive-mode-influenced behavior) is also a lot more brittle than it really needs to be, simply because we're not weighing costs and benefits. There's a bunch of relevant information that could be applied when deciding what to do (at the level of each individual heap page that cannot be cleanup locked right away), but we make no effort to apply that information -- we only care about the static choice of aggressive vs. non-aggressive there. -- Peter Geoghegan
On Mon, Dec 16, 2024 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Dec 16, 2024 at 10:37 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > On a related note, the other day I noticed another negative effect > > caused in part by SKIP_PAGES_THRESHOLD. SKIP_PAGES_THRESHOLD interacts > > with the opportunistic freeze heuristic [1] causing lots of all-frozen > > pages to be scanned when checksums are enabled. You can easily end up > > with a table that has very fragmented ranges of frozen, all-visible, > > and modified pages. In this case, the opportunistic freeze heuristic > > bears most of the blame. > > Bears most of the blame for what? Significantly reducing the total > amount of WAL written? No, I'm talking about the behavior of causing small pockets of all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD and are then scanned (even though they are already frozen). What I describe in that email I cited is that because we freeze opportunistically when we have or will emit an FPI, and bgwriter will write out blocks in clocksweep order, we end up with random pockets of pages getting frozen during/after a checkpoint. Then in the next vacuum, we end up scanning those all-frozen pages again because the ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is mostly going to happen for an insert-only workload. I'm not saying freezing the pages is bad, I'm saying that causing these pockets of frozen pages leads to scanning all-frozen pages on future vacuums. > > However, we are not close to coming up with a > > replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help. > > This wouldn't have affected your results, but it is worth considering > > more generally. > > One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes > it more likely that non-aggressive VACUUMs will advance relfrozenxid. > Granted, it's probably not doing a particularly good job at that right > now. But any effort to replace it should account for that. > > This is possible by making VACUUM consider the cost of scanning extra > heap pages up-front. If the number of "extra heap pages to be scanned" > to advance relfrozenxid happens to not be very high (or not so high > *relative to the current age(relfrozenxid)*), then pay that cost now, > in the current VACUUM operation. Even if age(relfrozenxid) is pretty > far from the threshold for aggressive mode, if the added cost of > advancing relfrozenxid is still not too high, why wouldn't we just do > it? That's an interesting idea. And it seems like a much more effective way of getting some relfrozenxid advancement than hoping that the pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have scanned all unfrozen tuples. > I think that aggressive mode is a bad idea more generally. The > behavior around waiting for a cleanup lock (the second > aggressive-mode-influenced behavior) is also a lot more brittle than > it really needs to be, simply because we're not weighing costs and > benefits. There's a bunch of relevant information that could be > applied when deciding what to do (at the level of each individual heap > page that cannot be cleanup locked right away), but we make no effort > to apply that information -- we only care about the static choice of > aggressive vs. non-aggressive there. What kind of information? Could you say more? Andres mentioned the other day that we could set pages all-visible in the VM even if we don't get the cleanup lock (lazy_scan_noprune()) case. That seems like a good idea. - Melanie
On Mon, Dec 16, 2024 at 1:50 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Then in the next vacuum, we end up scanning those all-frozen pages again because the > ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is > mostly going to happen for an insert-only workload. I'm not saying > freezing the pages is bad, I'm saying that causing these pockets of > frozen pages leads to scanning all-frozen pages on future vacuums. I guess that it doesn't necessarily matter what component you define as being at fault here, but FWIW I think that the problem is primarily SKIP_PAGES_THRESHOLD itself. After all, SKIP_PAGES_THRESHOLD tries to encourage certain desirable outcomes (namely readahead and non-aggressive relfrozenxid advancement) without ever really verifying that that's working out. If VACUUM doesn't truly get readahead (quite likely), it stills pay quite a high cost in CPU cycles. Similarly, if VACUUM skips even one all-visible page, we can't expect much of any benefit for having not skipped any other all-visible pages (whether or not we can safely advance relfrozenxid at all is an all or nothing thing). > That's an interesting idea. And it seems like a much more effective > way of getting some relfrozenxid advancement than hoping that the > pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have > scanned all unfrozen tuples. Sometimes this can be truly perverse. It's possible for a SELECT FOR SHARE to unset the all-frozen bit, without unsetting the all-visible bit. So all that it takes is one such SELECT FOR SHARE against one random tuple in an originally-all-frozen heap page inside a large grouping of all-frozen heap pages. That's almost certainly enough to obstruct non-aggressive relfrozenxid advancement _for the entire table_. ISTM that it's just criminal to miss out on non-aggressive relfrozenxid advancement because of some tiny issue such as that. And so SKIP_PAGES_THRESHOLD should be replaced by something that specifically has relfrozenxid as a goal of reading all-visible pages. Just how well this works out in practice will be very workload dependent. But there are workloads/tables where it works quite well: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Mixed_inserts_and_deletes > > I think that aggressive mode is a bad idea more generally. The > > behavior around waiting for a cleanup lock (the second > > aggressive-mode-influenced behavior) is also a lot more brittle than > > it really needs to be, simply because we're not weighing costs and > > benefits. There's a bunch of relevant information that could be > > applied when deciding what to do (at the level of each individual heap > > page that cannot be cleanup locked right away), but we make no effort > > to apply that information -- we only care about the static choice of > > aggressive vs. non-aggressive there. > > What kind of information? Could you say more? Waiting or not waiting doesn't have to be an either/or choice. For example, we could retry a few times, with a backoff, if and only if it seemed to make sense. Testing that I did a few years ago showed that that worked rather well. Why should we necessarily have to advance relfrozenxid exactly up to FreezeLimit during every aggressive VACUUM? Surely the picture over time and across multiple VACUUM operations is what matters most? At the very least, we should have an independent XID cutoff for "must advance relfrozenxid up to here, no matter what" -- we should just reuse FreezeLimit to control that behavior. We might very well "try quite hard to advance relfrozenxid to a value >= FreezeLimit" -- we just don't have to do it no matter what the cost is. There is a huge practical difference between "try quite hard" (e.g., retry the cleanup lock acquisition 3 times, with a sleep between each) and "try infinitely hard" (i.e., wait for a cleanup lock indefinitely). VACUUM is the only thing that freezes. Right now we'll sometimes wait for a cleanup lock forever, because "we have to freeze". But waiting like that necessarily implies that we cannot freeze any other page in the table. That might actually be the least worst thing that VACUUM can do, but only in the extreme case where we've really put it off for as long as we possibly could (by not waiting forever in prior VACUUM operations). There is actually a lot of potential benefit from "kicking the can down the road", if we go about it intelligently. Just waiting forever is a pretty horrible strategy, and so it should only be the strategy of last resort (we might reasonably be able to dump a log report about it in a world where it's truly the strategy of last resort). > Andres mentioned the other day that we could set pages all-visible in > the VM even if we don't get the cleanup lock (lazy_scan_noprune()) > case. That seems like a good idea. Maybe it's a good idea, but right now it poses a similar risk to my scenario involving a random, isolated SELECT FOR SHARE that happens to affect some random tuple on a cold/frozen page. Of course, this wouldn't be all that hard to fix. -- Peter Geoghegan
On Mon, Dec 16, 2024 at 2:17 PM Peter Geoghegan <pg@bowt.ie> wrote: > Maybe it's a good idea, but right now it poses a similar risk to my > scenario involving a random, isolated SELECT FOR SHARE that happens to > affect some random tuple on a cold/frozen page. Of course, this > wouldn't be all that hard to fix. BTW, one alternative to a "retry cleanup lock acquisition 3 times" heuristic is a version of freezing/pruning that doesn't require a cleanup lock at all. The only reason that we require a cleanup lock is to make it safe to defragment the page, to free tuple space. Nothing stops you from inventing a variant of pruning/freezing that works just like regular pruning/freezing, but without attempting to free up tuple space -- thus obviating the need for a cleanup lock. This process could still remove dead tuples by (say) setting their xmin to InvalidTransactionId -- that ought to still be safe. I don't think that problems with acquiring cleanup locks happen all that often -- they're probably very rare. But when they do happen they can be extremely serious -- they can even cause outages. And so there's a lot to be said for a design that removes those sorts of risks (or that at least significantly ameliorates them). It gives you the freedom to come up with better high-level designs that don't have to worry about not acquiring cleanup locks. -- Peter Geoghegan
On 12/16/24 19:49, Melanie Plageman wrote: > On Mon, Dec 16, 2024 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote: >> >> On Mon, Dec 16, 2024 at 10:37 AM Melanie Plageman >> <melanieplageman@gmail.com> wrote: >>> On a related note, the other day I noticed another negative effect >>> caused in part by SKIP_PAGES_THRESHOLD. SKIP_PAGES_THRESHOLD interacts >>> with the opportunistic freeze heuristic [1] causing lots of all-frozen >>> pages to be scanned when checksums are enabled. You can easily end up >>> with a table that has very fragmented ranges of frozen, all-visible, >>> and modified pages. In this case, the opportunistic freeze heuristic >>> bears most of the blame. >> >> Bears most of the blame for what? Significantly reducing the total >> amount of WAL written? > > No, I'm talking about the behavior of causing small pockets of > all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD > and are then scanned (even though they are already frozen). What I > describe in that email I cited is that because we freeze > opportunistically when we have or will emit an FPI, and bgwriter will > write out blocks in clocksweep order, we end up with random pockets of > pages getting frozen during/after a checkpoint. Then in the next > vacuum, we end up scanning those all-frozen pages again because the > ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is > mostly going to happen for an insert-only workload. I'm not saying > freezing the pages is bad, I'm saying that causing these pockets of > frozen pages leads to scanning all-frozen pages on future vacuums. > Yeah, this interaction between the components is not great :-( But can we think of a way to reduce the fragmentation? What would need to change? I don't think bgwriter can help much - it's mostly oblivious to the contents of the buffer, I don't think it could consider stuff like this when deciding what to evict. Maybe the freezing code could check how many of the nearby pages are frozen, and consider that together with the FPI write? >>> However, we are not close to coming up with a >>> replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help. >>> This wouldn't have affected your results, but it is worth considering >>> more generally. >> >> One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes >> it more likely that non-aggressive VACUUMs will advance relfrozenxid. >> Granted, it's probably not doing a particularly good job at that right >> now. But any effort to replace it should account for that. >> I don't follow. How could non-aggressive VACUUM advance relfrozenxid, ever? I mean, if it doesn't guarantee freezing all pages, how could it? >> This is possible by making VACUUM consider the cost of scanning extra >> heap pages up-front. If the number of "extra heap pages to be scanned" >> to advance relfrozenxid happens to not be very high (or not so high >> *relative to the current age(relfrozenxid)*), then pay that cost now, >> in the current VACUUM operation. Even if age(relfrozenxid) is pretty >> far from the threshold for aggressive mode, if the added cost of >> advancing relfrozenxid is still not too high, why wouldn't we just do >> it? > > That's an interesting idea. And it seems like a much more effective > way of getting some relfrozenxid advancement than hoping that the > pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have > scanned all unfrozen tuples. > I agree it might be useful to formulate this as a "costing" problem, not just in the context of a single vacuum, but for the overall maintenance overhead - essentially accepting the vacuum gets slower, in exchange for lower cost of maintenance later. But I think that (a) is going to be fairly complex, because how do you cost the future vacuum?, and (b) is somewhat misses my point that on modern NVMe SSD storage (SKIP_PAGES_THRESHOLD > 1) doesn't seem to be a win *ever*. So why shouldn't we reduce the SKIP_PAGES_THRESHOLD value (or perhaps make it configurable)? We can still do the other stuff (decide how aggressively to free stuff etc.) independently of that. regards -- Tomas Vondra
On Mon, Dec 16, 2024 at 2:38 PM Peter Geoghegan <pg@bowt.ie> wrote: > The only reason that we require a cleanup lock is to make it safe to > defragment the page, to free tuple space. Nothing stops you from > inventing a variant of pruning/freezing that works just like regular > pruning/freezing, but without attempting to free up tuple space -- > thus obviating the need for a cleanup lock. This process could still > remove dead tuples by (say) setting their xmin to InvalidTransactionId > -- that ought to still be safe. I think this amounts to inventing a new way to mark a tuple as dead, but I don't think we need to do that. We could just mark the line pointer LP_DEAD without actually touching the space that contains the tuple. At least, I think that would work. The only question in my mind is whether whatever we propose to do here would violate the locking rules documented in src/backend/storage/buffer/README. In my opinion, those rules are a bit vaguely worded in some places, but I interpret #1 and #2 as meaning that you can't look at the line pointer array without some kind of content lock, so an exclusive content lock should be good enough to mark a line pointer dead as long as you don't relocate any tuples. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Dec 17, 2024 at 9:11 AM Tomas Vondra <tomas@vondra.me> wrote: > I don't follow. How could non-aggressive VACUUM advance relfrozenxid, > ever? I mean, if it doesn't guarantee freezing all pages, how could it? Although it's very workload dependent, it still happens all the time. Just look at the autovacuum log output from almost any autovacuum that runs when the regression tests run. Or look at the autovacuum output for the small pgbench tables. In general, relfrozenxid simply tracks the oldest possible extant XID in the table. VACUUM doesn't necessarily need to do any freezing to advance relfrozenxid/relminmxid. But VACUUM *must* exhaustively scan every heap page that could possibly contain an old XID in order to be able to advance relfrozenxid/relminmxid safely. > > That's an interesting idea. And it seems like a much more effective > > way of getting some relfrozenxid advancement than hoping that the > > pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have > > scanned all unfrozen tuples. > But I think that (a) is going to be fairly complex, because how do you > cost the future vacuum?, and (b) is somewhat misses my point that on > modern NVMe SSD storage (SKIP_PAGES_THRESHOLD > 1) doesn't seem to be a > win *ever*. I am not suggesting that the readahead argument for SKIP_PAGES_THRESHOLD is really valid. I think that the relfrozenxid argument is the only one that makes any sense. Clearly both arguments justified the introduction of SKIP_PAGES_THRESHOLD, after the earliest work on the visibility map back in 2009 -- see the commit message for bf136cf6. In short, I am envisaging a design that decides whether or not it'll advance relfrozenxid based on both the costs and the benefits/need. Under this scheme, VACUUM would either scan exactly all all-visible-not-all-frozen pages, or scan none at all. This decision would be almost completely independent of the decision to freeze or not freeze pages (it'd be loosely related because FreezeLimit can never be more than autovacuum_freeze_max_age/2 XIDs in age). Then we'd be free to just get rid of SKIP_PAGES_THRESHOLD, which presumably isn't doing much for readahead. -- Peter Geoghegan
On Tue, Dec 17, 2024 at 10:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think this amounts to inventing a new way to mark a tuple as dead, > but I don't think we need to do that. We could just mark the line > pointer LP_DEAD without actually touching the space that contains the > tuple. At least, I think that would work. You might well be right about that. HOT chains might make things more complicated (a heap-only tuple can only be set LP_UNUSED), but we could probably figure that out if we had to. I'm not entirely sure that this would be worth the effort. I believe that having to wait for a cleanup lock is rare in general. On the other hand, it'd be convenient if the high-level design of VACUUM/autovacuum scheduling could freely ignore the possibility of having to wait for a cleanup lock. > The only question in my mind > is whether whatever we propose to do here would violate the locking > rules documented in src/backend/storage/buffer/README. In my opinion, > those rules are a bit vaguely worded in some places, but I interpret > #1 and #2 as meaning that you can't look at the line pointer array > without some kind of content lock, so an exclusive content lock should > be good enough to mark a line pointer dead as long as you don't > relocate any tuples. That is also my interpretation. -- Peter Geoghegan
On Mon, Dec 16, 2024 at 2:18 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Dec 16, 2024 at 1:50 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Then in the next vacuum, we end up scanning those all-frozen pages again because the > > ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is > > mostly going to happen for an insert-only workload. I'm not saying > > freezing the pages is bad, I'm saying that causing these pockets of > > frozen pages leads to scanning all-frozen pages on future vacuums. > > I guess that it doesn't necessarily matter what component you define > as being at fault here, but FWIW I think that the problem is primarily > SKIP_PAGES_THRESHOLD itself. After all, SKIP_PAGES_THRESHOLD tries to > encourage certain desirable outcomes (namely readahead and > non-aggressive relfrozenxid advancement) without ever really verifying > that that's working out. If VACUUM doesn't truly get readahead (quite > likely), it stills pay quite a high cost in CPU cycles. Similarly, if > VACUUM skips even one all-visible page, we can't expect much of any > benefit for having not skipped any other all-visible pages (whether or > not we can safely advance relfrozenxid at all is an all or nothing > thing). Right. ISTM that coming up with a better strategy to enable relfrozenxid advancement than SKIP_PAGES_THRESHOLD is something everyone agrees would be worthwhile. > > That's an interesting idea. And it seems like a much more effective > > way of getting some relfrozenxid advancement than hoping that the > > pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have > > scanned all unfrozen tuples. > > Sometimes this can be truly perverse. It's possible for a SELECT FOR > SHARE to unset the all-frozen bit, without unsetting the all-visible > bit. So all that it takes is one such SELECT FOR SHARE against one > random tuple in an originally-all-frozen heap page inside a large > grouping of all-frozen heap pages. That's almost certainly enough to > obstruct non-aggressive relfrozenxid advancement _for the entire > table_. ISTM that it's just criminal to miss out on non-aggressive > relfrozenxid advancement because of some tiny issue such as that. And > so SKIP_PAGES_THRESHOLD should be replaced by something that > specifically has relfrozenxid as a goal of reading all-visible pages. > > Just how well this works out in practice will be very workload > dependent. But there are workloads/tables where it works quite well: > > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Mixed_inserts_and_deletes I did look at the wiki page a bit. But one thing I didn't quite grasp is how you are proposing to measure the costs/benefits of scanning all all-visible pages. When you first mentioned this, I imagined you would use visibilitymap_count() at the beginning of the vacuum and consider scanning all the all-visible pages if there aren't many (when compared to the total number of pages needing scanning). But then, I'm not sure I see how that lets you advance relfrozenxid more often. It seems like the all-visible pages you would scan this way would be younger and less likely to be required to freeze (per freeze limit), so you'd end up just uselessly scanning them. > Why should we necessarily have to advance relfrozenxid exactly up to > FreezeLimit during every aggressive VACUUM? Surely the picture over > time and across multiple VACUUM operations is what matters most? At > the very least, we should have an independent XID cutoff for "must > advance relfrozenxid up to here, no matter what" -- we should just > reuse FreezeLimit to control that behavior. We might very well "try > quite hard to advance relfrozenxid to a value >= FreezeLimit" -- we > just don't have to do it no matter what the cost is. There is a huge > practical difference between "try quite hard" (e.g., retry the cleanup > lock acquisition 3 times, with a sleep between each) and "try > infinitely hard" (i.e., wait for a cleanup lock indefinitely). I got a bit confused here. Do you mean that because we call lazy_scan_noprune() and visit tuples this way, we can still advance the relfrozenxid to the oldest unfrozen xid value just based on what we see in lazy_scan_noprune() (i.e. even if we don't get the cleanup lock)? - Melanie
On Tue, Dec 17, 2024 at 9:11 AM Tomas Vondra <tomas@vondra.me> wrote: > > > > On 12/16/24 19:49, Melanie Plageman wrote: > > > No, I'm talking about the behavior of causing small pockets of > > all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD > > and are then scanned (even though they are already frozen). What I > > describe in that email I cited is that because we freeze > > opportunistically when we have or will emit an FPI, and bgwriter will > > write out blocks in clocksweep order, we end up with random pockets of > > pages getting frozen during/after a checkpoint. Then in the next > > vacuum, we end up scanning those all-frozen pages again because the > > ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is > > mostly going to happen for an insert-only workload. I'm not saying > > freezing the pages is bad, I'm saying that causing these pockets of > > frozen pages leads to scanning all-frozen pages on future vacuums. > > > > Yeah, this interaction between the components is not great :-( But can > we think of a way to reduce the fragmentation? What would need to change? Well reducing SKIP_PAGES_THRESHOLD would help. And unfortunately we do not know if the skippable pages are all-frozen without extra visibilitymap_get_status() calls -- so we can't decide to avoid scanning ranges of skippable pages because they are frozen. > I don't think bgwriter can help much - it's mostly oblivious to the > contents of the buffer, I don't think it could consider stuff like this > when deciding what to evict. Agreed. > Maybe the freezing code could check how many of the nearby pages are > frozen, and consider that together with the FPI write? That's an interesting idea. We wouldn't have any guaranteed info because we only have a lock on the page we are considering freezing. But we could keep track of the length of a run of pages we are freezing and opportunistically freeze pages that don't require freezing if they follow one or more pages requiring freezing. But I don't know how much more this buys us than removing SKIP_PAGES_THRESHOLD. Since it would "fix" the fragmentation, perhaps it makes larger future vacuum reads possible. But I wonder how much benefit it would be vs complexity. > >>> However, we are not close to coming up with a > >>> replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help. > >>> This wouldn't have affected your results, but it is worth considering > >>> more generally. > >> > >> One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes > >> it more likely that non-aggressive VACUUMs will advance relfrozenxid. > >> Granted, it's probably not doing a particularly good job at that right > >> now. But any effort to replace it should account for that. > >> > > I don't follow. How could non-aggressive VACUUM advance relfrozenxid, > ever? I mean, if it doesn't guarantee freezing all pages, how could it? It may, coincidentally, not skip any all-visible pages. Peter points out that this happens all the time for small tables, but wouldn't the overhead of an aggressive vacuum be barely noticeable for small tables? It seems like there is little cost to waiting. > >> This is possible by making VACUUM consider the cost of scanning extra > >> heap pages up-front. If the number of "extra heap pages to be scanned" > >> to advance relfrozenxid happens to not be very high (or not so high > >> *relative to the current age(relfrozenxid)*), then pay that cost now, > >> in the current VACUUM operation. Even if age(relfrozenxid) is pretty > >> far from the threshold for aggressive mode, if the added cost of > >> advancing relfrozenxid is still not too high, why wouldn't we just do > >> it? > > > > That's an interesting idea. And it seems like a much more effective > > way of getting some relfrozenxid advancement than hoping that the > > pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have > > scanned all unfrozen tuples. > > > > I agree it might be useful to formulate this as a "costing" problem, not > just in the context of a single vacuum, but for the overall maintenance > overhead - essentially accepting the vacuum gets slower, in exchange for > lower cost of maintenance later. Yes, that costing sounds like a big research and benchmarking project on its own. > But I think that (a) is going to be fairly complex, because how do you > cost the future vacuum?, and (b) is somewhat misses my point that on > modern NVMe SSD storage (SKIP_PAGES_THRESHOLD > 1) doesn't seem to be a > win *ever*. > > So why shouldn't we reduce the SKIP_PAGES_THRESHOLD value (or perhaps > make it configurable)? We can still do the other stuff (decide how > aggressively to free stuff etc.) independently of that. I think your tests show SKIP_PAGES_THRESHOLD has dubious if any benefit related to readahead. But the question is if we care about it for advancing relfrozenxid for small tables. - Melanie
On Tue, Dec 17, 2024 at 12:06 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I think your tests show SKIP_PAGES_THRESHOLD has dubious if any > benefit related to readahead. But the question is if we care about it > for advancing relfrozenxid for small tables. It seems like relfrozenxid advancement is really only a problem for big tables. If the table is small, the eventual aggressive vacuum doesn't cost that much. -- Robert Haas EDB: http://www.enterprisedb.com
On 12/17/24 18:06, Melanie Plageman wrote: > On Tue, Dec 17, 2024 at 9:11 AM Tomas Vondra <tomas@vondra.me> wrote: >> >> >> >> On 12/16/24 19:49, Melanie Plageman wrote: >> >>> No, I'm talking about the behavior of causing small pockets of >>> all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD >>> and are then scanned (even though they are already frozen). What I >>> describe in that email I cited is that because we freeze >>> opportunistically when we have or will emit an FPI, and bgwriter will >>> write out blocks in clocksweep order, we end up with random pockets of >>> pages getting frozen during/after a checkpoint. Then in the next >>> vacuum, we end up scanning those all-frozen pages again because the >>> ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is >>> mostly going to happen for an insert-only workload. I'm not saying >>> freezing the pages is bad, I'm saying that causing these pockets of >>> frozen pages leads to scanning all-frozen pages on future vacuums. >>> >> >> Yeah, this interaction between the components is not great :-( But can >> we think of a way to reduce the fragmentation? What would need to change? > > Well reducing SKIP_PAGES_THRESHOLD would help. How does SKIP_PAGES_THRESHOLD change the fragmentation? I think that's the side that's affected by the fragmentation, but it's really due to the eager freezing / bgwriter evictions, etc. If the threshold is set to 1 (i.e. to always skip), that just lowers the impact, but the relation is still as fragmented as before, no? > And unfortunately we do > not know if the skippable pages are all-frozen without extra > visibilitymap_get_status() calls -- so we can't decide to avoid > scanning ranges of skippable pages because they are frozen. > I may be missing something, but doesn't find_next_unskippable_block() already get those bits? In fact, it even checks VISIBILITYMAP_ALL_FROZEN but only for aggressive vacuum. But even if that wasn't the case, isn't checking the VM likely much cheaper than vacuuming the heap page? Also, I don't quite see why would this information help with reducing the fragmentation? Can you explain? >> I don't think bgwriter can help much - it's mostly oblivious to the >> contents of the buffer, I don't think it could consider stuff like this >> when deciding what to evict. > > Agreed. > >> Maybe the freezing code could check how many of the nearby pages are >> frozen, and consider that together with the FPI write? > > That's an interesting idea. We wouldn't have any guaranteed info > because we only have a lock on the page we are considering freezing. > But we could keep track of the length of a run of pages we are > freezing and opportunistically freeze pages that don't require > freezing if they follow one or more pages requiring freezing. I don't think we need a "guaranteed" information - a heuristics that's correct most of the time (say, >90%?) ought to be good enough. I mean, it has to be, because we'll never get a rule that's correct 100%. So even just looking at a batch of pages in VM should be enough, no? > But I don't know how much more this buys us than removing > SKIP_PAGES_THRESHOLD. Since it would "fix" the fragmentation, perhaps > it makes larger future vacuum reads possible. But I wonder how much > benefit it would be vs complexity. > I think that depends on which cost we're talking about. If we only talk about the efficiency of a single vacuum, then it probably does not help very much. I mean, if we assume the relation is already fragmented, then it seems to be cheaper to vacuum just the pages that need it (as if with SKIP_PAGES_THRESHOLD=1). But if we're talking about long-time benefits, in reducing the amount of freezing needed overall, maybe it'd be a win? I don't know. >>>>> However, we are not close to coming up with a >>>>> replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help. >>>>> This wouldn't have affected your results, but it is worth considering >>>>> more generally. >>>> >>>> One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes >>>> it more likely that non-aggressive VACUUMs will advance relfrozenxid. >>>> Granted, it's probably not doing a particularly good job at that right >>>> now. But any effort to replace it should account for that. >>>> >> >> I don't follow. How could non-aggressive VACUUM advance relfrozenxid, >> ever? I mean, if it doesn't guarantee freezing all pages, how could it? > > It may, coincidentally, not skip any all-visible pages. Peter points > out that this happens all the time for small tables, but wouldn't the > overhead of an aggressive vacuum be barely noticeable for small > tables? It seems like there is little cost to waiting. > Yeah, that's kinda my point / confusion. For it to help we would have to not skip any pages, but for large tables that seems quite unlikely (because it only takes one table that gets skipped, and there are many opportunities). And for small tables, I think it doesn't matter that much, because even aggressive vacuum is cheap. >>>> This is possible by making VACUUM consider the cost of scanning extra >>>> heap pages up-front. If the number of "extra heap pages to be scanned" >>>> to advance relfrozenxid happens to not be very high (or not so high >>>> *relative to the current age(relfrozenxid)*), then pay that cost now, >>>> in the current VACUUM operation. Even if age(relfrozenxid) is pretty >>>> far from the threshold for aggressive mode, if the added cost of >>>> advancing relfrozenxid is still not too high, why wouldn't we just do >>>> it? >>> >>> That's an interesting idea. And it seems like a much more effective >>> way of getting some relfrozenxid advancement than hoping that the >>> pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have >>> scanned all unfrozen tuples. >>> >> >> I agree it might be useful to formulate this as a "costing" problem, not >> just in the context of a single vacuum, but for the overall maintenance >> overhead - essentially accepting the vacuum gets slower, in exchange for >> lower cost of maintenance later. > > Yes, that costing sounds like a big research and benchmarking project > on its own. > True. I don't know if it makes sense to try to construct a detailed / accurate cost model for this. I meant more of a general model showing the general relationship between the amount of work that has to happen in different places. regards -- Tomas Vondra
On 12/17/24 18:52, Robert Haas wrote: > On Tue, Dec 17, 2024 at 12:06 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: >> I think your tests show SKIP_PAGES_THRESHOLD has dubious if any >> benefit related to readahead. But the question is if we care about it >> for advancing relfrozenxid for small tables. Yeah, although I'm a bit hesitant to draw such clear conclusion from one set of benchmarks on one machine (or two, but both with flash). I would not be surprised if the results were less clear on other types of storage (say, something like EBS). > > It seems like relfrozenxid advancement is really only a problem for > big tables. If the table is small, the eventual aggressive vacuum > doesn't cost that much. > Yeah, I agree with this. regards -- Tomas Vondra
On Tue, Dec 17, 2024 at 11:44 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I did look at the wiki page a bit. But one thing I didn't quite grasp > is how you are proposing to measure the costs/benefits of scanning all > all-visible pages. When you first mentioned this, I imagined you would > use visibilitymap_count() at the beginning of the vacuum and consider > scanning all the all-visible pages if there aren't many (when compared > to the total number of pages needing scanning). FWIW, the way that it actually worked in my abandoned patch set was that VACUUM would always work off of a copy of the VM, taken at the start -- a "VM snapshot" (I had an ambition that this VM snapshot concept would eventually be used to make VACUUM suspendable, since it was easy to serialize to disk in a temp file). Working off a VM snapshot gave VACUUM a fixed idea about which pages it needs to scan -- the cost model worked off of a count (of all-visible vs all-frozen pages) gathered when the VM snapshot was first established. There could never be any repeat VM accesses, since VACUUM just worked off of the snapshot for everything VM related other than setting VM bits. I'm not saying that you should do this exact thing. I'm just providing context. > But then, I'm not sure > I see how that lets you advance relfrozenxid more often. It seems like > the all-visible pages you would scan this way would be younger and > less likely to be required to freeze (per freeze limit), so you'd end > up just uselessly scanning them. I think that it would make sense to have a low "extra pages to scan" threshold of perhaps 5% of rel_pages, so that we automatically scan those pages regardless of the current value of age(relfrozenxid). It could be helpful with advancing relminmxid, in particular. There is a need to work out how that varies as age(relfrozenxid) gets closer to the cutoff for anti-wraparound autovacuum. The way I did that in the old patch set involved ramping up from 5% of rel_pages once age(relfrozenxid) reached the half-way point (half-way to requiring an anti-wraparound autovacuum). Maybe these heuristics aren't the best, but they seem like roughly the right idea to me. A conservative version might be to do something like this, without any of the ramp-up behavior (don't ramp up based on the current age(relfrozenxid)). > > Why should we necessarily have to advance relfrozenxid exactly up to > > FreezeLimit during every aggressive VACUUM? Surely the picture over > > time and across multiple VACUUM operations is what matters most? At > > the very least, we should have an independent XID cutoff for "must > > advance relfrozenxid up to here, no matter what" -- we should just > > reuse FreezeLimit to control that behavior. We might very well "try > > quite hard to advance relfrozenxid to a value >= FreezeLimit" -- we > > just don't have to do it no matter what the cost is. There is a huge > > practical difference between "try quite hard" (e.g., retry the cleanup > > lock acquisition 3 times, with a sleep between each) and "try > > infinitely hard" (i.e., wait for a cleanup lock indefinitely). > > I got a bit confused here. Do you mean that because we call > lazy_scan_noprune() and visit tuples this way, we can still advance > the relfrozenxid to the oldest unfrozen xid value just based on what > we see in lazy_scan_noprune() (i.e. even if we don't get the cleanup > lock)? What I meant is that the high-level rule that says that aggressive VACUUM must advance relfrozenxid to a value >= FreezeLimit is fairly arbitrary, and not very principled. It's an artefact of the way that FreezeLimit used to work -- nothing more. It does make sense to have a rule roughly like that, of course, but I think that it should be much looser. VACUUM could be more conservative about advancing relfrozenxid on average (i.e. it could advance relfrozenxid to a value more recent than FreezeLimit, and advance relfrozenxid in more individual VACUUM operations), while at the same time not making an absolute iron-clad guarantee about how much relfrozenxid has to be advanced within any given VACUUM. In short, VACUUM would promise less but deliver more. A part of that might be to teach lazy_scan_noprune to "try a little harder" when it made sense to wait a little while (but not forever) for a cleanup lock. Alternatively, as discussed with Robert today, it might be possible to freeze (and "prune") without requiring a cleanup lock in a way that was sufficient to always be able to advance relfrozenxid without hindrance from cursors with conflicting buffer pins and whatnot. -- Peter Geoghegan
On Tue, Dec 17, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 12/17/24 18:06, Melanie Plageman wrote: > > On Tue, Dec 17, 2024 at 9:11 AM Tomas Vondra <tomas@vondra.me> wrote: > >> > >> > >> > >> On 12/16/24 19:49, Melanie Plageman wrote: > >> > >>> No, I'm talking about the behavior of causing small pockets of > >>> all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD > >>> and are then scanned (even though they are already frozen). What I > >>> describe in that email I cited is that because we freeze > >>> opportunistically when we have or will emit an FPI, and bgwriter will > >>> write out blocks in clocksweep order, we end up with random pockets of > >>> pages getting frozen during/after a checkpoint. Then in the next > >>> vacuum, we end up scanning those all-frozen pages again because the > >>> ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is > >>> mostly going to happen for an insert-only workload. I'm not saying > >>> freezing the pages is bad, I'm saying that causing these pockets of > >>> frozen pages leads to scanning all-frozen pages on future vacuums. > >>> > >> > >> Yeah, this interaction between the components is not great :-( But can > >> we think of a way to reduce the fragmentation? What would need to change? > > > > Well reducing SKIP_PAGES_THRESHOLD would help. > > How does SKIP_PAGES_THRESHOLD change the fragmentation? I think that's > the side that's affected by the fragmentation, but it's really due to > the eager freezing / bgwriter evictions, etc. If the threshold is set to > 1 (i.e. to always skip), that just lowers the impact, but the relation > is still as fragmented as before, no? Yep, exactly. It doesn't help with fragmentation. It just helps us not scan all-frozen pages. The question is whether or not the fragmentation on its own matters. I think it would be better if we didn't have it -- we could potentially do larger reads, for example, if we have one continuous block of pages that are not all-frozen (most likely when using the read stream API). > > And unfortunately we do > > not know if the skippable pages are all-frozen without extra > > visibilitymap_get_status() calls -- so we can't decide to avoid > > scanning ranges of skippable pages because they are frozen. > > > > I may be missing something, but doesn't find_next_unskippable_block() > already get those bits? In fact, it even checks VISIBILITYMAP_ALL_FROZEN > but only for aggressive vacuum. But even if that wasn't the case, isn't > checking the VM likely much cheaper than vacuuming the heap page? find_next_unskippable_block() has them, but then if we do decide to skip a range of pages, back in heap_vac_scan_next_block() where we decide whether or not to skip a range using SKIP_PAGES_THRESHOLD, we know that the pages in the range are all-visible (otherwise they wouldn't be skippable) but we no longer know which of them were all-frozen. > >> Maybe the freezing code could check how many of the nearby pages are > >> frozen, and consider that together with the FPI write? > > > > That's an interesting idea. We wouldn't have any guaranteed info > > because we only have a lock on the page we are considering freezing. > > But we could keep track of the length of a run of pages we are > > freezing and opportunistically freeze pages that don't require > > freezing if they follow one or more pages requiring freezing. > > I don't think we need a "guaranteed" information - a heuristics that's > correct most of the time (say, >90%?) ought to be good enough. I mean, > it has to be, because we'll never get a rule that's correct 100%. So > even just looking at a batch of pages in VM should be enough, no? > > > But I don't know how much more this buys us than removing > > SKIP_PAGES_THRESHOLD. Since it would "fix" the fragmentation, perhaps > > it makes larger future vacuum reads possible. But I wonder how much > > benefit it would be vs complexity. > > > > I think that depends on which cost we're talking about. If we only talk > about the efficiency of a single vacuum, then it probably does not help > very much. I mean, if we assume the relation is already fragmented, then > it seems to be cheaper to vacuum just the pages that need it (as if with > SKIP_PAGES_THRESHOLD=1). > > But if we're talking about long-time benefits, in reducing the amount of > freezing needed overall, maybe it'd be a win? I don't know. Yea, it just depends on whether or not the pages we freeze for this reason are likely to stay frozen. I think I misspoke in saying we want to freeze pages next to pages requiring freezing. What we really want to do is freeze pages next to pages that are being opportunistically frozen -- because those are the ones that are creating the fragmentation. But, then where do you draw the line? You won't know if you are creating lots of random holes until after you've skipped opportunistically freezing some pages -- and by then it's too late. - Melanie