Thread: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Attached WIP patch series significantly simplifies the definition of scanned_pages inside vacuumlazy.c. Apart from making several very tricky things a lot simpler, and moving more complex code outside of the big "blkno" loop inside lazy_scan_heap (building on the Postgres 14 work), this refactoring directly facilitates 2 new optimizations (also in the patch): 1. We now collect LP_DEAD items into the dead_tuples array for all scanned pages -- even when we cannot get a cleanup lock. 2. We now don't give up on advancing relfrozenxid during a non-aggressive VACUUM when we happen to be unable to get a cleanup lock on a heap page. Both optimizations are much more natural with the refactoring in place. Especially #2, which can be thought of as making aggressive and non-aggressive VACUUM behave similarly. Sure, we shouldn't wait for a cleanup lock in a non-aggressive VACUUM (by definition) -- and we still don't in the patch (obviously). But why wouldn't we at least *check* if the page has tuples that need to be frozen in order for us to advance relfrozenxid? Why give up on advancing relfrozenxid in a non-aggressive VACUUM when there's no good reason to? See the draft commit messages from the patch series for many more details on the simplifications I am proposing. I'm not sure how much value the second optimization has on its own. But I am sure that the general idea of teaching non-aggressive VACUUM to be conscious of the value of advancing relfrozenxid is a good one -- and so #2 is a good start on that work, at least. I've discussed this idea with Andres (CC'd) a few times before now. Maybe we'll need another patch that makes VACUUM avoid setting heap pages to all-visible without also setting them to all-frozen (and freezing as necessary) in order to really get a benefit. Since, of course, a non-aggressive VACUUM still won't be able to advance relfrozenxid when it skipped over all-visible pages that are not also known to be all-frozen. Masahiko (CC'd) has expressed interest in working on opportunistic freezing. This refactoring patch seems related to that general area, too. At a high level, to me, this seems like the tuple freezing equivalent of the Postgres 14 work on bypassing index vacuuming when there are very few LP_DEAD items (interpret that as 0 LP_DEAD items, which is close to the truth anyway). There are probably quite a few interesting opportunities to make VACUUM better by not having such a sharp distinction between aggressive and non-aggressive VACUUM. Why should they be so different? A good medium term goal might be to completely eliminate aggressive VACUUMs. I have heard many stories about anti-wraparound/aggressive VACUUMs where the cure (which suddenly made autovacuum workers non-cancellable) was worse than the disease (not actually much danger of wraparound failure). For example: https://www.joyent.com/blog/manta-postmortem-7-27-2015 Yes, this problem report is from 2015, which is before we even had the freeze map stuff. I still think that the point about aggressive VACUUMs blocking DDL (leading to chaos) remains valid. There is another interesting area of future optimization within VACUUM, that also seems relevant to this patch: the general idea of *avoiding* pruning during VACUUM, when it just doesn't make sense to do so -- better to avoid dirtying the page for now. Needlessly pruning inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only with heap fill factor reduced to 95) will have autovacuums that *constantly* do it (granted, it may not matter so much there because VACUUM is unlikely to re-dirty the page anyway). This patch seems relevant to that area because it recognizes that pruning during VACUUM is not necessarily special -- a new function called lazy_scan_noprune may be used instead of lazy_scan_prune (though only when a cleanup lock cannot be acquired). These pages are nevertheless considered fully processed by VACUUM (this is perhaps 99% true, so it seems reasonable to round up to 100% true). I find it easy to imagine generalizing the same basic idea -- recognizing more ways in which pruning by VACUUM isn't necessarily better than opportunistic pruning, at the level of each heap page. Of course we *need* to prune sometimes (e.g., might be necessary to do so to set the page all-visible in the visibility map), but why bother when we don't, and when there is no reason to think that it'll help anyway? Something to think about, at least. -- Peter Geoghegan
Attachment
Hi, On 2021-11-21 18:13:51 -0800, Peter Geoghegan wrote: > I have heard many stories about anti-wraparound/aggressive VACUUMs > where the cure (which suddenly made autovacuum workers > non-cancellable) was worse than the disease (not actually much danger > of wraparound failure). For example: > > https://www.joyent.com/blog/manta-postmortem-7-27-2015 > > Yes, this problem report is from 2015, which is before we even had the > freeze map stuff. I still think that the point about aggressive > VACUUMs blocking DDL (leading to chaos) remains valid. As I noted below, I think this is a bit of a separate issue than what your changes address in this patch. > There is another interesting area of future optimization within > VACUUM, that also seems relevant to this patch: the general idea of > *avoiding* pruning during VACUUM, when it just doesn't make sense to > do so -- better to avoid dirtying the page for now. Needlessly pruning > inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only > with heap fill factor reduced to 95) will have autovacuums that > *constantly* do it (granted, it may not matter so much there because > VACUUM is unlikely to re-dirty the page anyway). Hm. I'm a bit doubtful that there's all that many cases where it's worth not pruning during vacuum. However, it seems much more common for opportunistic pruning during non-write accesses. Perhaps checking whether we'd log an FPW would be a better criteria for deciding whether to prune or not compared to whether we're dirtying the page? IME the WAL volume impact of FPWs is a considerably bigger deal than unnecessarily dirtying a page that has previously been dirtied in the same checkpoint "cycle". > This patch seems relevant to that area because it recognizes that pruning > during VACUUM is not necessarily special -- a new function called > lazy_scan_noprune may be used instead of lazy_scan_prune (though only when a > cleanup lock cannot be acquired). These pages are nevertheless considered > fully processed by VACUUM (this is perhaps 99% true, so it seems reasonable > to round up to 100% true). IDK, the potential of not having usable space on an overfly fragmented page doesn't seem that low. We can't just mark such pages as all-visible because then we'll potentially never reclaim that space. > Since any VACUUM (not just an aggressive VACUUM) can sometimes advance > relfrozenxid, we now make non-aggressive VACUUMs work just a little > harder in order to make that desirable outcome more likely in practice. > Aggressive VACUUMs have long checked contended pages with only a shared > lock, to avoid needlessly waiting on a cleanup lock (in the common case > where the contended page has no tuples that need to be frozen anyway). > We still don't make non-aggressive VACUUMs wait for a cleanup lock, of > course -- if we did that they'd no longer be non-aggressive. IMO the big difference between aggressive / non-aggressive isn't whether we wait for a cleanup lock, but that we don't skip all-visible pages... > But we now make the non-aggressive case notice that a failure to acquire a > cleanup lock on one particular heap page does not in itself make it unsafe > to advance relfrozenxid for the whole relation (which is what we usually see > in the aggressive case already). > > This new relfrozenxid optimization might not be all that valuable on its > own, but it may still facilitate future work that makes non-aggressive > VACUUMs more conscious of the benefit of advancing relfrozenxid sooner > rather than later. In general it would be useful for non-aggressive > VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for > a cleanup lock once or twice if needed). What do you mean by "waiting once or twice"? A single wait may simply never end on a busy page that's constantly pinned by a lot of backends... > It would also be generally useful if aggressive VACUUMs were "less > aggressive" opportunistically (e.g. by being responsive to query > cancellations when the risk of wraparound failure is still very low). Being canceleable is already a different concept than anti-wraparound vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but anti-wrap only at autovacuum_freeze_max_age. The problem is that the autovacuum scheduling is way too naive for that to be a significant benefit - nothing tries to schedule autovacuums so that they have a chance to complete before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum. This is one of the most embarassing issues around the whole anti-wrap topic. We kind of define it as an emergency that there's an anti-wraparound vacuum. But we have *absolutely no mechanism* to prevent them from occurring. > We now also collect LP_DEAD items in the dead_tuples array in the case > where we cannot immediately get a cleanup lock on the buffer. We cannot > prune without a cleanup lock, but opportunistic pruning may well have > left some LP_DEAD items behind in the past -- no reason to miss those. This has become *much* more important with the changes around deciding when to index vacuum. It's not just that opportunistic pruning could have left LP_DEAD items, it's that a previous vacuum is quite likely to have left them there, because the previous vacuum decided not to perform index cleanup. > Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic > technique is independently capable of cleaning up line pointer bloat), One thing we could do around this, btw, would be to aggressively replace LP_REDIRECT items with their target item. We can't do that in all situations (somebody might be following a ctid chain), but I think we have all the information needed to do so. Probably would require a new HTSV RECENTLY_LIVE state or something like that. I think that'd be quite a win - we right now often "migrate" to other pages for modifications not because we're out of space on a page, but because we run out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the number of line pointers, not just the number of actual tuples). Effectively doubling the number of available line item in common cases in a number of realistic / common scenarios would be quite the win. > Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE, > since there is no barely any real practical sense in which we actually > miss doing useful work for these pages. Besides, this information > always seemed to have little practical value, even to Postgres hackers. -0.5. I think it provides some value, and I don't see why the removal of the information should be tied to this change. It's hard to diagnose why some dead tuples aren't cleaned up - a common cause for that on smaller tables is that nearly all pages are pinned nearly all the time. I wonder if we could have a more restrained version of heap_page_prune() that doesn't require a cleanup lock? Obviously we couldn't defragment the page, but it's not immediately obvious that we need it if we constrain ourselves to only modify tuple versions that cannot be visible to anybody. Random note: I really dislike that we talk about cleanup locks in some parts of the code, and super-exclusive locks in others :(. > + /* > + * Aggressive VACUUM (which is the same thing as anti-wraparound > + * autovacuum for most practical purposes) exists so that we'll reliably > + * advance relfrozenxid and relminmxid sooner or later. But we can often > + * opportunistically advance them even in a non-aggressive VACUUM. > + * Consider if that's possible now. I don't agree with the "most practical purposes" bit. There's a huge difference because manual VACUUMs end up aggressive but not anti-wrap once older than vacuum_freeze_table_age. > + * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want > + * the rel_pages used by lazy_scan_prune, from before a possible relation > + * truncation took place. (vacrel->rel_pages is now new_rel_pages.) > + */ I think it should be doable to add an isolation test for this path. There have been quite a few bugs around the wider topic... > + if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages || > + !vacrel->freeze_cutoffs_valid) > + { > + /* Cannot advance relfrozenxid/relminmxid -- just update pg_class */ > + Assert(!aggressive); > + vac_update_relstats(rel, new_rel_pages, new_live_tuples, > + new_rel_allvisible, vacrel->nindexes > 0, > + InvalidTransactionId, InvalidMultiXactId, false); > + } > + else > + { > + /* Can safely advance relfrozen and relminmxid, too */ > + Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages == > + orig_rel_pages); > + vac_update_relstats(rel, new_rel_pages, new_live_tuples, > + new_rel_allvisible, vacrel->nindexes > 0, > + FreezeLimit, MultiXactCutoff, false); > + } I wonder if this whole logic wouldn't become easier and less fragile if we just went for maintaining the "actually observed" horizon while scanning the relation. If we skip a page via VM set the horizon to invalid. Otherwise we can keep track of the accurate horizon and use that. No need to count pages and stuff. > @@ -1050,18 +1046,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) > bool all_visible_according_to_vm = false; > LVPagePruneState prunestate; > > - /* > - * Consider need to skip blocks. See note above about forcing > - * scanning of last page. > - */ > -#define FORCE_CHECK_PAGE() \ > - (blkno == nblocks - 1 && should_attempt_truncation(vacrel)) > - > pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno); > > update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP, > blkno, InvalidOffsetNumber); > > + /* > + * Consider need to skip blocks > + */ > if (blkno == next_unskippable_block) > { > /* Time to advance next_unskippable_block */ > @@ -1110,13 +1102,19 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) > else > { > /* > - * The current block is potentially skippable; if we've seen a > - * long enough run of skippable blocks to justify skipping it, and > - * we're not forced to check it, then go ahead and skip. > - * Otherwise, the page must be at least all-visible if not > - * all-frozen, so we can set all_visible_according_to_vm = true. > + * The current block can be skipped if we've seen a long enough > + * run of skippable blocks to justify skipping it. > + * > + * There is an exception: we will scan the table's last page to > + * determine whether it has tuples or not, even if it would > + * otherwise be skipped (unless it's clearly not worth trying to > + * truncate the table). This avoids having lazy_truncate_heap() > + * take access-exclusive lock on the table to attempt a truncation > + * that just fails immediately because there are tuples in the > + * last page. > */ > - if (skipping_blocks && !FORCE_CHECK_PAGE()) > + if (skipping_blocks && > + !(blkno == nblocks - 1 && should_attempt_truncation(vacrel))) > { > /* > * Tricky, tricky. If this is in aggressive vacuum, the page I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like mixing such changes within a larger change doing many other things. > @@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) > > buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, > RBM_NORMAL, vacrel->bstrategy); > + page = BufferGetPage(buf); > + vacrel->scanned_pages++; I don't particularly like doing BufferGetPage() before holding a lock on the page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd be good to have a crosscheck that BufferGetPage() is only allowed when holding a page level lock. > /* > - * We need buffer cleanup lock so that we can prune HOT chains and > - * defragment the page. > + * We need a buffer cleanup lock to prune HOT chains and defragment > + * the page in lazy_scan_prune. But when it's not possible to acquire > + * a cleanup lock right away, we may be able to settle for reduced > + * processing in lazy_scan_noprune. > */ s/in lazy_scan_noprune/via lazy_scan_noprune/? > if (!ConditionalLockBufferForCleanup(buf)) > { > bool hastup; > > - /* > - * If we're not performing an aggressive scan to guard against XID > - * wraparound, and we don't want to forcibly check the page, then > - * it's OK to skip vacuuming pages we get a lock conflict on. They > - * will be dealt with in some future vacuum. > - */ > - if (!aggressive && !FORCE_CHECK_PAGE()) > + LockBuffer(buf, BUFFER_LOCK_SHARE); > + > + /* Check for new or empty pages before lazy_scan_noprune call */ > + if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true, > + vmbuffer)) > { > - ReleaseBuffer(buf); > - vacrel->pinskipped_pages++; > + /* Lock and pin released for us */ > + continue; > + } Why isn't this done in lazy_scan_noprune()? > + if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup)) > + { > + /* No need to wait for cleanup lock for this page */ > + UnlockReleaseBuffer(buf); > + if (hastup) > + vacrel->nonempty_pages = blkno + 1; > continue; > } Do we really need all of buf, blkno, page for both of these functions? Quite possible that yes, if so, could we add an assertion that BufferGetBockNumber(buf) == blkno? > + /* Check for new or empty pages before lazy_scan_prune call */ > + if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer)) > { Maybe worth a note mentioning that we need to redo this even in the aggressive case, because we didn't continually hold a lock on the page? > +/* > + * Empty pages are not really a special case -- they're just heap pages that > + * have no allocated tuples (including even LP_UNUSED items). You might > + * wonder why we need to handle them here all the same. It's only necessary > + * because of a rare corner-case involving a hard crash during heap relation > + * extension. If we ever make relation-extension crash safe, then it should > + * no longer be necessary to deal with empty pages here (or new pages, for > + * that matter). I don't think it's actually that rare - the window for this is huge. You just need to crash / immediate shutdown at any time between the relation having been extended and the new page contents being written out (checkpoint or buffer replacement / ring writeout). That's often many minutes. I don't really see that as a realistic thing to ever reliably avoid, FWIW. I think the overhead would be prohibitive. We'd need to do synchronous WAL logging while holding the extension lock I think. Um, not fun. > + * Caller can either hold a buffer cleanup lock on the buffer, or a simple > + * shared lock. > + */ Kinda sounds like it'd be incorrect to call this with an exclusive lock, which made me wonder why that could be true. Perhaps just say that it needs to be called with at least a shared lock? > +static bool > +lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno, > + Page page, bool sharelock, Buffer vmbuffer) It'd be good to document the return value - for me it's not a case where it's so obvious that it's not worth it. > +/* > + * lazy_scan_noprune() -- lazy_scan_prune() variant without pruning > + * > + * Caller need only hold a pin and share lock on the buffer, unlike > + * lazy_scan_prune, which requires a full cleanup lock. I'd add somethign like "returns whether a cleanup lock is required". Having to read multiple paragraphs to understand the basic meaning of the return value isn't great. > + if (ItemIdIsRedirected(itemid)) > + { > + *hastup = true; /* page won't be truncatable */ > + continue; > + } It's not really new, but this comment is now a bit confusing, because it can be understood to be about PageTruncateLinePointerArray(). > + case HEAPTUPLE_DEAD: > + case HEAPTUPLE_RECENTLY_DEAD: > + > + /* > + * We count DEAD and RECENTLY_DEAD tuples in new_dead_tuples. > + * > + * lazy_scan_prune only does this for RECENTLY_DEAD tuples, > + * and never has to deal with DEAD tuples directly (they > + * reliably become LP_DEAD items through pruning). Our > + * approach to DEAD tuples is a bit arbitrary, but it seems > + * better than totally ignoring them. > + */ > + new_dead_tuples++; > + break; Why does it make sense to track DEAD tuples this way? Isn't that going to lead to counting them over-and-over again? I think it's quite misleading to include them in "dead bot not yet removable". > + /* > + * Now save details of the LP_DEAD items from the page in the dead_tuples > + * array iff VACUUM uses two-pass strategy case > + */ Do we really need to have separate code for this in lazy_scan_prune() and lazy_scan_noprune()? > + } > + else > + { > + /* > + * We opt to skip FSM processing for the page on the grounds that it > + * is probably being modified by concurrent DML operations. Seems > + * best to assume that the space is best left behind for future > + * updates of existing tuples. This matches what opportunistic > + * pruning does. Why can we assume that there concurrent DML rather than concurrent read-only operations? IME it's much more common for read-only operations to block cleanup locks than read-write ones (partially because the frequency makes it easier, partially because cursors allow long-held pins, partially because the EXCLUSIVE lock of a r/w operation wouldn't let us get here) I think this is a change mostly in the right direction. But as formulated this commit does *WAY* too much at once. Greetings, Andres Freund
On Mon, Nov 22, 2021 at 11:29 AM Andres Freund <andres@anarazel.de> wrote: > Hm. I'm a bit doubtful that there's all that many cases where it's worth not > pruning during vacuum. However, it seems much more common for opportunistic > pruning during non-write accesses. Fair enough. I just wanted to suggest an exploratory conversation about pruning (among several other things). I'm mostly saying: hey, pruning during VACUUM isn't actually that special, at least not with this refactoring patch in place. So maybe it makes sense to go further, in light of that general observation about pruning in VACUUM. Maybe it wasn't useful to even mention this aspect now. I would rather focus on freezing optimizations for now -- that's much more promising. > Perhaps checking whether we'd log an FPW would be a better criteria for > deciding whether to prune or not compared to whether we're dirtying the page? > IME the WAL volume impact of FPWs is a considerably bigger deal than > unnecessarily dirtying a page that has previously been dirtied in the same > checkpoint "cycle". Agreed. (I tend to say the former when I really mean the latter, which I should try to avoid.) > IDK, the potential of not having usable space on an overfly fragmented page > doesn't seem that low. We can't just mark such pages as all-visible because > then we'll potentially never reclaim that space. Don't get me started on this - because I'll never stop. It makes zero sense that we don't think about free space holistically, using the whole context of what changed in the recent past. As I think you know already, a higher level concept (like open and closed pages) seems like the right direction to me -- because it isn't sensible to treat X bytes of free space in one heap page as essentially interchangeable with any other space on any other heap page. That misses an enormous amount of things that matter. The all-visible status of a page is just one such thing. > IMO the big difference between aggressive / non-aggressive isn't whether we > wait for a cleanup lock, but that we don't skip all-visible pages... I know what you mean by that, of course. But FWIW that definition seems too focused on what actually happens today, rather than what is essential given the invariants we have for VACUUM. And so I personally prefer to define it as "a VACUUM that *reliably* advances relfrozenxid". This looser definition will probably "age" well (ahem). > > This new relfrozenxid optimization might not be all that valuable on its > > own, but it may still facilitate future work that makes non-aggressive > > VACUUMs more conscious of the benefit of advancing relfrozenxid sooner > > rather than later. In general it would be useful for non-aggressive > > VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for > > a cleanup lock once or twice if needed). > > What do you mean by "waiting once or twice"? A single wait may simply never > end on a busy page that's constantly pinned by a lot of backends... I was speculating about future work again. I think that you've taken my words too literally. This is just a draft commit message, just a way of framing what I'm really trying to do. Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a non-aggressive VACUUM -- so "at least waiting for one or two pins during non-aggressive VACUUM" might not have been the best way of expressing the idea that I wanted to express. The important point is that _we can make a choice_ about stuff like this dynamically, based on the observed characteristics of the table, and some general ideas about the costs and benefits (of waiting or not waiting, or of how long we want to wait in total, whatever might be important). This probably just means adding some heuristics that are pretty sensitive to any reason to not do more work in a non-aggressive VACUUM, without *completely* balking at doing even a tiny bit more work. For example, we can definitely afford to wait a few more milliseconds to get a cleanup lock just once, especially if we're already pretty sure that that's all the extra work that it would take to ultimately be able to advance relfrozenxid in the ongoing (non-aggressive) VACUUM -- it's easy to make that case. Once you agree that it makes sense under these favorable circumstances, you've already made "aggressiveness" a continuous thing conceptually, at a high level. The current binary definition of "aggressive" is needlessly restrictive -- that much seems clear to me. I'm much less sure of what specific alternative should replace it. I've already prototyped advancing relfrozenxid using a dynamically determined value, so that our final relfrozenxid is just about the most recent safe value (not the original FreezeLimit). That's been interesting. Consider this log output from an autovacuum with the prototype patch (also uses my new instrumentation), based on standard pgbench (just tuned heap fill factor a bit): LOG: automatic vacuum of table "regression.public.pgbench_accounts": index scans: 0 pages: 0 removed, 909091 remain, 33559 skipped using visibility map (3.69% of total) tuples: 297113 removed, 50090880 remain, 90880 are dead but not yet removable removal cutoff: oldest xmin was 29296744, which is now 203341 xact IDs behind index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed I/O timings: read: 55.574 ms, write: 0.000 ms avg read rate: 17.805 MB/s, avg write rate: 4.389 MB/s buffer usage: 1728273 hits, 23150 misses, 5706 dirtied WAL usage: 594211 records, 0 full page images, 35065032 bytes system usage: CPU: user: 6.85 s, system: 0.08 s, elapsed: 10.15 s All of the autovacuums against the accounts table look similar to this one -- you don't see anything about relfrozenxid being advanced (because it isn't). Whereas for the smaller pgbench tables, every single VACUUM successfully advances relfrozenxid to a fairly recent XID (without there ever being an aggressive VACUUM) -- just because VACUUM needs to visit every page for the smaller tables. While the accounts table doesn't generally need to have 100% of all pages touched by VACUUM -- it's more like 95% there. Does that really make sense, though? I'm pretty sure that less aggressive VACUUMing (e.g. higher scale_factor setting) would lead to more aggressive setting of relfrozenxid here. I'm always suspicious when I see insignificant differences that lead to significant behavioral differences. Am I worried over nothing here? Perhaps -- we don't really need to advance relfrozenxid early with this table/workload anyway. But I'm not so sure. Again, my point is that there is a good chance that redefining aggressiveness in some way will be helpful. A more creative, flexible definition might be just what we need. The details are very much up in the air, though. > > It would also be generally useful if aggressive VACUUMs were "less > > aggressive" opportunistically (e.g. by being responsive to query > > cancellations when the risk of wraparound failure is still very low). > > Being canceleable is already a different concept than anti-wraparound > vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but > anti-wrap only at autovacuum_freeze_max_age. You know what I meant. Also, did *you* mean "being canceleable is already a different concept to *aggressive* vacuums"? :-) > The problem is that the > autovacuum scheduling is way too naive for that to be a significant benefit - > nothing tries to schedule autovacuums so that they have a chance to complete > before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is > to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum. Not sure what you mean about scheduling, since vacuum_freeze_table_age is only in place to make overnight (off hours low activity scripted VACUUMs) freeze tuples before any autovacuum worker gets the chance (since the latter may run at a much less convenient time). Sure, vacuum_freeze_table_age might also force a regular autovacuum worker to do an aggressive VACUUM -- but I think it's mostly intended for a manual overnight VACUUM. Not usually very helpful, but also not harmful. Oh, wait. I think that you're talking about how autovacuum workers in particular tend to be affected by this. We launch an av worker that wants to clean up bloat, but it ends up being aggressive (and maybe taking way longer), perhaps quite randomly, only due to vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is that it? > This is one of the most embarassing issues around the whole anti-wrap > topic. We kind of define it as an emergency that there's an anti-wraparound > vacuum. But we have *absolutely no mechanism* to prevent them from occurring. What do you mean? Only an autovacuum worker can do an anti-wraparound VACUUM (which is not quite the same thing as an aggressive VACUUM). I agree that anti-wraparound autovacuum is way too unfriendly, though. > > We now also collect LP_DEAD items in the dead_tuples array in the case > > where we cannot immediately get a cleanup lock on the buffer. We cannot > > prune without a cleanup lock, but opportunistic pruning may well have > > left some LP_DEAD items behind in the past -- no reason to miss those. > > This has become *much* more important with the changes around deciding when to > index vacuum. It's not just that opportunistic pruning could have left LP_DEAD > items, it's that a previous vacuum is quite likely to have left them there, > because the previous vacuum decided not to perform index cleanup. I haven't seen any evidence of that myself (with the optimization added to Postgres 14 by commit 5100010ee4). I still don't understand why you doubted that work so much. I'm not saying that you're wrong to; I'm saying that I don't think that I understand your perspective on it. What I have seen in my own tests (particularly with BenchmarkSQL) is that most individual tables either never apply the optimization even once (because the table reliably has heap pages with many more LP_DEAD items than the 2%-of-relpages threshold), or will never need to (because there are precisely zero LP_DEAD items anyway). Remaining tables that *might* use the optimization tend to not go very long without actually getting a round of index vacuuming. It's just too easy for updates (and even aborted xact inserts) to introduce new LP_DEAD items for us to go long without doing index vacuuming. If you can be more concrete about a problem you've seen, then I might be able to help. It's not like there are no options in this already. I already thought about introducing a small degree of randomness into the process of deciding to skip or to not skip (in the consider_bypass_optimization path of lazy_vacuum() on Postgres 14). The optimization is mostly valuable because it allows us to do more useful work in VACUUM -- not because it allows us to do less useless work in VACUUM. In particular, it allows to tune autovacuum_vacuum_insert_scale_factor very aggressively with an append-only table, without useless index vacuuming making it all but impossible for autovacuum to get to the useful work. > > Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic > > technique is independently capable of cleaning up line pointer bloat), > > One thing we could do around this, btw, would be to aggressively replace > LP_REDIRECT items with their target item. We can't do that in all situations > (somebody might be following a ctid chain), but I think we have all the > information needed to do so. Probably would require a new HTSV RECENTLY_LIVE > state or something like that. Another idea is to truncate the line pointer during pruning (including opportunistic pruning). Matthias van de Meent has a patch for that. I am not aware of a specific workload where the patch helps, but that doesn't mean that there isn't one, or that it doesn't matter. It's subtle enough that I might have just missed something. I *expect* the true damage over time to be very hard to model or understand -- I imagine the potential for weird feedback loops is there. > I think that'd be quite a win - we right now often "migrate" to other pages > for modifications not because we're out of space on a page, but because we run > out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the > number of line pointers, not just the number of actual tuples). Effectively > doubling the number of available line item in common cases in a number of > realistic / common scenarios would be quite the win. I believe Masahiko is working on this in the current cycle. It would be easier if we had a better sense of how increasing MaxHeapTuplesPerPage will affect tidbitmap.c. But the idea of increasing that seems sound to me. > > Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE, > > since there is no barely any real practical sense in which we actually > > miss doing useful work for these pages. Besides, this information > > always seemed to have little practical value, even to Postgres hackers. > > -0.5. I think it provides some value, and I don't see why the removal of the > information should be tied to this change. It's hard to diagnose why some dead > tuples aren't cleaned up - a common cause for that on smaller tables is that > nearly all pages are pinned nearly all the time. Is that still true, though? If it turns out that we need to leave it in, then I can do that. But I'd prefer to wait until we have more information before making a final decision. Remember, the high level idea of this whole patch is that we do as much work as possible for any scanned_pages, which now includes pages that we never successfully acquired a cleanup lock on. And so we're justified in assuming that they're exactly equivalent to pages that we did get a cleanup on -- that's now the working assumption. I know that that's not literally true, but that doesn't mean it's not a useful fiction -- it should be very close to the truth. Also, I would like to put more information (much more useful information) in the same log output. Perhaps that will be less controversial if I take something useless away first. > I wonder if we could have a more restrained version of heap_page_prune() that > doesn't require a cleanup lock? Obviously we couldn't defragment the page, but > it's not immediately obvious that we need it if we constrain ourselves to only > modify tuple versions that cannot be visible to anybody. > > Random note: I really dislike that we talk about cleanup locks in some parts > of the code, and super-exclusive locks in others :(. Somebody should normalize that. > > + /* > > + * Aggressive VACUUM (which is the same thing as anti-wraparound > > + * autovacuum for most practical purposes) exists so that we'll reliably > > + * advance relfrozenxid and relminmxid sooner or later. But we can often > > + * opportunistically advance them even in a non-aggressive VACUUM. > > + * Consider if that's possible now. > > I don't agree with the "most practical purposes" bit. There's a huge > difference because manual VACUUMs end up aggressive but not anti-wrap once > older than vacuum_freeze_table_age. Okay. > > + * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want > > + * the rel_pages used by lazy_scan_prune, from before a possible relation > > + * truncation took place. (vacrel->rel_pages is now new_rel_pages.) > > + */ > > I think it should be doable to add an isolation test for this path. There have > been quite a few bugs around the wider topic... I would argue that we already have one -- vacuum-reltuples.spec. I had to update its expected output in the patch. I would argue that the behavioral change (count tuples on a pinned-by-cursor heap page) that necessitated updating the expected output for the test is an improvement overall. > > + { > > + /* Can safely advance relfrozen and relminmxid, too */ > > + Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages == > > + orig_rel_pages); > > + vac_update_relstats(rel, new_rel_pages, new_live_tuples, > > + new_rel_allvisible, vacrel->nindexes > 0, > > + FreezeLimit, MultiXactCutoff, false); > > + } > > I wonder if this whole logic wouldn't become easier and less fragile if we > just went for maintaining the "actually observed" horizon while scanning the > relation. If we skip a page via VM set the horizon to invalid. Otherwise we > can keep track of the accurate horizon and use that. No need to count pages > and stuff. There is no question that that makes sense as an optimization -- my prototype convinced me of that already. But I don't think that it can simplify anything (not even the call to vac_update_relstats itself, to actually update relfrozenxid at the end). Fundamentally, this will only work if we decide to only skip all-frozen pages, which (by definition) only happens within aggressive VACUUMs. Isn't it that simple? You recently said (on the heap-pruning-14-bug thread) that you don't think it would be practical to always set a page all-frozen when we see that we're going to set it all-visible -- apparently you feel that we could never opportunistically freeze early such that all-visible but not all-frozen pages practically cease to exist. I'm still not sure why you believe that (though you may be right, or I might have misunderstood, since it's complicated). It would certainly benefit this dynamic relfrozenxid business if it was possible, though. If we could somehow make that work, then almost every VACUUM would be able to advance relfrozenxid, independently of aggressive-ness -- because we wouldn't have any all-visible-but-not-all-frozen pages to skip (that important detail wouldn't be left to chance). > > - if (skipping_blocks && !FORCE_CHECK_PAGE()) > > + if (skipping_blocks && > > + !(blkno == nblocks - 1 && should_attempt_truncation(vacrel))) > > { > > /* > > * Tricky, tricky. If this is in aggressive vacuum, the page > > I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like > mixing such changes within a larger change doing many other things. I got rid of FORCE_CHECK_PAGE() itself in this patch (not a later patch) because the patch also removes the only other FORCE_CHECK_PAGE() call -- and the latter change is very much in scope for the big patch (can't be broken down into smaller changes, I think). And so this felt natural to me. But if you prefer, I can break it out into a separate commit. > > @@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) > > > > buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, > > RBM_NORMAL, vacrel->bstrategy); > > + page = BufferGetPage(buf); > > + vacrel->scanned_pages++; > > I don't particularly like doing BufferGetPage() before holding a lock on the > page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd > be good to have a crosscheck that BufferGetPage() is only allowed when holding > a page level lock. I have occasionally wondered if the whole idea of reading heap pages with only a pin (and having cleanup locks in VACUUM) is really worth it -- alternative designs seem possible. Obviously that's a BIG discussion, and not one to have right now. But it seems kind of relevant. Since it is often legit to read a heap page without a buffer lock (only a pin), I can't see why BufferGetPage() without a buffer lock shouldn't also be okay -- if anything it seems safer. I think that I would agree with you if it wasn't for that inconsistency (which is rather a big "if", to be sure -- even for me). > > + /* Check for new or empty pages before lazy_scan_noprune call */ > > + if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true, > > + vmbuffer)) > > { > > - ReleaseBuffer(buf); > > - vacrel->pinskipped_pages++; > > + /* Lock and pin released for us */ > > + continue; > > + } > > Why isn't this done in lazy_scan_noprune()? No reason, really -- could be done that way (we'd then also give lazy_scan_prune the same treatment). I thought that it made a certain amount of sense to keep some of this in the main loop, but I can change it if you want. > > + if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup)) > > + { > > + /* No need to wait for cleanup lock for this page */ > > + UnlockReleaseBuffer(buf); > > + if (hastup) > > + vacrel->nonempty_pages = blkno + 1; > > continue; > > } > > Do we really need all of buf, blkno, page for both of these functions? Quite > possible that yes, if so, could we add an assertion that > BufferGetBockNumber(buf) == blkno? This just matches the existing lazy_scan_prune function (which doesn't mean all that much, since it was only added in Postgres 14). Will add the assertion to both. > > + /* Check for new or empty pages before lazy_scan_prune call */ > > + if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer)) > > { > > Maybe worth a note mentioning that we need to redo this even in the aggressive > case, because we didn't continually hold a lock on the page? Isn't that obvious? Either way it isn't the kind of thing that I'd try to optimize away. It's such a narrow issue. > > +/* > > + * Empty pages are not really a special case -- they're just heap pages that > > + * have no allocated tuples (including even LP_UNUSED items). You might > > + * wonder why we need to handle them here all the same. It's only necessary > > + * because of a rare corner-case involving a hard crash during heap relation > > + * extension. If we ever make relation-extension crash safe, then it should > > + * no longer be necessary to deal with empty pages here (or new pages, for > > + * that matter). > > I don't think it's actually that rare - the window for this is huge. I can just remove the comment, though it still makes sense to me. > I don't really see that as a realistic thing to ever reliably avoid, FWIW. I > think the overhead would be prohibitive. We'd need to do synchronous WAL > logging while holding the extension lock I think. Um, not fun. My long term goal for the FSM (the lease based design I talked about earlier this year) includes soft ownership of free space from preallocated pages by individual xacts -- the smgr layer itself becomes transactional and crash safe (at least to a limited degree). This includes bulk extension of relations, to make up for the new overhead implied by crash safe rel extension. I don't think that we should require VACUUM (or anything else) to be cool with random uninitialized pages -- to me that just seems backwards. We can't do true bulk extension right now (just an inferior version that doesn't give specific pages to specific backends) because the risk of losing a bunch of empty pages for way too long is not acceptable. But that doesn't seem fundamental to me -- that's one of the things we'd be fixing at the same time (through what I call soft ownership semantics). I think we'd come out ahead on performance, and *also* have a more robust approach to relation extension. > > + * Caller can either hold a buffer cleanup lock on the buffer, or a simple > > + * shared lock. > > + */ > > Kinda sounds like it'd be incorrect to call this with an exclusive lock, which > made me wonder why that could be true. Perhaps just say that it needs to be > called with at least a shared lock? Okay. > > +static bool > > +lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno, > > + Page page, bool sharelock, Buffer vmbuffer) > > It'd be good to document the return value - for me it's not a case where it's > so obvious that it's not worth it. Okay. > > +/* > > + * lazy_scan_noprune() -- lazy_scan_prune() variant without pruning > > + * > > + * Caller need only hold a pin and share lock on the buffer, unlike > > + * lazy_scan_prune, which requires a full cleanup lock. > > I'd add somethign like "returns whether a cleanup lock is required". Having to > read multiple paragraphs to understand the basic meaning of the return value > isn't great. Will fix. > > + if (ItemIdIsRedirected(itemid)) > > + { > > + *hastup = true; /* page won't be truncatable */ > > + continue; > > + } > > It's not really new, but this comment is now a bit confusing, because it can > be understood to be about PageTruncateLinePointerArray(). I didn't think of that. Will address it in the next version. > Why does it make sense to track DEAD tuples this way? Isn't that going to lead > to counting them over-and-over again? I think it's quite misleading to include > them in "dead bot not yet removable". Compared to what? Do we really want to invent a new kind of DEAD tuple (e.g., to report on), just to handle this rare case? I accept that this code is lying about the tuples being RECENTLY_DEAD, kind of. But isn't it still strictly closer to the truth, compared to HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not counting it at all. Note that we don't remember LP_DEAD items here, either (not here, in lazy_scan_noprune, and not in lazy_scan_prune on HEAD). Because we pretty much interpret LP_DEAD items as "future LP_UNUSED items" instead -- we make a soft assumption that we're going to go on to mark the same items LP_UNUSED during a second pass over the heap. My point is that there is no natural way to count "fully DEAD tuple that autovacuum didn't deal with" -- and so I picked RECENTLY_DEAD. > > + /* > > + * Now save details of the LP_DEAD items from the page in the dead_tuples > > + * array iff VACUUM uses two-pass strategy case > > + */ > > Do we really need to have separate code for this in lazy_scan_prune() and > lazy_scan_noprune()? There is hardly any repetition, though. > > + } > > + else > > + { > > + /* > > + * We opt to skip FSM processing for the page on the grounds that it > > + * is probably being modified by concurrent DML operations. Seems > > + * best to assume that the space is best left behind for future > > + * updates of existing tuples. This matches what opportunistic > > + * pruning does. > > Why can we assume that there concurrent DML rather than concurrent read-only > operations? IME it's much more common for read-only operations to block > cleanup locks than read-write ones (partially because the frequency makes it > easier, partially because cursors allow long-held pins, partially because the > EXCLUSIVE lock of a r/w operation wouldn't let us get here) I actually agree. It still probably isn't worth dealing with the FSM here, though. It's just too much mechanism for too little benefit in a very rare case. What do you think? -- Peter Geoghegan
Hi, On 2021-11-22 17:07:46 -0800, Peter Geoghegan wrote: > Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a > non-aggressive VACUUM -- so "at least waiting for one or two pins > during non-aggressive VACUUM" might not have been the best way of > expressing the idea that I wanted to express. The important point is > that _we can make a choice_ about stuff like this dynamically, based > on the observed characteristics of the table, and some general ideas > about the costs and benefits (of waiting or not waiting, or of how > long we want to wait in total, whatever might be important). This > probably just means adding some heuristics that are pretty sensitive > to any reason to not do more work in a non-aggressive VACUUM, without > *completely* balking at doing even a tiny bit more work. > For example, we can definitely afford to wait a few more milliseconds > to get a cleanup lock just once We currently have no infrastructure to wait for an lwlock or pincount for a limited time. And at least for the former it'd not be easy to add. It may be worth adding that at some point, but I'm doubtful this is sufficient reason for nontrivial new infrastructure in very performance sensitive areas. > All of the autovacuums against the accounts table look similar to this > one -- you don't see anything about relfrozenxid being advanced > (because it isn't). Whereas for the smaller pgbench tables, every > single VACUUM successfully advances relfrozenxid to a fairly recent > XID (without there ever being an aggressive VACUUM) -- just because > VACUUM needs to visit every page for the smaller tables. While the > accounts table doesn't generally need to have 100% of all pages > touched by VACUUM -- it's more like 95% there. Does that really make > sense, though? Does what make really sense? > I'm pretty sure that less aggressive VACUUMing (e.g. higher > scale_factor setting) would lead to more aggressive setting of > relfrozenxid here. I'm always suspicious when I see insignificant > differences that lead to significant behavioral differences. Am I > worried over nothing here? Perhaps -- we don't really need to advance > relfrozenxid early with this table/workload anyway. But I'm not so > sure. I think pgbench_accounts is just a really poor showcase. Most importantly there's no even slightly longer running transactions that hold down the xid horizon. But in real workloads thats incredibly common IME. It's also quite uncommon in real workloads to huge tables in which all records are updated. It's more common to have value ranges that are nearly static, and a more heavily changing range. I think the most interesting cases where using the "measured" horizon will be advantageous is anti-wrap vacuums. Those obviously have to happen for rarely modified tables, including completely static ones, too. Using the "measured" horizon will allow us to reduce the frequency of anti-wrap autovacuums on old tables, because we'll be able to set a much more recent relfrozenxid. This is becoming more common with the increased use of partitioning. > > The problem is that the > > autovacuum scheduling is way too naive for that to be a significant benefit - > > nothing tries to schedule autovacuums so that they have a chance to complete > > before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is > > to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum. > > Not sure what you mean about scheduling, since vacuum_freeze_table_age > is only in place to make overnight (off hours low activity scripted > VACUUMs) freeze tuples before any autovacuum worker gets the chance > (since the latter may run at a much less convenient time). Sure, > vacuum_freeze_table_age might also force a regular autovacuum worker > to do an aggressive VACUUM -- but I think it's mostly intended for a > manual overnight VACUUM. Not usually very helpful, but also not > harmful. > Oh, wait. I think that you're talking about how autovacuum workers in > particular tend to be affected by this. We launch an av worker that > wants to clean up bloat, but it ends up being aggressive (and maybe > taking way longer), perhaps quite randomly, only due to > vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is > that it? No, not quite. We treat anti-wraparound vacuums as an emergency (including logging messages, not cancelling). But the only mechanism we have against anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's not really a "real" mechanism, because it requires an "independent" reason to vacuum a table. I've seen cases where anti-wraparound vacuums weren't a problem / never happend for important tables for a long time, because there always was an "independent" reason for autovacuum to start doing its thing before the table got to be autovacuum_freeze_max_age old. But at some point the important tables started to be big enough that autovacuum didn't schedule vacuums that got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap vacuums. Then things started to burn, because of the unpaced anti-wrap vacuums clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite remember the details. Behaviour that lead to a "sudden" falling over, rather than getting gradually worse are bad - they somehow tend to happen on Friday evenings :). > > This is one of the most embarassing issues around the whole anti-wrap > > topic. We kind of define it as an emergency that there's an anti-wraparound > > vacuum. But we have *absolutely no mechanism* to prevent them from occurring. > > What do you mean? Only an autovacuum worker can do an anti-wraparound > VACUUM (which is not quite the same thing as an aggressive VACUUM). Just that autovacuum should have a mechanism to trigger aggressive vacuums (i.e. ones that are guaranteed to be able to increase relfrozenxid unless cancelled) before getting to the "emergency"-ish anti-wraparound state. Or alternatively that we should have a separate threshold for the "harsher" anti-wraparound measures. > > > We now also collect LP_DEAD items in the dead_tuples array in the case > > > where we cannot immediately get a cleanup lock on the buffer. We cannot > > > prune without a cleanup lock, but opportunistic pruning may well have > > > left some LP_DEAD items behind in the past -- no reason to miss those. > > > > This has become *much* more important with the changes around deciding when to > > index vacuum. It's not just that opportunistic pruning could have left LP_DEAD > > items, it's that a previous vacuum is quite likely to have left them there, > > because the previous vacuum decided not to perform index cleanup. > > I haven't seen any evidence of that myself (with the optimization > added to Postgres 14 by commit 5100010ee4). I still don't understand > why you doubted that work so much. I'm not saying that you're wrong > to; I'm saying that I don't think that I understand your perspective > on it. I didn't (nor do) doubt that it can be useful - to the contrary, I think the unconditional index pass was a huge practial issue. I do however think that there are cases where it can cause trouble. The comment above wasn't meant as a criticism - just that it seems worth pointing out that one reason we might encounter a lot of LP_DEAD items is previous vacuums that didn't perform index cleanup. > What I have seen in my own tests (particularly with BenchmarkSQL) is > that most individual tables either never apply the optimization even > once (because the table reliably has heap pages with many more LP_DEAD > items than the 2%-of-relpages threshold), or will never need to > (because there are precisely zero LP_DEAD items anyway). Remaining > tables that *might* use the optimization tend to not go very long > without actually getting a round of index vacuuming. It's just too > easy for updates (and even aborted xact inserts) to introduce new > LP_DEAD items for us to go long without doing index vacuuming. I think workloads are a bit more worried than a realistic set of benchmarksk that one person can run yourself. I gave you examples of cases that I see as likely being bitten by this, e.g. when the skipped index cleanup prevents IOS scans. When both the likely-to-be-modified and likely-to-be-queried value ranges are a small subset of the entire data, the 2% threshold can prevent vacuum from cleaning up LP_DEAD entries for a long time. Or when all index scans are bitmap index scans, and nothing ends up cleaning up the dead index entries in certain ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively small rollback / non-HOT update rate can start to be really painful. > > > Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic > > > technique is independently capable of cleaning up line pointer bloat), > > > > One thing we could do around this, btw, would be to aggressively replace > > LP_REDIRECT items with their target item. We can't do that in all situations > > (somebody might be following a ctid chain), but I think we have all the > > information needed to do so. Probably would require a new HTSV RECENTLY_LIVE > > state or something like that. > > Another idea is to truncate the line pointer during pruning (including > opportunistic pruning). Matthias van de Meent has a patch for that. I'm a bit doubtful that's as important (which is not to say that it's not worth doing). For a heavily updated table the max space usage of the line pointer array just isn't as big a factor as ending up with only half the usable line pointers. > > > Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE, > > > since there is no barely any real practical sense in which we actually > > > miss doing useful work for these pages. Besides, this information > > > always seemed to have little practical value, even to Postgres hackers. > > > > -0.5. I think it provides some value, and I don't see why the removal of the > > information should be tied to this change. It's hard to diagnose why some dead > > tuples aren't cleaned up - a common cause for that on smaller tables is that > > nearly all pages are pinned nearly all the time. > > Is that still true, though? If it turns out that we need to leave it > in, then I can do that. But I'd prefer to wait until we have more > information before making a final decision. Remember, the high level > idea of this whole patch is that we do as much work as possible for > any scanned_pages, which now includes pages that we never successfully > acquired a cleanup lock on. And so we're justified in assuming that > they're exactly equivalent to pages that we did get a cleanup on -- > that's now the working assumption. I know that that's not literally > true, but that doesn't mean it's not a useful fiction -- it should be > very close to the truth. IDK, it seems misleading to me. Small tables with a lot of churn - quite common - are highly reliant on LP_DEAD entries getting removed or the tiny table suddenly isn't so tiny anymore. And it's harder to diagnose why the cleanup isn't happening without knowledge that pages needing cleanup couldn't be cleaned up due to pins. If you want to improve the logic so that we only count pages that would have something to clean up, I'd be happy as well. It doesn't have to mean exactly what it means today. > > > + * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want > > > + * the rel_pages used by lazy_scan_prune, from before a possible relation > > > + * truncation took place. (vacrel->rel_pages is now new_rel_pages.) > > > + */ > > > > I think it should be doable to add an isolation test for this path. There have > > been quite a few bugs around the wider topic... > > I would argue that we already have one -- vacuum-reltuples.spec. I had > to update its expected output in the patch. I would argue that the > behavioral change (count tuples on a pinned-by-cursor heap page) that > necessitated updating the expected output for the test is an > improvement overall. I was thinking of truncations, which I don't think vacuum-reltuples.spec tests. > > > + { > > > + /* Can safely advance relfrozen and relminmxid, too */ > > > + Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages == > > > + orig_rel_pages); > > > + vac_update_relstats(rel, new_rel_pages, new_live_tuples, > > > + new_rel_allvisible, vacrel->nindexes > 0, > > > + FreezeLimit, MultiXactCutoff, false); > > > + } > > > > I wonder if this whole logic wouldn't become easier and less fragile if we > > just went for maintaining the "actually observed" horizon while scanning the > > relation. If we skip a page via VM set the horizon to invalid. Otherwise we > > can keep track of the accurate horizon and use that. No need to count pages > > and stuff. > > There is no question that that makes sense as an optimization -- my > prototype convinced me of that already. But I don't think that it can > simplify anything (not even the call to vac_update_relstats itself, to > actually update relfrozenxid at the end). Maybe. But we've had quite a few bugs because we ended up changing some detail of what is excluded in one of the counters, leading to wrong determination about whether we scanned everything or not. > Fundamentally, this will only work if we decide to only skip all-frozen > pages, which (by definition) only happens within aggressive VACUUMs. Hm? Or if there's just no runs of all-visible pages of sufficient length, so we don't end up skipping at all. > You recently said (on the heap-pruning-14-bug thread) that you don't > think it would be practical to always set a page all-frozen when we > see that we're going to set it all-visible -- apparently you feel that > we could never opportunistically freeze early such that all-visible > but not all-frozen pages practically cease to exist. I'm still not > sure why you believe that (though you may be right, or I might have > misunderstood, since it's complicated). Yes, I think it may not work out to do that. But it's not a very strongly held opinion. On reason for my doubt is the following: We can set all-visible on a page without a FPW image (well, as long as hint bits aren't logged). There's a significant difference between needing to WAL log FPIs for every heap page or not, and it's not that rare for data to live shorter than autovacuum_freeze_max_age or that limit never being reached. On a table with 40 million individually inserted rows, fully hintbitted via reads, I see a first VACUUM taking 1.6s and generating 11MB of WAL. A subsequent VACUUM FREEZE takes 5s and generates 500MB of WAL. That's a quite large multiplier... If we ever managed to not have a per-page all-visible flag this'd get even more extreme, because we'd then not even need to dirty the page for insert-only pages. But if we want to freeze, we'd need to (unless we just got rid of freezing). > It would certainly benefit this dynamic relfrozenxid business if it was > possible, though. If we could somehow make that work, then almost every > VACUUM would be able to advance relfrozenxid, independently of > aggressive-ness -- because we wouldn't have any > all-visible-but-not-all-frozen pages to skip (that important detail wouldn't > be left to chance). Perhaps we can have most of the benefit even without that. If we were to freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip all-visible but not !all-frozen pages if they were less than x% of the to-be-scanned data, we should be able to to still increase relfrozenxid in a lot of cases? > > I don't particularly like doing BufferGetPage() before holding a lock on the > > page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd > > be good to have a crosscheck that BufferGetPage() is only allowed when holding > > a page level lock. > > I have occasionally wondered if the whole idea of reading heap pages > with only a pin (and having cleanup locks in VACUUM) is really worth > it -- alternative designs seem possible. Obviously that's a BIG > discussion, and not one to have right now. But it seems kind of > relevant. With 'reading' do you mean reads-from-os, or just references to buffer contents? > Since it is often legit to read a heap page without a buffer lock > (only a pin), I can't see why BufferGetPage() without a buffer lock > shouldn't also be okay -- if anything it seems safer. I think that I > would agree with you if it wasn't for that inconsistency (which is > rather a big "if", to be sure -- even for me). At least for heap it's rarely legit to read buffer contents via BufferGetPage() without a lock. It's legit to read data at already-determined offsets, but you can't look at much other than the tuple contents. > > Why does it make sense to track DEAD tuples this way? Isn't that going to lead > > to counting them over-and-over again? I think it's quite misleading to include > > them in "dead bot not yet removable". > > Compared to what? Do we really want to invent a new kind of DEAD tuple > (e.g., to report on), just to handle this rare case? When looking at logs I use the "tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n" line to see whether the user is likely to have issues around an old transaction / slot / prepared xact preventing cleanup. If new_dead_tuples doesn't identify those cases anymore that's not reliable anymore. > I accept that this code is lying about the tuples being RECENTLY_DEAD, > kind of. But isn't it still strictly closer to the truth, compared to > HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not > counting it at all. I don't see how it's closer at all. There's imo a significant difference between not being able to remove tuples because of the xmin horizon, and not being able to remove it because we couldn't get a cleanup lock. Greetings, Andres Freund
On Mon, Nov 22, 2021 at 9:49 PM Andres Freund <andres@anarazel.de> wrote: > > For example, we can definitely afford to wait a few more milliseconds > > to get a cleanup lock just once > > We currently have no infrastructure to wait for an lwlock or pincount for a > limited time. And at least for the former it'd not be easy to add. It may be > worth adding that at some point, but I'm doubtful this is sufficient reason > for nontrivial new infrastructure in very performance sensitive areas. It was a hypothetical example. To be more practical about it: it seems likely that we won't really benefit from waiting some amount of time (not forever) for a cleanup lock in non-aggressive VACUUM, once we have some of the relfrozenxid stuff we've talked about in place. In a world where we're smarter about advancing relfrozenxid in non-aggressive VACUUMs, the choice between waiting for a cleanup lock, and not waiting (but also not advancing relfrozenxid at all) matters less -- it's no longer a binary choice. It's no longer a binary choice because we will have done away with the current rigid way in which our new relfrozenxid for the relation is either FreezeLimit, or nothing at all. So far we've only talked about the case where we can update relfrozenxid with a value that happens to be much newer than FreezeLimit. If we can do that, that's great. But what about setting relfrozenxid to an *older* value than FreezeLimit instead (in a non-aggressive VACUUM)? That's also pretty good! There is still a decent chance that the final "suboptimal" relfrozenxid that we determine can be safely set in pg_class at the end of our VACUUM will still be far more recent than the preexisting relfrozenxid. Especially with larger tables. Advancing relfrozenxid should be thought of as a totally independent thing to freezing tuples, at least in vacuumlazy.c itself. That's kinda the case today, even, but *explicitly* decoupling advancing relfrozenxid from actually freezing tuples seems like a good high level goal for this project. Remember, FreezeLimit is derived from vacuum_freeze_min_age in the obvious way: OldestXmin for the VACUUM, minus vacuum_freeze_min_age GUC/reloption setting. I'm pretty sure that this means that making autovacuum freeze tuples more aggressively (by reducing vacuum_freeze_min_age) could have the perverse effect of making non-aggressive VACUUMs less likely to advance relfrozenxid -- which is exactly backwards. This effect could easily be missed, even by expert users, since there is no convenient instrumentation that shows how and when relfrozenxid is advanced. > > All of the autovacuums against the accounts table look similar to this > > one -- you don't see anything about relfrozenxid being advanced > > (because it isn't). >> Does that really make > > sense, though? > > Does what make really sense? Well, my accounts table example wasn't a particularly good one (it was a conveniently available example). I am now sure that you got the point I was trying to make here already, based on what you go on to say about non-aggressive VACUUMs optionally *not* skipping all-visible-not-all-frozen heap pages in the hopes of advancing relfrozenxid earlier (more on that idea below, in my response). On reflection, the simplest way of expressing the same idea is what I just said about decoupling (decoupling advancing relfrozenxid from freezing). > I think pgbench_accounts is just a really poor showcase. Most importantly > there's no even slightly longer running transactions that hold down the xid > horizon. But in real workloads thats incredibly common IME. It's also quite > uncommon in real workloads to huge tables in which all records are > updated. It's more common to have value ranges that are nearly static, and a > more heavily changing range. I agree. > I think the most interesting cases where using the "measured" horizon will be > advantageous is anti-wrap vacuums. Those obviously have to happen for rarely > modified tables, including completely static ones, too. Using the "measured" > horizon will allow us to reduce the frequency of anti-wrap autovacuums on old > tables, because we'll be able to set a much more recent relfrozenxid. That's probably true in practice -- but who knows these days, with the autovacuum_vacuum_insert_scale_factor stuff? Either way I see no reason to emphasize that case in the design itself. The "decoupling" concept now seems like the key design-level concept -- everything else follows naturally from that. > This is becoming more common with the increased use of partitioning. Also with bulk loading. There could easily be a tiny number of distinct XIDs that are close together in time, for many many rows -- practically one XID, or even exactly one XID. > No, not quite. We treat anti-wraparound vacuums as an emergency (including > logging messages, not cancelling). But the only mechanism we have against > anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's > not really a "real" mechanism, because it requires an "independent" reason to > vacuum a table. Got it. > I've seen cases where anti-wraparound vacuums weren't a problem / never > happend for important tables for a long time, because there always was an > "independent" reason for autovacuum to start doing its thing before the table > got to be autovacuum_freeze_max_age old. But at some point the important > tables started to be big enough that autovacuum didn't schedule vacuums that > got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap > vacuums. Right. Not just because they were big; also because autovacuum runs at geometric intervals -- the final reltuples from last time is used to determine the point at which av runs this time. This might make sense, or it might not make any sense -- it all depends (mostly on index stuff). > Then things started to burn, because of the unpaced anti-wrap vacuums > clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite > remember the details. Non-cancelling anti-wraparound VACUUMs that (all of a sudden) cause chaos because they interact badly with automated DDL is one I've seen several times -- I'm sure you have too. That was what the Manta/Joyent blogpost I referenced upthread went into. > Behaviour that lead to a "sudden" falling over, rather than getting gradually > worse are bad - they somehow tend to happen on Friday evenings :). These are among our most important challenges IMV. > Just that autovacuum should have a mechanism to trigger aggressive vacuums > (i.e. ones that are guaranteed to be able to increase relfrozenxid unless > cancelled) before getting to the "emergency"-ish anti-wraparound state. Maybe, but that runs into the problem of needing another GUC that nobody will ever be able to remember the name of. I consider the idea of adding a variety of measures that make non-aggressive VACUUM much more likely to advance relfrozenxid in practice to be far more promising. > Or alternatively that we should have a separate threshold for the "harsher" > anti-wraparound measures. Or maybe just raise the default of autovacuum_freeze_max_age, which many people don't change? That might be a lot safer than it once was. Or will be, once we manage to teach VACUUM to advance relfrozenxid more often in non-aggressive VACUUMs on Postgres 15. Imagine a world in which we have that stuff in place, as well as related enhancements added in earlier releases: autovacuum_vacuum_insert_scale_factor, the freezemap, and the wraparound failsafe. These add up to a lot; with all of that in place, the risk we'd be introducing by increasing the default value of autovacuum_freeze_max_age would be *far* lower than the risk of making the same change back in 2006. I bring up 2006 because it was the year that commit 48188e1621 added autovacuum_freeze_max_age -- the default hasn't changed since that time. > I think workloads are a bit more worried than a realistic set of benchmarksk > that one person can run yourself. No question. I absolutely accept that I only have to miss one important detail with something like this -- that just goes with the territory. Just saying that I have yet to see any evidence that the bypass-indexes behavior really hurt anything. I do take the idea that I might have missed something very seriously, despite all this. > I gave you examples of cases that I see as likely being bitten by this, > e.g. when the skipped index cleanup prevents IOS scans. When both the > likely-to-be-modified and likely-to-be-queried value ranges are a small subset > of the entire data, the 2% threshold can prevent vacuum from cleaning up > LP_DEAD entries for a long time. Or when all index scans are bitmap index > scans, and nothing ends up cleaning up the dead index entries in certain > ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively > small rollback / non-HOT update rate can start to be really painful. That does seem possible. But I consider it very unlikely to appear as a regression caused by the bypass mechanism itself -- not in any way that was consistent over time. As far as I can tell, autovacuum scheduling just doesn't operate at that level of precision, and never has. I have personally observed that ANALYZE does a very bad job at noticing LP_DEAD items in tables/workloads where LP_DEAD items (not DEAD tuples) tend to concentrate [1]. The whole idea that ANALYZE should count these items as if they were normal tuples seems pretty bad to me. Put it this way: imagine you run into trouble with the bypass thing, and then you opt to disable it on that table (using the INDEX_CLEANUP reloption). Why should this step solve the problem on its own? In order for that to work, VACUUM would have to have to know to be very aggressive about these LP_DEAD items. But there is good reason to believe that it just won't ever notice them, as long as ANALYZE is expected to provide reliable statistics that drive autovacuum -- they're just too concentrated for the block-based approach to truly work. I'm not minimizing the risk. Just telling you my thoughts on this. > I'm a bit doubtful that's as important (which is not to say that it's not > worth doing). For a heavily updated table the max space usage of the line > pointer array just isn't as big a factor as ending up with only half the > usable line pointers. Agreed; by far the best chance we have of improving the line pointer bloat situation is preventing it in the first place, by increasing MaxHeapTuplesPerPage. Once we actually do that, our remaining options are going to be much less helpful -- then it really is mostly just up to VACUUM. > And it's harder to diagnose why the > cleanup isn't happening without knowledge that pages needing cleanup couldn't > be cleaned up due to pins. > > If you want to improve the logic so that we only count pages that would have > something to clean up, I'd be happy as well. It doesn't have to mean exactly > what it means today. It seems like what you really care about here are remaining cases where our inability to acquire a cleanup lock has real consequences -- you want to hear about it when it happens, however unlikely it may be. In other words, you want to keep something in log_autovacuum_* that indicates that "less than the expected amount of work was completed" due to an inability to acquire a cleanup lock. And so for you, this is a question of keeping instrumentation that might still be useful, not a question of how we define things fundamentally, at the design level. Sound right? If so, then this proposal might be acceptable to you: * Remaining DEAD tuples with storage (though not LP_DEAD items from previous opportunistic pruning) will get counted separately in the lazy_scan_noprune (no cleanup lock) path. Also count the total number of distinct pages that were found to contain one or more such DEAD tuples. * These two new counters will be reported on their own line in the log output, though only in the cases where we actually have any such tuples -- which will presumably be much rarer than simply failing to get a cleanup lock (that's now no big deal at all, because we now consistently do certain cleanup steps, and because FreezeLimit isn't the only viable thing that we can set relfrozenxid to, at least in the non-aggressive case). * There is still a limited sense in which the same items get counted as RECENTLY_DEAD -- though just those aspects that make the overall design simpler. So the helpful aspects of this are still preserved. We only need to tell pgstat_report_vacuum() that these items are "deadtuples" (remaining dead tuples). That can work by having its caller add a new int64 counter (same new tuple-based counter used for the new log line) to vacrel->new_dead_tuples. We'd also add the same new tuple counter in about the same way at the point where we determine a final vacrel->new_rel_tuples. So we wouldn't really be treating anything as RECENTLY_DEAD anymore -- pgstat_report_vacuum() and vacrel->new_dead_tuples don't specifically expect anything about RECENTLY_DEAD-ness already. > I was thinking of truncations, which I don't think vacuum-reltuples.spec > tests. Got it. I'll look into that for v2. > Maybe. But we've had quite a few bugs because we ended up changing some detail > of what is excluded in one of the counters, leading to wrong determination > about whether we scanned everything or not. Right. But let me just point out that my whole approach is to make that impossible, by not needing to count pages, except in scanned_pages (and in frozenskipped_pages + rel_pages). The processing performed for any page that we actually read during VACUUM should be uniform (or practically uniform), by definition. With minimal fudging in the cleanup lock case (because we mostly do the same work there too). There should be no reason for any more page counters now, except for non-critical instrumentation. For example, if you want to get the total number of pages skipped via the visibility map (not just all-frozen pages), then you simply subtract scanned_pages from rel_pages. > > Fundamentally, this will only work if we decide to only skip all-frozen > > pages, which (by definition) only happens within aggressive VACUUMs. > > Hm? Or if there's just no runs of all-visible pages of sufficient length, so > we don't end up skipping at all. Of course. But my point was: who knows when that'll happen? > On reason for my doubt is the following: > > We can set all-visible on a page without a FPW image (well, as long as hint > bits aren't logged). There's a significant difference between needing to WAL > log FPIs for every heap page or not, and it's not that rare for data to live > shorter than autovacuum_freeze_max_age or that limit never being reached. This sounds like an objection to one specific heuristic, and not an objection to the general idea. The only essential part is "opportunistic freezing during vacuum, when the cost is clearly very low, and the benefit is probably high". And so it now seems you were making a far more limited statement than I first believed. Obviously many variations are possible -- there is a spectrum. Example: a heuristic that makes VACUUM notice when it is going to freeze at least one tuple on a page, iff the page will be marked all-visible in any case -- we should instead freeze every tuple on the page, and mark the page all-frozen, batching work (could account for LP_DEAD items here too, not counting them on the assumption that they'll become LP_UNUSED during the second heap pass later on). If we see these conditions, then the likely explanation is that the tuples on the heap page happen to have XIDs that are "split" by the not-actually-important FreezeLimit cutoff, despite being essentially similar in any way that matters. If you want to make the same heuristic more conservative: only do this when no existing tuples are frozen, since that could be taken as a sign of the original heuristic not quite working on the same heap page at an earlier stage. I suspect that even very conservative versions of the same basic idea would still help a lot. > Perhaps we can have most of the benefit even without that. If we were to > freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip > all-visible but not !all-frozen pages if they were less than x% of the > to-be-scanned data, we should be able to to still increase relfrozenxid in a > lot of cases? I bet that's true. I like that idea. If we had this policy, then the number of "extra" visited-in-non-aggressive-vacuum pages (all-visible but not yet all-frozen pages) could be managed over time through more opportunistic freezing. This might make it work even better. These all-visible (but not all-frozen) heap pages could be considered "tenured", since they have survived at least one full VACUUM cycle without being unset. So why not also freeze them based on the assumption that they'll probably stay that way forever? There won't be so many of the pages when we do this anyway, by definition -- since we'd have a heuristic that limited the total number (say to no more than 10% of the total relation size, something like that). We're smoothing out the work that currently takes place all together during an aggressive VACUUM this way. Moreover, there is perhaps a good chance that the total number of all-visible-not all-frozen heap pages will *stay* low over time, as a result of this policy actually working -- there may be a virtuous cycle that totally prevents us from getting an aggressive VACUUM even once. > > I have occasionally wondered if the whole idea of reading heap pages > > with only a pin (and having cleanup locks in VACUUM) is really worth > > it -- alternative designs seem possible. Obviously that's a BIG > > discussion, and not one to have right now. But it seems kind of > > relevant. > > With 'reading' do you mean reads-from-os, or just references to buffer > contents? The latter. [1] https://postgr.es/m/CAH2-Wz=9R83wcwZcPUH4FVPeDM4znzbzMvp3rt21+XhQWMU8+g@mail.gmail.com -- Peter Geoghegan
Hi, On 2021-11-23 17:01:20 -0800, Peter Geoghegan wrote: > > On reason for my doubt is the following: > > > > We can set all-visible on a page without a FPW image (well, as long as hint > > bits aren't logged). There's a significant difference between needing to WAL > > log FPIs for every heap page or not, and it's not that rare for data to live > > shorter than autovacuum_freeze_max_age or that limit never being reached. > > This sounds like an objection to one specific heuristic, and not an > objection to the general idea. I understood you to propose that we do not have separate frozen and all-visible states. Which I think will be problematic, because of scenarios like the above. > The only essential part is "opportunistic freezing during vacuum, when the > cost is clearly very low, and the benefit is probably high". And so it now > seems you were making a far more limited statement than I first believed. I'm on board with freezing when we already dirty out the page, and when doing so doesn't cause an additional FPI. And I don't think I've argued against that in the past. > These all-visible (but not all-frozen) heap pages could be considered > "tenured", since they have survived at least one full VACUUM cycle > without being unset. So why not also freeze them based on the > assumption that they'll probably stay that way forever? Because it's a potentially massive increase in write volume? E.g. if you have a insert-only workload, and you discard old data by dropping old partitions, this will often add yet another rewrite, despite your data likely never getting old enough to need to be frozen. Given that we often immediately need to start another vacuum just when one finished, because the vacuum took long enough to reach thresholds of vacuuming again, I don't think the (auto-)vacuum count is a good proxy. Maybe you meant this as a more limited concept, i.e. only doing so when the percentage of all-visible but not all-frozen pages is small? We could perhaps do better with if we had information about the system-wide rate of xid throughput and how often / how long past vacuums of a table took. Greetings, Andres Freund
On Tue, Nov 23, 2021 at 5:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Behaviour that lead to a "sudden" falling over, rather than getting gradually > > worse are bad - they somehow tend to happen on Friday evenings :). > > These are among our most important challenges IMV. I haven't had time to work through any of your feedback just yet -- though it's certainly a priority for. I won't get to it until I return home from PGConf NYC next week. Even still, here is a rebased v2, just to fix the bitrot. This is just a courtesy to anybody interested in the patch. -- Peter Geoghegan
Attachment
On Tue, Nov 30, 2021 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote: > I haven't had time to work through any of your feedback just yet -- > though it's certainly a priority for. I won't get to it until I return > home from PGConf NYC next week. Attached is v3, which works through most of your (Andres') feedback. Changes in v3: * While the first patch still gets rid of the "pinskipped_pages" instrumentation, the second patch adds back a replacement that's better targeted: it tracks and reports "missed_dead_tuples". This means that log output will show the number of fully DEAD tuples with storage that could not be pruned away due to the fact that that would have required waiting for a cleanup lock. But we *don't* generally report the number of pages that we couldn't get a cleanup lock on, because that in itself doesn't mean that we skipped any useful work (which is very much the point of all of the refactoring in the first patch). * We now have FSM processing in the lazy_scan_noprune case, which more or less matches the standard lazy_scan_prune case. * Many small tweaks, based on suggestions from Andres, and other things that I noticed. * Further simplification of the "consider skipping pages using visibility map" logic -- now we always don't skip the last block in the relation, without calling should_attempt_truncation() to make sure we have a reason. Note that this means that we'll always read the final page during VACUUM, even when doing so is provably unhelpful. I'd prefer to keep the code that deals with skipping pages using the visibility map as simple as possible. There isn't much downside to always doing that once my refactoring is in place: there is no risk that we'll wait for a cleanup lock (on the final page in the rel) for no good reason. We're only wasting one page access, at most. (I'm not 100% sure that this is the right trade-off, actually, but it's at least worth considering.) Not included in v3: * Still haven't added the isolation test for rel truncation, though it's on my TODO list. * I'm still working on the optimization that we discussed on this thread: the optimization that allows the final relfrozenxid (that we set in pg_class) to be determined dynamically, based on the actual XIDs we observed in the table (we don't just naively use FreezeLimit). I'm not ready to post that today, but it shouldn't take too much longer to be good enough to review. Thanks -- Peter Geoghegan
Attachment
On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > * I'm still working on the optimization that we discussed on this > thread: the optimization that allows the final relfrozenxid (that we > set in pg_class) to be determined dynamically, based on the actual > XIDs we observed in the table (we don't just naively use FreezeLimit). Attached is v4 of the patch series, which now includes this optimization, broken out into its own patch. In addition, it includes a prototype of opportunistic freezing. My emphasis here has been on making non-aggressive VACUUMs *always* advance relfrozenxid, outside of certain obvious edge cases. And so with all the patches applied, up to and including the opportunistic freezing patch, every autovacuum of every table manages to advance relfrozenxid during benchmarking -- usually to a fairly recent value. I've focussed on making aggressive VACUUMs (especially anti-wraparound autovacuums) a rare occurrence, for truly exceptional cases (e.g., user keeps canceling autovacuums, maybe due to automated script that performs DDL). That has taken priority over other goals, for now. There is a kind of virtuous circle here, where successive non-aggressive autovacuums never fall behind on freezing, and so never fail to advance relfrozenxid (there are never any all_visible-but-not-all_frozen pages, and we can cope with not acquiring a cleanup lock quite well). When VACUUM chooses to freeze a tuple opportunistically, the frozen XIDs naturally cannot hold back the final safe relfrozenxid for the relation. Opportunistic freezing avoids setting all_visible (without setting all_frozen) in the visibility map. It's impossible for VACUUM to just set a page to all_visible now, which seems like an essential part of making a decent amount of relfrozenxid advancement take place in almost every VACUUM operation. Here is an example of what I'm calling a virtuous circle -- all pgbench_history autovacuums look like this with the patch applied: LOG: automatic vacuum of table "regression.public.pgbench_history": index scans: 0 pages: 0 removed, 35503 remain, 31930 skipped using visibility map (89.94% of total) tuples: 0 removed, 5568687 remain (547976 newly frozen), 0 are dead but not yet removable removal cutoff: oldest xmin was 5570281, which is now 1177 xact IDs behind relfrozenxid: advanced by 546618 xact IDs, new value: 5565226 index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed I/O timings: read: 0.003 ms, write: 0.000 ms avg read rate: 0.068 MB/s, avg write rate: 0.068 MB/s buffer usage: 7169 hits, 1 misses, 1 dirtied WAL usage: 7043 records, 1 full page images, 6974928 bytes system usage: CPU: user: 0.10 s, system: 0.00 s, elapsed: 0.11 s Note that relfrozenxid is almost the same as oldest xmin here. Note also that the log output shows the number of tuples newly frozen. I see the same general trends with *every* pgbench_history autovacuum. Actually, with every autovacuum. The history table tends to have ultra-recent relfrozenxid values, which isn't always what we see, but that difference may not matter. As far as I can tell, we can expect practically every table to have a relfrozenxid that would (at least traditionally) be considered very safe/recent. Barring weird application issues that make it totally impossible to advance relfrozenxid (e.g., idle cursors that hold onto a buffer pin forever), it seems as if relfrozenxid will now steadily march forward. Sure, relfrozenxid advancement might be held by the occasional inability to acquire a cleanup lock, but the effect isn't noticeable over time; what are the chances that a cleanup lock won't be available on the same page (with the same old XID) more than once or twice? The odds of that happening become astronomically tiny, long before there is any real danger (barring pathological cases). In the past, we've always talked about opportunistic freezing as a way of avoiding re-dirtying heap pages during successive VACUUM operations -- especially as a way of lowering the total volume of WAL. While I agree that that's important, I have deliberately ignored it for now, preferring to focus on the relfrozenxid stuff, and smoothing out the cost of freezing (avoiding big shocks from aggressive/anti-wraparound autovacuums). I care more about stable performance than absolute throughput, but even still I believe that the approach I've taken to opportunistic freezing is probably too aggressive. But it's dead simple, which will make it easier to understand and discuss the issue of central importance. It may be possible to optimize the WAL-logging used during freezing, getting the cost down to the point where freezing early just isn't a concern. The current prototype adds extra WAL overhead, to be sure, but even that's not wildly unreasonable (you make some of it back on FPIs, depending on the workload -- especially with tables like pgbench_history, where delaying freezing is a total loss). -- Peter Geoghegan
Attachment
On Thu, Dec 16, 2021 at 5:27 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > > * I'm still working on the optimization that we discussed on this > > thread: the optimization that allows the final relfrozenxid (that we > > set in pg_class) to be determined dynamically, based on the actual > > XIDs we observed in the table (we don't just naively use FreezeLimit). > > Attached is v4 of the patch series, which now includes this > optimization, broken out into its own patch. In addition, it includes > a prototype of opportunistic freezing. > > My emphasis here has been on making non-aggressive VACUUMs *always* > advance relfrozenxid, outside of certain obvious edge cases. And so > with all the patches applied, up to and including the opportunistic > freezing patch, every autovacuum of every table manages to advance > relfrozenxid during benchmarking -- usually to a fairly recent value. > I've focussed on making aggressive VACUUMs (especially anti-wraparound > autovacuums) a rare occurrence, for truly exceptional cases (e.g., > user keeps canceling autovacuums, maybe due to automated script that > performs DDL). That has taken priority over other goals, for now. Great! I've looked at 0001 patch and here are some comments: @@ -535,8 +540,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, xidFullScanLimit); aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid, mxactFullScanLimit); + skipwithvm = true; if (params->options & VACOPT_DISABLE_PAGE_SKIPPING) + { + /* + * Force aggressive mode, and disable skipping blocks using the + * visibility map (even those set all-frozen) + */ aggressive = true; + skipwithvm = false; + } vacrel = (LVRelState *) palloc0(sizeof(LVRelState)); @@ -544,6 +557,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, vacrel->rel = rel; vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes, &vacrel->indrels); + vacrel->aggressive = aggressive; vacrel->failsafe_active = false; vacrel->consider_bypass_optimization = true; How about adding skipwithvm to LVRelState too? --- /* - * The current block is potentially skippable; if we've seen a - * long enough run of skippable blocks to justify skipping it, and - * we're not forced to check it, then go ahead and skip. - * Otherwise, the page must be at least all-visible if not - * all-frozen, so we can set all_visible_according_to_vm = true. + * The current page can be skipped if we've seen a long enough run + * of skippable blocks to justify skipping it -- provided it's not + * the last page in the relation (according to rel_pages/nblocks). + * + * We always scan the table's last page to determine whether it + * has tuples or not, even if it would otherwise be skipped + * (unless we're skipping every single page in the relation). This + * avoids having lazy_truncate_heap() take access-exclusive lock + * on the table to attempt a truncation that just fails + * immediately because there are tuples on the last page. */ - if (skipping_blocks && !FORCE_CHECK_PAGE()) + if (skipping_blocks && blkno < nblocks - 1) Why do we always need to scan the last page even if heap truncation is disabled (or in the failsafe mode)? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > My emphasis here has been on making non-aggressive VACUUMs *always* > > advance relfrozenxid, outside of certain obvious edge cases. And so > > with all the patches applied, up to and including the opportunistic > > freezing patch, every autovacuum of every table manages to advance > > relfrozenxid during benchmarking -- usually to a fairly recent value. > > I've focussed on making aggressive VACUUMs (especially anti-wraparound > > autovacuums) a rare occurrence, for truly exceptional cases (e.g., > > user keeps canceling autovacuums, maybe due to automated script that > > performs DDL). That has taken priority over other goals, for now. > > Great! Maybe this is a good time to revisit basic questions about VACUUM. I wonder if we can get rid of some of the GUCs for VACUUM now. Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for physical blocks, but we use logical units (XIDs). We probably shouldn't be using any units, but using XIDs "feels wrong" to me. Even with my patch, it is theoretically possible that we won't be able to advance relfrozenxid very much, because we cannot get a cleanup lock on one single heap page with one old XID. But even in this extreme case, how relevant is the "age" of this old XID, really? What really matters is whether or not we can advance relfrozenxid in time (with time to spare). And so the wraparound risk of the system is not affected all that much by the age of the single oldest XID. The risk mostly comes from how much total work we still need to do to advance relfrozenxid. If the single old XID is quite old indeed (~1.5 billion XIDs), but there is only one, then we just have to freeze one tuple to be able to safely advance relfrozenxid (maybe advance it by a huge amount!). How long can it take to freeze one tuple, with the freeze map, etc? On the other hand, the risk may be far greater if we have *many* tuples that are still unfrozen, whose XIDs are only "middle aged" right now. The idea behind vacuum_freeze_min_age seems to be to be lazy about work (tuple freezing) in the hope that we'll never need to do it, but that seems obsolete now. (It probably made a little more sense before the visibility map.) Using XIDs makes sense for things like autovacuum_freeze_max_age, because there we have to worry about wraparound and relfrozenxid (whether or not we like it). But with this patch, and with everything else (the failsafe, insert-driven autovacuums, everything we've done over the last several years) I think that it might be time to increase the autovacuum_freeze_max_age default. Maybe even to something as high as 800 million transaction IDs, but certainly to 400 million. What do you think? (Maybe don't answer just yet, something to think about.) > + vacrel->aggressive = aggressive; > vacrel->failsafe_active = false; > vacrel->consider_bypass_optimization = true; > > How about adding skipwithvm to LVRelState too? Agreed -- it's slightly better that way. Will change this. > */ > - if (skipping_blocks && !FORCE_CHECK_PAGE()) > + if (skipping_blocks && blkno < nblocks - 1) > > Why do we always need to scan the last page even if heap truncation is > disabled (or in the failsafe mode)? My goal here was to keep the behavior from commit e8429082, "Avoid useless truncation attempts during VACUUM", while simplifying things around skipping heap pages via the visibility map (including removing the FORCE_CHECK_PAGE() macro). Of course you're right that this particular change that you have highlighted does change the behavior a little -- now we will always treat the final page as a "scanned page", except perhaps when 100% of all pages in the relation are skipped using the visibility map. This was a deliberate choice (and perhaps even a good choice!). I think that avoiding accessing the last heap page like this isn't worth the complexity. Note that we may already access heap pages (making them "scanned pages") despite the fact that we know it's unnecessary: the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't even try to avoid wasting CPU cycles on these not-skipped-but-skippable pages). So I think that the performance cost for the last page isn't going to be noticeable. However, now that I think about it, I wonder...what do you think of SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today? SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly after the original visibility map implementation was committed in 2009. The idea that it helps us to advance relfrozenxid outside of aggressive VACUUMs (per commit message from bf136cf6e3) seems like it might no longer matter with the patch -- because now we won't ever set a page all-visible but not all-frozen. Plus the idea that we need to do all this work just to get readahead from the OS seems...questionable. -- Peter Geoghegan
On Sat, Dec 18, 2021 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > My emphasis here has been on making non-aggressive VACUUMs *always* > > > advance relfrozenxid, outside of certain obvious edge cases. And so > > > with all the patches applied, up to and including the opportunistic > > > freezing patch, every autovacuum of every table manages to advance > > > relfrozenxid during benchmarking -- usually to a fairly recent value. > > > I've focussed on making aggressive VACUUMs (especially anti-wraparound > > > autovacuums) a rare occurrence, for truly exceptional cases (e.g., > > > user keeps canceling autovacuums, maybe due to automated script that > > > performs DDL). That has taken priority over other goals, for now. > > > > Great! > > Maybe this is a good time to revisit basic questions about VACUUM. I > wonder if we can get rid of some of the GUCs for VACUUM now. > > Can we fully get rid of vacuum_freeze_table_age? Does it mean that a vacuum always is an aggressive vacuum? If opportunistic freezing works well on all tables, we might no longer need vacuum_freeze_table_age. But I’m not sure that’s true since the cost of freezing tuples is not 0. > We probably shouldn't be using any units, but using XIDs "feels wrong" > to me. Even with my patch, it is theoretically possible that we won't > be able to advance relfrozenxid very much, because we cannot get a > cleanup lock on one single heap page with one old XID. But even in > this extreme case, how relevant is the "age" of this old XID, really? > What really matters is whether or not we can advance relfrozenxid in > time (with time to spare). And so the wraparound risk of the system is > not affected all that much by the age of the single oldest XID. The > risk mostly comes from how much total work we still need to do to > advance relfrozenxid. If the single old XID is quite old indeed (~1.5 > billion XIDs), but there is only one, then we just have to freeze one > tuple to be able to safely advance relfrozenxid (maybe advance it by a > huge amount!). How long can it take to freeze one tuple, with the > freeze map, etc? I think that that's true for (mostly) static tables. But regarding constantly-updated tables, since autovacuum runs based on the number of garbage tuples (or inserted tuples) and how old the relfrozenxid is if an autovacuum could not advance the relfrozenxid because it could not get a cleanup lock on the page that has the single oldest XID, it's likely that when autovacuum runs next time it will have to process other pages too since the page will get dirty enough. It might be a good idea that we remember pages where we could not get a cleanup lock somewhere and revisit them after index cleanup. While revisiting the pages, we don’t prune the page but only freeze tuples. > > On the other hand, the risk may be far greater if we have *many* > tuples that are still unfrozen, whose XIDs are only "middle aged" > right now. The idea behind vacuum_freeze_min_age seems to be to be > lazy about work (tuple freezing) in the hope that we'll never need to > do it, but that seems obsolete now. (It probably made a little more > sense before the visibility map.) Why is it obsolete now? I guess that it's still valid depending on the cases, for example, heavily updated tables. > > Using XIDs makes sense for things like autovacuum_freeze_max_age, > because there we have to worry about wraparound and relfrozenxid > (whether or not we like it). But with this patch, and with everything > else (the failsafe, insert-driven autovacuums, everything we've done > over the last several years) I think that it might be time to increase > the autovacuum_freeze_max_age default. Maybe even to something as high > as 800 million transaction IDs, but certainly to 400 million. What do > you think? (Maybe don't answer just yet, something to think about.) I don’t have an objection to increasing autovacuum_freeze_max_age for now. One of my concerns with anti-wraparound vacuums is that too many tables (or several large tables) will reach autovacuum_freeze_max_age at once, using up autovacuum slots and preventing autovacuums from being launched on tables that are heavily being updated. Given these works, expanding the gap between vacuum_freeze_table_age and autovacuum_freeze_max_age would have better chances for the tables to advance its relfrozenxid by an aggressive vacuum instead of an anti-wraparound-aggressive vacuum. 400 million seems to be a good start. > > > + vacrel->aggressive = aggressive; > > vacrel->failsafe_active = false; > > vacrel->consider_bypass_optimization = true; > > > > How about adding skipwithvm to LVRelState too? > > Agreed -- it's slightly better that way. Will change this. > > > */ > > - if (skipping_blocks && !FORCE_CHECK_PAGE()) > > + if (skipping_blocks && blkno < nblocks - 1) > > > > Why do we always need to scan the last page even if heap truncation is > > disabled (or in the failsafe mode)? > > My goal here was to keep the behavior from commit e8429082, "Avoid > useless truncation attempts during VACUUM", while simplifying things > around skipping heap pages via the visibility map (including removing > the FORCE_CHECK_PAGE() macro). Of course you're right that this > particular change that you have highlighted does change the behavior a > little -- now we will always treat the final page as a "scanned page", > except perhaps when 100% of all pages in the relation are skipped > using the visibility map. > > This was a deliberate choice (and perhaps even a good choice!). I > think that avoiding accessing the last heap page like this isn't worth > the complexity. Note that we may already access heap pages (making > them "scanned pages") despite the fact that we know it's unnecessary: > the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't > even try to avoid wasting CPU cycles on these > not-skipped-but-skippable pages). So I think that the performance cost > for the last page isn't going to be noticeable. Agreed. > > However, now that I think about it, I wonder...what do you think of > SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today? > SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly > after the original visibility map implementation was committed in > 2009. The idea that it helps us to advance relfrozenxid outside of > aggressive VACUUMs (per commit message from bf136cf6e3) seems like it > might no longer matter with the patch -- because now we won't ever set > a page all-visible but not all-frozen. Plus the idea that we need to > do all this work just to get readahead from the OS > seems...questionable. Given the opportunistic freezing, that's true but I'm concerned whether opportunistic freezing always works well on all tables since freezing tuples is not 0 cost. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Dec 20, 2021 at 8:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Can we fully get rid of vacuum_freeze_table_age? > > Does it mean that a vacuum always is an aggressive vacuum? No. Just somewhat more like one. Still no waiting for cleanup locks, though. Also, autovacuum is still cancelable (that's technically from anti-wraparound VACUUM, but you know what I mean). And there shouldn't be a noticeable difference in terms of how many blocks can be skipped using the VM. > If opportunistic freezing works well on all tables, we might no longer > need vacuum_freeze_table_age. But I’m not sure that’s true since the > cost of freezing tuples is not 0. That's true, of course, but right now the only goal of opportunistic freezing is to advance relfrozenxid in every VACUUM. It needs to be shown to be worth it, of course. But let's assume that it is worth it, for a moment (perhaps only because we optimize freezing itself in passing) -- then there is little use for vacuum_freeze_table_age, that I can see. > I think that that's true for (mostly) static tables. But regarding > constantly-updated tables, since autovacuum runs based on the number > of garbage tuples (or inserted tuples) and how old the relfrozenxid is > if an autovacuum could not advance the relfrozenxid because it could > not get a cleanup lock on the page that has the single oldest XID, > it's likely that when autovacuum runs next time it will have to > process other pages too since the page will get dirty enough. I'm not arguing that the age of the single oldest XID is *totally* irrelevant. Just that it's typically much less important than the total amount of work we'd have to do (freezing) to be able to advance relfrozenxid. In any case, the extreme case where we just cannot get a cleanup lock on one particular page with an old XID is probably very rare. > It might be a good idea that we remember pages where we could not get > a cleanup lock somewhere and revisit them after index cleanup. While > revisiting the pages, we don’t prune the page but only freeze tuples. Maybe, but I think that it would make more sense to not use FreezeLimit for that at all. In an aggressive VACUUM (where we might actually have to wait for a cleanup lock), why should we wait once the age is over vacuum_freeze_min_age (usually 50 million XIDs)? The official answer is "because we need to advance relfrozenxid". But why not accept a much older relfrozenxid that is still sufficiently young/safe, in order to avoid waiting for a cleanup lock? In other words, what if our approach of "being diligent about advancing relfrozenxid" makes the relfrozenxid problem worse, not better? The problem with "being diligent" is that it is defined by FreezeLimit (which is more or less the same thing as vacuum_freeze_min_age), which is supposed to be about which tuples we will freeze. That's a very different thing to how old relfrozenxid should be or can be (after an aggressive VACUUM finishes). > > On the other hand, the risk may be far greater if we have *many* > > tuples that are still unfrozen, whose XIDs are only "middle aged" > > right now. The idea behind vacuum_freeze_min_age seems to be to be > > lazy about work (tuple freezing) in the hope that we'll never need to > > do it, but that seems obsolete now. (It probably made a little more > > sense before the visibility map.) > > Why is it obsolete now? I guess that it's still valid depending on the > cases, for example, heavily updated tables. Because after the 9.6 freezemap work we'll often set the all-visible bit in the VM, but not the all-frozen bit (unless we have the opportunistic freezing patch applied, which specifically avoids that). When that happens, affected heap pages will still have older-than-vacuum_freeze_min_age-XIDs after VACUUM runs, until we get to an aggressive VACUUM. There could be many VACUUMs before the aggressive VACUUM. This "freezing cliff" seems like it might be a big problem, in general. That's what I'm trying to address here. Either way, the system doesn't really respect vacuum_freeze_min_age in the way that it did before 9.6 -- which is what I meant by "obsolete". > I don’t have an objection to increasing autovacuum_freeze_max_age for > now. One of my concerns with anti-wraparound vacuums is that too many > tables (or several large tables) will reach autovacuum_freeze_max_age > at once, using up autovacuum slots and preventing autovacuums from > being launched on tables that are heavily being updated. I think that the patch helps with that, actually -- there tends to be "natural variation" in the relfrozenxid age of each table, which comes from per-table workload characteristics. > Given these > works, expanding the gap between vacuum_freeze_table_age and > autovacuum_freeze_max_age would have better chances for the tables to > advance its relfrozenxid by an aggressive vacuum instead of an > anti-wraparound-aggressive vacuum. 400 million seems to be a good > start. The idea behind getting rid of vacuum_freeze_table_age (not to be confused by the other idea about getting rid of vacuum_freeze_min_age) is this: with the patch series, we only tend to get an anti-wraparound VACUUM in extreme and relatively rare cases. For example, we will get aggressive anti-wraparound VACUUMs on tables that *never* grow, but constantly get HOT updates (e.g. the pgbench_accounts table with heap fill factor reduced to 90). We won't really be able to use the VM when this happens, either. With tables like this -- tables that still get aggressive VACUUMs -- maybe the patch doesn't make a huge difference. But that's truly the extreme case -- that is true only because there is already zero chance of there being a non-aggressive VACUUM. We'll get aggressive anti-wraparound VACUUMs every time we reach autovacuum_freeze_max_age, again and again -- no change, really. But since it's only these extreme cases that continue to get aggressive VACUUMs, why do we still need vacuum_freeze_table_age? It helps right now (without the patch) by "escalating" a regular VACUUM to an aggressive one. But the cases that we still expect an aggressive VACUUM (with the patch) are the cases where there is zero chance of that happening. Almost by definition. > Given the opportunistic freezing, that's true but I'm concerned > whether opportunistic freezing always works well on all tables since > freezing tuples is not 0 cost. That is the big question for this patch. -- Peter Geoghegan
On Mon, Dec 20, 2021 at 9:35 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Given the opportunistic freezing, that's true but I'm concerned > > whether opportunistic freezing always works well on all tables since > > freezing tuples is not 0 cost. > > That is the big question for this patch. Attached is a mechanical rebase of the patch series. This new version just fixes bitrot, caused by Masahiko's recent lazyvacuum.c refactoring work. In other words, this revision has no significant changes compared to the v4 that I posted back in late December -- just want to keep CFTester green. I still have plenty of work to do here. Especially with the final patch (the v5-0005-* "freeze early" patch), which is generally more speculative than the other patches. I'm playing catch-up now, since I just returned from vacation. -- Peter Geoghegan
Attachment
On Fri, Dec 17, 2021 at 9:30 PM Peter Geoghegan <pg@bowt.ie> wrote: > Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of > vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for > physical blocks, but we use logical units (XIDs). I don't see how we can get rid of these. We know that catastrophe will ensue if we fail to freeze old XIDs for a sufficiently long time --- where sufficiently long has to do with the number of XIDs that have been subsequently consumed. So it's natural to decide whether or not we're going to wait for cleanup locks on pages on the basis of how old the XIDs they contain actually are. Admittedly, that decision doesn't need to be made at the start of the vacuum, as we do today. We could happily skip waiting for a cleanup lock on pages that contain only newer XIDs, but if there is a page that both contains an old XID and stays pinned for a long time, we eventually have to sit there and wait for that pin to be released. And the best way to decide when to switch to that strategy is really based on the age of that XID, at least as I see it, because it is the age of that XID reaching 2 billion that is going to kill us. I think vacuum_freeze_min_age also serves a useful purpose: it prevents us from freezing data that's going to be modified again or even deleted in the near future. Since we can't know the future, we must base our decision on the assumption that the future will be like the past: if the page hasn't been modified for a while, then we should assume it's not likely to be modified again soon; otherwise not. If we knew the time at which the page had last been modified, it would be very reasonable to use that here - say, freeze the XIDs if the page hasn't been touched in an hour, or whatever. But since we lack such timestamps the XID age is the closest proxy we have. > The > risk mostly comes from how much total work we still need to do to > advance relfrozenxid. If the single old XID is quite old indeed (~1.5 > billion XIDs), but there is only one, then we just have to freeze one > tuple to be able to safely advance relfrozenxid (maybe advance it by a > huge amount!). How long can it take to freeze one tuple, with the > freeze map, etc? I don't really see any reason for optimism here. There could be a lot of unfrozen pages in the relation, and we'd have to troll through all of those in order to find that single old XID. Moreover, there is nothing whatsoever to focus autovacuum's attention on that single old XID rather than anything else. Nothing in the autovacuum algorithm will cause it to focus its efforts on that single old XID at a time when there's no pin on the page, or at a time when that XID becomes the thing that's holding back vacuuming throughout the cluster. A lot of vacuum problems that users experience today would be avoided if autovacuum had perfect knowledge of what it ought to be prioritizing at any given time, or even some knowledge. But it doesn't, and is often busy fiddling while Rome burns. IOW, the time that it takes to freeze that one tuple *in theory* might be small. But in practice it may be very large, because we won't necessarily get around to it on any meaningful time frame. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 6, 2022 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 17, 2021 at 9:30 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of > > vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for > > physical blocks, but we use logical units (XIDs). > > I don't see how we can get rid of these. We know that catastrophe will > ensue if we fail to freeze old XIDs for a sufficiently long time --- > where sufficiently long has to do with the number of XIDs that have > been subsequently consumed. I don't really disagree with anything you've said, I think. There are a few subtleties here. I'll try to tease them apart. I agree that we cannot do without something like vacrel->FreezeLimit for the foreseeable future -- but the closely related GUC (vacuum_freeze_min_age) is another matter. Although everything you've said in favor of the GUC seems true, the GUC is not a particularly effective (or natural) way of constraining the problem. It just doesn't make sense as a tunable. One obvious reason for this is that the opportunistic freezing stuff is expected to be the thing that usually forces freezing -- not vacuum_freeze_min_age, nor FreezeLimit, nor any other XID-based cutoff. As you more or less pointed out yourself, we still need FreezeLimit as a backstop mechanism. But the value of FreezeLimit can just come from autovacuum_freeze_max_age/2 in all cases (no separate GUC), or something along those lines. We don't particularly expect the value of FreezeLimit to matter, at least most of the time. It should only noticeably affect our behavior during anti-wraparound VACUUMs, which become rare with the patch (e.g. my pgbench_accounts example upthread). Most individual tables will never get even one anti-wraparound VACUUM -- it just doesn't ever come for most tables in practice. My big issue with vacuum_freeze_min_age is that it doesn't really work with the freeze map work in 9.6, which creates problems that I'm trying to address by freezing early and so on. After all, HEAD (and all stable branches) can easily set a page to all-visible (but not all-frozen) in the VM, meaning that the page's tuples won't be considered for freezing until the next aggressive VACUUM. This means that vacuum_freeze_min_age is already frequently ignored by the implementation -- it's conditioned on other things that are practically impossible to predict. Curious about your thoughts on this existing issue with vacuum_freeze_min_age. I am concerned about the "freezing cliff" that it creates. > So it's natural to decide whether or not > we're going to wait for cleanup locks on pages on the basis of how old > the XIDs they contain actually are. I agree, but again, it's only a backstop. With the patch we'd have to be rather unlucky to ever need to wait like this. What are the chances that we keep failing to freeze an old XID from one particular page, again and again? My testing indicates that it's a negligible concern in practice (barring pathological cases with idle cursors, etc). > I think vacuum_freeze_min_age also serves a useful purpose: it > prevents us from freezing data that's going to be modified again or > even deleted in the near future. Since we can't know the future, we > must base our decision on the assumption that the future will be like > the past: if the page hasn't been modified for a while, then we should > assume it's not likely to be modified again soon; otherwise not. But the "freeze early" heuristics work a bit like that anyway. We won't freeze all the tuples on a whole heap page early if we won't otherwise set the heap page to all-visible (not all-frozen) in the VM anyway. > If we > knew the time at which the page had last been modified, it would be > very reasonable to use that here - say, freeze the XIDs if the page > hasn't been touched in an hour, or whatever. But since we lack such > timestamps the XID age is the closest proxy we have. XID age is a *terrible* proxy. The age of an XID in a tuple header may advance quickly, even when nobody modifies the same table at all. I concede that it is true that we are (in some sense) "gambling" by freezing early -- we may end up freezing a tuple that we subsequently update anyway. But aren't we also "gambling" by *not* freezing early? By not freezing, we risk getting into "freezing debt" that will have to be paid off in one ruinously large installment. I would much rather "gamble" on something where we can tolerate consistently "losing" than gamble on something where I cannot ever afford to lose (even if it's much less likely that I'll lose during any given VACUUM operation). Besides all this, I think that we have a rather decent chance of coming out ahead in practice by freezing early. In practice the marginal cost of freezing early is consistently pretty low. Cost-control-driven (as opposed to need-driven) freezing is *supposed* to be cheaper, of course. And like it or not, freezing is really just part of the cost of storing data using Postgres (for the time being, at least). > > The > > risk mostly comes from how much total work we still need to do to > > advance relfrozenxid. If the single old XID is quite old indeed (~1.5 > > billion XIDs), but there is only one, then we just have to freeze one > > tuple to be able to safely advance relfrozenxid (maybe advance it by a > > huge amount!). How long can it take to freeze one tuple, with the > > freeze map, etc? > > I don't really see any reason for optimism here. > IOW, the time that it takes to freeze that one tuple *in theory* might > be small. But in practice it may be very large, because we won't > necessarily get around to it on any meaningful time frame. On second thought I agree that my specific example of 1.5 billion XIDs was a little too optimistic of me. But 50 million XIDs (i.e. the vacuum_freeze_min_age default) is too pessimistic. The important point is that FreezeLimit could plausibly become nothing more than a backstop mechanism, with the design from the patch series -- something that typically has no effect on what tuples actually get frozen. -- Peter Geoghegan
On Thu, Jan 6, 2022 at 2:45 PM Peter Geoghegan <pg@bowt.ie> wrote: > But the "freeze early" heuristics work a bit like that anyway. We > won't freeze all the tuples on a whole heap page early if we won't > otherwise set the heap page to all-visible (not all-frozen) in the VM > anyway. I believe that applications tend to update rows according to predictable patterns. Andy Pavlo made an observation about this at one point: https://youtu.be/AD1HW9mLlrg?t=3202 I think that we don't do a good enough job of keeping logically related tuples (tuples inserted around the same time) together, on the same original heap page, which motivated a lot of my experiments with the FSM from last year. Even still, it seems like a good idea for us to err in the direction of assuming that tuples on the same heap page are logically related. The tuples should all be frozen together when possible. And *not* frozen early when the heap page as a whole can't be frozen (barring cases with one *much* older XID before FreezeLimit). -- Peter Geoghegan
On Thu, Jan 6, 2022 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote: > One obvious reason for this is that the opportunistic freezing stuff > is expected to be the thing that usually forces freezing -- not > vacuum_freeze_min_age, nor FreezeLimit, nor any other XID-based > cutoff. As you more or less pointed out yourself, we still need > FreezeLimit as a backstop mechanism. But the value of FreezeLimit can > just come from autovacuum_freeze_max_age/2 in all cases (no separate > GUC), or something along those lines. We don't particularly expect the > value of FreezeLimit to matter, at least most of the time. It should > only noticeably affect our behavior during anti-wraparound VACUUMs, > which become rare with the patch (e.g. my pgbench_accounts example > upthread). Most individual tables will never get even one > anti-wraparound VACUUM -- it just doesn't ever come for most tables in > practice. This seems like a weak argument. Sure, you COULD hard-code the limit to be autovacuum_freeze_max_age/2 rather than making it a separate tunable, but I don't think it's better. I am generally very skeptical about the idea of using the same GUC value for multiple purposes, because it often turns out that the optimal value for one purpose is different than the optimal value for some other purpose. For example, the optimal amount of memory for a hash table is likely different than the optimal amount for a sort, which is why we now have hash_mem_multiplier. When it's not even the same value that's being used in both places, but the original value in one place and a value derived from some formula in the other, the chances of things working out are even less. I feel generally that a lot of the argument you're making here supposes that tables are going to get vacuumed regularly. I agree that IF tables are being vacuumed on a regular basis, and if as part of that we always push relfrozenxid forward as far as we can, we will rarely have a situation where aggressive strategies to avoid wraparound are required. However, I disagree strongly with the idea that we can assume that tables will get vacuumed regularly. That can fail to happen for all sorts of reasons. One of the common ones is a poor choice of autovacuum configuration. The most common problem in my experience is a cost limit that is too low to permit the amount of vacuuming that is actually required, but other kinds of problems like not enough workers (so tables get starved), too many workers (so the cost limit is being shared between many processes), autovacuum=off either globally or on one table (because of ... reasons), autovacuum_vacuum_insert_threshold = -1 plus not many updates (so thing ever triggers the vacuum), autovacuum_naptime=1d (actually seen in the real world! ... and, no, it didn't work well), or stats collector problems are all possible. We can *hope* that there are going to be regular vacuums of the table long before wraparound becomes a danger, but realistically, we better not assume that in our choice of algorithms, because the real world is a messy place where all sorts of crazy things happen. Now, I agree with you in part: I don't think it's obvious that it's useful to tune vacuum_freeze_table_age. When I advise customers on how to fix vacuum problems, I am usually telling them to increase autovacuum_vacuum_cost_limit, possibly also with an increase in autovacuum_workers; or to increase or decrease autovacuum_freeze_max_age depending on which problem they have; or occasionally to adjust settings like autovacuum_naptime. It doesn't often seem to be necessary to change vacuum_freeze_table_age or, for that matter, vacuum_freeze_min_age. But if we remove them and then discover scenarios where tuning them would have been useful, we'll have no options for fixing PostgreSQL systems in the field. Waiting for the next major release in such a scenario, or even the next minor release, is not good. We should be VERY conservative about removing existing settings if there's any chance that somebody could use them to tune their way out of trouble. > My big issue with vacuum_freeze_min_age is that it doesn't really work > with the freeze map work in 9.6, which creates problems that I'm > trying to address by freezing early and so on. After all, HEAD (and > all stable branches) can easily set a page to all-visible (but not > all-frozen) in the VM, meaning that the page's tuples won't be > considered for freezing until the next aggressive VACUUM. This means > that vacuum_freeze_min_age is already frequently ignored by the > implementation -- it's conditioned on other things that are practically > impossible to predict. > > Curious about your thoughts on this existing issue with > vacuum_freeze_min_age. I am concerned about the "freezing cliff" that > it creates. So, let's see: if we see a page where the tuples are all-visible and we seize the opportunity to freeze it, we can spare ourselves the need to ever visit that page again (unless it gets modified). But if we only mark it all-visible and leave the freezing for later, the next aggressive vacuum will have to scan and dirty the page. I'm prepared to believe that it's worth the cost of freezing the page in that scenario. We've already dirtied the page and written some WAL and maybe generated an FPW, so doing the rest of the work now rather than saving it until later seems likely to be a win. I think it's OK to behave, in this situation, as if vacuum_freeze_min_age=0. There's another situation in which vacuum_freeze_min_age could apply, though: suppose the page isn't all-visible yet. I'd argue that in that case we don't want to run around freezing stuff unless it's quite old - like older than vacuum_freeze_table_age, say. Because we know we're going to have to revisit this page in the next vacuum anyway, and expending effort to freeze tuples that may be about to be modified again doesn't seem prudent. So, hmm, on further reflection, maybe it's OK to remove vacuum_freeze_min_age. But if we do, then I think we had better carefully distinguish between the case where the page can thereby be marked all-frozen and the case where it cannot. I guess you say the same, further down. > > So it's natural to decide whether or not > > we're going to wait for cleanup locks on pages on the basis of how old > > the XIDs they contain actually are. > > I agree, but again, it's only a backstop. With the patch we'd have to > be rather unlucky to ever need to wait like this. > > What are the chances that we keep failing to freeze an old XID from > one particular page, again and again? My testing indicates that it's a > negligible concern in practice (barring pathological cases with idle > cursors, etc). I mean, those kinds of pathological cases happen *all the time*. Sure, there are plenty of users who don't leave cursors open. But the ones who do don't leave them around for short periods of time on randomly selected pages of the table. They are disproportionately likely to leave them on the same table pages over and over, just like data can't in general be assumed to be uniformly accessed. And not uncommonly, they leave them around until the snow melts. And we need to worry about those kinds of users, actually much more than we need to worry about users doing normal things. Honestly, autovacuum on a system where things are mostly "normal" - no long-running transactions, adequate resources for autovacuum to do its job, reasonable configuration settings - isn't that bad. It's true that there are people who get surprised by an aggressive autovacuum kicking off unexpectedly, but it's usually the first one during the cluster lifetime (which is typically the biggest, since the initial load tends to be bigger than later ones) and it's usually annoying but survivable. The places where autovacuum becomes incredibly frustrating are the pathological cases. When insufficient resources are available to complete the work in a timely fashion, or difficult trade-offs have to be made, autovacuum is too dumb to make the right choices. And even if you call your favorite PostgreSQL support provider and they provide an expert, once it gets behind, autovacuum isn't very tractable: it will insist on vacuuming everything, right now, in an order that it chooses, and it's not going to listen to take any nonsense from some human being who thinks they might have some useful advice to provide! > But the "freeze early" heuristics work a bit like that anyway. We > won't freeze all the tuples on a whole heap page early if we won't > otherwise set the heap page to all-visible (not all-frozen) in the VM > anyway. Hmm, I didn't realize that we had that. Is that an existing thing or something new you're proposing to do? If existing, where is it? > > IOW, the time that it takes to freeze that one tuple *in theory* might > > be small. But in practice it may be very large, because we won't > > necessarily get around to it on any meaningful time frame. > > On second thought I agree that my specific example of 1.5 billion XIDs > was a little too optimistic of me. But 50 million XIDs (i.e. the > vacuum_freeze_min_age default) is too pessimistic. The important point > is that FreezeLimit could plausibly become nothing more than a > backstop mechanism, with the design from the patch series -- something > that typically has no effect on what tuples actually get frozen. I agree that it's OK for this to become a purely backstop mechanism ... but again, I think that the design of such backstop mechanisms should be done as carefully as we know how, because users seem to hit the backstop all the time. We want it to be made of, you know, nylon twine, rather than, say, sharp nails. :-) -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 7, 2022 at 12:24 PM Robert Haas <robertmhaas@gmail.com> wrote: > This seems like a weak argument. Sure, you COULD hard-code the limit > to be autovacuum_freeze_max_age/2 rather than making it a separate > tunable, but I don't think it's better. I am generally very skeptical > about the idea of using the same GUC value for multiple purposes, > because it often turns out that the optimal value for one purpose is > different than the optimal value for some other purpose. I thought I was being conservative by suggesting autovacuum_freeze_max_age/2. My first thought was to teach VACUUM to make its FreezeLimit "OldestXmin - autovacuum_freeze_max_age". To me these two concepts really *are* the same thing: vacrel->FreezeLimit becomes a backstop, just as anti-wraparound autovacuum (the autovacuum_freeze_max_age cutoff) becomes a backstop. Of course, an anti-wraparound VACUUM will do early freezing in the same way as any other VACUUM will (with the patch series). So even when the FreezeLimit backstop XID cutoff actually affects the behavior of a given VACUUM operation, it may well not be the reason why most individual tuples that we freeze get frozen. That is, most individual heap pages will probably have tuples frozen for some other reason. Though it depends on workload characteristics, most individual heap pages will typically be frozen as a group, even here. This is a logical consequence of the fact that tuple freezing and advancing relfrozenxid are now only loosely coupled -- it's about as loose as the current relfrozenxid invariant will allow. > I feel generally that a lot of the argument you're making here > supposes that tables are going to get vacuumed regularly. > I agree that > IF tables are being vacuumed on a regular basis, and if as part of > that we always push relfrozenxid forward as far as we can, we will > rarely have a situation where aggressive strategies to avoid > wraparound are required. It's all relative. We hope that (with the patch) cases that only ever get anti-wraparound VACUUMs are limited to tables where nothing else drives VACUUM, for sensible reasons related to workload characteristics (like the pgbench_accounts example upthread). It's inevitable that some users will misconfigure the system, though -- no question about that. I don't see why users that misconfigure the system in this way should be any worse off than they would be today. They probably won't do substantially less freezing (usually somewhat more), and will advance pg_class.relfrozenxid in exactly the same way as today (usually a bit better, actually). What have I missed? Admittedly the design of the "Freeze tuples early to advance relfrozenxid" patch (i.e. v5-0005-*patch) is still unsettled; I need to verify that my claims about it are really robust. But as far as I know they are. Reviewers should certainly look at that with a critical eye. > Now, I agree with you in part: I don't think it's obvious that it's > useful to tune vacuum_freeze_table_age. That's definitely the easier argument to make. After all, vacuum_freeze_table_age will do nothing unless VACUUM runs before the anti-wraparound threshold (autovacuum_freeze_max_age) is reached. The patch series should be strictly better than that. Primarily because it's "continuous", and so isn't limited to cases where the table age falls within the "vacuum_freeze_table_age - autovacuum_freeze_max_age" goldilocks age range. > We should be VERY conservative about removing > existing settings if there's any chance that somebody could use them > to tune their way out of trouble. I agree, I suppose, but right now I honestly can't think of a reason why they would be useful. If I am wrong about this then I'm probably also wrong about some basic facet of the high-level design, in which case I should change course altogether. In other words, removing the GUCs is not an incidental thing. It's possible that I would never have pursued this project if I didn't first notice how wrong-headed the GUCs are. > So, let's see: if we see a page where the tuples are all-visible and > we seize the opportunity to freeze it, we can spare ourselves the need > to ever visit that page again (unless it gets modified). But if we > only mark it all-visible and leave the freezing for later, the next > aggressive vacuum will have to scan and dirty the page. I'm prepared > to believe that it's worth the cost of freezing the page in that > scenario. That's certainly the most compelling reason to perform early freezing. It's not completely free of downsides, but it's pretty close. > There's another situation in which vacuum_freeze_min_age could apply, > though: suppose the page isn't all-visible yet. I'd argue that in that > case we don't want to run around freezing stuff unless it's quite old > - like older than vacuum_freeze_table_age, say. Because we know we're > going to have to revisit this page in the next vacuum anyway, and > expending effort to freeze tuples that may be about to be modified > again doesn't seem prudent. So, hmm, on further reflection, maybe it's > OK to remove vacuum_freeze_min_age. But if we do, then I think we had > better carefully distinguish between the case where the page can > thereby be marked all-frozen and the case where it cannot. I guess you > say the same, further down. I do. Although v5-0005-*patch still freezes early when the page is dirtied by pruning, I have my doubts about that particular "freeze early" criteria. I believe that everything I just said about misconfigured autovacuums doesn't rely on anything more than the "most compelling scenario for early freezing" mechanism that arranges to make us set the all-frozen bit (not just the all-visible bit). > I mean, those kinds of pathological cases happen *all the time*. Sure, > there are plenty of users who don't leave cursors open. But the ones > who do don't leave them around for short periods of time on randomly > selected pages of the table. They are disproportionately likely to > leave them on the same table pages over and over, just like data can't > in general be assumed to be uniformly accessed. And not uncommonly, > they leave them around until the snow melts. > And we need to worry about those kinds of users, actually much more > than we need to worry about users doing normal things. I couldn't agree more. In fact, I was mostly thinking about how to *help* these users. Insisting on waiting for a cleanup lock before it becomes strictly necessary (when the table age is only 50 million/vacuum_freeze_min_age) is actually a big part of the problem for these users. vacuum_freeze_min_age enforces a false dichotomy on aggressive VACUUMs, that just isn't unhelpful. Why should waiting on a cleanup lock fix anything? Even in the extreme case where we are guaranteed to eventually have a wraparound failure in the end (due to an idle cursor in an unsupervised database), the user is still much better off, I think. We will have at least managed to advance relfrozenxid to the exact oldest XID on the one heap page that somebody holds an idle cursor (conflicting buffer pin) on. And we'll usually have frozen most of the tuples that need to be frozen. Sure, the user may need to use single-user mode to run a manual VACUUM, but at least this process only needs to freeze approximately one tuple to get the system back online again. If the DBA notices the problem before the database starts to refuse to allocate XIDs, then they'll have a much better chance of avoiding a wraparound failure through simple intervention (like killing the backend with the idle cursor). We can pay down 99.9% of the "freeze debt" independently of this intractable problem of something holding onto an idle cursor. > Honestly, > autovacuum on a system where things are mostly "normal" - no > long-running transactions, adequate resources for autovacuum to do its > job, reasonable configuration settings - isn't that bad. Right. Autovacuum is "too big to fail". > > But the "freeze early" heuristics work a bit like that anyway. We > > won't freeze all the tuples on a whole heap page early if we won't > > otherwise set the heap page to all-visible (not all-frozen) in the VM > > anyway. > > Hmm, I didn't realize that we had that. Is that an existing thing or > something new you're proposing to do? If existing, where is it? It's part of v5-0005-*patch. Still in flux to some degree, because it's necessary to balance a few things. That shouldn't undermine the arguments I've made here. > I agree that it's OK for this to become a purely backstop mechanism > ... but again, I think that the design of such backstop mechanisms > should be done as carefully as we know how, because users seem to hit > the backstop all the time. We want it to be made of, you know, nylon > twine, rather than, say, sharp nails. :-) Absolutely. But if autovacuum can only ever run due to age(relfrozenxid) reaching autovacuum_freeze_max_age, then I can't see a downside. Again, the v5-0005-*patch needs to meet the standard that I've laid out. If it doesn't then I've messed up already. -- Peter Geoghegan
On Fri, Jan 7, 2022 at 5:20 PM Peter Geoghegan <pg@bowt.ie> wrote: > I thought I was being conservative by suggesting > autovacuum_freeze_max_age/2. My first thought was to teach VACUUM to > make its FreezeLimit "OldestXmin - autovacuum_freeze_max_age". To me > these two concepts really *are* the same thing: vacrel->FreezeLimit > becomes a backstop, just as anti-wraparound autovacuum (the > autovacuum_freeze_max_age cutoff) becomes a backstop. I can't follow this. If the idea is that we're going to opportunistically freeze a page whenever that allows us to mark it all-visible, then the remaining question is what XID age we should use to force freezing when that rule doesn't apply. It seems to me that there is a rebuttable presumption that that case ought to work just as it does today - and I think I hear you saying that it should NOT work as it does today, but should use some other threshold. Yet I can't understand why you think that. > I couldn't agree more. In fact, I was mostly thinking about how to > *help* these users. Insisting on waiting for a cleanup lock before it > becomes strictly necessary (when the table age is only 50 > million/vacuum_freeze_min_age) is actually a big part of the problem > for these users. vacuum_freeze_min_age enforces a false dichotomy on > aggressive VACUUMs, that just isn't unhelpful. Why should waiting on a > cleanup lock fix anything? Because waiting on a lock means that we'll acquire it as soon as it's available. If you repeatedly call your local Pizzeria Uno's and ask whether there is a wait, and head to the restaurant only when the answer is in the negative, you may never get there, because they may be busy every time you call - especially if you always call around lunch or dinner time. Even if you eventually get there, it may take multiple days before you find a time when a table is immediately available, whereas if you had just gone over there and stood in line, you likely would have been seated in under an hour and savoring the goodness of quality deep-dish pizza not too long thereafter. The same principle applies here. I do think that waiting for a cleanup lock when the age of the page is only vacuum_freeze_min_age seems like it might be too aggressive, but I don't think that's how it works. AFAICS, it's based on whether the vacuum is marked as aggressive, which has to do with vacuum_freeze_table_age, not vacuum_freeze_min_age. Let's turn the question around: if the age of the oldest XID on the page is >150 million transactions and the buffer cleanup lock is not available now, what makes you think that it's any more likely to be available when the XID age reaches 200 million or 300 million or 700 million? There is perhaps an argument for some kind of tunable that eventually shoots the other session in the head (if we can identify it, anyway) but it seems to me that regardless of what threshold we pick, polling is strictly less likely to find a time when the page is available than waiting for the cleanup lock. It has the counterbalancing advantage of allowing the autovacuum worker to do other useful work in the meantime and that is indeed a significant upside, but at some point you're going to have to give up and admit that polling is a failed strategy, and it's unclear why 150 million XIDs - or probably even 50 million XIDs - isn't long enough to say that we're not getting the job done with half measures. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 13, 2022 at 12:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > I can't follow this. If the idea is that we're going to > opportunistically freeze a page whenever that allows us to mark it > all-visible, then the remaining question is what XID age we should use > to force freezing when that rule doesn't apply. That is the idea, yes. > It seems to me that > there is a rebuttable presumption that that case ought to work just as > it does today - and I think I hear you saying that it should NOT work > as it does today, but should use some other threshold. Yet I can't > understand why you think that. Cases where we can not get a cleanup lock fall into 2 sharply distinct categories in my mind: 1. Cases where our inability to get a cleanup lock signifies nothing at all about the page in question, or any page in the same table, with the same workload. 2. Pathological cases. Cases where we're at least at the mercy of the application to do something about an idle cursor, where the situation may be entirely hopeless on a long enough timeline. (Whether or not it actually happens in the end is less significant.) As far as I can tell, based on testing, category 1 cases are fixed by the patch series: while a small number of pages from tables in category 1 cannot be cleanup-locked during each VACUUM, even with the patch series, it happens at random, with no discernable pattern. The overall result is that our ability to advance relfrozenxid is really not impacted *over time*. It's reasonable to suppose that lightning will not strike in the same place twice -- and it would really have to strike several times to invalidate this assumption. It's not impossible, but the chances over time are infinitesimal -- and the aggregate effect over time (not any one VACUUM operation) is what matters. There are seldom more than 5 or so of these pages, even on large tables. What are the chances that some random not-yet-all-frozen block (that we cannot freeze tuples on) will also have the oldest couldn't-be-frozen XID, even once? And when it is the oldest, why should it be the oldest by very many XIDs? And what are the chances that the same page has the same problem, again and again, without that being due to some pathological workload thing? Admittedly you may see a blip from this -- you might notice that the final relfrozenxid value for that one single VACUUM isn't quite as new as you'd like. But then the next VACUUM should catch up with the stable long term average again. It's hard to describe exactly why this effect is robust, but as I said, empirically, in practice, it appears to be robust. That might not be good enough as an explanation that justifies committing the patch series, but that's what I see. And I think I will be able to nail it down. AFAICT that just leaves concern for cases in category 2. More on that below. > Even if you eventually get there, it may take > multiple days before you find a time when a table is immediately > available, whereas if you had just gone over there and stood in line, > you likely would have been seated in under an hour and savoring the > goodness of quality deep-dish pizza not too long thereafter. The same > principle applies here. I think that you're focussing on individual VACUUM operations, whereas I'm more concerned about the aggregate effect of a particular policy over time. Let's assume for a moment that the only thing that we really care about is reliably keeping relfrozenxid reasonably recent. Even then, waiting for a cleanup lock (to freeze some tuples) might be the wrong thing to do. Waiting in line means that we're not freezing other tuples (nobody else can either). So we're allowing ourselves to fall behind on necessary, routine maintenance work that allows us to advance relfrozenxid....in order to advance relfrozenxid. > I do think that waiting for a cleanup lock when the age of the page is > only vacuum_freeze_min_age seems like it might be too aggressive, but > I don't think that's how it works. AFAICS, it's based on whether the > vacuum is marked as aggressive, which has to do with > vacuum_freeze_table_age, not vacuum_freeze_min_age. Let's turn the > question around: if the age of the oldest XID on the page is >150 > million transactions and the buffer cleanup lock is not available now, > what makes you think that it's any more likely to be available when > the XID age reaches 200 million or 300 million or 700 million? This is my concern -- what I've called category 2 cases have this exact quality. So given that, why not freeze what you can, elsewhere, on other pages that don't have the same issue (presumably the vast vast majority in the table)? That way you have the best possible chance of recovering once the DBA gets a clue and fixes the issue. > There > is perhaps an argument for some kind of tunable that eventually shoots > the other session in the head (if we can identify it, anyway) but it > seems to me that regardless of what threshold we pick, polling is > strictly less likely to find a time when the page is available than > waiting for the cleanup lock. It has the counterbalancing advantage of > allowing the autovacuum worker to do other useful work in the meantime > and that is indeed a significant upside, but at some point you're > going to have to give up and admit that polling is a failed strategy, > and it's unclear why 150 million XIDs - or probably even 50 million > XIDs - isn't long enough to say that we're not getting the job done > with half measures. That's kind of what I meant. The difference between 50 million and 150 million is rather unclear indeed. So having accepted that that might be true, why not be open to the possibility that it won't turn out to be true in the long run, for any given table? With the enhancements from the patch series in place (particularly the early freezing stuff), what do we have to lose by making the FreezeLimit XID cutoff for freezing much higher than your typical vacuum_freeze_min_age? Maybe the same as autovacuum_freeze_max_age or vacuum_freeze_table_age (it can't be higher than that without also making these other settings become meaningless, of course). Taking a wait-and-see approach like this (not being too quick to decide that a table is in category 1 or category 2) doesn't seem to make wraparound failure any more likely in any particular scenario, but makes it less likely in other scenarios. It also gives us early visibility into the problem, because we'll see that autovacuum can no longer advance relfrozenxid (using the enhanced log output) where that's generally expected. -- Peter Geoghegan
On Thu, Jan 13, 2022 at 1:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > Admittedly you may see a blip from this -- you might notice that the > final relfrozenxid value for that one single VACUUM isn't quite as new > as you'd like. But then the next VACUUM should catch up with the > stable long term average again. It's hard to describe exactly why this > effect is robust, but as I said, empirically, in practice, it appears > to be robust. That might not be good enough as an explanation that > justifies committing the patch series, but that's what I see. And I > think I will be able to nail it down. Attached is v6, which like v5 is a rebased version that I'm posting to keep CFTester happy. I pushed a commit that consolidates VACUUM VERBOSE and autovacuum logging earlier (commit 49c9d9fc), which bitrot v5. So new real changes, nothing to note. Although it technically has nothing to do with this patch series, I will point out that it's now a lot easier to debug using VACUUM VERBOSE, which will directly display information about how we've advanced relfrozenxid, tuples frozen, etc: pg@regression:5432 =# delete from mytenk2 where hundred < 15; DELETE 1500 pg@regression:5432 =# vacuum VERBOSE mytenk2; INFO: vacuuming "regression.public.mytenk2" INFO: finished vacuuming "regression.public.mytenk2": index scans: 1 pages: 0 removed, 345 remain, 0 skipped using visibility map (0.00% of total) tuples: 1500 removed, 8500 remain (8500 newly frozen), 0 are dead but not yet removable removable cutoff: 17411, which is 0 xids behind next new relfrozenxid: 17411, which is 3 xids ahead of previous value index scan needed: 341 pages from table (98.84% of total) had 1500 dead item identifiers removed index "mytenk2_unique1_idx": pages: 39 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "mytenk2_unique2_idx": pages: 30 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "mytenk2_hundred_idx": pages: 11 in total, 1 newly deleted, 1 currently deleted, 0 reusable I/O timings: read: 0.011 ms, write: 0.000 ms avg read rate: 1.428 MB/s, avg write rate: 2.141 MB/s buffer usage: 1133 hits, 2 misses, 3 dirtied WAL usage: 1446 records, 1 full page images, 199702 bytes system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s VACUUM -- Peter Geoghegan
Attachment
On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > 1. Cases where our inability to get a cleanup lock signifies nothing > at all about the page in question, or any page in the same table, with > the same workload. > > 2. Pathological cases. Cases where we're at least at the mercy of the > application to do something about an idle cursor, where the situation > may be entirely hopeless on a long enough timeline. (Whether or not it > actually happens in the end is less significant.) Sure. I'm worrying about case (2). I agree that in case (1) waiting for the lock is almost always the wrong idea. > I think that you're focussing on individual VACUUM operations, whereas > I'm more concerned about the aggregate effect of a particular policy > over time. I don't think so. I think I'm worrying about the aggregate effect of a particular policy over time *in the pathological cases* i.e. (2). > This is my concern -- what I've called category 2 cases have this > exact quality. So given that, why not freeze what you can, elsewhere, > on other pages that don't have the same issue (presumably the vast > vast majority in the table)? That way you have the best possible > chance of recovering once the DBA gets a clue and fixes the issue. That's the part I'm not sure I believe. Imagine a table with a gigantic number of pages that are not yet all-visible, a small number of all-visible pages, and one page containing very old XIDs on which a cursor holds a pin. I don't think it's obvious that not waiting is best. Maybe you're going to end up vacuuming the table repeatedly and doing nothing useful. If you avoid vacuuming it repeatedly, you still have a lot of work to do once the DBA locates a clue. I think there's probably an important principle buried in here: the XID threshold that forces a vacuum had better also force waiting for pins. If it doesn't, you can tight-loop on that table without getting anything done. > That's kind of what I meant. The difference between 50 million and 150 > million is rather unclear indeed. So having accepted that that might > be true, why not be open to the possibility that it won't turn out to > be true in the long run, for any given table? With the enhancements > from the patch series in place (particularly the early freezing > stuff), what do we have to lose by making the FreezeLimit XID cutoff > for freezing much higher than your typical vacuum_freeze_min_age? > Maybe the same as autovacuum_freeze_max_age or vacuum_freeze_table_age > (it can't be higher than that without also making these other settings > become meaningless, of course). We should probably distinguish between the situation where (a) an adverse pin is held continuously and effectively forever and (b) adverse pins are held frequently but for short periods of time. I think it's possible to imagine a small, very hot table (or portion of a table) where very high concurrency means there are often pins. In case (a), it's not obvious that waiting will ever resolve anything, although it might prevent other problems like infinite looping. In case (b), a brief wait will do a lot of good. But maybe that doesn't even matter. I think part of your argument is that if we fail to update relfrozenxid for a while, that really isn't that bad. I think I agree, up to a point. One consequence of failing to immediately advance relfrozenxid might be that pg_clog and friends are bigger, but that's pretty minor. Another consequence might be that we might vacuum the table more times, which is more serious. I'm not really sure that can happen to a degree that is meaningful, apart from the infinite loop case already described, but I'm also not entirely sure that it can't. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jan 17, 2022 at 7:12 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > > 1. Cases where our inability to get a cleanup lock signifies nothing > > at all about the page in question, or any page in the same table, with > > the same workload. > > > > 2. Pathological cases. Cases where we're at least at the mercy of the > > application to do something about an idle cursor, where the situation > > may be entirely hopeless on a long enough timeline. (Whether or not it > > actually happens in the end is less significant.) > > Sure. I'm worrying about case (2). I agree that in case (1) waiting > for the lock is almost always the wrong idea. I don't doubt that we'd each have little difficulty determining which category (1 or 2) a given real world case should be placed in, using a variety of methods that put the issue in context (e.g., looking at the application code, talking to the developers or the DBA). Of course, it doesn't follow that it would be easy to teach vacuumlazy.c how to determine which category the same "can't get cleanup lock" falls under, since (just for starters) there is no practical way for VACUUM to see all that context. That's what I'm effectively trying to work around with this "wait and see approach" that demotes FreezeLimit to a backstop (and so justifies removing the vacuum_freeze_min_age GUC that directly dictates our FreezeLimit today). The cure may be worse than the disease, and the cure isn't actually all that great at the best of times, so we should wait until the disease visibly gets pretty bad before being "interventionist" by waiting for a cleanup lock. I've already said plenty about why I don't like vacuum_freeze_min_age (or FreezeLimit) due to XIDs being fundamentally the wrong unit. But that's not the only fundamental problem that I see. The other problem is this: vacuum_freeze_min_age also dictates when an aggressive VACUUM will start to wait for a cleanup lock. But why should the first thing be the same as the second thing? I see absolutely no reason for it. (Hence the idea of making FreezeLimit a backstop, and getting rid of the GUC itself.) > > This is my concern -- what I've called category 2 cases have this > > exact quality. So given that, why not freeze what you can, elsewhere, > > on other pages that don't have the same issue (presumably the vast > > vast majority in the table)? That way you have the best possible > > chance of recovering once the DBA gets a clue and fixes the issue. > > That's the part I'm not sure I believe. To be clear, I think that I have yet to adequately demonstrate that this is true. It's a bit tricky to do so -- absence of evidence isn't evidence of absence. I think that your principled skepticism makes sense right now. Fortunately the early refactoring patches should be uncontroversial. The controversial parts are all in the last patch in the patch series, which isn't too much code. (Plus another patch to at least get rid of vacuum_freeze_min_age, and maybe vacuum_freeze_table_age too, that hasn't been written just yet.) > Imagine a table with a > gigantic number of pages that are not yet all-visible, a small number > of all-visible pages, and one page containing very old XIDs on which a > cursor holds a pin. I don't think it's obvious that not waiting is > best. Maybe you're going to end up vacuuming the table repeatedly and > doing nothing useful. If you avoid vacuuming it repeatedly, you still > have a lot of work to do once the DBA locates a clue. Maybe this is a simpler way of putting it: I want to delay waiting on a pin until it's pretty clear that we truly have a pathological case, which should in practice be limited to an anti-wraparound VACUUM, which will now be naturally rare -- most individual tables will literally never have even one anti-wraparound VACUUM. We don't need to reason about the vacuuming schedule this way, since anti-wraparound VACUUMs are driven by age(relfrozenxid) -- we don't really have to predict anything. Maybe we'll need to do an anti-wraparound VACUUM immediately after a non-aggressive autovacuum runs, without getting a cleanup lock (due to an idle cursor pathological case). We won't be able to advance relfrozenxid until the anti-wraparound VACUUM runs (at the earliest) in this scenario, but it makes no difference. Rather than predicting the future, we're covering every possible outcome (at least to the extent that that's possible). > I think there's probably an important principle buried in here: the > XID threshold that forces a vacuum had better also force waiting for > pins. If it doesn't, you can tight-loop on that table without getting > anything done. I absolutely agree -- that's why I think that we still need FreezeLimit. Just as a backstop, that in practice very rarely influences our behavior. Probably just in those remaining cases that are never vacuumed except for the occasional anti-wraparound VACUUM (even then it might not be very important). > We should probably distinguish between the situation where (a) an > adverse pin is held continuously and effectively forever and (b) > adverse pins are held frequently but for short periods of time. I agree. It's just hard to do that from vacuumlazy.c, during a routine non-aggressive VACUUM operation. > I think it's possible to imagine a small, very hot table (or portion of > a table) where very high concurrency means there are often pins. In > case (a), it's not obvious that waiting will ever resolve anything, > although it might prevent other problems like infinite looping. In > case (b), a brief wait will do a lot of good. But maybe that doesn't > even matter. I think part of your argument is that if we fail to > update relfrozenxid for a while, that really isn't that bad. Yeah, that is a part of it -- it doesn't matter (until it really matters), and we should be careful to avoid making the situation worse by waiting for a cleanup lock unnecessarily. That's actually a very drastic thing to do, at least in a world where freezing has been decoupled from advancing relfrozenxid. Updating relfrozenxid should now be thought of as a continuous thing, not a discrete thing. And so it's highly unlikely that any given VACUUM will ever *completely* fail to advance relfrozenxid -- that fact alone signals a pathological case (things that are supposed to be continuous should not ever appear to be discrete). But you need multiple VACUUMs to see this "signal". It is only revealed over time. It seems wise to make the most modest possible assumptions about what's going on here. We might well "get lucky" before the next VACUUM comes around when we encounter what at first appears to be a problematic case involving an idle cursor -- for all kinds of reasons. Like maybe an opportunistic prune gets rid of the old XID for us, without any freezing, during some brief window where the application doesn't have a cursor. We're only talking about one or two heap pages here. We might also *not* "get lucky" with the application and its use of idle cursors, of course. But in that case we must have been doomed all along. And we'll at least have put things on a much better footing in this disaster scenario -- there is relatively little freezing left to do in single user mode, and relfrozenxid should already be the same as the exact oldest XID in that one page. > I think I agree, up to a point. One consequence of failing to > immediately advance relfrozenxid might be that pg_clog and friends are > bigger, but that's pretty minor. My arguments are probabilistic (sort of), which makes it tricky. Actual test cases/benchmarks should bear out the claims that I've made. If anything fully convinces you, it'll be that, I think. > Another consequence might be that we > might vacuum the table more times, which is more serious. I'm not > really sure that can happen to a degree that is meaningful, apart from > the infinite loop case already described, but I'm also not entirely > sure that it can't. It's definitely true that this overall strategy could result in there being more individual VACUUM operations. But that naturally follow from teaching VACUUM to avoid waiting indefinitely. Obviously the important question is whether we'll do meaningfully more work for less benefit (in Postgres 15, relative to Postgres 14). Your concern is very reasonable. I just can't imagine how we could lose out to any notable degree. Which is a start. -- Peter Geoghegan
On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > Updating relfrozenxid should now be thought of as a continuous thing, > not a discrete thing. I think that's pretty nearly 100% wrong. The most simplistic way of expressing that is to say - clearly it can only happen when VACUUM runs, which is not all the time. That's a bit facile, though; let me try to say something a little smarter. There are real production systems that exist today where essentially all vacuums are anti-wraparound vacuums. And there are also real production systems that exist today where virtually none of the vacuums are anti-wraparound vacuums. So if we ship your proposed patches, the frequency with which relfrozenxid gets updated is going to increase by a large multiple, perhaps 100x, for the second group of people, who will then perceive the movement of relfrozenxid to be much closer to continuous than it is today even though, technically, it's still a step function. But the people in the first category are not going to see any difference at all. And therefore the reasoning that says - anti-wraparound vacuums just aren't going to happen any more - or - relfrozenxid will advance continuously seems like dangerous wishful thinking to me. It's only true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need not be true in any particular environment, which to me means that all conclusions based on the idea that it has to be true are pretty dubious. There's no doubt in my mind that advancing relfrozenxid opportunistically is a good idea. However, I'm not sure how reasonable it is to change any other behavior on the basis of the fact that we're doing it, because we don't know how often it really happens. If someone says "every time I travel to Europe on business, I will use the opportunity to bring you back a nice present," you can't evaluate how much impact that will have on your life without knowing how often they travel to Europe on business. And that varies radically from "never" to "a lot" based on the person. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jan 17, 2022 at 2:13 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Updating relfrozenxid should now be thought of as a continuous thing, > > not a discrete thing. > > I think that's pretty nearly 100% wrong. The most simplistic way of > expressing that is to say - clearly it can only happen when VACUUM > runs, which is not all the time. That just seems like semantics to me. The very next sentence after the one you quoted in your reply was "And so it's highly unlikely that any given VACUUM will ever *completely* fail to advance relfrozenxid". It's continuous *within* each VACUUM. As far as I can tell there is pretty much no way that the patch series will ever fail to advance relfrozenxid *by at least a little bit*, barring pathological cases with cursors and whatnot. > That's a bit facile, though; let me > try to say something a little smarter. There are real production > systems that exist today where essentially all vacuums are > anti-wraparound vacuums. And there are also real production systems > that exist today where virtually none of the vacuums are > anti-wraparound vacuums. So if we ship your proposed patches, the > frequency with which relfrozenxid gets updated is going to increase by > a large multiple, perhaps 100x, for the second group of people, who > will then perceive the movement of relfrozenxid to be much closer to > continuous than it is today even though, technically, it's still a > step function. But the people in the first category are not going to > see any difference at all. Actually, I think that even the people in the first category might well have about the same improved experience. Not just because of this patch series, mind you. It would also have a lot to do with the autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to mention the freeze map. What version are these users on? I have actually seen this for myself. With BenchmarkSQL, the largest table (the order lines table) starts out having its autovacuums driven entirely by autovacuum_vacuum_insert_scale_factor, even though there is a fair amount of bloat from updates. It stays like that for hours on HEAD. But even with my reasonably tuned setup, there is eventually a switchover point. Eventually all autovacuums end up as aggressive anti-wraparound VACUUMs -- this happens once the table gets sufficiently large (this is one of the two that is append-only, with one update to every inserted row from the delivery transaction, which happens hours after the initial insert). With the patch series, we have a kind of virtuous circle with freezing and with advancing relfrozenxid with the same order lines table. As far as I can tell, we fix the problem with the patch series. Because there are about 10 tuples inserted per new order transaction, the actual "XID consumption rate of the table" is much lower than the "worst case XID consumption" for such a table. It's also true that even with the patch we still get anti-wraparound VACUUMs for two fixed-size, hot-update-only tables: the stock table, and the customers table. But that's no big deal. It only happens because nothing else will ever trigger an autovacuum, no matter the autovacuum_freeze_max_age setting. > And therefore the reasoning that says - anti-wraparound vacuums just > aren't going to happen any more - or - relfrozenxid will advance > continuously seems like dangerous wishful thinking to me. I never said that anti-wraparound vacuums just won't happen anymore. I said that they'll be limited to cases like the stock table or customers table case. I was very clear on that point. With pgbench, whether or not you ever see any anti-wraparound VACUUMs will depend on how heap fillfactor for the accounts table -- set it low enough (maybe to 90) and you will still get them, since there won't be any other reason to VACUUM. As for the branches table, and the tellers table, they'll get VACUUMs in any case, regardless of heap fillfactor. And so they'll always advance relfrozenxid during eac VACUUM, and never have even one anti-wraparound VACUUM. > It's only > true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need > not be true in any particular environment, which to me means that all > conclusions based on the idea that it has to be true are pretty > dubious. There's no doubt in my mind that advancing relfrozenxid > opportunistically is a good idea. However, I'm not sure how reasonable > it is to change any other behavior on the basis of the fact that we're > doing it, because we don't know how often it really happens. It isn't that hard to see that the cases where we continue to get any anti-wraparound VACUUMs with the patch seem to be limited to cases like the stock/customers table, or cases like the pathological idle cursor cases we've been discussing. Pretty narrow cases, overall. Don't take my word for it - see for yourself. -- Peter Geoghegan
On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > That just seems like semantics to me. The very next sentence after the > one you quoted in your reply was "And so it's highly unlikely that any > given VACUUM will ever *completely* fail to advance relfrozenxid". > It's continuous *within* each VACUUM. As far as I can tell there is > pretty much no way that the patch series will ever fail to advance > relfrozenxid *by at least a little bit*, barring pathological cases > with cursors and whatnot. I mean this boils down to saying that VACUUM will advance relfrozenxid except when it doesn't. > Actually, I think that even the people in the first category might > well have about the same improved experience. Not just because of this > patch series, mind you. It would also have a lot to do with the > autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to > mention the freeze map. What version are these users on? I think it varies. I expect the increase in the default cost limit to have had a much more salutary effect than autovacuum_vacuum_insert_scale_factor, but I don't know for sure. At any rate, if you make the database big enough and generate dirty data fast enough, it doesn't matter what the default limits are. > I never said that anti-wraparound vacuums just won't happen anymore. I > said that they'll be limited to cases like the stock table or > customers table case. I was very clear on that point. I don't know how I'm supposed to sensibly respond to a statement like this. If you were very clear, then I'm being deliberately obtuse if I fail to understand. If I say you weren't very clear, then we're just contradicting each other. > It isn't that hard to see that the cases where we continue to get any > anti-wraparound VACUUMs with the patch seem to be limited to cases > like the stock/customers table, or cases like the pathological idle > cursor cases we've been discussing. Pretty narrow cases, overall. > Don't take my word for it - see for yourself. I don't think that's really possible. Words like "narrow" and "pathological" are value judgments, not factual statements. If I do an experiment where no wraparound autovacuums happen, as I'm sure I can, then those are the normal cases where the patch helps. If I do an experiment where they do happen, as I'm sure that I also can, you'll probably say either that the case in question is like the stock/customers table, or that it's pathological. What will any of this prove? I think we're reaching the point of diminishing returns in this conversation. What I want to know is that users aren't going to be harmed - even in cases where they have behavior that is like the stock/customers table, or that you consider pathological, or whatever other words we want to use to describe the weird things that happen to people. And I think we've made perhaps a bit of modest progress in exploring that issue, but certainly less than I'd like. I don't want to spend the next several days going around in circles about it though. That does not seem likely to make anyone happy. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jan 17, 2022 at 8:13 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > > That just seems like semantics to me. The very next sentence after the > > one you quoted in your reply was "And so it's highly unlikely that any > > given VACUUM will ever *completely* fail to advance relfrozenxid". > > It's continuous *within* each VACUUM. As far as I can tell there is > > pretty much no way that the patch series will ever fail to advance > > relfrozenxid *by at least a little bit*, barring pathological cases > > with cursors and whatnot. > > I mean this boils down to saying that VACUUM will advance relfrozenxid > except when it doesn't. It actually doesn't boil down, at all. The world is complicated and messy, whether we like it or not. > > I never said that anti-wraparound vacuums just won't happen anymore. I > > said that they'll be limited to cases like the stock table or > > customers table case. I was very clear on that point. > > I don't know how I'm supposed to sensibly respond to a statement like > this. If you were very clear, then I'm being deliberately obtuse if I > fail to understand. I don't know if I'd accuse you of being obtuse, exactly. Mostly I just think it's strange that you don't seem to take what I say seriously when it cannot be proven very easily. I don't think that you intend this to be disrespectful, and I don't take it personally. I just don't understand it. > > It isn't that hard to see that the cases where we continue to get any > > anti-wraparound VACUUMs with the patch seem to be limited to cases > > like the stock/customers table, or cases like the pathological idle > > cursor cases we've been discussing. Pretty narrow cases, overall. > > Don't take my word for it - see for yourself. > > I don't think that's really possible. Words like "narrow" and > "pathological" are value judgments, not factual statements. If I do an > experiment where no wraparound autovacuums happen, as I'm sure I can, > then those are the normal cases where the patch helps. If I do an > experiment where they do happen, as I'm sure that I also can, you'll > probably say either that the case in question is like the > stock/customers table, or that it's pathological. What will any of > this prove? You seem to be suggesting that I used words like "pathological" in some kind of highly informal, totally subjective way, when I did no such thing. I quite clearly said that you'll only get an anti-wraparound VACUUM with the patch applied when the only factor that *ever* causes *any* autovacuum worker to VACUUM the table (assuming the workload is stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With a table like this, even increasing autovacuum_freeze_max_age to its absolute maximum of 2 billion would not make it any more likely that we'd get a non-aggressive VACUUM -- it would merely make the anti-wraparound VACUUMs less frequent. No big change should be expected with a table like that. Also, since the patch is not magic, and doesn't even change the basic invariants for relfrozenxid, it's still true that any scenario in which it's fundamentally impossible for VACUUM to keep up will also have anti-wraparound VACUUMs. But that's the least of the user's trouble -- in the long run we're going to have the system refuse to allocate new XIDs with such a workload. The claim that I have made is 100% testable. Even if it was flat out incorrect, not getting anti-wraparound VACUUMs per se is not the important part. The important part is that the work is managed intelligently, and the burden is spread out over time. I am particularly concerned about the "freezing cliff" we get when many pages are all-visible but not also all-frozen. Consistently avoiding an anti-wraparound VACUUM (except with very particular workload characteristics) is really just a side effect -- it's something that makes the overall benefit relatively obvious, and relatively easy to measure. I thought that you'd appreciate that. -- Peter Geoghegan
On Tue, Jan 18, 2022 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote: > I quite clearly said that you'll only get an anti-wraparound VACUUM > with the patch applied when the only factor that *ever* causes *any* > autovacuum worker to VACUUM the table (assuming the workload is > stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With > a table like this, even increasing autovacuum_freeze_max_age to its > absolute maximum of 2 billion would not make it any more likely that > we'd get a non-aggressive VACUUM -- it would merely make the > anti-wraparound VACUUMs less frequent. No big change should be > expected with a table like that. Sure, I don't disagree with any of that. I don't see how I could. But I don't see how it detracts from the points I was trying to make either. > Also, since the patch is not magic, and doesn't even change the basic > invariants for relfrozenxid, it's still true that any scenario in > which it's fundamentally impossible for VACUUM to keep up will also > have anti-wraparound VACUUMs. But that's the least of the user's > trouble -- in the long run we're going to have the system refuse to > allocate new XIDs with such a workload. Also true. But again, it's just about making sure that the patch doesn't make other decisions that make things worse for people in that situation. That's what I was expressing uncertainty about. > The claim that I have made is 100% testable. Even if it was flat out > incorrect, not getting anti-wraparound VACUUMs per se is not the > important part. The important part is that the work is managed > intelligently, and the burden is spread out over time. I am > particularly concerned about the "freezing cliff" we get when many > pages are all-visible but not also all-frozen. Consistently avoiding > an anti-wraparound VACUUM (except with very particular workload > characteristics) is really just a side effect -- it's something that > makes the overall benefit relatively obvious, and relatively easy to > measure. I thought that you'd appreciate that. I do. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 18, 2022 at 6:11 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 18, 2022 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I quite clearly said that you'll only get an anti-wraparound VACUUM > > with the patch applied when the only factor that *ever* causes *any* > > autovacuum worker to VACUUM the table (assuming the workload is > > stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With > > a table like this, even increasing autovacuum_freeze_max_age to its > > absolute maximum of 2 billion would not make it any more likely that > > we'd get a non-aggressive VACUUM -- it would merely make the > > anti-wraparound VACUUMs less frequent. No big change should be > > expected with a table like that. > > Sure, I don't disagree with any of that. I don't see how I could. But > I don't see how it detracts from the points I was trying to make > either. You said "...the reasoning that says - anti-wraparound vacuums just aren't going to happen any more - or - relfrozenxid will advance continuously seems like dangerous wishful thinking to me". You then proceeded to attack a straw man -- a view that I couldn't possibly hold. This certainly surprised me, because my actual claims seemed well within the bounds of what is possible, and in any case can be verified with a fairly modest effort. That's what I was reacting to -- it had nothing to do with any concerns you may have had. I wasn't thinking about long-idle cursors at all. I was defending myself, because I was put in a position where I had to defend myself. > > Also, since the patch is not magic, and doesn't even change the basic > > invariants for relfrozenxid, it's still true that any scenario in > > which it's fundamentally impossible for VACUUM to keep up will also > > have anti-wraparound VACUUMs. But that's the least of the user's > > trouble -- in the long run we're going to have the system refuse to > > allocate new XIDs with such a workload. > > Also true. But again, it's just about making sure that the patch > doesn't make other decisions that make things worse for people in that > situation. That's what I was expressing uncertainty about. I am not just trying to avoid making things worse when users are in this situation. I actually want to give users every chance to avoid being in this situation in the first place. In fact, almost everything I've said about this aspect of things was about improving things for these users. It was not about covering myself -- not at all. It would be easy for me to throw up my hands, and change nothing here (keep the behavior that makes FreezeLimit derived from the vacuum_freeze_min GUC), since it's all incidental to the main goals of this patch series. I still don't understand why you think that my idea (not yet implemented) of making FreezeLimit into a backstop (making it autovacuum_freeze_max_age/2 or something) and relying on the new "early freezing" criteria for almost everything is going to make the situation worse in this scenario with long idle cursors. It's intended to make it better. Why do you think that the current vacuum_freeze_min_age-based FreezeLimit isn't actually the main problem in these scenarios? I think that the way that that works right now (in particular during aggressive VACUUMs) is just an accident of history. It's all path dependence -- each incremental step may have made sense, but what we have now doesn't seem to. Waiting for a cleanup lock might feel like the diligent thing to do, but that doesn't make it so. My sense is that there are very few apps that are hopelessly incapable of advancing relfrozenxid from day one. I find it much easier to believe that users that had this experience got away with it for a very long time, until their luck ran out, somehow. I would like to minimize the chance of that ever happening, to the extent that that's possible within the confines of the basic heapam/vacuumlazy.c invariants. -- Peter Geoghegan
On Tue, Jan 18, 2022 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > That's what I was reacting to -- it had nothing to do with any > concerns you may have had. I wasn't thinking about long-idle cursors > at all. I was defending myself, because I was put in a position where > I had to defend myself. I don't think I've said anything on this thread that is an attack on you. I am getting pretty frustrated with the tenor of the discussion, though. I feel like you're the one attacking me, and I don't like it. > I still don't understand why you think that my idea (not yet > implemented) of making FreezeLimit into a backstop (making it > autovacuum_freeze_max_age/2 or something) and relying on the new > "early freezing" criteria for almost everything is going to make the > situation worse in this scenario with long idle cursors. It's intended > to make it better. I just don't understand how I haven't been able to convey my concern here by now. I've already written multiple emails about it. If none of them were clear enough for you to understand, I'm not sure how saying the same thing over again can help. When I say I've already written about this, I'm referring specifically to the following: - https://postgr.es/m/CA+TgmobKJm9BsZR3ETeb6MJdLKWxKK5ZXx0XhLf-W9kUgvOcNA@mail.gmail.com in the second-to-last paragraph, beginning with "I don't really see" - https://www.postgresql.org/message-id/CA%2BTgmoaGoZ2wX6T4sj0eL5YAOQKW3tS8ViMuN%2BtcqWJqFPKFaA%40mail.gmail.com in the second paragraph beginning with "Because waiting on a lock" - https://www.postgresql.org/message-id/CA%2BTgmoZYri_LUp4od_aea%3DA8RtjC%2B-Z1YmTc7ABzTf%2BtRD2Opw%40mail.gmail.com in the paragraph beginning with "That's the part I'm not sure I believe." For all of that, I'm not even convinced that you're wrong. I just think you might be wrong. I don't really know. It seems to me however that you're understating the value of waiting, which I've tried to explain in the above places. Waiting does have the very real disadvantage of starving the rest of the system of the work that autovacuum worker would have been doing, and that's why I think you might be right. However, there are cases where waiting, and only waiting, gets the job done. If you're not willing to admit that those cases exist, or you think they don't matter, then we disagree. If you admit that they exist and think they matter but believe that there's some reason why increasing FreezeLimit can't cause any damage, then either (a) you have a good reason for that belief which I have thus far been unable to understand or (b) you're more optimistic about the proposed change than can be entirely justified. > My sense is that there are very few apps that are hopelessly incapable > of advancing relfrozenxid from day one. I find it much easier to > believe that users that had this experience got away with it for a > very long time, until their luck ran out, somehow. I would like to > minimize the chance of that ever happening, to the extent that that's > possible within the confines of the basic heapam/vacuumlazy.c > invariants. I agree with the idea that most people are OK at the beginning and then at some point their luck runs out and catastrophe strikes. I think there are a couple of different kinds of catastrophe that can happen. For instance, somebody could park a cursor in the middle of a table someplace and leave it there until the snow melts. Or, somebody could take a table lock and sit on it forever. Or, there could be a corrupted page in the table that causes VACUUM to error out every time it's reached. In the second and third situations, it doesn't matter a bit what we do with FreezeLimit, but in the first one it might. If the user is going to leave that cursor sitting there literally forever, the best solution is to raise FreezeLimit as high as we possibly can. The system is bound to shut down due to wraparound at some point, but we at least might as well vacuum other stuff while we're waiting for that to happen. On the other hand if that user is going to close that cursor after 10 minutes and open a new one in the same place 10 seconds later, the best thing to do is to keep FreezeLimit as low as possible, because the first time we wait for the pin to be released we're guaranteed to advance relfrozenxid within 10 minutes, whereas if we don't do that we may keep missing the brief windows in which no cursor is held for a very long time. But we have absolutely no way of knowing which of those things is going to happen on any particular system, or of estimating which one is more common in general. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jan 19, 2022 at 6:56 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't think I've said anything on this thread that is an attack on > you. I am getting pretty frustrated with the tenor of the discussion, > though. I feel like you're the one attacking me, and I don't like it. "Attack" is a strong word (much stronger than "defend"), and I don't think I'd use it to describe anything that has happened on this thread. All I said was that you misrepresented my views when you pounced on my use of the word "continuous". Which, honestly, I was very surprised by. > For all of that, I'm not even convinced that you're wrong. I just > think you might be wrong. I don't really know. I agree that I might be wrong, though of course I think that I'm probably correct. I value your input as a critical voice -- that's generally how we get really good designs. > However, there are cases where waiting, and only > waiting, gets the job done. If you're not willing to admit that those > cases exist, or you think they don't matter, then we disagree. They exist, of course. That's why I don't want to completely eliminate the idea of waiting for a cleanup lock. Rather, I want to change the design to recognize that that's an extreme measure, that should be delayed for as long as possible. There are many ways that the problem could naturally resolve itself. Waiting for a cleanup lock after only 50 million XIDs (the vacuum_freeze_min_age default) is like performing brain surgery to treat somebody with a headache (at least with the infrastructure from the earlier patches in place). It's not impossible that "surgery" could help, in theory (could be a tumor, better to catch these things early!), but that fact alone can hardly justify such a drastic measure. That doesn't mean that brain surgery isn't ever appropriate, of course. It should be delayed until it starts to become obvious that it's really necessary (but before it really is too late). > If you > admit that they exist and think they matter but believe that there's > some reason why increasing FreezeLimit can't cause any damage, then > either (a) you have a good reason for that belief which I have thus > far been unable to understand or (b) you're more optimistic about the > proposed change than can be entirely justified. I don't deny that it's just about possible that the changes that I'm thinking of could make the situation worse in some cases, but I think that the overwhelming likelihood is that things will be improved across the board. Consider the age of the tables from BenchmarkSQL, with the patch series: relname │ age │ mxid_age ──────────────────┼─────────────┼────────── bmsql_district │ 657 │ 0 bmsql_warehouse │ 696 │ 0 bmsql_item │ 1,371,978 │ 0 bmsql_config │ 1,372,061 │ 0 bmsql_new_order │ 3,754,163 │ 0 bmsql_history │ 11,545,940 │ 0 bmsql_order_line │ 23,095,678 │ 0 bmsql_oorder │ 40,653,743 │ 0 bmsql_customer │ 51,371,610 │ 0 bmsql_stock │ 51,371,610 │ 0 (10 rows) We see significant "natural variation" here, unlike HEAD, where the age of all tables is exactly the same at all times, or close to it (incidentally, this leads to the largest tables all being anti-wraparound VACUUMed at the same time). There is a kind of natural ebb and flow for each table over time, as relfrozenxid is advanced, due in part to workload characteristics. Less than half of all XIDs will ever modify the two largest tables, for example, and so autovacuum should probably never be launched because of the age of either table (barring some change in workload conditions, perhaps). As I've said a few times now, XIDs are generally "the wrong unit", except when needed as a backstop against wraparound failure. The natural variation that I see contributes to my optimism. A situation where we cannot get a cleanup lock may well resolve itself, for many reasons, that are hard to precisely nail down but are nevertheless very real. The vacuum_freeze_min_age design (particularly within an aggressive VACUUM) is needlessly rigid, probably just because the assumption before now has always been that we can only advance relfrozenxid in an aggressive VACUUM (it might happen in a non-aggressive VACUUM if we get very lucky, which cannot be accounted for). Because it is rigid, it is brittle. Because it is brittle, it will (on a long enough timeline, for a susceptible workload) actually break. > On the other hand if that user is going to close that > cursor after 10 minutes and open a new one in the same place 10 > seconds later, the best thing to do is to keep FreezeLimit as low as > possible, because the first time we wait for the pin to be released > we're guaranteed to advance relfrozenxid within 10 minutes, whereas if > we don't do that we may keep missing the brief windows in which no > cursor is held for a very long time. But we have absolutely no way of > knowing which of those things is going to happen on any particular > system, or of estimating which one is more common in general. I agree with all that, and I think that this particular scenario is the crux of the issue. The first time this happens (and we don't get a cleanup lock), then we will at least be able to set relfrozenxid to the exact oldest unfrozen XID. So that'll already have bought us some wallclock time -- often a great deal (why should the oldest XID on such a page be particularly old?). Furthermore, there will often be many more VACUUMs before we need to do an aggressive VACUUM -- each of these VACUUM operations is an opportunity to freeze the oldest tuple that holds up cleanup. Or maybe this XID is in a dead tuple, and so somebody's opportunistic pruning operation does the right thing for us. Never underestimate the power of dumb luck, especially in a situation where there are many individual "trials", and we only have to get lucky once. If and when that doesn't work out, and we actually have to do an anti-wraparound VACUUM, then something will have to give. Since anti-wraparound VACUUMs are naturally confined to certain kinds of tables/workloads with the patch series, we can now be pretty confident that the problem really is with this one problematic heap page, with the idle cursor. We could even verify this directly if we wanted to, by noticing that the preexisting relfrozenxid is an exact match for one XID on some can't-cleanup-lock page -- we could emit a WARNING about the page/tuple if we wanted to. To return to my colorful analogy from earlier, we now know that the patient almost certainly has a brain tumor. What new risk is implied by delaying the wait like this? Very little, I believe. Lets say we derive FreezeLimit from autovacuum_freeze_max_age/2 (instead of vacuum_freeze_min_age). We still ought to have the opportunity to wait for the cleanup lock for rather a long time -- if the XID consumption rate is so high that that isn't true, then we're doomed anyway. All told, there seems to be a huge net reduction in risk with this design. -- Peter Geoghegan
On Wed, Jan 19, 2022 at 2:54 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On the other hand if that user is going to close that > > cursor after 10 minutes and open a new one in the same place 10 > > seconds later, the best thing to do is to keep FreezeLimit as low as > > possible, because the first time we wait for the pin to be released > > we're guaranteed to advance relfrozenxid within 10 minutes, whereas if > > we don't do that we may keep missing the brief windows in which no > > cursor is held for a very long time. But we have absolutely no way of > > knowing which of those things is going to happen on any particular > > system, or of estimating which one is more common in general. > > I agree with all that, and I think that this particular scenario is > the crux of the issue. Great, I'm glad we agree on that much. I would be interested in hearing what other people think about this scenario. > The first time this happens (and we don't get a cleanup lock), then we > will at least be able to set relfrozenxid to the exact oldest unfrozen > XID. So that'll already have bought us some wallclock time -- often a > great deal (why should the oldest XID on such a page be particularly > old?). Furthermore, there will often be many more VACUUMs before we > need to do an aggressive VACUUM -- each of these VACUUM operations is > an opportunity to freeze the oldest tuple that holds up cleanup. Or > maybe this XID is in a dead tuple, and so somebody's opportunistic > pruning operation does the right thing for us. Never underestimate the > power of dumb luck, especially in a situation where there are many > individual "trials", and we only have to get lucky once. > > If and when that doesn't work out, and we actually have to do an > anti-wraparound VACUUM, then something will have to give. Since > anti-wraparound VACUUMs are naturally confined to certain kinds of > tables/workloads with the patch series, we can now be pretty confident > that the problem really is with this one problematic heap page, with > the idle cursor. We could even verify this directly if we wanted to, > by noticing that the preexisting relfrozenxid is an exact match for > one XID on some can't-cleanup-lock page -- we could emit a WARNING > about the page/tuple if we wanted to. To return to my colorful analogy > from earlier, we now know that the patient almost certainly has a > brain tumor. > > What new risk is implied by delaying the wait like this? Very little, > I believe. Lets say we derive FreezeLimit from > autovacuum_freeze_max_age/2 (instead of vacuum_freeze_min_age). We > still ought to have the opportunity to wait for the cleanup lock for > rather a long time -- if the XID consumption rate is so high that that > isn't true, then we're doomed anyway. All told, there seems to be a > huge net reduction in risk with this design. I'm just being honest here when I say that I can't see any huge reduction in risk. Nor a huge increase in risk. It just seems speculative to me. If I knew something about the system or the workload, then I could say what would likely work out best on that system, but in the abstract I neither know nor understand how it's possible to know. My gut feeling is that it's going to make very little difference either way. People who never release their cursors or locks or whatever are going to be sad either way, and people who usually do will be happy either way. There's some in-between category of people who release sometimes but not too often for whom it may matter, possibly quite a lot. It also seems possible that one decision rather than another will make the happy people MORE happy, or the sad people MORE sad. For most people, though, I think it's going to be irrelevant. The fact that you seem to view the situation quite differently is a big part of what worries me here. At least one of us is missing something. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 20, 2022 at 6:55 AM Robert Haas <robertmhaas@gmail.com> wrote: > Great, I'm glad we agree on that much. I would be interested in > hearing what other people think about this scenario. Agreed. > I'm just being honest here when I say that I can't see any huge > reduction in risk. Nor a huge increase in risk. It just seems > speculative to me. If I knew something about the system or the > workload, then I could say what would likely work out best on that > system, but in the abstract I neither know nor understand how it's > possible to know. I think that it's very hard to predict the timeline with a scenario like this -- no question. But I often imagine idealized scenarios like the one you brought up with cursors, with the intention of lowering the overall exposure to problems to the extent that that's possible; if it was obvious, we'd have fixed it by now already. I cannot think of any reason why making FreezeLimit into what I've been calling a backstop introduces any new risk, but I can think of ways in which it avoids risk. We shouldn't be waiting indefinitely for something totally outside our control or understanding, and so blocking all freezing and other maintenance on the table, until it's provably necessary. More fundamentally, freezing should be thought of as an overhead of storing tuples in heap blocks, as opposed to an overhead of transactions (that allocate XIDs). Meaning that FreezeLimit becomes almost an emergency thing, closely associated with aggressive anti-wraparound VACUUMs. > My gut feeling is that it's going to make very little difference > either way. People who never release their cursors or locks or > whatever are going to be sad either way, and people who usually do > will be happy either way. In a real world scenario, the rate at which XIDs are used could be very low. Buying a few hundred million more XIDs until the pain begins could amount to buying weeks or months for the user in practice. Plus they have visibility into the issue, in that they can potentially see exactly when they stopped being able to advance relfrozenxid by looking at the autovacuum logs. My thinking on vacuum_freeze_min_age has shifted very slightly. I now think that I'll probably need to keep it around, just so things like VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally) continue to work. So maybe its default should be changed to -1, which is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it should still be greatly deemphasized in user docs. -- Peter Geoghegan
On Thu, Jan 20, 2022 at 11:45 AM Peter Geoghegan <pg@bowt.ie> wrote: > My thinking on vacuum_freeze_min_age has shifted very slightly. I now > think that I'll probably need to keep it around, just so things like > VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally) > continue to work. So maybe its default should be changed to -1, which > is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it > should still be greatly deemphasized in user docs. I like that better, because it lets us retain an escape valve in case we should need it. I suggest that the documentation should say things like "The default is believed to be suitable for most use cases" or "We are not aware of a reason to change the default" rather than something like "There is almost certainly no good reason to change this" or "What kind of idiot are you, anyway?" :-) -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 20, 2022 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 20, 2022 at 11:45 AM Peter Geoghegan <pg@bowt.ie> wrote: > > My thinking on vacuum_freeze_min_age has shifted very slightly. I now > > think that I'll probably need to keep it around, just so things like > > VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally) > > continue to work. So maybe its default should be changed to -1, which > > is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it > > should still be greatly deemphasized in user docs. > > I like that better, because it lets us retain an escape valve in case > we should need it. I do see some value in that, too. Though it's not going to be a way of turning off the early freezing stuff, which seems unnecessary (though I do still have work to do on getting the overhead for that down). > I suggest that the documentation should say things > like "The default is believed to be suitable for most use cases" or > "We are not aware of a reason to change the default" rather than > something like "There is almost certainly no good reason to change > this" or "What kind of idiot are you, anyway?" :-) I will admit to having a big bias here: I absolutely *loathe* these GUCs. I really, really hate them. Consider how we have to include messy caveats about autovacuum_freeze_min_age when talking about autovacuum_vacuum_insert_scale_factor. Then there's the fact that you really cannot think about the rate of XID consumption intuitively -- it has at best a weak, unpredictable relationship with anything that users can understand, such as data stored or wall clock time. Then there are the problems with the equivalent MultiXact GUCs, which somehow, against all odds, are even worse: https://buttondown.email/nelhage/archive/notes-on-some-postgresql-implementation-details/ -- Peter Geoghegan
On Thu, 20 Jan 2022 at 17:01, Peter Geoghegan <pg@bowt.ie> wrote: > > Then there's the fact that you > really cannot think about the rate of XID consumption intuitively -- > it has at best a weak, unpredictable relationship with anything that > users can understand, such as data stored or wall clock time. This confuses me. "Transactions per second" is a headline database metric that lots of users actually focus on quite heavily -- rather too heavily imho. Ok, XID consumption is only a subset of transactions that are not read-only but that's a detail that's pretty easy to explain and users get pretty quickly. There are corner cases like transactions that look read-only but are actually read-write or transactions that consume multiple xids but complex systems are full of corner cases and people don't seem too surprised about these things. What I find confuses people much more is the concept of the oldestxmin. I think most of the autovacuum problems I've seen come from cases where autovacuum is happily kicking off useless vacuums because the oldestxmin hasn't actually advanced enough for them to do any useful work. -- greg
On Fri, Jan 21, 2022 at 12:07 PM Greg Stark <stark@mit.edu> wrote: > This confuses me. "Transactions per second" is a headline database > metric that lots of users actually focus on quite heavily -- rather > too heavily imho. But transactions per second is for the whole database, not for individual tables. It's also really a benchmarking thing, where the size and variety of transactions is fixed. With something like pgbench it actually is exactly the same thing, but such a workload is not at all realistic. Even BenchmarkSQL/TPC-C isn't like that, despite the fact that it is a fairly synthetic workload (it's just not super synthetic). > Ok, XID consumption is only a subset of transactions > that are not read-only but that's a detail that's pretty easy to > explain and users get pretty quickly. My point was mostly this: the number of distinct extant unfrozen tuple headers (and the range of the relevant XIDs) is generally highly unpredictable today. And the number of tuples we'll have to freeze to be able to advance relfrozenxid by a good amount is quite variable, in general. For example, if we bulk extend a relation as part of an ETL process, then the number of distinct XIDs could be as low as 1, even though we can expect a great deal of "freeze debt" that will have to be paid off at some point (with the current design, in the common case where the user doesn't account for this effect because they're not already an expert). There are other common cases that are not quite as extreme as that, that still have the same effect -- even an expert will find it hard or impossible to tune autovacuum_freeze_min_age for that. Another case of interest (that illustrates the general principle) is something like pgbench_tellers. We'll never have an aggressive VACUUM of the table with the patch, and we shouldn't ever need to freeze any tuples. But, owing to workload characteristics, we'll constantly be able to keep its relfrozenxid very current, because (even if we introduce skew) each individual row cannot go very long without being updated, allowing old XIDs to age out that way. There is also an interesting middle ground, where you get a mixture of both tendencies due to skew. The tuple that's most likely to get updated was the one that was just updated. How are you as a DBA ever supposed to tune autovacuum_freeze_min_age if tuples happen to be qualitatively different in this way? > What I find confuses people much more is the concept of the > oldestxmin. I think most of the autovacuum problems I've seen come > from cases where autovacuum is happily kicking off useless vacuums > because the oldestxmin hasn't actually advanced enough for them to do > any useful work. As it happens, the proposed log output won't use the term oldestxmin anymore -- I think that it makes sense to rename it to "removable cutoff". Here's an example: LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 317308 remain, 250258 skipped using visibility map (78.87% of total) tuples: 70 removed, 34105925 remain (6830471 newly frozen), 2528 are dead but not yet removable removable cutoff: 37574752, which is 230115 xids behind next new relfrozenxid: 35221275, which is 5219310 xids ahead of previous value index scan needed: 55540 pages from table (17.50% of total) had 3339809 dead item identifiers removed index "bmsql_oorder_pkey": pages: 144257 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "bmsql_oorder_idx2": pages: 330083 in total, 0 newly deleted, 0 currently deleted, 0 reusable I/O timings: read: 7928.207 ms, write: 1386.662 ms avg read rate: 33.107 MB/s, avg write rate: 26.218 MB/s buffer usage: 220825 hits, 443331 misses, 351084 dirtied WAL usage: 576110 records, 364797 full page images, 2046767817 bytes system usage: CPU: user: 10.62 s, system: 7.56 s, elapsed: 104.61 s Note also that I deliberately made the "new relfrozenxid" line that immediately follows (information that we haven't shown before now) similar, to highlight that they're now closely related concepts. Now if you VACUUM a table that is either empty or has only frozen tuples, VACUUM will set relfrozenxid to oldestxmin/removable cutoff. Internally, oldestxmin is the "starting point" for our final/target relfrozenxid for the table. We ratchet it back dynamically, whenever we see an older-than-current-target XID that cannot be immediately frozen (e.g., when we can't easily get a cleanup lock on the page). -- Peter Geoghegan
On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > I do see some value in that, too. Though it's not going to be a way of > turning off the early freezing stuff, which seems unnecessary (though > I do still have work to do on getting the overhead for that down). Attached is v7, a revision that overhauls the algorithm that decides what to freeze. I'm now calling it block-driven freezing in the commit message. Also included is a new patch, that makes VACUUM record zero free space in the FSM for an all-visible page, unless the total amount of free space happens to be greater than one half of BLCKSZ. The fact that I am now including this new FSM patch (v7-0006-*patch) may seem like a case of expanding the scope of something that could well do without it. But hear me out! It's true that the new FSM patch isn't essential. I'm including it now because it seems relevant to the approach taken with block-driven freezing -- it may even make my general approach easier to understand. The new approach to freezing is to freeze every tuple on a block that is about to be set all-visible (and thus set it all-frozen too), or to not freeze anything on the page at all (at least until one XID gets really old, which should be rare). This approach has all the benefits that I described upthread, and a new benefit: it effectively encourages the application to allow pages to "become settled". The main difference in how we freeze here (relative to v6 of the patch) is that I'm *not* freezing a page just because it was dirtied/pruned. I now think about freezing as an essentially page-level thing, barring edge cases where we have to freeze individual tuples, just because the XIDs really are getting old (it's an edge case when we can't freeze all the tuples together due to a mix of new and old, which is something we specifically set out to avoid now). Freezing whole pages ==================== When VACUUM sees that all remaining/unpruned tuples on a page are all-visible, it isn't just important because of cost control considerations. It's deeper than that. It's also treated as a tentative signal from the application itself, about the data itself. Which is: this page looks "settled" -- it may never be updated again, but if there is an update it likely won't change too much about the whole page. Also, if the page is ever updated in the future, it's likely that that will happen at a much later time than you should expect for those *other* nearby pages, that *don't* appear to be settled. And so VACUUM infers that the page is *qualitatively* different to these other nearby pages. VACUUM therefore makes it hard (though not impossible) for future inserts or updates to disturb these settled pages, via this FSM behavior -- it is short sighted to just see the space remaining on the page as free space, equivalent to any other. This holistic approach seems to work well for TPC-C/BenchmarkSQL, and perhaps even in general. More on TPC-C below. This is not unlike the approach taken by other DB systems, where free space management is baked into concurrency control, and the concept of physical data independence as we know it from Postgres never really existed. My approach also seems related to the concept of a "tenured generation", which is key to generational garbage collection. The whole basis of generational garbage collection is the generational hypothesis: "most objects die young". This is an empirical observation about applications written in GC'd programming languages actually behave, not a rigorous principle, and yet in practice it appears to always hold. Intuitively, it seems to me like the hypothesis must work in practice because if it didn't then a counterexample nemesis application's behavior would be totally chaotic, in every way. Theoretically possible, but of no real concern, since the program makes zero practical sense *as an actual program*. A Java program must make sense to *somebody* (at least the person that wrote it), which, it turns out, helpfully constrains the space of possibilities that any industrial strength GC implementation needs to handle well. The same principles seem to apply here, with VACUUM. Grouping logical rows into pages that become their "permanent home until further notice" may be somewhat arbitrary, at first, but that doesn't mean it won't end up sticking. Just like with generational garbage collection, where the application isn't expected to instruct the GC about its plans for memory that it allocates, that can nevertheless be usefully organized into distinct generations through an adaptive process. Second order effects ==================== Relating the FSM to page freezing/all-visible setting makes much more sense if you consider the second order effects. There is bound to be competition for free space among backends that access the free space map. By *not* freezing a page during VACUUM because it looks unsettled, we make its free space available in the traditional way instead. It follows that unsettled pages (in tables with lots of updates) are now the only place that backends that need more free space from the FSM can look -- unsettled pages therefore become a hot commodity, freespace-wise. A page that initially appeared "unsettled", that went on to become settled in this newly competitive environment might have that happen by pure chance -- but probably not. It *could* happen by chance, of course -- in which case the page will get dirtied again, and the cycle continues, for now. There will be further opportunities to figure it out, and freezing the tuples on the page "prematurely" still has plenty of benefits. Locality matters a lot, obviously. The goal with the FSM stuff is merely to make it *possible* for pages to settle naturally, to the extent that we can. We really just want to avoid hindering a naturally occurring process -- we want to avoid destroying naturally occuring locality. We must be willing to accept some cost for that. Even if it takes a few attempts for certain pages, constraining the application's choice of where to get free space from (can't be a page marked all-visible) allows pages to *systematically* become settled over time. The application is in charge, really -- not VACUUM. This is already the case, whether we like it or not. VACUUM needs to learn to live in that reality, rather than fighting it. When VACUUM considers a page settled, and the physical page still has a relatively large amount of free space (say 45% of BLCKSZ, a borderline case in the new FSM patch), "losing" so much free space certainly is unappealing. We set the free space to 0 in the free space map all the same, because we're cutting our losses at that point. While the exact threshold I've proposed is tentative, the underlying theory seems pretty sound to me. The BLCKSZ/2 cutoff (and the way that it extends the general rules for whole-page freezing) is intended to catch pages that are qualitatively different, as well as quantitatively different. It is a balancing act, between not wasting space, and the risk of systemic problems involving excessive amounts of non-HOT updates that must move a successor version to another page. It's possible that a higher cutoff (for example a cutoff of 80% of BLCKSZ, not 50%) will actually lead to *worse* space utilization, in addition to the downsides from fragmentation -- it's far from a simple trade-off. (Not that you should believe that 50% is special, it's just a starting point for me.) TPC-C ===== I'm going to talk about a benchmark that ran throughout the week, starting on Monday. Each run lasted 24 hours, and there were 2 runs in total, for both the patch and for master/baseline. So this benchmark lasted 4 days, not including the initial bulk loading, with databases that were over 450GB in size by the time I was done (that's 450GB+ for both the patch and master) . Benchmarking for days at a time is pretty inconvenient, but it seems necessary to see certain effects in play. We need to wait until the baseline/master case starts to have anti-wraparound VACUUMs with default, realistic settings, which just takes days and days. I make available all of my data for the Benchmark in question, which is way more information that anybody is likely to want -- I dump anything that even might be useful from the system views in an automated way. There are html reports for all 4 24 hour long runs. Google drive link: https://drive.google.com/drive/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR?usp=sharing While the patch did well overall, and I will get to the particulars towards the end of the email, I want to start with what I consider to be the important part: the user/admin experience with VACUUM, and VACUUM's performance stability. This is about making VACUUM less scary. As I've said several times now, with an append-only table like pgbench_history we see a consistent pattern where relfrozenxid is set to a value very close to the same VACUUM's OldestXmin value (even precisely equal to OldestXmin) during each VACUUM operation, again and again, forever -- that case is easy to understand and appreciate, and has already been discussed. Now (with v7's new approach to freezing), a related pattern can be seen in the case of the two big, troublesome TPC-C tables, the orders and order lines tables. To recap, these tables are somewhat like the history table, in that new orders insert into both tables, again and again, forever. But they also have one huge difference to simple append-only tables too, which is the source of most of our problems with TPC-C. The difference is: there are also delayed, correlated updates of each row from each table. Exactly one such update per row for both tables, which takes place hours after each order's insert, when the earlier order is processed by TPC-C's delivery transaction. In the long run we need the data to age out and not get re-dirtied, as the table grows and grows indefinitely, much like with a simple append-only table. At the same time, we don't want to have poor free space management for these deferred updates. It's adversarial, sort of, but in a way that is grounded in reality. With the order and order lines tables, relfrozenxid tends to be advanced up to the OldestXmin used by the *previous* VACUUM operation -- an unmistakable pattern. I'll show you all of the autovacuum log output for the orders table during the second 24 hour long benchmark run: 2022-01-27 01:46:27 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1205349 remain, 887225 skipped using visibility map (73.61% of total) tuples: 253872 removed, 134182902 remain (26482225 newly frozen), 27193 are dead but not yet removable removable cutoff: 243783407, older by 728844 xids when operation ended new relfrozenxid: 215400514, which is 26840669 xids ahead of previous value ... 2022-01-27 05:54:39 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1345302 remain, 993924 skipped using visibility map (73.88% of total) tuples: 261656 removed, 150022816 remain (29757570 newly frozen), 29216 are dead but not yet removable removable cutoff: 276319403, older by 826850 xids when operation ended new relfrozenxid: 243838706, which is 28438192 xids ahead of previous value ... 2022-01-27 10:37:24 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1504707 remain, 1110002 skipped using visibility map (73.77% of total) tuples: 316086 removed, 167990124 remain (33754949 newly frozen), 33326 are dead but not yet removable removable cutoff: 313328445, older by 987732 xids when operation ended new relfrozenxid: 276309397, which is 32470691 xids ahead of previous value ... 2022-01-27 15:49:51 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1680649 remain, 1250525 skipped using visibility map (74.41% of total) tuples: 343946 removed, 187739072 remain (37346315 newly frozen), 38037 are dead but not yet removable removable cutoff: 354149019, older by 1222160 xids when operation ended new relfrozenxid: 313332249, which is 37022852 xids ahead of previous value ... 2022-01-27 21:55:34 PST LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 1886336 remain, 1403800 skipped using visibility map (74.42% of total) tuples: 389748 removed, 210899148 remain (43453900 newly frozen), 45802 are dead but not yet removable removable cutoff: 401955979, older by 1458514 xids when operation ended new relfrozenxid: 354134615, which is 40802366 xids ahead of previous value This mostly speaks for itself, I think. (Anybody that's interested can drill down to the logs for order lines, which looks similar.) The effect we see with the order/order lines table isn't perfectly reliable. Actually, it depends on how you define it. It's possible that we won't be able to acquire a cleanup lock on the wrong page at the wrong time, and as a result fail to advance relfrozenxid by the usual amount, once. But that effect appears to be both rare and of no real consequence. One could reasonably argue that we never fell behind, because we still did 99.9%+ of the required freezing -- we just didn't immediately get to advance relfrozenxid, because of a temporary hiccup on one page. We will still advance relfrozenxid by a small amount. Sometimes it'll be by only hundreds of XIDs when millions or tens of millions of XIDs were expected. Once we advance it by some amount, we can reasonably suppose that the issue was just a hiccup. On the master branch, the first 24 hour period has no anti-wraparound VACUUMs, and so looking at that first 24 hour period gives you some idea of how worse off we are in the short term -- the freezing stuff won't really start to pay for itself until the second 24 hour run with these mostly-default freeze related settings. The second 24 hour run on master almost exclusively has anti-wraparound VACUUMs for all the largest tables, though -- all at the same time. And not just the first time, either! This causes big spikes that the patch totally avoids, simply by avoiding anti-wraparound VACUUMs. With the patch, there are no anti-wraparound VACUUMs, barring tables that will never be vacuumed for any other reason, where it's still inevitable, limited to the stock table and customers table. It was a mistake for me to emphasize "no anti-wraparound VACUUMs outside pathological cases" before now. I stand by those statements as accurate, but anti-wraparound VACUUMs should not have been given so much emphasis. Let's assume that somehow we really were to get an anti-wraparound VACUUM against one of the tables where that's just not expected, like this orders table -- let's suppose that I got that part wrong, in some way. It would hardly matter at all! We'd still have avoided the freezing cliff during this anti-wraparound VACUUM, which is the real benefit. Chances are good that we needed to VACUUM anyway, just to clean any very old garbage tuples up -- relfrozenxid is now predictive of the age of the oldest garbage tuples, which might have been a good enough reason to VACUUM anyway. The stampede of anti-wraparound VACUUMs against multiple tables seems like it would still be fixed, since relfrozenxid now actually tells us something about the table (as opposed to telling us only about what the user set vacuum_freeze_min_age to). The only concerns that this leaves for me are all usability related, and not of primary importance (e.g. do we really need to make anti-wraparound VACUUMs non-cancelable now?). TPC-C raw numbers ================= The single most important number for the patch might be the decrease in both buffer misses and buffer hits, which I believe is caused by the patch being able to use index-only scans much more effectively (with modifications to BenchmarkSQL to improve the indexing strategy [1]). This is quite clear from pg_stat_database state at the end. Patch: xact_commit | 440,515,133 xact_rollback | 1,871,142 blks_read | 3,754,614,188 blks_hit | 174,551,067,731 tup_returned | 341,222,714,073 tup_fetched | 124,797,772,450 tup_inserted | 2,900,197,655 tup_updated | 4,549,948,092 tup_deleted | 165,222,130 Here is the same pg_stat_database info for master: xact_commit | 440,402,505 xact_rollback | 1,871,536 blks_read | 4,002,682,052 blks_hit | 283,015,966,386 tup_returned | 346,448,070,798 tup_fetched | 237,052,965,901 tup_inserted | 2,899,735,420 tup_updated | 4,547,220,642 tup_deleted | 165,103,426 The blks_read is x0.938 of master/baseline for the patch -- not bad. More importantly, blks_hit is x0.616 for the patch -- quite a significant reduction in a key cost. Note that we start to get this particular benefit for individual read queries pretty early on -- avoiding unsetting visibility map bits like this matters right from the start. In TPC-C terms, the ORDER_STATUS transaction will have much lower latency, particularly tail latency, since it uses index-only scans to good effect. There are 5 distinct transaction types from the benchmark, and an improvement to one particular transaction type isn't unusual -- so you often have to drill down, and look at the full html report. The latency situation is improved across the board with the patch, by quite a bit, especially after the second run. This server can sustain much more throughput than the TPC-C spec formally permits, even though I've increased the TPM rate from the benchmark by 10x the spec legal limit, so query latency is the main TPC-C metric of interest here. WAL === Then there's the WAL overhead. Like practically any workload, the WAL consumption for this workload is dominated by FPIs, despite the fact that I've tuned checkpoints reasonably well. The patch *does* write more WAL in the first set of runs -- it writes a total of ~3.991 TiB, versus ~3.834 TiB for master. In other words, during the first 24 hour run (before the trouble with the anti-wraparound freeze cliff even begins for the master branch), the patch writes x1.040 as much WAL in total. The good news is that the patch comes out ahead by the end, after the second set of 24 hour runs. By the time the second run finishes, it's 8.332 TiB of WAL total for the patch, versus 8.409 TiB for master, putting the patch at x0.990 in the end -- a small improvement. I believe that most of the WAL doesn't get generated by VACUUM here anyway -- opportunistic pruning works well for this workload. I expect to be able to commit the first 2 patches in a couple of weeks, since that won't need to block on making the case for the final 3 or 4 patches from the patch series. The early stuff is mostly just refactoring work that removes needless differences between aggressive and non-aggressive VACUUM operations. It makes a lot of sense on its own. [1] https://github.com/pgsql-io/benchmarksql/pull/16 -- Peter Geoghegan
Attachment
- v7-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patch
- v7-0005-Make-block-level-characteristics-drive-freezing.patch
- v7-0006-Add-all-visible-FSM-heuristic.patch
- v7-0003-Consolidate-VACUUM-xid-cutoff-logic.patch
- v7-0002-Add-VACUUM-instrumentation-for-scanned-pages-relf.patch
- v7-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patch
On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I do see some value in that, too. Though it's not going to be a way of > > turning off the early freezing stuff, which seems unnecessary (though > > I do still have work to do on getting the overhead for that down). > > Attached is v7, a revision that overhauls the algorithm that decides > what to freeze. I'm now calling it block-driven freezing in the commit > message. Also included is a new patch, that makes VACUUM record zero > free space in the FSM for an all-visible page, unless the total amount > of free space happens to be greater than one half of BLCKSZ. > > The fact that I am now including this new FSM patch (v7-0006-*patch) > may seem like a case of expanding the scope of something that could > well do without it. But hear me out! It's true that the new FSM patch > isn't essential. I'm including it now because it seems relevant to the > approach taken with block-driven freezing -- it may even make my > general approach easier to understand. Without having looked at the latest patches, there was something in the back of my mind while following the discussion upthread -- the proposed opportunistic freezing made a lot more sense if the earlier-proposed open/closed pages concept was already available. > Freezing whole pages > ==================== > It's possible that a higher cutoff (for example a cutoff of 80% of > BLCKSZ, not 50%) will actually lead to *worse* space utilization, in > addition to the downsides from fragmentation -- it's far from a simple > trade-off. (Not that you should believe that 50% is special, it's just > a starting point for me.) How was the space utilization with the 50% cutoff in the TPC-C test? > TPC-C raw numbers > ================= > > The single most important number for the patch might be the decrease > in both buffer misses and buffer hits, which I believe is caused by > the patch being able to use index-only scans much more effectively > (with modifications to BenchmarkSQL to improve the indexing strategy > [1]). This is quite clear from pg_stat_database state at the end. > > Patch: > blks_hit | 174,551,067,731 > tup_fetched | 124,797,772,450 > Here is the same pg_stat_database info for master: > blks_hit | 283,015,966,386 > tup_fetched | 237,052,965,901 That's impressive. -- John Naylor EDB: http://www.enterprisedb.com
On Fri, Feb 4, 2022 at 2:00 PM John Naylor <john.naylor@enterprisedb.com> wrote: > Without having looked at the latest patches, there was something in > the back of my mind while following the discussion upthread -- the > proposed opportunistic freezing made a lot more sense if the > earlier-proposed open/closed pages concept was already available. Yeah, sorry about that. The open/closed pages concept is still something I plan on working on. My prototype (which I never posted to the list) will be rebased, and I'll try to target Postgres 16. > > Freezing whole pages > > ==================== > > > It's possible that a higher cutoff (for example a cutoff of 80% of > > BLCKSZ, not 50%) will actually lead to *worse* space utilization, in > > addition to the downsides from fragmentation -- it's far from a simple > > trade-off. (Not that you should believe that 50% is special, it's just > > a starting point for me.) > > How was the space utilization with the 50% cutoff in the TPC-C test? The picture was mixed. To get the raw numbers, compare pg-relation-sizes-after-patch-2.out and pg-relation-sizes-after-master-2.out files from the drive link I provided (to repeat, get them from https://drive.google.com/drive/u/1/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR) Highlights: the largest table (the bmsql_order_line table) had a total size of x1.006 relative to master, meaning that we did slightly worse there. However, the index on the same table was slightly smaller instead, probably because reducing heap fragmentation tends to make the index deletion stuff work a bit better than before. Certain small tables (bmsql_district and bmsql_warehouse) were actually significantly smaller (less than half their size on master), probably just because the patch can reliably remove LP_DEAD items from heap pages, even when a cleanup lock isn't available. The bmsql_new_order table was quite a bit larger, but it's not that large anyway (1250 MB on master at the very end, versus 1433 MB with the patch). This is a clear trade-off, since we get much less fragmentation in the same table (as evidenced by the VACUUM output, where there are fewer pages with any LP_DEAD items per VACUUM with the patch). The workload for that table is characterized by inserting new orders together, and deleting the same orders as a group later on. So we're bound to pay a cost in space utilization to lower the fragmentation. > > blks_hit | 174,551,067,731 > > tup_fetched | 124,797,772,450 > > > Here is the same pg_stat_database info for master: > > > blks_hit | 283,015,966,386 > > tup_fetched | 237,052,965,901 > > That's impressive. Thanks! It's still possible to get a big improvement like that with something like TPC-C because there are certain behaviors that are clearly suboptimal -- once you look at the details of the workload, and compare an imaginary ideal to the actual behavior of the system. In particular, there is really only one way that the free space management can work for the two big tables that will perform acceptably -- the orders have to be stored in the same place to begin with, and stay in the same place forever (at least to the extent that that's possible). -- Peter Geoghegan
On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > When VACUUM sees that all remaining/unpruned tuples on a page are > all-visible, it isn't just important because of cost control > considerations. It's deeper than that. It's also treated as a > tentative signal from the application itself, about the data itself. > Which is: this page looks "settled" -- it may never be updated again, > but if there is an update it likely won't change too much about the > whole page. While I agree that there's some case to be made for leaving settled pages well enough alone, your criterion for settled seems pretty much accidental. Imagine a system where there are two applications running, A and B. Application A runs all the time and all the transactions which it performs are short. Therefore, when a certain page is not modified by transaction A for a short period of time, the page will become all-visible and will be considered settled. Application B runs once a month and performs various transactions all of which are long, perhaps on a completely separate set of tables. While application B is running, pages take longer to settle not only for application B but also for application A. It doesn't make sense to say that the application is in control of the behavior when, in reality, it may be some completely separate application that is controlling the behavior. > The application is in charge, really -- not VACUUM. This is already > the case, whether we like it or not. VACUUM needs to learn to live in > that reality, rather than fighting it. When VACUUM considers a page > settled, and the physical page still has a relatively large amount of > free space (say 45% of BLCKSZ, a borderline case in the new FSM > patch), "losing" so much free space certainly is unappealing. We set > the free space to 0 in the free space map all the same, because we're > cutting our losses at that point. While the exact threshold I've > proposed is tentative, the underlying theory seems pretty sound to me. > The BLCKSZ/2 cutoff (and the way that it extends the general rules for > whole-page freezing) is intended to catch pages that are qualitatively > different, as well as quantitatively different. It is a balancing act, > between not wasting space, and the risk of systemic problems involving > excessive amounts of non-HOT updates that must move a successor > version to another page. I can see that this could have significant advantages under some circumstances. But I think it could easily be far worse under other circumstances. I mean, you can have workloads where you do some amount of read-write work on a table and then go read only and sequential scan it an infinite number of times. An algorithm that causes the table to be smaller at the point where we switch to read-only operations, even by a modest amount, wins infinitely over anything else. But even if you have no change in the access pattern, is it a good idea to allow the table to be, say, 5% larger if it means that correlated data is colocated? In general, probably yes. If that means that the table fails to fit in shared_buffers instead of fitting, no. If that means that the table fails to fit in the OS cache instead of fitting, definitely no. And to me, that kind of effect is why it's hard to gain much confidence in regards to stuff like this via laboratory testing. I mean, I'm glad you're doing such tests. But in a laboratory test, you tend not to have things like a sudden and complete change in the workload, or a random other application sometimes sharing the machine, or only being on the edge of running out of memory. I think in general people tend to avoid such things in benchmarking scenarios, but even if include stuff like this, it's hard to know what to include that would be representative of real life, because just about anything *could* happen in real life. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 4, 2022 at 2:45 PM Robert Haas <robertmhaas@gmail.com> wrote: > While I agree that there's some case to be made for leaving settled > pages well enough alone, your criterion for settled seems pretty much > accidental. I fully admit that I came up with the FSM heuristic with TPC-C in mind. But you have to start somewhere. Fortunately, the main benefit of this patch series (avoiding the freeze cliff during anti-wraparound VACUUMs, often avoiding anti-wraparound VACUUMs altogether) don't depend on the experimental FSM patch at all. I chose to post that now because it seemed to help with my more general point about qualitatively different pages, and freezing at the page level. > Imagine a system where there are two applications running, > A and B. Application A runs all the time and all the transactions > which it performs are short. Therefore, when a certain page is not > modified by transaction A for a short period of time, the page will > become all-visible and will be considered settled. Application B runs > once a month and performs various transactions all of which are long, > perhaps on a completely separate set of tables. While application B is > running, pages take longer to settle not only for application B but > also for application A. It doesn't make sense to say that the > application is in control of the behavior when, in reality, it may be > some completely separate application that is controlling the behavior. Application B will already block pruning by VACUUM operations against application A's table, and so effectively blocks recording of the resultant free space in the FSM in your scenario. And so application A and application B should be considered the same application already. That's just how VACUUM works. VACUUM isn't a passive observer of the system -- it's another participant. It both influences and is influenced by almost everything else in the system. > I can see that this could have significant advantages under some > circumstances. But I think it could easily be far worse under other > circumstances. I mean, you can have workloads where you do some amount > of read-write work on a table and then go read only and sequential > scan it an infinite number of times. An algorithm that causes the > table to be smaller at the point where we switch to read-only > operations, even by a modest amount, wins infinitely over anything > else. But even if you have no change in the access pattern, is it a > good idea to allow the table to be, say, 5% larger if it means that > correlated data is colocated? In general, probably yes. If that means > that the table fails to fit in shared_buffers instead of fitting, no. > If that means that the table fails to fit in the OS cache instead of > fitting, definitely no. 5% larger seems like a lot more than would be typical, based on what I've seen. I don't think that the regression in this scenario can be characterized as "infinitely worse", or anything like it. On a long enough timeline, the potential upside of something like this is nearly unlimited -- it could avoid a huge amount of write amplification. But the potential downside seems to be small and fixed -- which is the point (bounding the downside). The mere possibility of getting that big benefit (avoiding the costs from heap fragmentation) is itself a benefit, even when it turns out not to pay off in your particular case. It can be seen as insurance. > And to me, that kind of effect is why it's hard to gain much > confidence in regards to stuff like this via laboratory testing. I > mean, I'm glad you're doing such tests. But in a laboratory test, you > tend not to have things like a sudden and complete change in the > workload, or a random other application sometimes sharing the machine, > or only being on the edge of running out of memory. I think in general > people tend to avoid such things in benchmarking scenarios, but even > if include stuff like this, it's hard to know what to include that > would be representative of real life, because just about anything > *could* happen in real life. Then what could you have confidence in? -- Peter Geoghegan
On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > Application B will already block pruning by VACUUM operations against > application A's table, and so effectively blocks recording of the > resultant free space in the FSM in your scenario. And so application A > and application B should be considered the same application already. > That's just how VACUUM works. Sure ... but that also sucks. If we consider application A and application B to be the same application, then we're basing our decision about what to do on information that is inaccurate. > 5% larger seems like a lot more than would be typical, based on what > I've seen. I don't think that the regression in this scenario can be > characterized as "infinitely worse", or anything like it. On a long > enough timeline, the potential upside of something like this is nearly > unlimited -- it could avoid a huge amount of write amplification. But > the potential downside seems to be small and fixed -- which is the > point (bounding the downside). The mere possibility of getting that > big benefit (avoiding the costs from heap fragmentation) is itself a > benefit, even when it turns out not to pay off in your particular > case. It can be seen as insurance. I don't see it that way. There are cases where avoiding writes is better, and cases where trying to cram everything into the fewest possible ages is better. With the right test case you can make either strategy look superior. What I think your test case has going for it is that it is similar to something that a lot of people, really a ton of people, actually do with PostgreSQL. However, it's not going to be an accurate model of what everybody does, and therein lies some element of danger. > Then what could you have confidence in? Real-world experience. Which is hard to get if we don't ever commit any patches, but a good argument for (a) having them tested by multiple different hackers who invent test cases independently and (b) some configurability where we can reasonably include it, so that if anyone does experience problems they have an escape. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 4, 2022 at 4:18 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Application B will already block pruning by VACUUM operations against > > application A's table, and so effectively blocks recording of the > > resultant free space in the FSM in your scenario. And so application A > > and application B should be considered the same application already. > > That's just how VACUUM works. > > Sure ... but that also sucks. If we consider application A and > application B to be the same application, then we're basing our > decision about what to do on information that is inaccurate. I agree that it sucks, but I don't think that it's particularly relevant to the FSM prototype patch that I included with v7 of the patch series. A heap page cannot be considered "closed" (either in the specific sense from the patch, or in any informal sense) when it has recently dead tuples. At some point we should invent a fallback path for pruning, that migrates recently dead tuples to some other subsidiary structure, retaining only forwarding information in the heap page. But even that won't change what I just said about closed pages (it'll just make it easier to return and fix things up later on). > I don't see it that way. There are cases where avoiding writes is > better, and cases where trying to cram everything into the fewest > possible ages is better. With the right test case you can make either > strategy look superior. The cost of reads is effectively much lower than writes with modern SSDs, in TCO terms. Plus when a FSM strategy like the one from the patch does badly according to a naive measure such as total table size, that in itself doesn't mean that we do worse with reads. In fact, it's quite the opposite. The benchmark showed that v7 of the patch did very slightly worse on overall space utilization, but far, far better on reads. In fact, the benefits for reads were far in excess of any efficiency gains for writes/with WAL. The greatest bottleneck is almost always latency on modern hardware [1]. It follows that keeping logically related data grouped together is crucial. Far more important than potentially using very slightly more space. The story I wanted to tell with the FSM patch was about open and closed pages being the right long term direction. More generally, we should emphasize managing page-level costs, and deemphasize managing tuple-level costs, which are much less meaningful. > What I think your test case has going for it > is that it is similar to something that a lot of people, really a ton > of people, actually do with PostgreSQL. However, it's not going to be > an accurate model of what everybody does, and therein lies some > element of danger. No question -- agreed. > > Then what could you have confidence in? > > Real-world experience. Which is hard to get if we don't ever commit > any patches, but a good argument for (a) having them tested by > multiple different hackers who invent test cases independently and (b) > some configurability where we can reasonably include it, so that if > anyone does experience problems they have an escape. I agree. [1] https://dl.acm.org/doi/10.1145/1022594.1022596 -- Peter Geoghegan
On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan <pg@bowt.ie> wrote: > > My emphasis here has been on making non-aggressive VACUUMs *always* > advance relfrozenxid, outside of certain obvious edge cases. And so > with all the patches applied, up to and including the opportunistic > freezing patch, every autovacuum of every table manages to advance > relfrozenxid during benchmarking -- usually to a fairly recent value. > I've focussed on making aggressive VACUUMs (especially anti-wraparound > autovacuums) a rare occurrence, for truly exceptional cases (e.g., > user keeps canceling autovacuums, maybe due to automated script that > performs DDL). That has taken priority over other goals, for now. While I've seen all the above cases triggering anti-wraparound cases by far the majority of the cases are not of these pathological forms. By far the majority of anti-wraparound vacuums are triggered by tables that are very large and so don't trigger regular vacuums for "long periods" of time and consistently hit the anti-wraparound threshold first. There's nothing limiting how long "long periods" is and nothing tying it to the rate of xid consumption. It's quite common to have some *very* large mostly static tables in databases that have other tables that are *very* busy. The worst I've seen is a table that took 36 hours to vacuum in a database that consumed about a billion transactions per day... That's extreme but these days it's quite common to see tables that get anti-wraparound vacuums every week or so despite having < 1% modified tuples. And databases are only getting bigger and transaction rates faster... -- greg
On Fri, Feb 4, 2022 at 10:21 PM Greg Stark <stark@mit.edu> wrote: > On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan <pg@bowt.ie> wrote: > > My emphasis here has been on making non-aggressive VACUUMs *always* > > advance relfrozenxid, outside of certain obvious edge cases. And so > > with all the patches applied, up to and including the opportunistic > > freezing patch, every autovacuum of every table manages to advance > > relfrozenxid during benchmarking -- usually to a fairly recent value. > > I've focussed on making aggressive VACUUMs (especially anti-wraparound > > autovacuums) a rare occurrence, for truly exceptional cases (e.g., > > user keeps canceling autovacuums, maybe due to automated script that > > performs DDL). That has taken priority over other goals, for now. > > While I've seen all the above cases triggering anti-wraparound cases > by far the majority of the cases are not of these pathological forms. Right - it's practically inevitable that you'll need an anti-wraparound VACUUM to advance relfrozenxid right now. Technically it's possible to advance relfrozenxid in any VACUUM, but in practice it just never happens on a large table. You only need to get unlucky with one heap page, either by failing to get a cleanup lock, or (more likely) by setting even one single page all-visible but not all-frozen just once (once in any VACUUM that takes place between anti-wraparound VACUUMs). > By far the majority of anti-wraparound vacuums are triggered by tables > that are very large and so don't trigger regular vacuums for "long > periods" of time and consistently hit the anti-wraparound threshold > first. autovacuum_vacuum_insert_scale_factor can help with this on 13 and 14, but only if you tune autovacuum_freeze_min_age with that goal in mind. Which probably doesn't happen very often. > There's nothing limiting how long "long periods" is and nothing tying > it to the rate of xid consumption. It's quite common to have some > *very* large mostly static tables in databases that have other tables > that are *very* busy. > > The worst I've seen is a table that took 36 hours to vacuum in a > database that consumed about a billion transactions per day... That's > extreme but these days it's quite common to see tables that get > anti-wraparound vacuums every week or so despite having < 1% modified > tuples. And databases are only getting bigger and transaction rates > faster... Sounds very much like what I've been calling the freezing cliff. An anti-wraparound VACUUM throws things off by suddenly dirtying many more pages than the expected amount for a VACUUM against the table, despite there being no change in workload characteristics. If you just had to remove the dead tuples in such a table, then it probably wouldn't matter if it happened earlier than expected. -- Peter Geoghegan
On Fri, Feb 4, 2022 at 10:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > Right - it's practically inevitable that you'll need an > anti-wraparound VACUUM to advance relfrozenxid right now. Technically > it's possible to advance relfrozenxid in any VACUUM, but in practice > it just never happens on a large table. You only need to get unlucky > with one heap page, either by failing to get a cleanup lock, or (more > likely) by setting even one single page all-visible but not all-frozen > just once (once in any VACUUM that takes place between anti-wraparound > VACUUMs). Minor correction: That's a slight exaggeration, since we won't skip groups of all-visible pages that don't exceed SKIP_PAGES_THRESHOLD blocks (32 blocks). -- Peter Geoghegan
On Fri, Feb 4, 2022 at 10:21 PM Greg Stark <stark@mit.edu> wrote: > By far the majority of anti-wraparound vacuums are triggered by tables > that are very large and so don't trigger regular vacuums for "long > periods" of time and consistently hit the anti-wraparound threshold > first. That's interesting, because my experience is different. Most of the time when I get asked to look at a system, it turns out that there is a prepared transaction or a forgotten replication slot and nobody noticed until the system hit the wraparound threshold. Or occasionally a long-running transaction or a failing/stuck vacuum that has the same effect. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 4, 2022 at 10:45 PM Peter Geoghegan <pg@bowt.ie> wrote: > > While I've seen all the above cases triggering anti-wraparound cases > > by far the majority of the cases are not of these pathological forms. > > Right - it's practically inevitable that you'll need an > anti-wraparound VACUUM to advance relfrozenxid right now. Technically > it's possible to advance relfrozenxid in any VACUUM, but in practice > it just never happens on a large table. You only need to get unlucky > with one heap page, either by failing to get a cleanup lock, or (more > likely) by setting even one single page all-visible but not all-frozen > just once (once in any VACUUM that takes place between anti-wraparound > VACUUMs). But ... if I'm not mistaken, in the kind of case that Greg is describing, relfrozenxid will be advanced exactly as often as it is today. That's because, if VACUUM is only ever getting triggered by XID age advancement and not by bloat, there's no opportunity for your patch set to advance relfrozenxid any sooner than we're doing now. So I think that people in this kind of situation will potentially be helped or hurt by other things the patch set does, but the eager relfrozenxid stuff won't make any difference for them. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Feb 7, 2022 at 10:08 AM Robert Haas <robertmhaas@gmail.com> wrote: > But ... if I'm not mistaken, in the kind of case that Greg is > describing, relfrozenxid will be advanced exactly as often as it is > today. But what happens today in a scenario like Greg's is pathological, despite being fairly common (common in large DBs). It doesn't seem informative to extrapolate too much from current experience for that reason. > That's because, if VACUUM is only ever getting triggered by XID > age advancement and not by bloat, there's no opportunity for your > patch set to advance relfrozenxid any sooner than we're doing now. We must distinguish between: 1. "VACUUM is fundamentally never going to need to run unless it is forced to, just to advance relfrozenxid" -- this applies to tables like the stock and customers tables from the benchmark. and: 2. "VACUUM must sometimes run to mark newly appended heap pages all-visible, and maybe to also remove dead tuples, but not that often -- and yet we current only get expensive and inconveniently timed anti-wraparound VACUUMs, no matter what" -- this applies to all the other big tables in the benchmark, in particular to the orders and order lines tables, but also to simpler cases like pgbench_history. As I've said a few times now, the patch doesn't change anything for 1. But Greg's problem tables very much sound like they're from category 2. And what we see with the master branch for such tables is that they always get anti-wraparound VACUUMs, past a certain size (depends on things like exact XID rate and VACUUM settings, the insert-driven autovacuum scheduling stuff matters). While the patch never reaches that point in practice, during my testing -- and doesn't come close. It is true that in theory, as the size of ones of these "category 2" tables tends to infinity, the patch ends up behaving the same as master anyway. But I'm pretty sure that that usually doesn't matter at all, or matters less than you'd think. As I emphasized when presenting the recent v7 TPC-C benchmark, neither of the two "TPC-C big problem tables" (which are particularly interesting/tricky examples of tables from category 2) come close to getting an anti-wraparound VACUUM (plus, as I said in the same email, wouldn't matter if they did). > So I think that people in this kind of situation will potentially be > helped or hurt by other things the patch set does, but the eager > relfrozenxid stuff won't make any difference for them. To be clear, I think it would if everything was in place, including the basic relfrozenxid advancement thing, plus the new freezing stuff (though you wouldn't need the experimental FSM thing to get this benefit). Here is a thought experiment that may make the general idea a bit clearer: Imagine I reran the same benchmark as before, with the same settings, and the expectation that everything would be the same as first time around for the patch series. But to make things more interesting, this time I add an adversarial element: I add an adversarial gizmo that burns XIDs steadily, without doing any useful work. This gizmo doubles the rate of XID consumption for the database as a whole, perhaps by calling "SELECT txid_current()" in a loop, followed by a timed sleep (with a delay chosen with the goal of doubling XID consumption). I imagine that this would also burn CPU cycles, but probably not enough to make more than a noise level impact -- so we're severely stressing the implementation by adding this gizmo, but the stress is precisely targeted at XID consumption and related implementation details. It's a pretty clean experiment. What happens now? I believe (though haven't checked for myself) that nothing important would change. We'd still see the same VACUUM operations occur at approximately the same times (relative to the start of the benchmark) that we saw with the original benchmark, and each VACUUM operation would do approximately the same amount of physical work on each occasion. Of course, the autovacuum log output would show that the OldestXmin for each individual VACUUM operation had larger values than first time around for this newly initdb'd TPC-C database (purely as a consequence of the XID burning gizmo), but it would *also* show *concomitant* increases for our newly set relfrozenxid. The system should therefore hardly behave differently at all compared to the original benchmark run, despite this adversarial gizmo. It's fair to wonder: okay, but what if it was 4x, 8x, 16x? What then? That does get a bit more complicated, and we should get into why that is. But for now I'll just say that I think that even that kind of extreme would make much less difference than you might think -- since relfrozenxid advancement has been qualitatively improved by the patch series. It is especially likely that nothing would change if you were willing to increase autovacuum_freeze_max_age to get a bit more breathing room -- room to allow the autovacuums to run at their "natural" times. You wouldn't necessarily have to go too far -- the extra breathing room from increasing autovacuum_freeze_max_age buys more wall clock time *between* any two successive "naturally timed autovacuums". Again, a virtuous cycle. Does that make sense? It's pretty subtle, admittedly, and you no doubt have (very reasonable) concerns about the extremes, even if you accept all that. I just want to get the general idea across here, as a starting point for further discussion. -- Peter Geoghegan
On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan <pg@bowt.ie> wrote: > > That's because, if VACUUM is only ever getting triggered by XID > > age advancement and not by bloat, there's no opportunity for your > > patch set to advance relfrozenxid any sooner than we're doing now. > > We must distinguish between: > > 1. "VACUUM is fundamentally never going to need to run unless it is > forced to, just to advance relfrozenxid" -- this applies to tables > like the stock and customers tables from the benchmark. > > and: > > 2. "VACUUM must sometimes run to mark newly appended heap pages > all-visible, and maybe to also remove dead tuples, but not that often > -- and yet we current only get expensive and inconveniently timed > anti-wraparound VACUUMs, no matter what" -- this applies to all the > other big tables in the benchmark, in particular to the orders and > order lines tables, but also to simpler cases like pgbench_history. It's not really very understandable for me when you refer to the way table X behaves in Y benchmark, because I haven't studied that in enough detail to know. If you say things like insert-only table, or a continuous-random-updates table, or whatever the case is, it's a lot easier to wrap my head around it. > Does that make sense? It's pretty subtle, admittedly, and you no doubt > have (very reasonable) concerns about the extremes, even if you accept > all that. I just want to get the general idea across here, as a > starting point for further discussion. Not really. I think you *might* be saying tables which currently get only wraparound vacuums will end up getting other kinds of vacuums with your patch because things will improve enough for other tables in the system that they will be able to get more attention than they do currently. But I'm not sure I am understanding you correctly, and even if I am I don't understand why that would be so, and even if it is I think it doesn't help if essentially all the tables in the system are suffering from the problem. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Feb 7, 2022 at 12:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > That's because, if VACUUM is only ever getting triggered by XID > > > age advancement and not by bloat, there's no opportunity for your > > > patch set to advance relfrozenxid any sooner than we're doing now. > > > > We must distinguish between: > > > > 1. "VACUUM is fundamentally never going to need to run unless it is > > forced to, just to advance relfrozenxid" -- this applies to tables > > like the stock and customers tables from the benchmark. > > > > and: > > > > 2. "VACUUM must sometimes run to mark newly appended heap pages > > all-visible, and maybe to also remove dead tuples, but not that often > > -- and yet we current only get expensive and inconveniently timed > > anti-wraparound VACUUMs, no matter what" -- this applies to all the > > other big tables in the benchmark, in particular to the orders and > > order lines tables, but also to simpler cases like pgbench_history. > > It's not really very understandable for me when you refer to the way > table X behaves in Y benchmark, because I haven't studied that in > enough detail to know. If you say things like insert-only table, or a > continuous-random-updates table, or whatever the case is, it's a lot > easier to wrap my head around it. What I've called category 2 tables are the vast majority of big tables in practice. They include pure append-only tables, but also tables that grow and grow from inserts, but also have some updates. The point of the TPC-C order + order lines examples was to show how broad the category really is. And how mixtures of inserts and bloat from updates on one single table confuse the implementation in general. > > Does that make sense? It's pretty subtle, admittedly, and you no doubt > > have (very reasonable) concerns about the extremes, even if you accept > > all that. I just want to get the general idea across here, as a > > starting point for further discussion. > > Not really. I think you *might* be saying tables which currently get > only wraparound vacuums will end up getting other kinds of vacuums > with your patch because things will improve enough for other tables in > the system that they will be able to get more attention than they do > currently. Yes, I am. > But I'm not sure I am understanding you correctly, and even > if I am I don't understand why that would be so, and even if it is I > think it doesn't help if essentially all the tables in the system are > suffering from the problem. When I say "relfrozenxid advancement has been qualitatively improved by the patch", what I mean is that we are much closer to a rate of relfrozenxid advancement that is far closer to the theoretically optimal rate for our current design, with freezing and with 32-bit XIDs, and with the invariants for freezing. Consider the extreme case, and generalize. In the simple append-only table case, it is most obvious. The final relfrozenxid is very close to OldestXmin (only tiny noise level differences appear), regardless of XID consumption by the system in general, and even within the append-only table in particular. Other cases are somewhat trickier, but have roughly the same quality, to a surprising degree. Lots of things that never really should have affected relfrozenxid to begin with do not, for the first time. -- Peter Geoghegan
On Sat, Jan 29, 2022 at 8:42 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v7, a revision that overhauls the algorithm that decides > what to freeze. I'm now calling it block-driven freezing in the commit > message. Also included is a new patch, that makes VACUUM record zero > free space in the FSM for an all-visible page, unless the total amount > of free space happens to be greater than one half of BLCKSZ. I pushed the earlier refactoring and instrumentation patches today. Attached is v8. No real changes -- just a rebased version. It will be easier to benchmark and test the page-driven freezing stuff now, since the master/baseline case will now output instrumentation showing how relfrozenxid has been advanced (if at all) -- whether (and to what extent) each VACUUM operation advances relfrozenxid can now be directly compared, just by monitoring the log_autovacuum_min_duration output for a given table over time. -- Peter Geoghegan
Attachment
On Fri, Feb 11, 2022 at 8:30 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v8. No real changes -- just a rebased version. Concerns about my general approach to this project (and even the Postgres 14 VACUUM work) were expressed by Robert and Andres over on the "Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM" thread. Some of what was said honestly shocked me. It now seems unwise to pursue this project on my original timeline. I even thought about shelving it indefinitely (which is still on the table). I propose the following compromise: the least contentious patch alone will be in scope for Postgres 15, while the other patches will not be. I'm referring to the first patch from v8, which adds dynamic tracking of the oldest extant XID in each heap table, in order to be able to use it as our new relfrozenxid. I can't imagine that I'll have difficulty convincing Andres of the merits of this idea, for one, since it was his idea in the first place. It makes a lot of sense, independent of any change to how and when we freeze. The first patch is tricky, but at least it won't require elaborate performance validation. It doesn't change any of the basic performance characteristics of VACUUM. It sometimes allows us to advance relfrozenxid to a value beyond FreezeLimit (typically only possible in an aggressive VACUUM), which is an intrinsic good. If it isn't effective then the overhead seems very unlikely to be noticeable. It's pretty much a strictly additive improvement. Are there any objections to this plan? -- Peter Geoghegan
On Fri, Feb 18, 2022 at 3:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > Concerns about my general approach to this project (and even the > Postgres 14 VACUUM work) were expressed by Robert and Andres over on > the "Nonrandom scanned_pages distorts pg_class.reltuples set by > VACUUM" thread. Some of what was said honestly shocked me. It now > seems unwise to pursue this project on my original timeline. I even > thought about shelving it indefinitely (which is still on the table). > > I propose the following compromise: the least contentious patch alone > will be in scope for Postgres 15, while the other patches will not be. > I'm referring to the first patch from v8, which adds dynamic tracking > of the oldest extant XID in each heap table, in order to be able to > use it as our new relfrozenxid. I can't imagine that I'll have > difficulty convincing Andres of the merits of this idea, for one, > since it was his idea in the first place. It makes a lot of sense, > independent of any change to how and when we freeze. > > The first patch is tricky, but at least it won't require elaborate > performance validation. It doesn't change any of the basic performance > characteristics of VACUUM. It sometimes allows us to advance > relfrozenxid to a value beyond FreezeLimit (typically only possible in > an aggressive VACUUM), which is an intrinsic good. If it isn't > effective then the overhead seems very unlikely to be noticeable. It's > pretty much a strictly additive improvement. > > Are there any objections to this plan? I really like the idea of reducing the scope of what is being changed here, and I agree that eagerly advancing relfrozenxid carries much less risk than the other changes. I'd like to have a clearer idea of exactly what is in each of the remaining patches before forming a final opinion. What's tricky about 0001? Does it change any other behavior, either as a necessary component of advancing relfrozenxid more eagerly, or otherwise? If there's a way you can make the precise contents of 0002 and 0003 more clear, I would like that, too. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 18, 2022 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > I'd like to have a clearer idea of exactly what is in each of the > remaining patches before forming a final opinion. Great. > What's tricky about 0001? Does it change any other behavior, either as > a necessary component of advancing relfrozenxid more eagerly, or > otherwise? It does not change any other behavior. It's totally mechanical. 0001 is tricky in the sense that there are a lot of fine details, and if you get any one of them wrong the result might be a subtle bug. For example, the heap_tuple_needs_freeze() code path is only used when we cannot get a cleanup lock, which is rare -- and some of the branches within the function are relatively rare themselves. The obvious concern is: What if some detail of how we track the new relfrozenxid value (and new relminmxid value) in this seldom-hit codepath is just wrong, in whatever way we didn't think of? On the other hand, we must already be precise in almost the same way within heap_tuple_needs_freeze() today -- it's not all that different (we currently need to avoid leaving any XIDs < FreezeLimit behind, which isn't made that less complicated by the fact that it's a static XID cutoff). Plus, we have experience with bugs like this. There was hardening added to catch stuff like this back in 2017, following the "freeze the dead" bug. > If there's a way you can make the precise contents of 0002 and 0003 > more clear, I would like that, too. The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible() thing) wasn't on the table before now. 0002 is the patch that changes the basic criteria for freezing, making it block-based rather than based on the FreezeLimit cutoff (barring edge cases that are important for correctness, but shouldn't noticeably affect freezing overhead). The single biggest practical improvement from 0002 is that it eliminates what I've called the freeze cliff, which is where many old tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be frozen all at once, in a balloon payment during an eventual aggressive VACUUM. Although it's easy to see that that could be useful, it is harder to justify (much harder) than anything else. Because we're freezing more eagerly overall, we're also bound to do more freezing without benefit in certain cases. Although I think that this can be justified as the cost of doing business, that's a hard argument to make. In short, 0001 is mechanically tricky, but easy to understand at a high level. Whereas 0002 is mechanically simple, but tricky to understand at a high level (and therefore far trickier than 0001 overall). -- Peter Geoghegan
On Fri, Feb 18, 2022 at 4:10 PM Peter Geoghegan <pg@bowt.ie> wrote: > It does not change any other behavior. It's totally mechanical. > > 0001 is tricky in the sense that there are a lot of fine details, and > if you get any one of them wrong the result might be a subtle bug. For > example, the heap_tuple_needs_freeze() code path is only used when we > cannot get a cleanup lock, which is rare -- and some of the branches > within the function are relatively rare themselves. The obvious > concern is: What if some detail of how we track the new relfrozenxid > value (and new relminmxid value) in this seldom-hit codepath is just > wrong, in whatever way we didn't think of? Right. I think we have no choice but to accept such risks if we want to make any progress here, and every patch carries them to some degree. I hope that someone else will review this patch in more depth than I have just now, but what I notice reading through it is that some of the comments seem pretty opaque. For instance: + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current + * target relfrozenxid and relminmxid for the relation. Assumption is that "maintains" is fuzzy. I think you should be saying something much more explicit, and the thing you are saying should make it clear that these arguments are input-output arguments: i.e. the caller must set them correctly before calling this function, and they will be updated by the function. I don't think you have to spell all of that out in every place where this comes up in the patch, but it needs to be clear from what you do say. For example, I would be happier with a comment that said something like "Every call to this function will either set HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if it is currently newer than that. Thus, after a series of calls to this function, *NewRelfrozenxid represents a lower bound on unfrozen xmin values in the tuples examined. Before calling this function, caller should initialize *NewRelfrozenxid to <something>." + * Changing nothing, so might have to ratchet back NewRelminmxid, + * NewRelfrozenxid, or both together This comment I like. + * New multixact might have remaining XID older than + * NewRelfrozenxid This one's good, too. + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current + * target relfrozenxid and relminmxid for the relation. Assumption is that + * caller will never freeze any of the XIDs from the tuple, even when we say + * that they should. If caller opts to go with our recommendation to freeze, + * then it must account for the fact that it shouldn't trust how we've set + * NewRelfrozenxid/NewRelminmxid. (In practice aggressive VACUUMs always take + * our recommendation because they must, and non-aggressive VACUUMs always opt + * to not freeze, preferring to ratchet back NewRelfrozenxid instead). I don't understand this one. + * (Actually, we maintain NewRelminmxid differently here, because we + * assume that XIDs that should be frozen according to cutoff_xid won't + * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.) This one either. I haven't really grokked exactly what is happening in heap_tuple_needs_freeze yet, and may not have time to study it further in the near future. Not saying it's wrong, although improving the comments above would likely help me out. > > If there's a way you can make the precise contents of 0002 and 0003 > > more clear, I would like that, too. > > The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible() > thing) wasn't on the table before now. 0002 is the patch that changes > the basic criteria for freezing, making it block-based rather than > based on the FreezeLimit cutoff (barring edge cases that are important > for correctness, but shouldn't noticeably affect freezing overhead). > > The single biggest practical improvement from 0002 is that it > eliminates what I've called the freeze cliff, which is where many old > tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be > frozen all at once, in a balloon payment during an eventual aggressive > VACUUM. Although it's easy to see that that could be useful, it is > harder to justify (much harder) than anything else. Because we're > freezing more eagerly overall, we're also bound to do more freezing > without benefit in certain cases. Although I think that this can be > justified as the cost of doing business, that's a hard argument to > make. You've used the term "freezing cliff" repeatedly in earlier emails, and this is the first time I've been able to understand what you meant. I'm glad I do, now. But can you describe the algorithm that 0002 uses to accomplish this improvement? Like "if it sees that the page meets criteria X, then it freezes all tuples on the page, else if it sees that that individual tuples on the page meet criteria Y, then it freezes just those." And like explain what of that is same/different vs. now. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-02-18 13:09:45 -0800, Peter Geoghegan wrote: > 0001 is tricky in the sense that there are a lot of fine details, and > if you get any one of them wrong the result might be a subtle bug. For > example, the heap_tuple_needs_freeze() code path is only used when we > cannot get a cleanup lock, which is rare -- and some of the branches > within the function are relatively rare themselves. The obvious > concern is: What if some detail of how we track the new relfrozenxid > value (and new relminmxid value) in this seldom-hit codepath is just > wrong, in whatever way we didn't think of? I think it'd be good to add a few isolationtest cases for the can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The slightly harder part is verifying that VACUUM did something reasonable, but that still should be doable? Greetings, Andres Freund
Hi, On 2022-02-18 15:54:19 -0500, Robert Haas wrote: > > Are there any objections to this plan? > > I really like the idea of reducing the scope of what is being changed > here, and I agree that eagerly advancing relfrozenxid carries much > less risk than the other changes. Sounds good to me too! Greetings, Andres Freund
On Fri, Feb 18, 2022 at 1:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current > + * target relfrozenxid and relminmxid for the relation. Assumption is that > > "maintains" is fuzzy. I think you should be saying something much more > explicit, and the thing you are saying should make it clear that these > arguments are input-output arguments: i.e. the caller must set them > correctly before calling this function, and they will be updated by > the function. Makes sense. > I don't think you have to spell all of that out in every > place where this comes up in the patch, but it needs to be clear from > what you do say. For example, I would be happier with a comment that > said something like "Every call to this function will either set > HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an > argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if > it is currently newer than that. Thus, after a series of calls to this > function, *NewRelfrozenxid represents a lower bound on unfrozen xmin > values in the tuples examined. Before calling this function, caller > should initialize *NewRelfrozenxid to <something>." We have to worry about XIDs from MultiXacts (and xmax values more generally). And we have to worry about the case where we start out with only xmin frozen (by an earlier VACUUM), and then have to freeze xmax too. I believe that we have to generally consider xmin and xmax independently. For example, we cannot ignore xmax, just because we looked at xmin, since in general xmin alone might have already been frozen. > + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current > + * target relfrozenxid and relminmxid for the relation. Assumption is that > + * caller will never freeze any of the XIDs from the tuple, even when we say > + * that they should. If caller opts to go with our recommendation to freeze, > + * then it must account for the fact that it shouldn't trust how we've set > + * NewRelfrozenxid/NewRelminmxid. (In practice aggressive VACUUMs always take > + * our recommendation because they must, and non-aggressive VACUUMs always opt > + * to not freeze, preferring to ratchet back NewRelfrozenxid instead). > > I don't understand this one. > > + * (Actually, we maintain NewRelminmxid differently here, because we > + * assume that XIDs that should be frozen according to cutoff_xid won't > + * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.) > > This one either. The difference between the cleanup lock path (in lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both of these confusing comment blocks, really. Note that cutoff_xid is the name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze have for FreezeLimit (maybe we should rename every occurence of cutoff_xid in heapam.c to FreezeLimit). At a high level, we aren't changing the fundamental definition of an aggressive VACUUM in any of the patches -- we still need to advance relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on HEAD, today (we may be able to advance it *past* FreezeLimit, but that's just a bonus). But in a non-aggressive VACUUM, where there is still no strict requirement to advance relfrozenxid (by any amount), the code added by 0001 can set relfrozenxid to any known safe value, which could either be from before FreezeLimit, or after FreezeLimit -- almost anything is possible (provided we respect the relfrozenxid invariant, and provided we see that we didn't skip any all-visible-not-all-frozen pages). Since we still need to "respect FreezeLimit" in an aggressive VACUUM, the aggressive case might need to wait for a full cleanup lock the hard way, having tried and failed to do it the easy way within lazy_scan_noprune (lazy_scan_noprune will still return false when any call to heap_tuple_needs_freeze for any tuple returns false) -- same as on HEAD, today. And so the difference at issue here is: FreezeLimit/cutoff_xid only needs to affect the new NewRelfrozenxid value we use for relfrozenxid in heap_prepare_freeze_tuple, which is involved in real freezing -- not in heap_tuple_needs_freeze, whose main purpose is still to help us avoid freezing where a cleanup lock isn't immediately available. While the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze is to determine its bool return value, which will only be of interest to the aggressive case (which might have to get a cleanup lock and do it the hard way), not the non-aggressive case (where ratcheting back NewRelfrozenxid is generally possible, and generally leaves us with almost as good of a value). In other words: the calls to heap_tuple_needs_freeze made from lazy_scan_noprune are simply concerned with the page as it actually is, whereas the similar/corresponding calls to heap_prepare_freeze_tuple from lazy_scan_prune are concerned with *what the page will actually become*, after freezing finishes, and after lazy_scan_prune is done with the page entirely (ultimately the final NewRelfrozenxid value set in pg_class.relfrozenxid only has to be <= the oldest extant XID *at the time the VACUUM operation is just about to end*, not some earlier time, so "being versus becoming" is an interesting distinction for us). Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed here, to make all of this less confusing. I only now fully realized how confusing all of this stuff is -- very. > I haven't really grokked exactly what is happening in > heap_tuple_needs_freeze yet, and may not have time to study it further > in the near future. Not saying it's wrong, although improving the > comments above would likely help me out. Definitely needs more polishing. > You've used the term "freezing cliff" repeatedly in earlier emails, > and this is the first time I've been able to understand what you > meant. I'm glad I do, now. Ugh. I thought that a snappy term like that would catch on quickly. Guess not! > But can you describe the algorithm that 0002 uses to accomplish this > improvement? Like "if it sees that the page meets criteria X, then it > freezes all tuples on the page, else if it sees that that individual > tuples on the page meet criteria Y, then it freezes just those." And > like explain what of that is same/different vs. now. The mechanics themselves are quite simple (again, understanding the implications is the hard part). The approach taken within 0002 is still rough, to be honest, but wouldn't take long to clean up (there are XXX/FIXME comments about this in 0002). As a general rule, we try to freeze all of the remaining live tuples on a page (following pruning) together, as a group, or none at all. Most of the time this is triggered by our noticing that the page is about to be set all-visible (but not all-frozen), and doing work sufficient to mark it fully all-frozen instead. Occasionally there is FreezeLimit to consider, which is now more of a backstop thing, used to make sure that we never get too far behind in terms of unfrozen XIDs. This is useful in part because it avoids any future non-aggressive VACUUM that is fundamentally unable to advance relfrozenxid (you can't skip all-visible pages if there are only all-frozen pages in the VM in practice). We're generally doing a lot more freezing with 0002, but we still manage to avoid freezing too much in tables like pgbench_tellers or pgbench_branches -- tables where it makes the least sense. Such tables will be updated so frequently that VACUUM is relatively unlikely to ever mark any page all-visible, avoiding the main criteria for freezing implicitly. It's also unlikely that they'll ever have an XID that is so old to trigger the fallback FreezeLimit-style criteria for freezing. In practice, freezing tuples like this is generally not that expensive in most tables where VACUUM freezes the majority of pages immediately (tables that aren't like pgbench_tellers or pgbench_branches), because they're generally big tables, where the overhead of FPIs tends to dominate anyway (gambling that we can avoid more FPIs later on is not a bad gamble, as gambles go). This seems to make the overhead acceptable, on balance. Granted, you might be able to poke holes in that argument, and reasonable people might disagree on what acceptable should mean. There are many value judgements here, which makes it complicated. (On the other hand we might be able to do better if there was a particularly bad case for the 0002 work, if one came to light.) -- Peter Geoghegan
On Fri, Feb 18, 2022 at 2:11 PM Andres Freund <andres@anarazel.de> wrote: > I think it'd be good to add a few isolationtest cases for the > can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The > slightly harder part is verifying that VACUUM did something reasonable, but > that still should be doable? We could even just extend existing, related tests, from vacuum-reltuples.spec. Another testing strategy occurs to me: we could stress-test the implementation by simulating an environment where the no-cleanup-lock path is hit an unusually large number of times, possibly a fixed percentage of the time (like 1%, 5%), say by making vacuumlazy.c's ConditionalLockBufferForCleanup() call return false randomly. Now that we have lazy_scan_noprune for the no-cleanup-lock path (which is as similar to the regular lazy_scan_prune path as possible), I wouldn't expect this ConditionalLockBufferForCleanup() testing gizmo to be too disruptive. -- Peter Geoghegan
On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > Another testing strategy occurs to me: we could stress-test the > implementation by simulating an environment where the no-cleanup-lock > path is hit an unusually large number of times, possibly a fixed > percentage of the time (like 1%, 5%), say by making vacuumlazy.c's > ConditionalLockBufferForCleanup() call return false randomly. Now that > we have lazy_scan_noprune for the no-cleanup-lock path (which is as > similar to the regular lazy_scan_prune path as possible), I wouldn't > expect this ConditionalLockBufferForCleanup() testing gizmo to be too > disruptive. I tried this out, using the attached patch. It was quite interesting, even when run against HEAD. I think that I might have found a bug on HEAD, though I'm not really sure. If you modify the patch to simulate conditions under which ConditionalLockBufferForCleanup() fails about 2% of the time, you get much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze, without it being so aggressive as to make "make check-world" fail -- which is exactly what I expected. If you are much more aggressive about it, and make it 50% instead (which you can get just by using the patch as written), then some tests will fail, mostly for reasons that aren't surprising or interesting (e.g. plan changes). This is also what I'd have guessed would happen. However, it gets more interesting. One thing that I did not expect to happen at all also happened (with the current 50% rate of simulated ConditionalLockBufferForCleanup() failure from the patch): if I run "make check" from the pg_surgery directory, then the Postgres backend gets stuck in an infinite loop inside lazy_scan_prune, which has been a symptom of several tricky bugs in the past year (not every time, but usually). Specifically, the VACUUM statement launched by the SQL command "vacuum freeze htab2;" from the file contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this misbehavior. This is a temp table, which is a choice made by the tests specifically because they need to "use a temp table so that vacuum behavior doesn't depend on global xmin". This is convenient way of avoiding spurious regression tests failures (e.g. from autoanalyze), and relies on the GlobalVisTempRels behavior established by Andres' 2020 bugfix commit 94bc27b5. It's quite possible that this is nothing more than a bug in my adversarial gizmo patch -- since I don't think that ConditionalLockBufferForCleanup() can ever fail with a temp buffer (though even that's not completely clear right now). Even if the behavior that I saw does not indicate a bug on HEAD, it still seems informative. At the very least, it wouldn't hurt to Assert() that the target table isn't a temp table inside lazy_scan_noprune, documenting our assumptions around temp tables and ConditionalLockBufferForCleanup(). I haven't actually tried to debug the issue just yet, so take all this with a grain of salt. -- Peter Geoghegan
Attachment
Hi, (On phone, so crappy formatting and no source access) On February 19, 2022 3:08:41 PM PST, Peter Geoghegan <pg@bowt.ie> wrote: >On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote: >> Another testing strategy occurs to me: we could stress-test the >> implementation by simulating an environment where the no-cleanup-lock >> path is hit an unusually large number of times, possibly a fixed >> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's >> ConditionalLockBufferForCleanup() call return false randomly. Now that >> we have lazy_scan_noprune for the no-cleanup-lock path (which is as >> similar to the regular lazy_scan_prune path as possible), I wouldn't >> expect this ConditionalLockBufferForCleanup() testing gizmo to be too >> disruptive. > >I tried this out, using the attached patch. It was quite interesting, >even when run against HEAD. I think that I might have found a bug on >HEAD, though I'm not really sure. > >If you modify the patch to simulate conditions under which >ConditionalLockBufferForCleanup() fails about 2% of the time, you get >much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze, >without it being so aggressive as to make "make check-world" fail -- >which is exactly what I expected. If you are much more aggressive >about it, and make it 50% instead (which you can get just by using the >patch as written), then some tests will fail, mostly for reasons that >aren't surprising or interesting (e.g. plan changes). This is also >what I'd have guessed would happen. > >However, it gets more interesting. One thing that I did not expect to >happen at all also happened (with the current 50% rate of simulated >ConditionalLockBufferForCleanup() failure from the patch): if I run >"make check" from the pg_surgery directory, then the Postgres backend >gets stuck in an infinite loop inside lazy_scan_prune, which has been >a symptom of several tricky bugs in the past year (not every time, but >usually). Specifically, the VACUUM statement launched by the SQL >command "vacuum freeze htab2;" from the file >contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this >misbehavior. >This is a temp table, which is a choice made by the tests specifically >because they need to "use a temp table so that vacuum behavior doesn't >depend on global xmin". This is convenient way of avoiding spurious >regression tests failures (e.g. from autoanalyze), and relies on the >GlobalVisTempRels behavior established by Andres' 2020 bugfix commit >94bc27b5. We don't have a blocking path for cleanup locks of temporary buffers IIRC (normally not reachable). So I wouldn't be surprisedif a cleanup lock failing would cause some odd behavior. >It's quite possible that this is nothing more than a bug in my >adversarial gizmo patch -- since I don't think that >ConditionalLockBufferForCleanup() can ever fail with a temp buffer >(though even that's not completely clear right now). Even if the >behavior that I saw does not indicate a bug on HEAD, it still seems >informative. At the very least, it wouldn't hurt to Assert() that the >target table isn't a temp table inside lazy_scan_noprune, documenting >our assumptions around temp tables and >ConditionalLockBufferForCleanup(). Definitely worth looking into more. This reminds me of a recent thing I noticed in the aio patch. Spgist can end up busy looping when buffers are locked, insteadof blocking. Not actually related, of course. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sat, Feb 19, 2022 at 3:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > It's quite possible that this is nothing more than a bug in my > adversarial gizmo patch -- since I don't think that > ConditionalLockBufferForCleanup() can ever fail with a temp buffer > (though even that's not completely clear right now). Even if the > behavior that I saw does not indicate a bug on HEAD, it still seems > informative. This very much looks like a bug in pg_surgery itself now -- attached is a draft fix. The temp table thing was a red herring. I found I could get exactly the same kind of failure when htab2 was a permanent table (which was how it originally appeared, before commit 0811f766fd made it into a temp table due to test flappiness issues). The relevant "vacuum freeze htab2" happens at a point after the test has already deliberately corrupted one of its tuples using heap_force_kill(). It's not that we aren't careful enough about the corruption at some point in vacuumlazy.c, which was my second theory. But I quickly discarded that idea, and came up with a third theory: the relevant heap_surgery.c path does the relevant ItemIdSetDead() to kill items, without also defragmenting the page to remove the tuples with storage, which is wrong. This meant that we depended on pruning happening (in this case during VACUUM) and defragmenting the page in passing. But there is no reason to not defragment the page within pg_surgery (at least no obvious reason), since we have a cleanup lock anyway. Theoretically you could blame this on lazy_scan_noprune instead, since it thinks it can collect LP_DEAD items while assuming that they have no storage, but that doesn't make much sense to me. There has never been any way of setting a heap item to LP_DEAD without also defragmenting the page. Since that's exactly what it means to prune a heap page. (Actually, the same used to be true about heap vacuuming, which worked more like heap pruning before Postgres 14, but that doesn't seem important.) -- Peter Geoghegan
Attachment
On Sat, Feb 19, 2022 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote: > This very much looks like a bug in pg_surgery itself now -- attached > is a draft fix. Wait, that's not it either. I jumped the gun -- this isn't sufficient (though the patch I posted might not be a bad idea anyway). Looks like pg_surgery isn't processing HOT chains as whole units, which it really should (at least in the context of killing items via the heap_force_kill() function). Killing a root item in a HOT chain is just hazardous -- disconnected/orphaned heap-only tuples are liable to cause chaos, and should be avoided everywhere (including during pruning, and within pg_surgery). It's likely that the hardening I already planned on adding to pruning [1] (as follow-up work to recent bugfix commit 18b87b201f) will prevent lazy_scan_prune from getting stuck like this, whatever the cause happens to be. The actual page image I see lazy_scan_prune choke on (i.e. exhibit the same infinite loop unpleasantness we've seen before on) is not in a consistent state at all (its tuples consist of tuples from a single HOT chain, and the HOT chain is totally inconsistent on account of having an LP_DEAD line pointer root item). pg_surgery could in principle do the right thing here by always treating HOT chains as whole units. Leaving behind disconnected/orphaned heap-only tuples is pretty much pointless anyway, since they'll never be accessible by index scans. Even after a REINDEX, since there is no root item from the heap page to go in the index. (A dump and restore might work better, though.) [1] https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com -- Peter Geoghegan
Hi, On 2022-02-19 17:22:33 -0800, Peter Geoghegan wrote: > Looks like pg_surgery isn't processing HOT chains as whole units, > which it really should (at least in the context of killing items via > the heap_force_kill() function). Killing a root item in a HOT chain is > just hazardous -- disconnected/orphaned heap-only tuples are liable to > cause chaos, and should be avoided everywhere (including during > pruning, and within pg_surgery). How does that cause the endless loop? It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for me. So something needs have changed with your patch? > It's likely that the hardening I already planned on adding to pruning > [1] (as follow-up work to recent bugfix commit 18b87b201f) will > prevent lazy_scan_prune from getting stuck like this, whatever the > cause happens to be. Yea, we should pick that up again. Not just for robustness or performance. Also because it's just a lot easier to understand. > Leaving behind disconnected/orphaned heap-only tuples is pretty much > pointless anyway, since they'll never be accessible by index scans. > Even after a REINDEX, since there is no root item from the heap page > to go in the index. (A dump and restore might work better, though.) Given that heap_surgery's raison d'etre is correcting corruption etc, I think it makes sense for it to do as minimal work as possible. Iterating through a HOT chain would be a problem if you e.g. tried to repair a page with HOT corruption. Greetings, Andres Freund
On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <andres@anarazel.de> wrote: > How does that cause the endless loop? Attached is the page image itself, dumped via gdb (and gzip'd). This was on recent HEAD (commit 8f388f6f, actually), plus 0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No defragmenting in pg_surgery, nothing like that. > It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for > me. So something needs have changed with your patch? It doesn't always happen -- only about half the time on my machine. Maybe it's timing sensitive? We hit the "goto retry" on offnum 2, which is the first tuple with storage (you can see "the ghost" of the tuple from the LP_DEAD item at offnum 1, since the page isn't defragmented in pg_surgery). I think that this happens because the heap-only tuple at offnum 2 is fully DEAD to lazy_scan_prune, but hasn't been recognized as such by heap_page_prune. There is no way that they'll ever "agree" on the tuple being DEAD right now, because pruning still doesn't assume that an orphaned heap-only tuple is fully DEAD. We can either do that, or we can throw an error concerning corruption when heap_page_prune notices orphaned tuples. Neither seems particularly appealing. But it definitely makes no sense to allow lazy_scan_prune to spin in a futile attempt to reach agreement with heap_page_prune about a DEAD tuple really being DEAD. > Given that heap_surgery's raison d'etre is correcting corruption etc, I think > it makes sense for it to do as minimal work as possible. Iterating through a > HOT chain would be a problem if you e.g. tried to repair a page with HOT > corruption. I guess that's also true. There is at least a legitimate argument to be made for not leaving behind any orphaned heap-only tuples. The interface is a TID, and so the user may already believe that they're killing the heap-only, not just the root item (since ctid suggests that the TID of a heap-only tuple is the TID of the root item, which is kind of misleading). Anyway, we can decide on what to do in heap_surgery later, once the main issue is under control. My point was mostly just that orphaned heap-only tuples are definitely not okay, in general. They are the least worst option when corruption has already happened, maybe -- but maybe not. -- Peter Geoghegan
Attachment
Hi, On 2022-02-19 18:16:54 -0800, Peter Geoghegan wrote: > On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <andres@anarazel.de> wrote: > > How does that cause the endless loop? > > Attached is the page image itself, dumped via gdb (and gzip'd). This > was on recent HEAD (commit 8f388f6f, actually), plus > 0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No > defragmenting in pg_surgery, nothing like that. > > It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for > > me. So something needs have changed with your patch? > > It doesn't always happen -- only about half the time on my machine. > Maybe it's timing sensitive? Ah, I'd only run the tests three times or so, without it happening. Trying a few more times repro'd it. It's kind of surprising that this needs this 0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question of hint bits changing due to lazy_scan_noprune(), which then makes HeapTupleHeaderIsHotUpdated() have a different return value, preventing the "If the tuple is DEAD and doesn't chain to anything else" path from being taken. > We hit the "goto retry" on offnum 2, which is the first tuple with > storage (you can see "the ghost" of the tuple from the LP_DEAD item at > offnum 1, since the page isn't defragmented in pg_surgery). I think > that this happens because the heap-only tuple at offnum 2 is fully > DEAD to lazy_scan_prune, but hasn't been recognized as such by > heap_page_prune. There is no way that they'll ever "agree" on the > tuple being DEAD right now, because pruning still doesn't assume that > an orphaned heap-only tuple is fully DEAD. > We can either do that, or we can throw an error concerning corruption > when heap_page_prune notices orphaned tuples. Neither seems > particularly appealing. But it definitely makes no sense to allow > lazy_scan_prune to spin in a futile attempt to reach agreement with > heap_page_prune about a DEAD tuple really being DEAD. Yea, this sucks. I think we should go for the rewrite of the heap_prune_chain() logic. The current approach is just never going to be robust. Greetings, Andres Freund
On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote: > > We can either do that, or we can throw an error concerning corruption > > when heap_page_prune notices orphaned tuples. Neither seems > > particularly appealing. But it definitely makes no sense to allow > > lazy_scan_prune to spin in a futile attempt to reach agreement with > > heap_page_prune about a DEAD tuple really being DEAD. > > Yea, this sucks. I think we should go for the rewrite of the > heap_prune_chain() logic. The current approach is just never going to be > robust. No, it just isn't robust enough. But it's not that hard to fix. My patch really wasn't invasive. I confirmed that HeapTupleSatisfiesVacuum() and heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum 2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no reason to think that there is a new bug involved). The problem here is indeed just that heap_prune_chain() can't "get to" the tuple, given its current design. For anybody else that doesn't follow what we're talking about: The "doesn't chain to anything else" code at the start of heap_prune_chain() won't get to the heap-only tuple at offnum 2, since the tuple is itself HeapTupleHeaderIsHotUpdated() -- the expectation is that it'll be processed later on, once we locate the HOT chain's root item. Since, of course, the "root item" was already LP_DEAD before we even reached heap_page_prune() (on account of the pg_surgery corruption), there is no possible way that that can happen later on. And so we cannot find the same heap-only tuple and mark it LP_UNUSED (which is how we always deal with HEAPTUPLE_DEAD heap-only tuples) during pruning. -- Peter Geoghegan
On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote: > It's kind of surprising that this needs this > 0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question > of hint bits changing due to lazy_scan_noprune(), which then makes > HeapTupleHeaderIsHotUpdated() have a different return value, preventing the > "If the tuple is DEAD and doesn't chain to anything else" > path from being taken. That makes sense as an explanation. Goes to show just how fragile the "DEAD and doesn't chain to anything else" logic at the top of heap_prune_chain really is. -- Peter Geoghegan
Hi, On 2022-02-19 19:07:39 -0800, Peter Geoghegan wrote: > On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote: > > > We can either do that, or we can throw an error concerning corruption > > > when heap_page_prune notices orphaned tuples. Neither seems > > > particularly appealing. But it definitely makes no sense to allow > > > lazy_scan_prune to spin in a futile attempt to reach agreement with > > > heap_page_prune about a DEAD tuple really being DEAD. > > > > Yea, this sucks. I think we should go for the rewrite of the > > heap_prune_chain() logic. The current approach is just never going to be > > robust. > > No, it just isn't robust enough. But it's not that hard to fix. My > patch really wasn't invasive. I think we're in agreement there. We might think at some point about backpatching too, but I'd rather have it stew in HEAD for a bit first. > I confirmed that HeapTupleSatisfiesVacuum() and > heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum > 2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no > reason to think that there is a new bug involved). The problem here is > indeed just that heap_prune_chain() can't "get to" the tuple, given > its current design. Right. The reason that the "adversarial" patch makes a different is solely that it changes the heap_surgery test to actually kill an item, which it doesn't intend: create temp table htab2(a int); insert into htab2 values (100); update htab2 set a = 200; vacuum htab2; -- redirected TIDs should be skipped select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]); If the vacuum can get the cleanup lock due to the adversarial patch, the heap_force_kill() doesn't do anything, because the first item is a redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead targets the root item. Triggering the endless loop. Hm. I think this might be a mild regression in 14. In < 14 we'd just skip the tuple in lazy_scan_heap(), but now we have an uninterruptible endless loop. We'd do completely bogus stuff later in < 14 though, I think we'd just leave it in place despite being older than relfrozenxid, which obviously has its own set of issues. Greetings, Andres Freund
On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Given that heap_surgery's raison d'etre is correcting corruption etc, I think > > it makes sense for it to do as minimal work as possible. Iterating through a > > HOT chain would be a problem if you e.g. tried to repair a page with HOT > > corruption. > > I guess that's also true. There is at least a legitimate argument to > be made for not leaving behind any orphaned heap-only tuples. The > interface is a TID, and so the user may already believe that they're > killing the heap-only, not just the root item (since ctid suggests > that the TID of a heap-only tuple is the TID of the root item, which > is kind of misleading). Actually, I would say that heap_surgery's raison d'etre is making weird errors related to corruption of this or that TID go away, so that the user can cut their losses. That's how it's advertised. Let's assume that we don't want to make VACUUM/pruning just treat orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise status -- let's say that we want to err in the direction of doing nothing at all with the page. Now we have to have a weird error in VACUUM instead (not great, but better than just spinning between lazy_scan_prune and heap_prune_page). And we've just created natural demand for heap_surgery to deal with the problem by deleting whole HOT chains (not just root items). If we allow VACUUM to treat orphaned heap-only tuples as DEAD right away, then we might as well do the same thing in heap_surgery, since there is little chance that the user will get to the heap-only tuples before VACUUM does (not something to rely on, at any rate). Either way, I think we probably end up needing to teach heap_surgery to kill entire HOT chains as a group, given a TID. -- Peter Geoghegan
On Sat, Feb 19, 2022 at 7:28 PM Andres Freund <andres@anarazel.de> wrote: > If the vacuum can get the cleanup lock due to the adversarial patch, the > heap_force_kill() doesn't do anything, because the first item is a > redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead > targets the root item. Triggering the endless loop. But it shouldn't matter if the root item is an LP_REDIRECT or a normal (not heap-only) tuple with storage. Either way it's the root of a HOT chain. The fact that pg_surgery treats LP_REDIRECT items differently from the other kind of root items is just arbitrary. It seems to have more to do with freezing tuples than killing tuples. -- Peter Geoghegan
Hi, On 2022-02-19 19:31:21 -0800, Peter Geoghegan wrote: > On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > Given that heap_surgery's raison d'etre is correcting corruption etc, I think > > > it makes sense for it to do as minimal work as possible. Iterating through a > > > HOT chain would be a problem if you e.g. tried to repair a page with HOT > > > corruption. > > > > I guess that's also true. There is at least a legitimate argument to > > be made for not leaving behind any orphaned heap-only tuples. The > > interface is a TID, and so the user may already believe that they're > > killing the heap-only, not just the root item (since ctid suggests > > that the TID of a heap-only tuple is the TID of the root item, which > > is kind of misleading). > > Actually, I would say that heap_surgery's raison d'etre is making > weird errors related to corruption of this or that TID go away, so > that the user can cut their losses. That's how it's advertised. I'm not that sure those are that different... Imagine some corruption leading to two hot chains ending in the same tid, which our fancy new secure pruning algorithm might detect. Either way, I'm a bit surprised about the logic to not allow killing redirect items? What if you have a redirect pointing to an unused item? > Let's assume that we don't want to make VACUUM/pruning just treat > orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise > status I don't think that'd ever be a good idea. Those tuples are visible to a seqscan after all. > -- let's say that we want to err in the direction of doing > nothing at all with the page. Now we have to have a weird error in > VACUUM instead (not great, but better than just spinning between > lazy_scan_prune and heap_prune_page). Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The problem here is a DEAD orphaned HOT tuples, and those we should be able to delete with the new page pruning logic, right? I think it might be worth getting rid of the need for the retry approach by reusing the same HTSV status array between heap_prune_page and lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in lazy_scan_prune() would be some form of corruption. And it'd be a pretty decent performance boost, HTSV ain't cheap. Greetings, Andres Freund
On Sat, Feb 19, 2022 at 7:47 PM Andres Freund <andres@anarazel.de> wrote: > I'm not that sure those are that different... Imagine some corruption leading > to two hot chains ending in the same tid, which our fancy new secure pruning > algorithm might detect. I suppose that's possible, but it doesn't seem all that likely to ever happen, what with the xmin -> xmax cross-tuple matching stuff. > Either way, I'm a bit surprised about the logic to not allow killing redirect > items? What if you have a redirect pointing to an unused item? Again, I simply think it boils down to having to treat HOT chains as a whole unit when killing TIDs. > > Let's assume that we don't want to make VACUUM/pruning just treat > > orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise > > status > > I don't think that'd ever be a good idea. Those tuples are visible to a > seqscan after all. I agree (I don't hate it completely, but it seems mostly bad). This is what leads me to the conclusion that pg_surgery has to be able to do this instead. Surely it's not okay to have something that makes VACUUM always end in error, that cannot even be fixed by pg_surgery. > > -- let's say that we want to err in the direction of doing > > nothing at all with the page. Now we have to have a weird error in > > VACUUM instead (not great, but better than just spinning between > > lazy_scan_prune and heap_prune_page). > > Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The > problem here is a DEAD orphaned HOT tuples, and those we should be able to > delete with the new page pruning logic, right? Right. But what good does that really do? The problematic page had a third tuple (at offnum 3) that was LIVE. If we could have done something about the problematic tuple at offnum 2 (which is where we got stuck), then we'd still be left with a very unpleasant choice about what happens to the third tuple. > I think it might be worth getting rid of the need for the retry approach by > reusing the same HTSV status array between heap_prune_page and > lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in > lazy_scan_prune() would be some form of corruption. And it'd be a pretty > decent performance boost, HTSV ain't cheap. I guess it doesn't actually matter if we leave an aborted DEAD tuple behind, that we could have pruned away, but didn't. The important thing is to be consistent at the level of the page. -- Peter Geoghegan
Hi, On February 19, 2022 7:56:53 PM PST, Peter Geoghegan <pg@bowt.ie> wrote: >On Sat, Feb 19, 2022 at 7:47 PM Andres Freund <andres@anarazel.de> wrote: >> Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The >> problem here is a DEAD orphaned HOT tuples, and those we should be able to >> delete with the new page pruning logic, right? > >Right. But what good does that really do? The problematic page had a >third tuple (at offnum 3) that was LIVE. If we could have done >something about the problematic tuple at offnum 2 (which is where we >got stuck), then we'd still be left with a very unpleasant choice >about what happens to the third tuple. Why does anything need to happen to it from vacuum's POV? It'll not be a problem for freezing etc. Until it's deleted vacuumdoesn't need to care. Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just continue. >> I think it might be worth getting rid of the need for the retry approach by >> reusing the same HTSV status array between heap_prune_page and >> lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in >> lazy_scan_prune() would be some form of corruption. And it'd be a pretty >> decent performance boost, HTSV ain't cheap. > >I guess it doesn't actually matter if we leave an aborted DEAD tuple >behind, that we could have pruned away, but didn't. The important >thing is to be consistent at the level of the page. That's not ok, because it opens up dangers of being interpreted differently after wraparound etc. But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status array? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sat, Feb 19, 2022 at 8:21 PM Andres Freund <andres@anarazel.de> wrote: > Why does anything need to happen to it from vacuum's POV? It'll not be a problem for freezing etc. Until it's deletedvacuum doesn't need to care. > > Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just continue. Maybe that's true, but it's just really weird to imagine not having an LP_REDIRECT that points to the LIVE item here, without throwing an error. Seems kind of iffy, to say the least. > >I guess it doesn't actually matter if we leave an aborted DEAD tuple > >behind, that we could have pruned away, but didn't. The important > >thing is to be consistent at the level of the page. > > That's not ok, because it opens up dangers of being interpreted differently after wraparound etc. > > But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status array? Right. Fundamentally, there isn't any reason why it should matter that VACUUM reached the heap page just before (rather than concurrent with or just after) some xact that inserted or updated on the page aborts. Just as long as we have a consistent idea about what's going on at the level of the whole page (or maybe the level of each HOT chain, but the whole page level seems simpler to me). -- Peter Geoghegan
On Sat, Feb 19, 2022 at 8:54 PM Andres Freund <andres@anarazel.de> wrote: > > Leaving behind disconnected/orphaned heap-only tuples is pretty much > > pointless anyway, since they'll never be accessible by index scans. > > Even after a REINDEX, since there is no root item from the heap page > > to go in the index. (A dump and restore might work better, though.) > > Given that heap_surgery's raison d'etre is correcting corruption etc, I think > it makes sense for it to do as minimal work as possible. Iterating through a > HOT chain would be a problem if you e.g. tried to repair a page with HOT > corruption. Yeah, I agree. I don't have time to respond to all of these emails thoroughly right now, but I think it's really important that pg_surgery do the exact surgery the user requested, and not any other work. I don't think that page defragmentation should EVER be REQUIRED as a condition of other work. If other code is relying on that, I'd say it's busted. I'm a little more uncertain about the case where we kill the root tuple of a HOT chain, because I can see that this might leave the page a state where sequential scans see the tuple at the end of the chain and index scans don't. I'm not sure whether that should be the responsibility of pg_surgery itself to avoid, or whether that's your problem as a user of it -- although I lean mildly toward the latter view, at the moment. But in any case surely the pruning code can't just decide to go into an infinite loop if that happens. Code that manipulates the states of data pages needs to be as robust against arbitrary on-disk states as we can reasonably make it, because pages get garbled on disk all the time. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 18, 2022 at 7:12 PM Peter Geoghegan <pg@bowt.ie> wrote: > We have to worry about XIDs from MultiXacts (and xmax values more > generally). And we have to worry about the case where we start out > with only xmin frozen (by an earlier VACUUM), and then have to freeze > xmax too. I believe that we have to generally consider xmin and xmax > independently. For example, we cannot ignore xmax, just because we > looked at xmin, since in general xmin alone might have already been > frozen. Right, so we at least need to add a similar comment to what I proposed for MXIDs, and maybe other changes are needed, too. > The difference between the cleanup lock path (in > lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in > lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both > of these confusing comment blocks, really. Note that cutoff_xid is the > name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze > have for FreezeLimit (maybe we should rename every occurence of > cutoff_xid in heapam.c to FreezeLimit). > > At a high level, we aren't changing the fundamental definition of an > aggressive VACUUM in any of the patches -- we still need to advance > relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on > HEAD, today (we may be able to advance it *past* FreezeLimit, but > that's just a bonus). But in a non-aggressive VACUUM, where there is > still no strict requirement to advance relfrozenxid (by any amount), > the code added by 0001 can set relfrozenxid to any known safe value, > which could either be from before FreezeLimit, or after FreezeLimit -- > almost anything is possible (provided we respect the relfrozenxid > invariant, and provided we see that we didn't skip any > all-visible-not-all-frozen pages). > > Since we still need to "respect FreezeLimit" in an aggressive VACUUM, > the aggressive case might need to wait for a full cleanup lock the > hard way, having tried and failed to do it the easy way within > lazy_scan_noprune (lazy_scan_noprune will still return false when any > call to heap_tuple_needs_freeze for any tuple returns false) -- same > as on HEAD, today. > > And so the difference at issue here is: FreezeLimit/cutoff_xid only > needs to affect the new NewRelfrozenxid value we use for relfrozenxid in > heap_prepare_freeze_tuple, which is involved in real freezing -- not > in heap_tuple_needs_freeze, whose main purpose is still to help us > avoid freezing where a cleanup lock isn't immediately available. While > the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze > is to determine its bool return value, which will only be of interest > to the aggressive case (which might have to get a cleanup lock and do > it the hard way), not the non-aggressive case (where ratcheting back > NewRelfrozenxid is generally possible, and generally leaves us with > almost as good of a value). > > In other words: the calls to heap_tuple_needs_freeze made from > lazy_scan_noprune are simply concerned with the page as it actually > is, whereas the similar/corresponding calls to > heap_prepare_freeze_tuple from lazy_scan_prune are concerned with > *what the page will actually become*, after freezing finishes, and > after lazy_scan_prune is done with the page entirely (ultimately > the final NewRelfrozenxid value set in pg_class.relfrozenxid only has > to be <= the oldest extant XID *at the time the VACUUM operation is > just about to end*, not some earlier time, so "being versus becoming" > is an interesting distinction for us). > > Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed > here, to make all of this less confusing. I only now fully realized > how confusing all of this stuff is -- very. Right. I think I understand all of this, or at least most of it -- but not from the comment. The question is how the comment can be more clear. My general suggestion is that function header comments should have more to do with the behavior of the function than how it fits into the bigger picture. If it's clear to the reader what conditions must hold before calling the function and which must hold on return, it helps a lot. IMHO, it's the job of the comments in the calling function to clarify why we then choose to call that function at the place and in the way that we do. > As a general rule, we try to freeze all of the remaining live tuples > on a page (following pruning) together, as a group, or none at all. > Most of the time this is triggered by our noticing that the page is > about to be set all-visible (but not all-frozen), and doing work > sufficient to mark it fully all-frozen instead. Occasionally there is > FreezeLimit to consider, which is now more of a backstop thing, used > to make sure that we never get too far behind in terms of unfrozen > XIDs. This is useful in part because it avoids any future > non-aggressive VACUUM that is fundamentally unable to advance > relfrozenxid (you can't skip all-visible pages if there are only > all-frozen pages in the VM in practice). > > We're generally doing a lot more freezing with 0002, but we still > manage to avoid freezing too much in tables like pgbench_tellers or > pgbench_branches -- tables where it makes the least sense. Such tables > will be updated so frequently that VACUUM is relatively unlikely to > ever mark any page all-visible, avoiding the main criteria for > freezing implicitly. It's also unlikely that they'll ever have an XID that is so > old to trigger the fallback FreezeLimit-style criteria for freezing. > > In practice, freezing tuples like this is generally not that expensive in > most tables where VACUUM freezes the majority of pages immediately > (tables that aren't like pgbench_tellers or pgbench_branches), because > they're generally big tables, where the overhead of FPIs tends > to dominate anyway (gambling that we can avoid more FPIs later on is not a > bad gamble, as gambles go). This seems to make the overhead > acceptable, on balance. Granted, you might be able to poke holes in > that argument, and reasonable people might disagree on what acceptable > should mean. There are many value judgements here, which makes it > complicated. (On the other hand we might be able to do better if there > was a particularly bad case for the 0002 work, if one came to light.) I think that the idea has potential, but I don't think that I understand yet what the *exact* algorithm is. Maybe I need to read the code, when I have some time for that. I can't form an intelligent opinion at this stage about whether this is likely to be a net positive. -- Robert Haas EDB: http://www.enterprisedb.com
, On Sun, Feb 20, 2022 at 7:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > Right, so we at least need to add a similar comment to what I proposed > for MXIDs, and maybe other changes are needed, too. Agreed. > > Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed > > here, to make all of this less confusing. I only now fully realized > > how confusing all of this stuff is -- very. > > Right. I think I understand all of this, or at least most of it -- but > not from the comment. The question is how the comment can be more > clear. My general suggestion is that function header comments should > have more to do with the behavior of the function than how it fits > into the bigger picture. If it's clear to the reader what conditions > must hold before calling the function and which must hold on return, > it helps a lot. IMHO, it's the job of the comments in the calling > function to clarify why we then choose to call that function at the > place and in the way that we do. You've given me a lot of high quality feedback on all of this, which I'll work through soon. It's hard to get the balance right here, but it's made much easier by this kind of feedback. > I think that the idea has potential, but I don't think that I > understand yet what the *exact* algorithm is. The algorithm seems to exploit a natural tendency that Andres once described in a blog post about his snapshot scalability work [1]. To a surprising extent, we can usefully bucket all tuples/pages into two simple categories: 1. Very, very old ("infinitely old" for all practical purposes). 2. Very very new. There doesn't seem to be much need for a third "in-between" category in practice. This seems to be at least approximately true all of the time. Perhaps Andres wouldn't agree with this very general statement -- he actually said something more specific. I for one believe that the point he made generalizes surprisingly well, though. I have my own theories about why this appears to be true. (Executive summary: power laws are weird, and it seems as if the sparsity-of-effects principle makes it easy to bucket things at the highest level, in a way that generalizes well across disparate workloads.) > Maybe I need to read the > code, when I have some time for that. I can't form an intelligent > opinion at this stage about whether this is likely to be a net > positive. The code in the v8-0002 patch is a bit sloppy right now. I didn't quite get around to cleaning it up -- I was focussed on performance validation of the algorithm itself. So bear that in mind if you do look at v8-0002 (might want to wait for v9-0002 before looking). I believe that the only essential thing about the algorithm itself is that it freezes all the tuples on a page when it anticipates setting the page all-visible, or (barring edge cases) freezes none at all. (Note that setting the page all-visible/all-frozen may be happen just after lazy_scan_prune returns, or in the second pass over the heap, after LP_DEAD items are set to LP_UNUSED -- lazy_scan_prune doesn't care which way it will happen.) There are one or two other design choices that we need to make, like what exact tuples we freeze in the edge case where FreezeLimit/XID age forces us to freeze in lazy_scan_prune. These other design choices don't seem relevant to the issue of central importance, which is whether or not we come out ahead overall with this new algorithm. FreezeLimit will seldom affect our choice to freeze or not freeze now, and so AFAICT the exact way that FreezeLimit affects which precise freezing-eligible tuples we freeze doesn't complicate performance validation. Remember when I got excited about how my big TPC-C benchmark run showed a predictable, tick/tock style pattern across VACUUM operations against the order and order lines table [2]? It seemed very significant to me that the OldestXmin of VACUUM operation n consistently went on to become the new relfrozenxid for the same table in VACUUM operation n + 1. It wasn't exactly the same XID, but very close to it (within the range of noise). This pattern was clearly present, even though VACUUM operation n + 1 might happen as long as 4 or 5 hours after VACUUM operation n (this was a big table). This pattern was encouraging to me because it showed (at least for the workload and tables in question) that the amount of unnecessary extra freezing can't have been too bad -- the fact that we can always advance relfrozenxid in the same way is evidence of that. Note that the vacuum_freeze_min_age setting can't have affected our choice of what to freeze (given what we see in the logs), and yet there is a clear pattern where the pages (it's really pages, not tuples) that the new algorithm doesn't freeze in VACUUM operation n will reliably get frozen in VACUUM operation n + 1 instead. And so this pattern seems to lend support to the general idea of letting the workload itself be the primary driver of what pages we freeze (not FreezeLimit, and not anything based on XIDs). That's really the underlying principle behind the new algorithm -- freezing is driven by workload characteristics (or page/block characteristics, if you prefer). ISTM that vacuum_freeze_min_age is almost impossible to tune -- XID age is just too squishy a concept for that to ever work. [1] https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon [2] https://postgr.es/m/CAH2-Wz=iLnf+0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g@mail.gmail.com -- scroll down to "TPC-C", which has the relevant autovacuum log output for the orders table, covering a 24 hour period -- Peter Geoghegan
On Sun, Feb 20, 2022 at 12:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > You've given me a lot of high quality feedback on all of this, which > I'll work through soon. It's hard to get the balance right here, but > it's made much easier by this kind of feedback. Attached is v9. Lots of changes. Highlights: * Much improved 0001 ("loosen coupling" dynamic relfrozenxid tracking patch). Some of the improvements are due to recent feedback from Robert. * Much improved 0002 ("Make page-level characteristics drive freezing" patch). Whole new approach to the implementation, though the same algorithm as before. * No more FSM patch -- that was totally separate work, that I shouldn't have attached to this project. * There are 2 new patches (these are now 0003 and 0004), both of which are concerned with allowing non-aggressive VACUUM to consistently advance relfrozenxid. I think that 0003 makes sense on general principle, but I'm much less sure about 0004. These aren't too important. While working on the new approach to freezing taken by v9-0002, I had some insight about the issues that Robert raised around 0001, too. I wasn't expecting that to happen. 0002 makes page-level freezing a first class thing. heap_prepare_freeze_tuple now has some (limited) knowledge of how this works. heap_prepare_freeze_tuple's cutoff_xid argument is now always the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We still have to pass FreezeLimit to heap_prepare_freeze_tuple, which helps us to respect FreezeLimit as a backstop, and so now it's passed via the new backstop_cutoff_xid argument instead. Whenever we opt to "freeze a page", the new page-level algorithm *always* uses the most recent possible XID and MXID values (OldestXmin and oldestMxact) to decide what XIDs/XMIDs need to be replaced. That might sound like it'd be too much, but it only applies to those pages that we actually decide to freeze (since page-level characteristics drive everything now). FreezeLimit is only one way of triggering that now (and one of the least interesting and rarest). 0002 also adds an alternative set of relfrozenxid/relminmxid tracker variables, to make the "don't freeze the page" path within lazy_scan_prune simpler (if you don't want to freeze the page, then use the set of tracker variables that go with that choice, which heap_prepare_freeze_tuple knows about and helps with). With page-level freezing, lazy_scan_prune wants to make a decision about the page as a whole, at the last minute, after all heap_prepare_freeze_tuple calls have already been made. So I think that heap_prepare_freeze_tuple needs to know about that aspect of lazy_scan_prune's behavior. When we *don't* want to freeze the page, we more or less need everything related to freezing inside lazy_scan_prune to behave like lazy_scan_noprune, which never freezes the page (that's mostly the point of lazy_scan_noprune). And that's almost what we actually do -- heap_prepare_freeze_tuple now outsources maintenance of this alternative set of "don't freeze the page" relfrozenxid/relminmxid tracker variables to its sibling function, heap_tuple_needs_freeze. That is the same function that lazy_scan_noprune itself actually calls. Now back to Robert's feedback on 0001, which had very complicated comments in the last version. This approach seems to make the "being versus becoming" or "going to freeze versus not going to freeze" distinctions much clearer. This is less true if you assume that 0002 won't be committed but 0001 will be. Even if that happens with Postgres 15, I have to imagine that adding something like 0002 must be the real goal, long term. Without 0002, the value from 0001 is far more limited. You need both together to get the virtuous cycle I've described. The approach with always using OldestXmin as cutoff_xid and oldestMxact as our cutoff_multi makes a lot of sense to me, in part because I think that it might well cut down on the tendency of VACUUM to allocate new MultiXacts in order to be able to freeze old ones. AFAICT the only reason that heap_prepare_freeze_tuple does that is because it has no flexibility on FreezeLimit and MultiXactCutoff. These are derived from vacuum_freeze_min_age and vacuum_multixact_freeze_min_age, respectively, and so they're two independent though fairly meaningless cutoffs. On the other hand, OldestXmin and OldestMxact are not independent in the same way. We get both of them at the same time and the same place, in vacuum_set_xid_limits. OldestMxact really is very close to OldestXmin -- only the units differ. It seems that heap_prepare_freeze_tuple allocates new MXIDs (when freezing old ones) in large part so it can NOT freeze XIDs that it would have been useful (and much cheaper) to remove anyway. On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's OldestXmin at all (it actually just gets FreezeLimit passed as its cutoff_xid argument). It cannot possibly recognize any of this for itself. Does that theory about MultiXacts sound plausible? I'm not claiming that the patch makes it impossible that FreezeMultiXactId() will have to allocate a new MultiXact to freeze during VACUUM -- the freeze-the-dead isolation tests already show that that's not true. I just think that page-level freezing based on page characteristics with oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff) cutoffs might make it a lot less likely in practice. oldestXmin and oldestMxact map to the same wall clock time, more or less -- that seems like it might be an important distinction, independent of everything else. Thanks -- Peter Geoghegan
Attachment
Hi, On 2022-02-24 20:53:08 -0800, Peter Geoghegan wrote: > 0002 makes page-level freezing a first class thing. > heap_prepare_freeze_tuple now has some (limited) knowledge of how this > works. heap_prepare_freeze_tuple's cutoff_xid argument is now always > the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We > still have to pass FreezeLimit to heap_prepare_freeze_tuple, which > helps us to respect FreezeLimit as a backstop, and so now it's passed > via the new backstop_cutoff_xid argument instead. I am not a fan of the backstop terminology. It's still the reason we need to do freezing for correctness reasons. It'd make more sense to me to turn it around and call the "non-backstop" freezing opportunistic freezing or such. > Whenever we opt to > "freeze a page", the new page-level algorithm *always* uses the most > recent possible XID and MXID values (OldestXmin and oldestMxact) to > decide what XIDs/XMIDs need to be replaced. That might sound like it'd > be too much, but it only applies to those pages that we actually > decide to freeze (since page-level characteristics drive everything > now). FreezeLimit is only one way of triggering that now (and one of > the least interesting and rarest). That largely makes sense to me and doesn't seem weird. I'm a tad concerned about replacing mxids that have some members that are older than OldestXmin but not older than FreezeLimit. It's not too hard to imagine that accelerating mxid consumption considerably. But we can probably, if not already done, special case that. > It seems that heap_prepare_freeze_tuple allocates new MXIDs (when > freezing old ones) in large part so it can NOT freeze XIDs that it > would have been useful (and much cheaper) to remove anyway. Well, we may have to allocate a new mxid because some members are older than FreezeLimit but others are still running. When do we not remove xids that would have been cheaper to remove once we decide to actually do work? > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's > OldestXmin at all (it actually just gets FreezeLimit passed as its > cutoff_xid argument). It cannot possibly recognize any of this for itself. It does recognize something like OldestXmin in a more precise and expensive way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId(). > Does that theory about MultiXacts sound plausible? I'm not claiming > that the patch makes it impossible that FreezeMultiXactId() will have > to allocate a new MultiXact to freeze during VACUUM -- the > freeze-the-dead isolation tests already show that that's not true. I > just think that page-level freezing based on page characteristics with > oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff) > cutoffs might make it a lot less likely in practice. Hm. I guess I'll have to look at the code for it. It doesn't immediately "feel" quite right. > oldestXmin and oldestMxact map to the same wall clock time, more or less -- > that seems like it might be an important distinction, independent of > everything else. Hm. Multis can be kept alive by fairly "young" member xids. So it may not be removably (without creating a newer multi) until much later than its creation time. So I don't think that's really true. > From 483bc8df203f9df058fcb53e7972e3912e223b30 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <pg@bowt.ie> > Date: Mon, 22 Nov 2021 10:02:30 -0800 > Subject: [PATCH v9 1/4] Loosen coupling between relfrozenxid and freezing. > > When VACUUM set relfrozenxid before now, it set it to whatever value was > used to determine which tuples to freeze -- the FreezeLimit cutoff. > This approach was very naive: the relfrozenxid invariant only requires > that new relfrozenxid values be <= the oldest extant XID remaining in > the table (at the point that the VACUUM operation ends), which in > general might be much more recent than FreezeLimit. There is no fixed > relationship between the amount of physical work performed by VACUUM to > make it safe to advance relfrozenxid (freezing and pruning), and the > actual number of XIDs that relfrozenxid can be advanced by (at least in > principle) as a result. VACUUM might have to freeze all of the tuples > from a hundred million heap pages just to enable relfrozenxid to be > advanced by no more than one or two XIDs. On the other hand, VACUUM > might end up doing little or no work, and yet still be capable of > advancing relfrozenxid by hundreds of millions of XIDs as a result. > > VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest > extant XID (and oldest extant MultiXactId) from the table, including > XIDs from the table's remaining/unfrozen MultiXacts. This requires that > VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes. > This optimization doesn't require any changes to the definition of > relfrozenxid, nor does it require changes to the core design of > freezing. > Final relfrozenxid values must still be >= FreezeLimit in an aggressive > VACUUM (FreezeLimit is still used as an XID-age based backstop there). > In non-aggressive VACUUMs (where there is still no strict guarantee that > relfrozenxid will be advanced at all), we now advance relfrozenxid by as > much as we possibly can. This exploits workload conditions that make it > easy to advance relfrozenxid by many more XIDs (for the same amount of > freezing/pruning work). Don't we now always advance relfrozenxid as much as we can, particularly also during aggressive vacuums? > * FRM_RETURN_IS_MULTI > * The return value is a new MultiXactId to set as new Xmax. > * (caller must obtain proper infomask bits using GetMultiXactIdHintBits) > + * > + * "relfrozenxid_out" is an output value; it's used to maintain target new > + * relfrozenxid for the relation. It can be ignored unless "flags" contains > + * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts > + * here. This follows the general convention: only track XIDs that will still > + * be in the table after the ongoing VACUUM finishes. Note that it's up to > + * caller to maintain this when the Xid return value is itself an Xid. > + * > + * Note that we cannot depend on xmin to maintain relfrozenxid_out. What does it mean for xmin to maintain something? > + * See heap_prepare_freeze_tuple for information about the basic rules for the > + * cutoffs used here. > + * > + * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which > + * are the current target relfrozenxid and relminmxid for the relation. We > + * assume that caller will never want to freeze its tuple, even when the tuple > + * "needs freezing" according to our return value. I don't understand the "will never want to" bit? > Caller should make temp > + * copies of global tracking variables before starting to process a page, so > + * that we can only scribble on copies. That way caller can just discard the > + * temp copies if it isn't okay with that assumption. > + * > + * Only aggressive VACUUM callers are expected to really care when a tuple > + * "needs freezing" according to us. It follows that non-aggressive VACUUMs > + * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all > + * cases. Could it make sense to track can_freeze and need_freeze separately? > @@ -7158,57 +7256,59 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid, > if (tuple->t_infomask & HEAP_XMAX_IS_MULTI) > { > MultiXactId multi; > + MultiXactMember *members; > + int nmembers; > > multi = HeapTupleHeaderGetRawXmax(tuple); > - if (!MultiXactIdIsValid(multi)) > - { > - /* no xmax set, ignore */ > - ; > - } > - else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask)) > + if (MultiXactIdIsValid(multi) && > + MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out)) > + *relminmxid_nofreeze_out = multi; I may be misreading the diff, but aren't we know continuing to use multi down below even if !MultiXactIdIsValid()? > + if (HEAP_LOCKED_UPGRADED(tuple->t_infomask)) > return true; > - else if (MultiXactIdPrecedes(multi, cutoff_multi)) > - return true; > - else > + else if (MultiXactIdPrecedes(multi, backstop_cutoff_multi)) > + needs_freeze = true; > + > + /* need to check whether any member of the mxact is too old */ > + nmembers = GetMultiXactIdMembers(multi, &members, false, > + HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)); Doesn't this mean we unpack the members even if the multi is old enough to need freezing? Just to then do it again during freezing? Accessing multis isn't cheap... > + if (TransactionIdPrecedes(members[i].xid, backstop_cutoff_xid)) > + needs_freeze = true; > + if (TransactionIdPrecedes(members[i].xid, > + *relfrozenxid_nofreeze_out)) > + *relfrozenxid_nofreeze_out = xid; > } > + if (nmembers > 0) > + pfree(members); > } > else > { > xid = HeapTupleHeaderGetRawXmax(tuple); > - if (TransactionIdIsNormal(xid) && > - TransactionIdPrecedes(xid, cutoff_xid)) > - return true; > + if (TransactionIdIsNormal(xid)) > + { > + if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out)) > + *relfrozenxid_nofreeze_out = xid; > + if (TransactionIdPrecedes(xid, backstop_cutoff_xid)) > + needs_freeze = true; > + } > } > > if (tuple->t_infomask & HEAP_MOVED) > { > xid = HeapTupleHeaderGetXvac(tuple); > - if (TransactionIdIsNormal(xid) && > - TransactionIdPrecedes(xid, cutoff_xid)) > - return true; > + if (TransactionIdIsNormal(xid)) > + { > + if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out)) > + *relfrozenxid_nofreeze_out = xid; > + if (TransactionIdPrecedes(xid, backstop_cutoff_xid)) > + needs_freeze = true; > + } > } This stanza is repeated a bunch. Perhaps put it in a small static inline helper? > /* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */ > TransactionId FreezeLimit; > MultiXactId MultiXactCutoff; > - /* Are FreezeLimit/MultiXactCutoff still valid? */ > - bool freeze_cutoffs_valid; > + /* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */ > + TransactionId NewRelfrozenXid; > + MultiXactId NewRelminMxid; Struct member names starting with an upper case look profoundly ugly to me... But this isn't the first one, so I guess... :( > From d10f42a1c091b4dc52670fca80a63fee4e73e20c Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <pg@bowt.ie> > Date: Mon, 13 Dec 2021 15:00:49 -0800 > Subject: [PATCH v9 2/4] Make page-level characteristics drive freezing. > > Teach VACUUM to freeze all of the tuples on a page whenever it notices > that it would otherwise mark the page all-visible, without also marking > it all-frozen. VACUUM typically won't freeze _any_ tuples on the page > unless _all_ tuples (that remain after pruning) are all-visible. This > makes the overhead of vacuuming much more predictable over time. We > avoid the need for large balloon payments during aggressive VACUUMs > (typically anti-wraparound autovacuums). Freezing is proactive, so > we're much less likely to get into "freezing debt". I still suspect this will cause a very substantial increase in WAL traffic in realistic workloads. It's common to have workloads where tuples are inserted once, and deleted once/ partition dropped. Freezing all the tuples is a lot more expensive than just marking the page all visible. It's not uncommon to be bound by WAL traffic rather than buffer dirtying rate (since the latter may be ameliorated by s_b and local storage, whereas WAL needs to be streamed/archived). This is particularly true because log_heap_visible() doesn't need an FPW if checkpoints aren't enabled. A small record vs an FPI is a *huge* difference. I think we'll have to make this less aggressive or tunable. Random ideas for heuristics: - Is it likely that freezing would not require an FPI or conversely that log_heap_visible() will also need an fpi? If the page already was recently modified / checksums are enabled the WAL overhead of the freezing doesn't play much of a role. - #dead items / #force-frozen items on the page - if we already need to do more than just setting all-visible, we can probably afford the WAL traffic. - relfrozenxid vs max_freeze_age / FreezeLimit. The closer they get, the more aggressive we should freeze all-visible pages. Might even make sense to start vacuuming an increasing percentage of all-visible pages during non-aggressive vacuums, the closer we get to FreezeLimit. - Keep stats about the age of dead and frozen over time. If all tuples are removed within a reasonable fraction of freeze_max_age, there's no point in freezing them. > The new approach to freezing also enables relfrozenxid advancement in > non-aggressive VACUUMs, which might be enough to avoid aggressive > VACUUMs altogether (with many individual tables/workloads). While the > non-aggressive case continues to skip all-visible (but not all-frozen) > pages (thereby making relfrozenxid advancement impossible), that in > itself will no longer hinder relfrozenxid advancement (outside of > pg_upgrade scenarios). I don't know how to parse "thereby making relfrozenxid advancement impossible ... will no longer hinder relfrozenxid advancement"? > We now consistently avoid leaving behind all-visible (not all-frozen) pages. > This (as well as work from commit 44fa84881f) makes relfrozenxid advancement > in non-aggressive VACUUMs commonplace. s/consistently/try to/? > The system accumulates freezing debt in proportion to the number of > physical heap pages with unfrozen tuples, more or less. Anything based > on XID age is likely to be a poor proxy for the eventual cost of > freezing (during the inevitable anti-wraparound autovacuum). At a high > level, freezing is now treated as one of the costs of storing tuples in > physical heap pages -- not a cost of transactions that allocate XIDs. > Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still > influence what we freeze, and when, they effectively become backstops. > It may still be necessary to "freeze a page" due to the presence of a > particularly old XID, from before VACUUM's FreezeLimit cutoff, though > that will be rare in practice -- FreezeLimit is just a backstop now. I don't really like the "rare in practice" bit. It'll be rare in some workloads but others will likely be much less affected. > + * Although this interface is primarily tuple-based, vacuumlazy.c caller > + * cooperates with us to decide on whether or not to freeze whole pages, > + * together as a single group. We prepare for freezing at the level of each > + * tuple, but the final decision is made for the page as a whole. All pages > + * that are frozen within a given VACUUM operation are frozen according to > + * cutoff_xid and cutoff_multi. Caller _must_ freeze the whole page when > + * we've set *force_freeze to true! > + * > + * cutoff_xid must be caller's oldest xmin to ensure that any XID older than > + * it could neither be running nor seen as running by any open transaction. > + * This ensures that the replacement will not change anyone's idea of the > + * tuple state. Similarly, cutoff_multi must be the smallest MultiXactId used > + * by any open transaction (at the time that the oldest xmin was acquired). I think this means my concern above about increasing mxid creation rate substantially may be warranted. > + * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must > + * be <= cutoff_multi. When any XID/XMID from before these backstop cutoffs > + * is encountered, we set *force_freeze to true, making caller freeze the page > + * (freezing-eligible XIDs/XMIDs will be frozen, at least). "Backstop > + * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old. > + * This shouldn't be necessary very often. VACUUM should prefer to freeze > + * when it's cheap (not when it's urgent). Hm. Does this mean that we might call heap_prepare_freeze_tuple and then decide not to freeze? Doesn't that mean we might create new multis over and over, because we don't end up pulling the trigger on freezing the page? > + > + /* > + * We allocated a MultiXact for this, so force freezing to avoid > + * wasting it > + */ > + *force_freeze = true; Ah, I guess not. But it'd be nicer if I didn't have to scroll down to the body of the function to figure it out... > From d2190abf366f148bae5307442e8a6245c6922e78 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <pg@bowt.ie> > Date: Mon, 21 Feb 2022 12:46:44 -0800 > Subject: [PATCH v9 3/4] Remove aggressive VACUUM skipping special case. > > Since it's simply never okay to miss out on advancing relfrozenxid > during an aggressive VACUUM (that's the whole point), the aggressive > case treated any page from a next_unskippable_block-wise skippable block > range as an all-frozen page (not a merely all-visible page) during > skipping. Such a page might not be all-visible/all-frozen at the point > that it actually gets skipped, but it could nevertheless be safely > skipped, and then counted in frozenskipped_pages (the page must have > been all-frozen back when we determined the extent of the range of > blocks to skip, since aggressive VACUUMs _must_ scan all-visible pages). > This is necessary to ensure that aggressive VACUUMs are always capable > of advancing relfrozenxid. > The non-aggressive case behaved slightly differently: it rechecked the > visibility map for each page at the point of skipping, and only counted > pages in frozenskipped_pages when they were still all-frozen at that > time. But it skipped the page either way (since we already committed to > skipping the page at the point of the recheck). This was correct, but > sometimes resulted in non-aggressive VACUUMs needlessly wasting an > opportunity to advance relfrozenxid (when a page was modified in just > the wrong way, at just the wrong time). It also resulted in a needless > recheck of the visibility map for each and every page skipped during > non-aggressive VACUUMs. > > Avoid these problems by conditioning the "skippable page was definitely > all-frozen when range of skippable pages was first determined" behavior > on what the visibility map _actually said_ about the range as a whole > back when we first determined the extent of the range (don't deduce what > must have happened at that time on the basis of aggressive-ness). This > allows us to reliably count skipped pages in frozenskipped_pages when > they were initially all-frozen. In particular, when a page's visibility > map bit is unset after the point where a skippable range of pages is > initially determined, but before the point where the page is actually > skipped, non-aggressive VACUUMs now count it in frozenskipped_pages, > just like aggressive VACUUMs always have [1]. It's not critical for the > non-aggressive case to get this right, but there is no reason not to. > > [1] Actually, it might not work that way when there happens to be a mix > of all-visible and all-frozen pages in a range of skippable pages. > There is no chance of VACUUM advancing relfrozenxid in this scenario > either way, though, so it doesn't matter. I think this commit message needs a good amount of polishing - it's very convoluted. It's late and I didn't sleep well, but I've tried to read it several times without really getting a sense of what this precisely does. > From 15dec1e572ac4da0540251253c3c219eadf46a83 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <pg@bowt.ie> > Date: Thu, 24 Feb 2022 17:21:45 -0800 > Subject: [PATCH v9 4/4] Avoid setting a page all-visible but not all-frozen. To me the commit message body doesn't actually describe what this is doing... > This is pretty much an addendum to the work in the "Make page-level > characteristics drive freezing" commit. It has been broken out like > this because I'm not even sure if it's necessary. It seems like we > might want to be paranoid about losing out on the chance to advance > relfrozenxid in non-aggressive VACUUMs, though. > The only test that will trigger this case is the "freeze-the-dead" > isolation test. It's incredibly narrow. On the other hand, why take a > chance? All it takes is one heap page that's all-visible (and not also > all-frozen) nestled between some all-frozen heap pages to lose out on > relfrozenxid advancement. The SKIP_PAGES_THRESHOLD stuff won't save us > then [1]. FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up causing a lot of time doing IO that we never need, completely trashing all CPU caches, while not actually causing decent readaead IO from what I've seen. Greetings, Andres Freund
On Thu, Feb 24, 2022 at 11:14 PM Andres Freund <andres@anarazel.de> wrote: > I am not a fan of the backstop terminology. It's still the reason we need to > do freezing for correctness reasons. Thanks for the review! I'm not wedded to that particular terminology, but I think that we need something like it. Open to suggestions. How about limit-based? Something like that? > It'd make more sense to me to turn it > around and call the "non-backstop" freezing opportunistic freezing or such. The problem with that scheme is that it leads to a world where "standard freezing" is incredibly rare (it often literally never happens), whereas "opportunistic freezing" is incredibly common. That doesn't make much sense to me. We tend to think of 50 million XIDs (the vacuum_freeze_min_age default) as being not that many. But I think that it can be a huge number, too. Even then, it's unpredictable -- I suspect that it can change without very much changing in the application, from the point of view of users. That's a big part of the problem I'm trying to address -- freezing outside of aggressive VACUUMs is way too rare (it might barely happen at all). FreezeLimit/vacuum_freeze_min_age was designed at a time when there was no visibility map at all, when it made somewhat more sense as the thing that drives freezing. Incidentally, this is part of the problem with anti-wraparound vacuums and freezing debt -- the fact that some quite busy databases take weeks or months to go through 50 million XIDs (or 200 million) increases the pain of the eventual aggressive VACUUM. It's not completely unbounded -- autovacuum_freeze_max_age is not 100% useless here. But the extent to which that stuff bounds the debt can vary enormously, for not-very-good reasons. > > Whenever we opt to > > "freeze a page", the new page-level algorithm *always* uses the most > > recent possible XID and MXID values (OldestXmin and oldestMxact) to > > decide what XIDs/XMIDs need to be replaced. That might sound like it'd > > be too much, but it only applies to those pages that we actually > > decide to freeze (since page-level characteristics drive everything > > now). FreezeLimit is only one way of triggering that now (and one of > > the least interesting and rarest). > > That largely makes sense to me and doesn't seem weird. I'm very pleased that the main intuition behind 0002 makes sense to you. That's a start, at least. > I'm a tad concerned about replacing mxids that have some members that are > older than OldestXmin but not older than FreezeLimit. It's not too hard to > imagine that accelerating mxid consumption considerably. But we can probably, > if not already done, special case that. Let's assume for a moment that this is a real problem. I'm not sure if it is or not myself (it's complicated), but let's say that it is. The problem may be more than offset by the positive impact on relminxmid advancement. I have placed a large emphasis on enabling relfrozenxid/relminxmid advancement in every non-aggressive VACUUM, for a number of reasons -- this is one of the reasons. Finding a way for every VACUUM operation to be "vacrel->scanned_pages + vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some* amount of relfrozenxid/relminxmid advancement possible in every VACUUM) has a great deal of value. As I said recently on the "do only critical work during single-user vacuum?" thread, why should the largest tables in databases that consume too many MXIDs do so evenly, across all their tables? There are usually one or two large tables, and many more smaller tables. I think it's much more likely that the largest tables consume approximately zero MultiXactIds in these databases -- actual MultiXactId consumption is probably concentrated in just one or two smaller tables (even when we burn through MultiXacts very quickly). But we don't recognize these kinds of distinctions at all right now. Under these conditions, we will have many more opportunities to advance relminmxid for most of the tables (including the larger tables) all the way up to current-oldestMxact with the patch series. Without needing to freeze *any* MultiXacts early (just freezing some XIDs early) to get that benefit. The patch series is not just about spreading the burden of freezing, so that non-aggressive VACUUMs freeze more -- it's also making relfrozenxid and relminmxid more recent and therefore *reliable* indicators of which tables any wraparound problems *really* are. Does that make sense to you? This kind of "virtuous cycle" seems really important to me. It's a subtle point, so I have to ask. > > It seems that heap_prepare_freeze_tuple allocates new MXIDs (when > > freezing old ones) in large part so it can NOT freeze XIDs that it > > would have been useful (and much cheaper) to remove anyway. > > Well, we may have to allocate a new mxid because some members are older than > FreezeLimit but others are still running. When do we not remove xids that > would have been cheaper to remove once we decide to actually do work? My point was that today, on HEAD, there is nothing fundamentally special about FreezeLimit (aka cutoff_xid) as far as heap_prepare_freeze_tuple is concerned -- and yet that's the only cutoff it knows about, really. Why can't we do better, by "exploiting the difference" between FreezeLimit and OldestXmin? > > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's > > OldestXmin at all (it actually just gets FreezeLimit passed as its > > cutoff_xid argument). It cannot possibly recognize any of this for itself. > > It does recognize something like OldestXmin in a more precise and expensive > way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId(). It doesn't look that way to me. While it's true that FreezeMultiXactId() will call MultiXactIdIsRunning(), that's only a cross-check. This cross-check is made at a point where we've already determined that the MultiXact in question is < cutoff_multi. In other words, it catches cases where a "MultiXactId < cutoff_multi" Multi contains an XID *that's still running* -- a correctness issue. Nothing to do with being smart about avoiding allocating new MultiXacts during freezing, or exploiting the fact that "FreezeLimit < OldestXmin" (which is almost always true, very true). This correctness issue is the same issue discussed in "NB: cutoff_xid *must* be <= the current global xmin..." comments that appear at the top of heap_prepare_freeze_tuple. That's all. > Hm. I guess I'll have to look at the code for it. It doesn't immediately > "feel" quite right. I kinda think it might be. Please let me know if you see a problem with what I've said. > > oldestXmin and oldestMxact map to the same wall clock time, more or less -- > > that seems like it might be an important distinction, independent of > > everything else. > > Hm. Multis can be kept alive by fairly "young" member xids. So it may not be > removably (without creating a newer multi) until much later than its creation > time. So I don't think that's really true. Maybe what I said above it true, even though (at the same time) I have *also* created new problems with "young" member xids. I really don't know right now, though. > > Final relfrozenxid values must still be >= FreezeLimit in an aggressive > > VACUUM (FreezeLimit is still used as an XID-age based backstop there). > > In non-aggressive VACUUMs (where there is still no strict guarantee that > > relfrozenxid will be advanced at all), we now advance relfrozenxid by as > > much as we possibly can. This exploits workload conditions that make it > > easy to advance relfrozenxid by many more XIDs (for the same amount of > > freezing/pruning work). > > Don't we now always advance relfrozenxid as much as we can, particularly also > during aggressive vacuums? I just meant "we hope for the best and accept what we can get". Will fix. > > * FRM_RETURN_IS_MULTI > > * The return value is a new MultiXactId to set as new Xmax. > > * (caller must obtain proper infomask bits using GetMultiXactIdHintBits) > > + * > > + * "relfrozenxid_out" is an output value; it's used to maintain target new > > + * relfrozenxid for the relation. It can be ignored unless "flags" contains > > + * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts > > + * here. This follows the general convention: only track XIDs that will still > > + * be in the table after the ongoing VACUUM finishes. Note that it's up to > > + * caller to maintain this when the Xid return value is itself an Xid. > > + * > > + * Note that we cannot depend on xmin to maintain relfrozenxid_out. > > What does it mean for xmin to maintain something? Will fix. > > + * See heap_prepare_freeze_tuple for information about the basic rules for the > > + * cutoffs used here. > > + * > > + * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which > > + * are the current target relfrozenxid and relminmxid for the relation. We > > + * assume that caller will never want to freeze its tuple, even when the tuple > > + * "needs freezing" according to our return value. > > I don't understand the "will never want to" bit? I meant "even when it's a non-aggressive VACUUM, which will never want to wait for a cleanup lock the hard way, and will therefore always settle for these relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out values". Note the convention here, which is relfrozenxid_nofreeze_out is not the same thing as relfrozenxid_out -- the former variable name is used for values in cases where we *don't* freeze, the latter for values in the cases where we do. Will try to clear that up. > > Caller should make temp > > + * copies of global tracking variables before starting to process a page, so > > + * that we can only scribble on copies. That way caller can just discard the > > + * temp copies if it isn't okay with that assumption. > > + * > > + * Only aggressive VACUUM callers are expected to really care when a tuple > > + * "needs freezing" according to us. It follows that non-aggressive VACUUMs > > + * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all > > + * cases. > > Could it make sense to track can_freeze and need_freeze separately? You mean to change the signature of heap_tuple_needs_freeze, so it doesn't return a bool anymore? It just has two bool pointers as arguments, can_freeze and need_freeze? I suppose that could make sense. Don't feel strongly either way. > I may be misreading the diff, but aren't we know continuing to use multi down > below even if !MultiXactIdIsValid()? Will investigate. > Doesn't this mean we unpack the members even if the multi is old enough to > need freezing? Just to then do it again during freezing? Accessing multis > isn't cheap... Will investigate. > This stanza is repeated a bunch. Perhaps put it in a small static inline > helper? Will fix. > Struct member names starting with an upper case look profoundly ugly to > me... But this isn't the first one, so I guess... :( I am in 100% agreement, actually. But you know how it goes... > I still suspect this will cause a very substantial increase in WAL traffic in > realistic workloads. It's common to have workloads where tuples are inserted > once, and deleted once/ partition dropped. I agree with the principle that this kind of use case should be accommodated in some way. > I think we'll have to make this less aggressive or tunable. Random ideas for > heuristics: The problem that all of these heuristics have is that they will tend to make it impossible for future non-aggressive VACUUMs to be able to advance relfrozenxid. All that it takes is one single all-visible page to make that impossible. As I said upthread, I think that being able to advance relfrozenxid (and especially relminmxid) by *some* amount in every VACUUM has non-obvious value. Maybe you can address that by changing the behavior of non-aggressive VACUUMs, so that they are directly sensitive to this. Maybe they don't skip any all-visible pages when there aren't too many, that kind of thing. That needs to be in scope IMV. > I don't know how to parse "thereby making relfrozenxid advancement impossible > ... will no longer hinder relfrozenxid advancement"? Will fix. > > We now consistently avoid leaving behind all-visible (not all-frozen) pages. > > This (as well as work from commit 44fa84881f) makes relfrozenxid advancement > > in non-aggressive VACUUMs commonplace. > > s/consistently/try to/? Will fix. > > The system accumulates freezing debt in proportion to the number of > > physical heap pages with unfrozen tuples, more or less. Anything based > > on XID age is likely to be a poor proxy for the eventual cost of > > freezing (during the inevitable anti-wraparound autovacuum). At a high > > level, freezing is now treated as one of the costs of storing tuples in > > physical heap pages -- not a cost of transactions that allocate XIDs. > > Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still > > influence what we freeze, and when, they effectively become backstops. > > It may still be necessary to "freeze a page" due to the presence of a > > particularly old XID, from before VACUUM's FreezeLimit cutoff, though > > that will be rare in practice -- FreezeLimit is just a backstop now. > > I don't really like the "rare in practice" bit. It'll be rare in some > workloads but others will likely be much less affected. Maybe. The first time one XID crosses FreezeLimit now will be enough to trigger freezing the page. So it's still very different to today. I'll change this, though. It's not important. > I think this means my concern above about increasing mxid creation rate > substantially may be warranted. Can you think of an adversarial workload, to get a sense of the extent of the problem? > > + * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must > > + * be <= cutoff_multi. When any XID/XMID from before these backstop cutoffs > > + * is encountered, we set *force_freeze to true, making caller freeze the page > > + * (freezing-eligible XIDs/XMIDs will be frozen, at least). "Backstop > > + * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old. > > + * This shouldn't be necessary very often. VACUUM should prefer to freeze > > + * when it's cheap (not when it's urgent). > > Hm. Does this mean that we might call heap_prepare_freeze_tuple and then > decide not to freeze? Yes. And so heap_prepare_freeze_tuple is now a little more like its sibling function, heap_tuple_needs_freeze. > Doesn't that mean we might create new multis over and > over, because we don't end up pulling the trigger on freezing the page? > Ah, I guess not. But it'd be nicer if I didn't have to scroll down to the body > of the function to figure it out... Will fix. > I think this commit message needs a good amount of polishing - it's very > convoluted. It's late and I didn't sleep well, but I've tried to read it > several times without really getting a sense of what this precisely does. It received much less polishing than the others. Think of 0003 like this: The logic for skipping a range of blocks using the visibility map works by deciding the next_unskippable_block-wise range of skippable blocks up front. Later, we actually execute the skipping of this range of blocks (assuming it exceeds SKIP_PAGES_THRESHOLD). These are two separate steps. Right now, we do this: if (skipping_blocks && blkno < nblocks - 1) { /* * Tricky, tricky. If this is in aggressive vacuum, the page * must have been all-frozen at the time we checked whether it * was skippable, but it might not be any more. We must be * careful to count it as a skipped all-frozen page in that * case, or else we'll think we can't update relfrozenxid and * relminmxid. If it's not an aggressive vacuum, we don't * know whether it was initially all-frozen, so we have to * recheck. */ if (vacrel->aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer)) vacrel->frozenskipped_pages++; continue; } The fact that this is conditioned in part on "vacrel->aggressive" concerns me here. Why should we have a special case for this, where we condition something on aggressive-ness that isn't actually strictly related to that? Why not just remember that the range that we're skipping was all-frozen up-front? That way non-aggressive VACUUMs are not unnecessarily at a disadvantage, when it comes to being able to advance relfrozenxid. What if we end up not incrementing vacrel->frozenskipped_pages when we easily could have, just because this is a non-aggressive VACUUM? I think that it's worth avoiding stuff like that whenever possible. Maybe this particular example isn't the most important one. For example it probably isn't as bad as the one was fixed by the lazy_scan_noprune work. But why even take a chance? Seems easier to remove the special case -- which is what this really is. > FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up > causing a lot of time doing IO that we never need, completely trashing all CPU > caches, while not actually causing decent readaead IO from what I've seen. I am also suspicious of SKIP_PAGES_THRESHOLD. But if we want to get rid of it, we'll need to be sensitive to how that affects relfrozenxid advancement in non-aggressive VACUUMs IMV. Thanks again for the review! -- Peter Geoghegan
Hi, On 2022-02-25 14:00:12 -0800, Peter Geoghegan wrote: > On Thu, Feb 24, 2022 at 11:14 PM Andres Freund <andres@anarazel.de> wrote: > > I am not a fan of the backstop terminology. It's still the reason we need to > > do freezing for correctness reasons. > > Thanks for the review! > > I'm not wedded to that particular terminology, but I think that we > need something like it. Open to suggestions. > > How about limit-based? Something like that? freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or s/limit/below/? I kind of like below because that answers < vs <= which I find hard to remember around freezing. > > I'm a tad concerned about replacing mxids that have some members that are > > older than OldestXmin but not older than FreezeLimit. It's not too hard to > > imagine that accelerating mxid consumption considerably. But we can probably, > > if not already done, special case that. > > Let's assume for a moment that this is a real problem. I'm not sure if > it is or not myself (it's complicated), but let's say that it is. The > problem may be more than offset by the positive impact on relminxmid > advancement. I have placed a large emphasis on enabling > relfrozenxid/relminxmid advancement in every non-aggressive VACUUM, > for a number of reasons -- this is one of the reasons. Finding a way > for every VACUUM operation to be "vacrel->scanned_pages + > vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some* > amount of relfrozenxid/relminxmid advancement possible in every > VACUUM) has a great deal of value. That may be true, but I think working more incrementally is better in this are. I'd rather have a smaller improvement for a release, collect some data, get another improvement in the next, than see a bunch of reports of larger wind and large regressions. > As I said recently on the "do only critical work during single-user > vacuum?" thread, why should the largest tables in databases that > consume too many MXIDs do so evenly, across all their tables? There > are usually one or two large tables, and many more smaller tables. I > think it's much more likely that the largest tables consume > approximately zero MultiXactIds in these databases -- actual > MultiXactId consumption is probably concentrated in just one or two > smaller tables (even when we burn through MultiXacts very quickly). > But we don't recognize these kinds of distinctions at all right now. Recognizing those distinctions seems independent of freezing multixacts with live members. I am happy with freezing them more aggressively if they don't have live members. It's freezing mxids with live members that has me concerned. The limits you're proposing are quite aggressive and can advance quickly. I've seen large tables with plenty multixacts. Typically concentrated over a value range (often changing over time). > Under these conditions, we will have many more opportunities to > advance relminmxid for most of the tables (including the larger > tables) all the way up to current-oldestMxact with the patch series. > Without needing to freeze *any* MultiXacts early (just freezing some > XIDs early) to get that benefit. The patch series is not just about > spreading the burden of freezing, so that non-aggressive VACUUMs > freeze more -- it's also making relfrozenxid and relminmxid more > recent and therefore *reliable* indicators of which tables any > wraparound problems *really* are. My concern was explicitly about the case where we have to create new multixacts... > Does that make sense to you? Yes. > > > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's > > > OldestXmin at all (it actually just gets FreezeLimit passed as its > > > cutoff_xid argument). It cannot possibly recognize any of this for itself. > > > > It does recognize something like OldestXmin in a more precise and expensive > > way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId(). > > It doesn't look that way to me. > > While it's true that FreezeMultiXactId() will call MultiXactIdIsRunning(), > that's only a cross-check. > This cross-check is made at a point where we've already determined that the > MultiXact in question is < cutoff_multi. In other words, it catches cases > where a "MultiXactId < cutoff_multi" Multi contains an XID *that's still > running* -- a correctness issue. Nothing to do with being smart about > avoiding allocating new MultiXacts during freezing, or exploiting the fact > that "FreezeLimit < OldestXmin" (which is almost always true, very true). If there's <= 1 live members in a mxact, we replace it with with a plain xid iff the xid also would get frozen. With the current freezing logic I don't see what passing down OldestXmin would change. Or how it differs to a meaningful degree from heap_prepare_freeze_tuple()'s logic. I don't see how it'd avoid a single new mxact from being allocated. > > > Caller should make temp > > > + * copies of global tracking variables before starting to process a page, so > > > + * that we can only scribble on copies. That way caller can just discard the > > > + * temp copies if it isn't okay with that assumption. > > > + * > > > + * Only aggressive VACUUM callers are expected to really care when a tuple > > > + * "needs freezing" according to us. It follows that non-aggressive VACUUMs > > > + * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all > > > + * cases. > > > > Could it make sense to track can_freeze and need_freeze separately? > > You mean to change the signature of heap_tuple_needs_freeze, so it > doesn't return a bool anymore? It just has two bool pointers as > arguments, can_freeze and need_freeze? Something like that. Or return true if there's anything to do, and then rely on can_freeze and need_freeze for finer details. But it doesn't matter that much. > > I still suspect this will cause a very substantial increase in WAL traffic in > > realistic workloads. It's common to have workloads where tuples are inserted > > once, and deleted once/ partition dropped. > > I agree with the principle that this kind of use case should be > accommodated in some way. > > > I think we'll have to make this less aggressive or tunable. Random ideas for > > heuristics: > > The problem that all of these heuristics have is that they will tend > to make it impossible for future non-aggressive VACUUMs to be able to > advance relfrozenxid. All that it takes is one single all-visible page > to make that impossible. As I said upthread, I think that being able > to advance relfrozenxid (and especially relminmxid) by *some* amount > in every VACUUM has non-obvious value. I think that's a laudable goal. But I don't think we should go there unless we are quite confident we've mitigated the potential downsides. Observed horizons for "never vacuumed before" tables and for aggressive vacuums alone would be a huge win. > Maybe you can address that by changing the behavior of non-aggressive > VACUUMs, so that they are directly sensitive to this. Maybe they don't > skip any all-visible pages when there aren't too many, that kind of > thing. That needs to be in scope IMV. Yea. I still like my idea to have vacuum process a some all-visible pages every time and to increase that percentage based on how old the relfrozenxid is. We could slowly "refill" the number of all-visible pages VACUUM is allowed to process whenever dirtying a page for other reasons. > > I think this means my concern above about increasing mxid creation rate > > substantially may be warranted. > > Can you think of an adversarial workload, to get a sense of the extent > of the problem? I'll try to come up with something. > > FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up > > causing a lot of time doing IO that we never need, completely trashing all CPU > > caches, while not actually causing decent readaead IO from what I've seen. > > I am also suspicious of SKIP_PAGES_THRESHOLD. But if we want to get > rid of it, we'll need to be sensitive to how that affects relfrozenxid > advancement in non-aggressive VACUUMs IMV. It might make sense to separate the purposes of SKIP_PAGES_THRESHOLD. The relfrozenxid advancement doesn't benefit from visiting all-frozen pages, just because there are only 30 of them in a row. > Thanks again for the review! NP, I think we need a lot of improvements in this area. I wish somebody would tackle merging heap_page_prune() with vacuuming. Primarily so we only do a single WAL record. But also because the separation has caused a *lot* of complexity. I've already more projects than I should, otherwise I'd start on it... Greetings, Andres Freund
On Fri, Feb 25, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Hm. I guess I'll have to look at the code for it. It doesn't immediately > > "feel" quite right. > > I kinda think it might be. Please let me know if you see a problem > with what I've said. Oh, wait. I have a better idea of what you meant now. The loop towards the end of FreezeMultiXactId() will indeed "Determine whether to keep this member or ignore it." when we need a new MultiXactId. The loop is exact in the sense that it will only include those XIDs that are truly needed -- those that are still running. But why should we ever get to the FreezeMultiXactId() loop with the stuff from 0002 in place? The whole purpose of the loop is to handle cases where we have to remove *some* (not all) XIDs from before cutoff_xid that appear in a MultiXact, which requires careful checking of each XID (this is only possible when the MultiXactId is < cutoff_multi to begin with, which is OldestMxact in the patch, which is presumably very recent). It's not impossible that we'll get some number of "skewed MultiXacts" with the patch -- cases that really do necessitate allocating a new MultiXact, just to "freeze some XIDs from a MultiXact". That is, there will sometimes be some number of XIDs that are < OldestXmin, but nevertheless appear in some MultiXactIds >= OldestMxact. This seems likely to be rare with the patch, though, since VACUUM calculates its OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi really are in the patch) at the same point in time. Which was the point I made in my email yesterday. How many of these "skewed MultiXacts" can we really expect? Seems like there might be very few in practice. But I'm really not sure about that. -- Peter Geoghegan
Hi, On 2022-02-25 15:28:17 -0800, Peter Geoghegan wrote: > But why should we ever get to the FreezeMultiXactId() loop with the > stuff from 0002 in place? The whole purpose of the loop is to handle > cases where we have to remove *some* (not all) XIDs from before > cutoff_xid that appear in a MultiXact, which requires careful checking > of each XID (this is only possible when the MultiXactId is < > cutoff_multi to begin with, which is OldestMxact in the patch, which > is presumably very recent). > > It's not impossible that we'll get some number of "skewed MultiXacts" > with the patch -- cases that really do necessitate allocating a new > MultiXact, just to "freeze some XIDs from a MultiXact". That is, there > will sometimes be some number of XIDs that are < OldestXmin, but > nevertheless appear in some MultiXactIds >= OldestMxact. This seems > likely to be rare with the patch, though, since VACUUM calculates its > OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi > really are in the patch) at the same point in time. Which was the > point I made in my email yesterday. I don't see why it matters that OldestXmin and OldestMxact are computed at the same time? It's a question of the workload, not vacuum algorithm. OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all members are older than OldestXmin (not quite true, but that's the bound), and they have always more than one member. > How many of these "skewed MultiXacts" can we really expect? I don't think they're skewed in any way. It's a fundamental aspect of multixacts. Greetings, Andres Freund
On Fri, Feb 25, 2022 at 3:48 PM Andres Freund <andres@anarazel.de> wrote: > I don't see why it matters that OldestXmin and OldestMxact are computed at the > same time? It's a question of the workload, not vacuum algorithm. I think it's both. > OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all > members are older than OldestXmin (not quite true, but that's the bound), and > they have always more than one member. > > > > How many of these "skewed MultiXacts" can we really expect? > > I don't think they're skewed in any way. It's a fundamental aspect of > multixacts. Having this happen to some degree is fundamental to MultiXacts, sure. But also seems like the approach of using FreezeLimit and MultiXactCutoff in the way that we do right now seems like it might make the problem a lot worse. Because they're completely meaningless cutoffs. They are magic numbers that have no relationship whatsoever to each other. There are problems with assuming that OldestXmin and OldestMxact "align" -- no question. But at least it's approximately true -- which is a start. They are at least not arbitrarily, unpredictably different, like FreezeLimit and MultiXactCutoff are, and always will be. I think that that's a meaningful and useful distinction. I am okay with making the most pessimistic possible assumptions about how any changes to how we freeze might cause FreezeMultiXactId() to allocate more MultiXacts than before. And I accept that the patch series shouldn't "get credit" for "offsetting" any problem like that by making relminmxid advancement occur much more frequently (even though that does seem very valuable). All I'm really saying is this: in general, there are probably quite a few opportunities for FreezeMultiXactId() to avoid allocating new XMIDs (just to freeze XIDs) by having the full context. And maybe by making the dialog between lazy_scan_prune and heap_prepare_freeze_tuple a bit more nuanced. -- Peter Geoghegan
On Fri, Feb 25, 2022 at 3:26 PM Andres Freund <andres@anarazel.de> wrote: > freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or > s/limit/below/? I kind of like below because that answers < vs <= which I find > hard to remember around freezing. I like freeze_required_limit the most. > That may be true, but I think working more incrementally is better in this > are. I'd rather have a smaller improvement for a release, collect some data, > get another improvement in the next, than see a bunch of reports of larger > wind and large regressions. I agree. There is an important practical way in which it makes sense to treat 0001 as separate to 0002. It is true that 0001 is independently quite useful. In practical terms, I'd be quite happy to just get 0001 into Postgres 15, without 0002. I think that that's what you meant here, in concrete terms, and we can agree on that now. However, it is *also* true that there is an important practical sense in which they *are* related. I don't want to ignore that either -- it does matter. Most of the value to be had here comes from the synergy between 0001 and 0002 -- or what I've been calling a "virtuous cycle", the thing that makes it possible to advance relfrozenxid/relminmxid in almost every VACUUM. Having both 0001 and 0002 together (or something along the same lines) is way more valuable than having just one. Perhaps we can even agree on this second point. I am encouraged by the fact that you at least recognize the general validity of the key ideas from 0002. If I am going to commit 0001 (and not 0002) ahead of feature freeze for 15, I better be pretty sure that I have at least roughly the right idea with 0002, too -- since that's the direction that 0001 is going in. It almost seems dishonest to pretend that I wasn't thinking of 0002 when I wrote 0001. I'm glad that you seem to agree that this business of accumulating freezing debt without any natural limit is just not okay. That is really fundamental to me. I mean, vacuum_freeze_min_age kind of doesn't work as designed. This is a huge problem for us. > > Under these conditions, we will have many more opportunities to > > advance relminmxid for most of the tables (including the larger > > tables) all the way up to current-oldestMxact with the patch series. > > Without needing to freeze *any* MultiXacts early (just freezing some > > XIDs early) to get that benefit. The patch series is not just about > > spreading the burden of freezing, so that non-aggressive VACUUMs > > freeze more -- it's also making relfrozenxid and relminmxid more > > recent and therefore *reliable* indicators of which tables any > > wraparound problems *really* are. > > My concern was explicitly about the case where we have to create new > multixacts... It was a mistake on my part to counter your point about that with this other point about eager relminmxid advancement. As I said in the last email, while that is very valuable, it's not something that needs to be brought into this. > > Does that make sense to you? > > Yes. Okay, great. The fact that you recognize the value in that comes as a relief. > > You mean to change the signature of heap_tuple_needs_freeze, so it > > doesn't return a bool anymore? It just has two bool pointers as > > arguments, can_freeze and need_freeze? > > Something like that. Or return true if there's anything to do, and then rely > on can_freeze and need_freeze for finer details. But it doesn't matter that much. Got it. > > The problem that all of these heuristics have is that they will tend > > to make it impossible for future non-aggressive VACUUMs to be able to > > advance relfrozenxid. All that it takes is one single all-visible page > > to make that impossible. As I said upthread, I think that being able > > to advance relfrozenxid (and especially relminmxid) by *some* amount > > in every VACUUM has non-obvious value. > > I think that's a laudable goal. But I don't think we should go there unless we > are quite confident we've mitigated the potential downsides. True. But that works both ways. We also shouldn't err in the direction of adding these kinds of heuristics (which have real downsides) until the idea of mostly swallowing the cost of freezing whole pages (while making it possible to disable) has lost, fairly. Overall, it looks like the cost is acceptable in most cases. I think that users will find it very reassuring to regularly and reliably see confirmation that wraparound is being kept at bay, by every VACUUM operation, with details that they can relate to their workload. That has real value IMV -- even when it's theoretically unnecessary for us to be so eager with advancing relfrozenxid. I really don't like the idea of falling behind on freezing systematically. You always run the "risk" of freezing being wasted. But that way of looking at it can be penny wise, pound foolish -- maybe we should just accept that trying to predict what will happen in the future (whether or not freezing will be worth it) is mostly not helpful. Our users mostly complain about performance stability these days. Big shocks are really something we ought to avoid. That does have a cost. Why wouldn't it? > > Maybe you can address that by changing the behavior of non-aggressive > > VACUUMs, so that they are directly sensitive to this. Maybe they don't > > skip any all-visible pages when there aren't too many, that kind of > > thing. That needs to be in scope IMV. > > Yea. I still like my idea to have vacuum process a some all-visible pages > every time and to increase that percentage based on how old the relfrozenxid > is. You can quite easily construct cases where the patch does much better than that, though -- very believable cases. Any table like pgbench_history. And so I lean towards quantifying the cost of page-level freezing carefully, making sure there is nothing pathological, and then just accepting it (with a GUC to disable). The reality is that freezing is really a cost of storing data in Postgres, and will be for the foreseeable future. > > Can you think of an adversarial workload, to get a sense of the extent > > of the problem? > > I'll try to come up with something. That would be very helpful. Thanks! > It might make sense to separate the purposes of SKIP_PAGES_THRESHOLD. The > relfrozenxid advancement doesn't benefit from visiting all-frozen pages, just > because there are only 30 of them in a row. Right. I imagine that SKIP_PAGES_THRESHOLD actually does help with this, but if we actually tried we'd find a much better way. > I wish somebody would tackle merging heap_page_prune() with > vacuuming. Primarily so we only do a single WAL record. But also because the > separation has caused a *lot* of complexity. I've already more projects than > I should, otherwise I'd start on it... That has value, but it doesn't feel as urgent. -- Peter Geoghegan
On Sun, Feb 20, 2022 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I think that the idea has potential, but I don't think that I > > understand yet what the *exact* algorithm is. > > The algorithm seems to exploit a natural tendency that Andres once > described in a blog post about his snapshot scalability work [1]. To a > surprising extent, we can usefully bucket all tuples/pages into two > simple categories: > > 1. Very, very old ("infinitely old" for all practical purposes). > > 2. Very very new. > > There doesn't seem to be much need for a third "in-between" category > in practice. This seems to be at least approximately true all of the > time. > > Perhaps Andres wouldn't agree with this very general statement -- he > actually said something more specific. I for one believe that the > point he made generalizes surprisingly well, though. I have my own > theories about why this appears to be true. (Executive summary: power > laws are weird, and it seems as if the sparsity-of-effects principle > makes it easy to bucket things at the highest level, in a way that > generalizes well across disparate workloads.) I think that this is not really a description of an algorithm -- and I think that it is far from clear that the third "in-between" category does not need to exist. > Remember when I got excited about how my big TPC-C benchmark run > showed a predictable, tick/tock style pattern across VACUUM operations > against the order and order lines table [2]? It seemed very > significant to me that the OldestXmin of VACUUM operation n > consistently went on to become the new relfrozenxid for the same table > in VACUUM operation n + 1. It wasn't exactly the same XID, but very > close to it (within the range of noise). This pattern was clearly > present, even though VACUUM operation n + 1 might happen as long as 4 > or 5 hours after VACUUM operation n (this was a big table). I think findings like this are very unconvincing. TPC-C (or any benchmark really) is so simple as to be a terrible proxy for what vacuuming is going to look like on real-world systems. Like, it's nice that it works, and it shows that something's working, but it doesn't demonstrate that the patch is making the right trade-offs overall. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Mar 1, 2022 at 1:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think that this is not really a description of an algorithm -- and I > think that it is far from clear that the third "in-between" category > does not need to exist. But I already described the algorithm. It is very simple mechanistically -- though that in itself means very little. As I have said multiple times now, the hard part is assessing what the implications are. And the even harder part is making a judgement about whether or not those implications are what we generally want. > I think findings like this are very unconvincing. TPC-C may be unrealistic in certain ways, but it is nevertheless vastly more realistic than pgbench. pgbench is really more of a stress test than a benchmark. The main reasons why TPC-C is interesting here are *very* simple, and would likely be equally true with TPC-E (just for example) -- even though TPC-E is a very different benchmark kind of OLTP workload overall. TPC-C (like TPC-E) features a diversity of transaction types, some of which are more complicated than others -- which is strictly more realistic than having only one highly synthetic OLTP transaction type. Each transaction type doesn't necessarily modify the same tables in the same way. This leads to natural diversity among tables and among transactions, including: * The typical or average number of distinct XIDs per heap page varies significantly among each table. There are way fewer distinct XIDs per "order line" table heap page than there are per "order" table heap page, for the obvious reason. * Roughly speaking, there are various different ways that free space management ought to work in a system like Postgres. For example it is necessary to make a "fragmentations vs space utilization" trade-off with the new orders table. * There are joins in some of the transactions! Maybe TPC-C is a crude approximation of reality, but it nevertheless exercises relevant parts of the system to a significant degree. What else would you expect me to use, for a project like this? To a significant degree the relfrozenxid tracking stuff is interesting because tables tend to have natural differences like the ones I have highlighted on this thread. How could that not be the case? Why wouldn't we want to take advantage of that? There might be some danger in over-optimizing for this particular benchmark, but right now that is so far from being the main problem that the idea seems strange to me. pgbench doesn't need the FSM, at all. In fact pgbench doesn't even really need VACUUM (except for antiwraparound), once heap fillfactor is lowered to 95 or so. pgbench simply isn't relevant, *at all*, except perhaps as a way of measuring regressions in certain synthetic cases that don't benefit. > TPC-C (or any > benchmark really) is so simple as to be a terrible proxy for what > vacuuming is going to look like on real-world systems. Doesn't that amount to "no amount of any kind of testing or benchmarking will convince me of anything, ever"? There is more than one type of real-world system. I think that TPC-C is representative of some real world systems in some regards. But even that's not the important point for me. I find TPC-C generally interesting for one reason: I can clearly see that Postgres does things in a way that just doesn't make much sense, which isn't particularly fundamental to how VACUUM works. My only long term goal is to teach Postgres to *avoid* various pathological cases exhibited by TPC-C (e.g., the B-Tree "split after new tuple" mechanism from commit f21668f328 *avoids* a pathological case from TPC-C). We don't necessarily have to agree on how important each individual case is "in the real world" (which is impossible to know anyway). We only have to agree that what we see is a pathological case (because some reasonable expectation is dramatically violated), and then work out a fix. I don't want to teach Postgres to be clever -- I want to teach it to avoid being stupid in cases where it exhibits behavior that really cannot be described any other way. You seem to talk about some of this work as if it was just as likely to have a detrimental effect elsewhere, for some equally plausible workload, which will have a downside that is roughly as bad as the advertised upside. I consider that very unlikely, though. Sure, regressions are quite possible, and a real concern -- but regressions *like that* are unlikely. Avoiding doing what is clearly the wrong thing just seems to work out that way, in general. -- Peter Geoghegan
On Fri, Feb 25, 2022 at 5:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > There is an important practical way in which it makes sense to treat > 0001 as separate to 0002. It is true that 0001 is independently quite > useful. In practical terms, I'd be quite happy to just get 0001 into > Postgres 15, without 0002. I think that that's what you meant here, in > concrete terms, and we can agree on that now. Attached is v10. While this does still include the freezing patch, it's not in scope for Postgres 15. As I've said, I still think that it makes sense to maintain the patch series with the freezing stuff, since it's structurally related. So, to be clear, the first two patches from the patch series are in scope for Postgres 15. But not the third. Highlights: * Changes to terminology and commit messages along the lines suggested by Andres. * Bug fixes to heap_tuple_needs_freeze()'s MultiXact handling. My testing strategy here still needs work. * Expanded refactoring by v10-0002 patch. The v10-0002 patch (which appeared for the first time in v9) was originally all about fixing a case where non-aggressive VACUUMs were at a gratuitous disadvantage (relative to aggressive VACUUMs) around advancing relfrozenxid -- very much like the lazy_scan_noprune work from commit 44fa8488. And that is still its main purpose. But the refactoring now seems related to Andres' idea of making non-aggressive VACUUMs decides to scan a few extra all-visible pages in order to be able to advance relfrozenxid. The code that sets up skipping the visibility map is made a lot clearer by v10-0002. That patch moves a significant amount of code from lazy_scan_heap() into a new helper routine (so it continues the trend started by the Postgres 14 work that added lazy_scan_prune()). Now skipping a range of visibility map pages is fundamentally based on setting up the range up front, and then using the same saved details about the range thereafter -- we don't have anymore ad-hoc VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for pages from a range that we already decided to skip (so no calls to those routines from lazy_scan_heap(), at least not until after we finish processing in lazy_scan_prune()). This is more or less what we were doing all along for one special case: aggressive VACUUMs. We had to make sure to either increment frozenskipped_pages or increment scanned_pages for every page from rel_pages -- this issue is described by lazy_scan_heap() comments on HEAD that begin with "Tricky, tricky." (these date back to the freeze map work from 2016). Anyway, there is no reason to not go further with that: we should make whole ranges the basic unit that we deal with when skipping. It's a lot simpler to think in terms of entire ranges (not individual pages) that are determined to be all-visible or all-frozen up-front, without needing to recheck anything (regardless of whether it's an aggressive VACUUM). We don't need to track frozenskipped_pages this way. And it's much more obvious that it's safe for more complicated cases, in particular for aggressive VACUUMs. This kind of approach seems necessary to make non-aggressive VACUUMs do a little more work opportunistically, when they realize that they can advance relfrozenxid relatively easily that way (which I believe Andres favors as part of overhauling freezing). That becomes a lot more natural when you have a clear and unambiguous separation between deciding what range of blocks to skip, and then actually skipping. I can imagine the new helper function added by v10-0002 (which I've called lazy_scan_skip_range()) eventually being taught to do these kinds of tricks. In general I think that all of the details of what to skip need to be decided up front. The loop in lazy_scan_heap() should execute skipping based on the instructions it receives from the new helper function, in the simplest way possible. The helper function can become more intelligent about the costs and benefits of skipping in the future, without that impacting lazy_scan_heap(). -- Peter Geoghegan
Attachment
On Sun, Mar 13, 2022 at 9:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v10. While this does still include the freezing patch, > it's not in scope for Postgres 15. As I've said, I still think that it > makes sense to maintain the patch series with the freezing stuff, > since it's structurally related. Attached is v11. Changes: * No longer includes the patch that adds page-level freezing. It was making it harder to assess code coverage for the patches that I'm targeting Postgres 15 with. And so including it with each new revision no longer seems useful. I'll pick it up for Postgres 16. * Extensive isolation tests added to v11-0001-*, exercising a lot of hard-to-hit code paths that are reached when VACUUM is unable to immediately acquire a cleanup lock on some heap page. In particular, we now have test coverage for the code in heapam.c that handles tracking the oldest extant XID and MXID in the presence of MultiXacts (on a no-cleanup-lock heap page). * v11-0002-* (which is the patch that avoids missing out on advancing relfrozenxid in non-aggressive VACUUMs due to a race condition on HEAD) now moves even more of the logic for deciding how VACUUM will skip using the visibility map into its own helper routine. Now lazy_scan_heap just does what the state returned by the helper routine tells it about the current skippable range -- it doesn't make any decisions itself anymore. This is far simpler than what we do currently, on HEAD. There are no behavioral changes here, but this approach could be pushed further to improve performance. We could easily determine *every* page that we're going to scan (not skip) up-front in even the largest tables, very early, before we've even scanned one page. This could enable things like I/O prefetching, or capping the size of the dead_items array based on our final scanned_pages (not on rel_pages). * A new patch (v11-0003-*) alters the behavior of VACUUM's DISABLE_PAGE_SKIPPING option. DISABLE_PAGE_SKIPPING no longer forces aggressive VACUUM -- now it only forces the use of the visibility map, since that behavior is totally independent of aggressiveness. I don't feel too strongly about the DISABLE_PAGE_SKIPPING change. It just seems logical to decouple no-vm-skipping from aggressiveness -- it might actually be helpful in testing the work from the patch series in the future. Any page counted in scanned_pages has essentially been processed by VACUUM with this work in place -- that was the idea behind the lazy_scan_noprune stuff from commit 44fa8488. Bear in mind that the relfrozenxid tracking stuff from v11-0001-* makes it almost certain that a DISABLE_PAGE_SKIPPING-without-aggressiveness VACUUM will still manage to advance relfrozenxid -- usually by the same amount as an equivalent aggressive VACUUM would anyway. (Failing to acquire a cleanup lock on some heap page might result in the final older relfrozenxid being appreciably older, but probably not, and we'd still almost certainly manage to advance relfrozenxid by *some* small amount.) Of course, anybody that wants both an aggressive VACUUM and a VACUUM that never skips even all-frozen pages in the visibility map will still be able to get that behavior quite easily. For example, VACUUM(DISABLE_PAGE_SKIPPING, FREEZE) will do that. Several of our existing tests must already use both of these options together, because the tests require an effective vacuum_freeze_min_age of 0 (and vacuum_multixact_freeze_min_age of 0) -- DISABLE_PAGE_SKIPPING alone won't do that on HEAD, which seems to confuse the issue (see commit b700f96c for an example of that). In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently* force lazy_scan_noprune to refuse to process a page on HEAD (it all depends on FreezeLimit/vacuum_freeze_min_age), it is logical for DISABLE_PAGE_SKIPPING to totally get out of the business of caring about that -- better to limit it to caring only about the visibility map (by no longer making it force aggressiveness). -- Peter Geoghegan
Attachment
On Wed, Mar 23, 2022 at 3:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently* > force lazy_scan_noprune to refuse to process a page on HEAD (it all > depends on FreezeLimit/vacuum_freeze_min_age), it is logical for > DISABLE_PAGE_SKIPPING to totally get out of the business of caring > about that -- better to limit it to caring only about the visibility > map (by no longer making it force aggressiveness). It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely disable skipping pages, we have a problem. The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named DISABLE_PAGE_SKIPPING. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely > disable skipping pages, we have a problem. It depends on how you define skipping. DISABLE_PAGE_SKIPPING was created at a time when a broader definition of skipping made a lot more sense. > The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named > DISABLE_PAGE_SKIPPING. VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show that 100% of all of the pages from rel_pages are scanned. A page that is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of its tuples frozen. But every other aspect of processing the page happens in just the same way as it would in the cleanup lock/lazy_scan_prune path. We'll even still VACUUM the page if it happens to have some existing LP_DEAD items left behind by opportunistic pruning. We don't need a cleanup in either lazy_scan_noprune (a share lock is all we need), nor do we even need one in lazy_vacuum_heap_page (a regular exclusive lock is all we need). -- Peter Geoghegan
On Wed, Mar 23, 2022 at 4:49 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Mar 23, 2022 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > > It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely > > disable skipping pages, we have a problem. > > It depends on how you define skipping. DISABLE_PAGE_SKIPPING was > created at a time when a broader definition of skipping made a lot > more sense. > > > The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named > > DISABLE_PAGE_SKIPPING. > > VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show > that 100% of all of the pages from rel_pages are scanned. A page that > is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of > its tuples frozen. But every other aspect of processing the page > happens in just the same way as it would in the cleanup > lock/lazy_scan_prune path. I see what you mean about it depending on how you define "skipping". But I think that DISABLE_PAGE_SKIPPING is intended as a sort of emergency safeguard when you really, really don't want to leave anything out. And therefore I favor defining it to mean that we don't skip any work at all. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 1:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > I see what you mean about it depending on how you define "skipping". > But I think that DISABLE_PAGE_SKIPPING is intended as a sort of > emergency safeguard when you really, really don't want to leave > anything out. I agree. > And therefore I favor defining it to mean that we don't > skip any work at all. But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot acquire a cleanup lock on a page, unless it happens to have XIDs from before FreezeLimit (which is probably 50 million XIDs behind OldestXmin, the vacuum_freeze_min_age default). I don't see much difference. Anyway, this isn't important. I'll just drop the third patch. -- Peter Geoghegan
On Thu, Mar 24, 2022 at 9:59 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Mar 23, 2022 at 1:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > > And therefore I favor defining it to mean that we don't > > skip any work at all. > > But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot > acquire a cleanup lock on a page, unless it happens to have XIDs from > before FreezeLimit (which is probably 50 million XIDs behind > OldestXmin, the vacuum_freeze_min_age default). I don't see much > difference. Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable all page skipping, so 3414099c turned out to be not enough.
On Wed, Mar 23, 2022 at 2:03 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable > all page skipping, so 3414099c turned out to be not enough. The proposed change to DISABLE_PAGE_SKIPPING is partly driven by that, and partly driven by a similar concern about aggressive VACUUM. It seems worth emphasizing the idea that an aggressive VACUUM is now just the same as any other VACUUM except for one detail: we're guaranteed to advance relfrozenxid to a value >= FreezeLimit at the end. The non-aggressive case has the choice to do things that make that impossible. But there are only two places where this can happen now: 1. Non-aggressive VACUUMs might decide to skip some all-visible pages in the new lazy_scan_skip() helper routine for skipping with the VM (see v11-0002-*). 2. A non-aggressive VACUUM can *always* decide to ratchet back its target relfrozenxid in lazy_scan_noprune, to avoid waiting for a cleanup lock -- a final value from before FreezeLimit is usually still pretty good. The first scenario is the only one where it becomes impossible for non-aggressive VACUUM to be able to advance relfrozenxid (with v11-0001-* in place) by any amount. Even that's a choice, made by weighing costs against benefits. There is no behavioral change in v11-0002-* (we're still using the old SKIP_PAGES_THRESHOLD strategy), but the lazy_scan_skip() helper routine could fairly easily be taught a lot more about the downside of skipping all-visible pages (namely how that makes it impossible to advance relfrozenxid). Maybe it's worth skipping all-visible pages (there are lots of them and age(relfrozenxid) is still low), and maybe it isn't worth it. We should get to decide, without implementation details making relfrozenxid advancement unsafe. It would be great if you could take a look v11-0002-*, Robert. Does it make sense to you? Thanks -- Peter Geoghegan
On Wed, Mar 23, 2022 at 6:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > It would be great if you could take a look v11-0002-*, Robert. Does it > make sense to you? You're probably not going to love hearing this, but I think you're still explaining things here in ways that are too baroque and hard to follow. I do think it's probably better. But, for example, in the commit message for 0001, I think you could change the subject line to "Allow non-aggressive vacuums to advance relfrozenxid" and it would be clearer. And then I think you could eliminate about half of the first paragraph, starting with "There is no fixed relationship", and all of the third paragraph (which starts with "Later work..."), and I think removing all that material would make it strictly more clear than it is currently. I don't think it's the place of a commit message to speculate too much on future directions or to wax eloquent on theoretical points. If that belongs anywhere, it's in a mailing list discussion. It seems to me that 0002 mixes code movement with functional changes. I'm completely on board with moving the code that decides how much to skip into a function. That seems like a great idea, and probably overdue. But it is not easy for me to see what has changed functionally between the old and new code organization, and I bet it would be possible to split this into two patches, one of which creates a function, and the other of which fixes the problem, and I think that would be a useful service to future readers of the code. I have a hard time believing that if someone in the future bisects a problem back to this commit, they're going to have an easy time finding the behavior change in here. In fact I can't see it myself. I think the actual functional change is to fix what is described in the second paragraph of the commit message, but I haven't been able to figure out where the logic is actually changing to address that. Note that I would be happy with the behavior change happening either before or after the code reorganization. I also think that the commit message for 0002 is probably longer and more complex than is really helpful, and that the subject line is too vague, but since I don't yet understand exactly what's happening here, I cannot comment on how I think it should be revised at this point, except to say that the second paragraph of that commit message looks like the most useful part. I would also like to mention a few things that I do like about 0002. One is that it seems to collapse two different pieces of logic for page skipping into one. That seems good. As mentioned, it's especially good because that logic is abstracted into a function. Also, it looks like it is making a pretty localized change to one (1) aspect of what VACUUM does -- and I definitely prefer patches that change only one thing at a time. Hope that's helpful. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Mar 24, 2022 at 10:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > You're probably not going to love hearing this, but I think you're > still explaining things here in ways that are too baroque and hard to > follow. I do think it's probably better. There are a lot of dimensions to this work. It's hard to know which to emphasize here. > But, for example, in the > commit message for 0001, I think you could change the subject line to > "Allow non-aggressive vacuums to advance relfrozenxid" and it would be > clearer. But non-aggressive VACUUMs have always been able to do that. How about: "Set relfrozenxid to oldest extant XID seen by VACUUM" > And then I think you could eliminate about half of the first > paragraph, starting with "There is no fixed relationship", and all of > the third paragraph (which starts with "Later work..."), and I think > removing all that material would make it strictly more clear than it > is currently. I don't think it's the place of a commit message to > speculate too much on future directions or to wax eloquent on > theoretical points. If that belongs anywhere, it's in a mailing list > discussion. Okay, I'll do that. > It seems to me that 0002 mixes code movement with functional changes. Believe it or not, I avoided functional changes in 0002 -- at least in one important sense. That's why you had difficulty spotting any. This must sound peculiar, since the commit message very clearly says that the commit avoids a problem seen only in the non-aggressive case. It's really quite subtle. You wrote this comment and code block (which I propose to remove in 0002), so clearly you already understand the race condition that I'm concerned with here: - if (skipping_blocks && blkno < rel_pages - 1) - { - /* - * Tricky, tricky. If this is in aggressive vacuum, the page - * must have been all-frozen at the time we checked whether it - * was skippable, but it might not be any more. We must be - * careful to count it as a skipped all-frozen page in that - * case, or else we'll think we can't update relfrozenxid and - * relminmxid. If it's not an aggressive vacuum, we don't - * know whether it was initially all-frozen, so we have to - * recheck. - */ - if (vacrel->aggressive || - VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer)) - vacrel->frozenskipped_pages++; - continue; - } What you're saying here boils down to this: it doesn't matter what the visibility map would say right this microsecond (in the aggressive case) were we to call VM_ALL_FROZEN(): we know for sure that the VM said that this page was all-frozen *in the recent past*. That's good enough; we will never fail to scan a page that might have an XID < OldestXmin (ditto for XMIDs) this way, which is all that really matters. This is absolutely mandatory in the aggressive case, because otherwise relfrozenxid advancement might be seen as unsafe. My observation is: Why should we accept the same race in the non-aggressive case? Why not do essentially the same thing in every VACUUM? In 0002 we now track if each range that we actually chose to skip had any all-visible (not all-frozen) pages -- if that happens then relfrozenxid advancement becomes unsafe. The existing code uses "vacrel->aggressive" as a proxy for the same condition -- the existing code reasons based on what the visibility map must have said about the page in the recent past. Which makes sense, but only works in the aggressive case. The approach taken in 0002 also makes the code simpler, which is what enabled putting the VM skipping code into its own helper function, but that was just a bonus. And so you could almost say that there is now behavioral change at all. We're skipping pages in the same way, based on the same information (from the visibility map) as before. We're just being a bit more careful than before about how that information is tracked, to avoid this race. A race that we always avoided in the aggressive case is now consistently avoided. > I'm completely on board with moving the code that decides how much to > skip into a function. That seems like a great idea, and probably > overdue. But it is not easy for me to see what has changed > functionally between the old and new code organization, and I bet it > would be possible to split this into two patches, one of which creates > a function, and the other of which fixes the problem, and I think that > would be a useful service to future readers of the code. It seems kinda tricky to split up 0002 like that. It's possible, but I'm not sure if it's possible to split it in a way that highlights the issue that I just described. Because we already avoided the race in the aggressive case. > I also think that the commit message for 0002 is probably longer and > more complex than is really helpful, and that the subject line is too > vague, but since I don't yet understand exactly what's happening here, > I cannot comment on how I think it should be revised at this point, > except to say that the second paragraph of that commit message looks > like the most useful part. I'll work on that. > I would also like to mention a few things that I do like about 0002. > One is that it seems to collapse two different pieces of logic for > page skipping into one. That seems good. As mentioned, it's especially > good because that logic is abstracted into a function. Also, it looks > like it is making a pretty localized change to one (1) aspect of what > VACUUM does -- and I definitely prefer patches that change only one > thing at a time. Totally embracing the idea that we don't necessarily need very recent information from the visibility map (it just has to be after OldestXmin was established) has a lot of advantages, architecturally. It could in principle be hours out of date in the longest VACUUM operations -- that should be fine. This is exactly the same principle that makes it okay to stick with our original rel_pages, even when the table has grown during a VACUUM operation (I documented this in commit 73f6ec3d3c recently). We could build on the approach taken by 0002 to create a totally comprehensive picture of the ranges we're skipping up-front, before we actually scan any pages, even with very large tables. We could in principle cache a very large number of skippable ranges up-front, without ever going back to the visibility map again later (unless we need to set a bit). It really doesn't matter if somebody else unsets a page's VM bit concurrently, at all. I see a lot of advantage to knowing our final scanned_pages almost immediately. Things like prefetching, capping the size of the dead_items array more intelligently (use final scanned_pages instead of rel_pages in dead_items_max_items()), improvements to progress reporting...not to mention more intelligent choices about whether we should try to advance relfrozenxid a bit earlier during non-aggressive VACUUMs. > Hope that's helpful. Very helpful -- thanks! -- Peter Geoghegan
On Thu, Mar 24, 2022 at 3:28 PM Peter Geoghegan <pg@bowt.ie> wrote: > But non-aggressive VACUUMs have always been able to do that. > > How about: "Set relfrozenxid to oldest extant XID seen by VACUUM" Sure, that sounds nice. > Believe it or not, I avoided functional changes in 0002 -- at least in > one important sense. That's why you had difficulty spotting any. This > must sound peculiar, since the commit message very clearly says that > the commit avoids a problem seen only in the non-aggressive case. It's > really quite subtle. Well, I think the goal in revising the code is to be as un-subtle as possible. Commits that people can't easily understand breed future bugs. > What you're saying here boils down to this: it doesn't matter what the > visibility map would say right this microsecond (in the aggressive > case) were we to call VM_ALL_FROZEN(): we know for sure that the VM > said that this page was all-frozen *in the recent past*. That's good > enough; we will never fail to scan a page that might have an XID < > OldestXmin (ditto for XMIDs) this way, which is all that really > matters. Makes sense. So maybe the commit message should try to emphasize this point e.g. "If a page is all-frozen at the time we check whether it can be skipped, don't allow it to affect the relfrozenxmin and relminmxid which we set for the relation. This was previously true for aggressive vacuums, but not for non-aggressive vacuums, which was inconsistent. (The reason this is a safe thing to do is that any new XIDs or MXIDs that appear on the page after we initially observe it to be frozen must be newer than any relfrozenxid or relminmxid the current vacuum could possibly consider storing into pg_class.)" > This is absolutely mandatory in the aggressive case, because otherwise > relfrozenxid advancement might be seen as unsafe. My observation is: > Why should we accept the same race in the non-aggressive case? Why not > do essentially the same thing in every VACUUM? Sure, that seems like a good idea. I think I basically agree with the goals of the patch. My concern is just about making the changes understandable to future readers. This area is notoriously subtle, and people are going to introduce more bugs even if the comments and code organization are fantastic. > And so you could almost say that there is now behavioral change at > all. I vigorously object to this part, though. We should always err on the side of saying that commits *do* have behavioral changes. We should go out of our way to call out in the commit message any possible way that someone might notice the difference between the post-commit situation and the pre-commit situation. It is fine, even good, to also be clear about how we're maintaining continuity and why we don't think it's a problem, but the only commits that should be described as not having any behavioral change are ones that do mechanical code movement, or are just changing comments, or something like that. > It seems kinda tricky to split up 0002 like that. It's possible, but > I'm not sure if it's possible to split it in a way that highlights the > issue that I just described. Because we already avoided the race in > the aggressive case. I do see that there are some difficulties there. I'm not sure what to do about that. I think a sufficiently clear commit message could possibly be enough, rather than trying to split the patch. But I also think splitting the patch should be considered, if that can reasonably be done. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Mar 24, 2022 at 1:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > How about: "Set relfrozenxid to oldest extant XID seen by VACUUM" > > Sure, that sounds nice. Cool. > > What you're saying here boils down to this: it doesn't matter what the > > visibility map would say right this microsecond (in the aggressive > > case) were we to call VM_ALL_FROZEN(): we know for sure that the VM > > said that this page was all-frozen *in the recent past*. That's good > > enough; we will never fail to scan a page that might have an XID < > > OldestXmin (ditto for XMIDs) this way, which is all that really > > matters. > > Makes sense. So maybe the commit message should try to emphasize this > point e.g. "If a page is all-frozen at the time we check whether it > can be skipped, don't allow it to affect the relfrozenxmin and > relminmxid which we set for the relation. This was previously true for > aggressive vacuums, but not for non-aggressive vacuums, which was > inconsistent. (The reason this is a safe thing to do is that any new > XIDs or MXIDs that appear on the page after we initially observe it to > be frozen must be newer than any relfrozenxid or relminmxid the > current vacuum could possibly consider storing into pg_class.)" Okay, I'll add something more like that. Almost every aspect of relfrozenxid advancement by VACUUM seems simpler when thought about in these terms IMV. Every VACUUM now scans all pages that might have XIDs < OldestXmin, and so every VACUUM can advance relfrozenxid to the oldest extant XID (barring non-aggressive VACUUMs that *choose* to skip some all-visible pages). There are a lot more important details, of course. My "Every VACUUM..." statement works well as an axiom because all of those other details don't create any awkward exceptions. > > This is absolutely mandatory in the aggressive case, because otherwise > > relfrozenxid advancement might be seen as unsafe. My observation is: > > Why should we accept the same race in the non-aggressive case? Why not > > do essentially the same thing in every VACUUM? > > Sure, that seems like a good idea. I think I basically agree with the > goals of the patch. Great. > My concern is just about making the changes > understandable to future readers. This area is notoriously subtle, and > people are going to introduce more bugs even if the comments and code > organization are fantastic. Makes sense. > > And so you could almost say that there is now behavioral change at > > all. > > I vigorously object to this part, though. We should always err on the > side of saying that commits *do* have behavioral changes. I think that you've taken my words too literally here. I would never conceal the intent of a piece of work like that. I thought that it would clarify matters to point out that I could in theory "get away with it if I wanted to" in this instance. This was only a means of conveying a subtle point about the behavioral changes from 0002 -- since you couldn't initially see them yourself (even with my commit message). Kind of like Tom Lane's 2011 talk on the query planner. The one where he lied to the audience several times. > > It seems kinda tricky to split up 0002 like that. It's possible, but > > I'm not sure if it's possible to split it in a way that highlights the > > issue that I just described. Because we already avoided the race in > > the aggressive case. > > I do see that there are some difficulties there. I'm not sure what to > do about that. I think a sufficiently clear commit message could > possibly be enough, rather than trying to split the patch. But I also > think splitting the patch should be considered, if that can reasonably > be done. I'll see if I can come up with something. It's hard to be sure about that kind of thing when you're this close to the code. -- Peter Geoghegan
On Thu, Mar 24, 2022 at 2:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > This is absolutely mandatory in the aggressive case, because otherwise > > > relfrozenxid advancement might be seen as unsafe. My observation is: > > > Why should we accept the same race in the non-aggressive case? Why not > > > do essentially the same thing in every VACUUM? > > > > Sure, that seems like a good idea. I think I basically agree with the > > goals of the patch. > > Great. Attached is v12. My current goal is to commit all 3 patches before feature freeze. Note that this does not include the more complicated patch including with previous revisions of the patch series (the page-level freezing work that appeared in versions before v11). Changes that appear in this new revision, v12: * Reworking of the commit messages based on feedback from Robert. * General cleanup of the changes to heapam.c from 0001 (the changes to heap_prepare_freeze_tuple and related functions). New and existing code now fits together a bit better. I also added a couple of new documenting assertions, to make the flow a bit easier to understand. * Added new assertions that document OldestXmin/FreezeLimit/relfrozenxid invariants, right at the point we update pg_class within vacuumlazy.c. These assertions would have a decent chance of failing if there were any bugs in the code. * Removed patch that made DISABLE_PAGE_SKIPPING not force aggressive VACUUM, limiting the underlying mechanism to forcing scanning of all pages in lazy_scan_heap (v11 was the first and last revision that included this patch). * Adds a new small patch 0003. This just moves the last piece of resource allocation that still took place at the top of lazy_scan_heap() back into its caller, heap_vacuum_rel(). The work in 0003 probably should have happened as part of the patch that became commit 73f6ec3d -- same idea. It's totally mechanical stuff. With 0002 and 0003, there is hardly any lazy_scan_heap code before the main loop that iterates through blocks in rel_pages (and the code that's still there is obviously related to the loop in a direct and obvious way). This seems like a big overall improvement in maintainability. Didn't see a way to split up 0002, per Robert's suggestion 3 days ago. As I said at the time, it's possible to split it up, but not in a way that highlights the underlying issue (since the issue 0002 fixes was always limited to non-aggressive VACUUMs). The commit message may have to suffice. -- Peter Geoghegan
Attachment
On Sun, Mar 27, 2022 at 11:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v12. My current goal is to commit all 3 patches before > feature freeze. Note that this does not include the more complicated > patch including with previous revisions of the patch series (the > page-level freezing work that appeared in versions before v11). Reviewing 0001, focusing on the words in the patch file much more than the code: I can understand this version of the commit message. Woohoo! I like understanding things. I think the header comments for FreezeMultiXactId() focus way too much on what the caller is supposed to do and not nearly enough on what FreezeMultiXactId() itself does. I think to some extent this also applies to the comments within the function body. On the other hand, the header comments for heap_prepare_freeze_tuple() seem good to me. If I were thinking of calling this function, I would know how to use the new arguments. If I were looking for bugs in it, I could compare the logic in the function to what these comments say it should be doing. Yay. I think I understand what the first paragraph of the header comment for heap_tuple_needs_freeze() is trying to say, but the second one is quite confusing. I think this is again because it veers into talking about what the caller should do rather than explaining what the function itself does. I don't like the statement-free else block in lazy_scan_noprune(). I think you could delete the else{} and just put that same comment there with one less level of indentation. There's a clear "return false" just above so it shouldn't be confusing what's happening. The comment hunk at the end of lazy_scan_noprune() would probably be better if it said something more specific than "caller can tolerate reduced processing." My guess is that it would be something like "caller does not need to do something or other." I have my doubts about whether the overwrite-a-future-relfrozenxid behavior is any good, but that's a topic for another day. I suggest keeping the words "it seems best to", though, because they convey a level of tentativeness, which seems appropriate. I am surprised to see you write in maintenance.sgml that the VACUUM which most recently advanced relfrozenxid will typically be the most recent aggressive VACUUM. I would have expected something like "(often the most recent VACUUM)". -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Mar 29, 2022 at 10:03 AM Robert Haas <robertmhaas@gmail.com> wrote: > I can understand this version of the commit message. Woohoo! I like > understanding things. That's good news. > I think the header comments for FreezeMultiXactId() focus way too much > on what the caller is supposed to do and not nearly enough on what > FreezeMultiXactId() itself does. I think to some extent this also > applies to the comments within the function body. To some extent this is a legitimate difference in style. I myself don't think that it's intrinsically good to have these sorts of comments. I just think that it can be the least worst thing when a function is intrinsically written with one caller and one very specific set of requirements in mind. That is pretty much a matter of taste, though. > I think I understand what the first paragraph of the header comment > for heap_tuple_needs_freeze() is trying to say, but the second one is > quite confusing. I think this is again because it veers into talking > about what the caller should do rather than explaining what the > function itself does. I wouldn't have done it that way if the function wasn't called heap_tuple_needs_freeze(). I would be okay with removing this paragraph if the function was renamed to reflect the fact it now tells the caller something about the tuple having an old XID/MXID relative to the caller's own XID/MXID cutoffs. Maybe the function name should be heap_tuple_would_freeze(), making it clear that the function merely tells caller what heap_prepare_freeze_tuple() *would* do, without presuming to tell the vacuumlazy.c caller what it *should* do about any of the information it is provided. Then it becomes natural to see the boolean return value and the changes the function makes to caller's relfrozenxid/relminmxid tracker variables as independent. > I don't like the statement-free else block in lazy_scan_noprune(). I > think you could delete the else{} and just put that same comment there > with one less level of indentation. There's a clear "return false" > just above so it shouldn't be confusing what's happening. Okay, will fix. > The comment hunk at the end of lazy_scan_noprune() would probably be > better if it said something more specific than "caller can tolerate > reduced processing." My guess is that it would be something like > "caller does not need to do something or other." I meant "caller can tolerate not pruning or freezing this particular page". Will fix. > I have my doubts about whether the overwrite-a-future-relfrozenxid > behavior is any good, but that's a topic for another day. I suggest > keeping the words "it seems best to", though, because they convey a > level of tentativeness, which seems appropriate. I agree that it's best to keep a tentative tone here. That code was written following a very specific bug in pg_upgrade several years back. There was a very recent bug fixed only last year, by commit 74cf7d46. FWIW I tend to think that we'd have a much better chance of catching that sort of thing if we'd had better relfrozenxid instrumentation before now. Now you'd see a negative value in the "new relfrozenxid: %u, which is %d xids ahead of previous value" part of the autovacuum log message in the event of such a bug. That's weird enough that I bet somebody would notice and report it. > I am surprised to see you write in maintenance.sgml that the VACUUM > which most recently advanced relfrozenxid will typically be the most > recent aggressive VACUUM. I would have expected something like "(often > the most recent VACUUM)". That's always been true, and will only be slightly less true in Postgres 15 -- the fact is that we only need to skip one all-visible page to lose out, and that's not unlikely with tables that aren't quite small with all the patches from v12 applied (we're still much too naive). The work that I'll get into Postgres 15 on VACUUM is very valuable as a basis for future improvements, but not all that valuable to users (improved instrumentation might be the biggest benefit in 15, or maybe relminmxid advancement for certain types of applications). I still think that we need to do more proactive page-level freezing to make relfrozenxid advancement happen in almost every VACUUM, but even that won't quite be enough. There are still cases where we need to make a choice about giving up on relfrozenxid advancement in a non-aggressive VACUUM -- all-visible pages won't completely go away with page-level freezing. At a minimum we'll still have edge cases like the case where heap_lock_tuple() unsets the all-frozen bit. And pg_upgrade'd databases, too. 0002 structures the logic for skipping using the VM in a way that will make the choice to skip or not skip all-visible pages in non-aggressive VACUUMs quite natural. I suspect that SKIP_PAGES_THRESHOLD was always mostly just about relfrozenxid advancement in non-aggressive VACUUM, all along. We can do much better than SKIP_PAGES_THRESHOLD, especially if we preprocess the entire visibility map up-front -- we'll know the costs and benefits up-front, before committing to early relfrozenxid advancement. Overall, aggressive vs non-aggressive VACUUM seems like a false dichotomy to me. ISTM that it should be a totally dynamic set of behaviors. There should probably be several different "aggressive gradations''. Most VACUUMs start out completely non-aggressive (including even anti-wraparound autovacuums), but can escalate from there. The non-cancellable autovacuum behavior (technically an anti-wraparound thing, but really an aggressiveness thing) should be something we escalate to, as with the failsafe. Dynamic behavior works a lot better. And it makes scheduling of autovacuum workers a lot more straightforward -- the discontinuities seem to make that much harder, which is one more reason to avoid them altogether. -- Peter Geoghegan
On Tue, Mar 29, 2022 at 11:58 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I think I understand what the first paragraph of the header comment > > for heap_tuple_needs_freeze() is trying to say, but the second one is > > quite confusing. I think this is again because it veers into talking > > about what the caller should do rather than explaining what the > > function itself does. > > I wouldn't have done it that way if the function wasn't called > heap_tuple_needs_freeze(). > > I would be okay with removing this paragraph if the function was > renamed to reflect the fact it now tells the caller something about > the tuple having an old XID/MXID relative to the caller's own XID/MXID > cutoffs. Maybe the function name should be heap_tuple_would_freeze(), > making it clear that the function merely tells caller what > heap_prepare_freeze_tuple() *would* do, without presuming to tell the > vacuumlazy.c caller what it *should* do about any of the information > it is provided. Attached is v13, which does it that way. This does seem like a real increase in clarity, albeit one that comes at the cost of renaming heap_tuple_needs_freeze(). v13 also addresses all of the other items from Robert's most recent round of feedback. I would like to commit something close to v13 on Friday or Saturday. Thanks -- Peter Geoghegan
Attachment
+ diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid); + Assert(diff > 0); Did you see that this crashed on windows cfbot? https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log TRAP: FailedAssertion("diff > 0", File: "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984) abort() has been called2022-03-30 03:48:30.267 GMT [5316][client backend] [pg_regress/tablefunc][3/15389:0] ERROR: infiniterecursion detected 2022-03-30 03:48:38.031 GMT [5592][postmaster] LOG: server process (PID 5984) was terminated by exception 0xC0000354 2022-03-30 03:48:38.031 GMT [5592][postmaster] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE pg_catalog.pg_database 2022-03-30 03:48:38.031 GMT [5592][postmaster] HINT: See C include file "ntstatus.h" for a description of the hexadecimalvalue. https://cirrus-ci.com/task/4592929254670336 00000000`007ff130 00000001`400b4ef8 postgres!ExceptionalCondition( char * conditionName = 0x00000001`40a915d8 "diff > 0", char * errorType = 0x00000001`40a915c8 "FailedAssertion", char * fileName = 0x00000001`40a91598 "c:\cirrus\src\backend\access\heap\vacuumlazy.c", int lineNumber = 0n724)+0x8d [c:\cirrus\src\backend\utils\error\assert.c @ 70] 00000000`007ff170 00000001`402a0914 postgres!heap_vacuum_rel( struct RelationData * rel = 0x00000000`00a51088, struct VacuumParams * params = 0x00000000`00a8420c, struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x1038 [c:\cirrus\src\backend\access\heap\vacuumlazy.c@ 724] 00000000`007ff350 00000001`402a4686 postgres!table_relation_vacuum( struct RelationData * rel = 0x00000000`00a51088, struct VacuumParams * params = 0x00000000`00a8420c, struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x34 [c:\cirrus\src\include\access\tableam.h@ 1681] 00000000`007ff380 00000001`402a1a2d postgres!vacuum_rel( unsigned int relid = 0x4ee, struct RangeVar * relation = 0x00000000`01799ae0, struct VacuumParams * params = 0x00000000`00a8420c)+0x5a6 [c:\cirrus\src\backend\commands\vacuum.c @ 2068] 00000000`007ff400 00000001`4050f1ef postgres!vacuum( struct List * relations = 0x00000000`0179df58, struct VacuumParams * params = 0x00000000`00a8420c, struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0, bool isTopLevel = true)+0x69d [c:\cirrus\src\backend\commands\vacuum.c @ 482] 00000000`007ff5f0 00000001`4050dc95 postgres!autovacuum_do_vac_analyze( struct autovac_table * tab = 0x00000000`00a84208, struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x8f [c:\cirrus\src\backend\postmaster\autovacuum.c@ 3248] 00000000`007ff640 00000001`4050b4e3 postgres!do_autovacuum(void)+0xef5 [c:\cirrus\src\backend\postmaster\autovacuum.c@ 2503] It seems like there should be even more logs, especially since it says: [03:48:43.119] Uploading 3 artifacts for c:\cirrus\**\*.diffs [03:48:43.122] Uploaded c:\cirrus\contrib\tsm_system_rows\regression.diffs [03:48:43.125] Uploaded c:\cirrus\contrib\tsm_system_time\regression.diffs
On Tue, Mar 29, 2022 at 11:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > + diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid); > + Assert(diff > 0); > > Did you see that this crashed on windows cfbot? > > https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log > TRAP: FailedAssertion("diff > 0", File: "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984) That's weird. There are very similar assertions a little earlier, that must have *not* failed here, before the call to vac_update_relstats(). I was actually thinking of removing this assertion for that reason -- I thought that it was redundant. Perhaps something is amiss inside vac_update_relstats(), where the boolean flag that indicates that pg_class.relfrozenxid was advanced is set: if (frozenxid_updated) *frozenxid_updated = false; if (TransactionIdIsNormal(frozenxid) && pgcform->relfrozenxid != frozenxid && (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) || TransactionIdPrecedes(ReadNextTransactionId(), pgcform->relfrozenxid))) { if (frozenxid_updated) *frozenxid_updated = true; pgcform->relfrozenxid = frozenxid; dirty = true; } Maybe the "existing relfrozenxid is in the future, silently update relfrozenxid" part of the condition (which involves ReadNextTransactionId()) somehow does the wrong thing here. But how? The other assertions take into account the fact that OldestXmin can itself "go backwards" across VACUUM operations against the same table: Assert(!aggressive || vacrel->NewRelfrozenXid == OldestXmin || TransactionIdPrecedesOrEquals(FreezeLimit, vacrel->NewRelfrozenXid)); Note the "vacrel->NewRelfrozenXid == OldestXmin", without which the assertion will fail pretty easily when the regression tests are run. Perhaps I need to do something like that with the other assertion as well (or more likely just get rid of it). Will figure it out tomorrow. -- Peter Geoghegan
On Wed, Mar 30, 2022 at 12:01 AM Peter Geoghegan <pg@bowt.ie> wrote: > Perhaps something is amiss inside vac_update_relstats(), where the > boolean flag that indicates that pg_class.relfrozenxid was advanced is > set: > > if (frozenxid_updated) > *frozenxid_updated = false; > if (TransactionIdIsNormal(frozenxid) && > pgcform->relfrozenxid != frozenxid && > (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) || > TransactionIdPrecedes(ReadNextTransactionId(), > pgcform->relfrozenxid))) > { > if (frozenxid_updated) > *frozenxid_updated = true; > pgcform->relfrozenxid = frozenxid; > dirty = true; > } > > Maybe the "existing relfrozenxid is in the future, silently update > relfrozenxid" part of the condition (which involves > ReadNextTransactionId()) somehow does the wrong thing here. But how? I tried several times to recreate this issue on CI. No luck with that, though -- can't get it to fail again after 4 attempts. This was a VACUUM of pg_database, run from an autovacuum worker. I am vaguely reminded of the two bugs fixed by Andres in commit a54e1f15. Both were issues with the shared relcache init file affecting shared and nailed catalog relations. Those bugs had symptoms like " ERROR: found xmin ... from before relfrozenxid ..." for various system catalogs. We know that this particular assertion did not fail during the same VACUUM: Assert(vacrel->NewRelfrozenXid == OldestXmin || TransactionIdPrecedesOrEquals(vacrel->relfrozenxid, vacrel->NewRelfrozenXid)); So it's hard to see how this could be a bug in the patch -- the final new relfrozenxid is presumably equal to VACUUM's OldestXmin in the problem scenario seen on the CI Windows instance yesterday (that's why this earlier assertion didn't fail). The assertion I'm showing here needs the "vacrel->NewRelfrozenXid == OldestXmin" part of the condition to account for the fact that OldestXmin/GetOldestNonRemovableTransactionId() is known to "go backwards". Without that the regression tests will fail quite easily. The surprising part of the CI failure must have taken place just after this assertion, when VACUUM's call to vacuum_set_xid_limits() actually updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid -- presumably because the existing relfrozenxid appeared to be "in the future" when we examine it in pg_class again. We see evidence that this must have happened afterwards, when the closely related assertion (used only in instrumentation code) fails: From my patch: > if (frozenxid_updated) > { > - diff = (int32) (FreezeLimit - vacrel->relfrozenxid); > + diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid); > + Assert(diff > 0); > appendStringInfo(&buf, > _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"), > - FreezeLimit, diff); > + vacrel->NewRelfrozenXid, diff); > } Does anybody have any ideas about what might be going on here? -- Peter Geoghegan
Hi, On 2022-03-30 17:50:42 -0700, Peter Geoghegan wrote: > I tried several times to recreate this issue on CI. No luck with that, > though -- can't get it to fail again after 4 attempts. It's really annoying that we don't have Assert variants that show the compared values, that might make it easier to intepret what's going on. Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe AssertCmp(type, a, op, b), Then the assertion could have been something like AssertCmp(int32, diff, >, 0) Does the line number in the failed run actually correspond to the xid, rather than the mxid case? I didn't check. You could try to increase the likelihood of reproducing the failure by duplicating the invocation that lead to the crash a few times in the .cirrus.yml file in your dev branch. That might allow hitting the problem more quickly. Maybe reduce autovacuum_naptime in src/tools/ci/pg_ci_base.conf? Or locally - one thing that windows CI does different from the other platforms is that it runs isolation, contrib and a bunch of other tests using the same cluster. Which of course increases the likelihood of autovacuum having stuff to do, *particularly* on shared relations - normally there's probably not enough changes for that. You can do something similar locally on linux with make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true (the prove_installcheck=true to prevent tap tests from running, we don't seem to have another way for that) I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more load concurrently than running tests serially... > We know that this particular assertion did not fail during the same VACUUM: > > Assert(vacrel->NewRelfrozenXid == OldestXmin || > TransactionIdPrecedesOrEquals(vacrel->relfrozenxid, > vacrel->NewRelfrozenXid)); The comment in your patch says "is either older or newer than FreezeLimit" - I assume that's some rephrasing damage? > So it's hard to see how this could be a bug in the patch -- the final > new relfrozenxid is presumably equal to VACUUM's OldestXmin in the > problem scenario seen on the CI Windows instance yesterday (that's why > this earlier assertion didn't fail). Perhaps it's worth commiting improved assertions on master? If this is indeed a pre-existing bug, and we're just missing due to slightly less stringent asserts, we could rectify that separately. > The surprising part of the CI failure must have taken place just after > this assertion, when VACUUM's call to vacuum_set_xid_limits() actually > updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid -- > presumably because the existing relfrozenxid appeared to be "in the > future" when we examine it in pg_class again. We see evidence that > this must have happened afterwards, when the closely related assertion > (used only in instrumentation code) fails: Hm. This triggers some vague memories. There's some oddities around shared relations being vacuumed separately in all the databases and thus having separate horizons. After "remembering" that, I looked in the cirrus log for the failed run, and the worker was processing shared a shared relation last: 2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG: automatic analyze of table "contrib_regression.pg_catalog.pg_authid" Obviously that's not a guarantee that the next table processed also is a shared catalog, but ... Oh, the relid is actually in the stack trace. 0x4ee = 1262 = pg_database. Which makes sense, the test ends up with a high percentage of dead rows in pg_database, due to all the different contrib tests creating/dropping a database. > From my patch: > > > if (frozenxid_updated) > > { > > - diff = (int32) (FreezeLimit - vacrel->relfrozenxid); > > + diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid); > > + Assert(diff > 0); > > appendStringInfo(&buf, > > _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"), > > - FreezeLimit, diff); > > + vacrel->NewRelfrozenXid, diff); > > } Perhaps this ought to be an elog() instead of an Assert()? Something has gone pear shaped if we get here... It's a bit annoying though, because it'd have to be a PANIC to be visible on the bf / CI :(. Greetings, Andres Freund
On Wed, Mar 30, 2022 at 7:00 PM Andres Freund <andres@anarazel.de> wrote: > Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe > AssertCmp(type, a, op, b), > > Then the assertion could have been something like > AssertCmp(int32, diff, >, 0) I'd definitely use them if they were there. > Does the line number in the failed run actually correspond to the xid, rather > than the mxid case? I didn't check. Yes, I verified -- definitely relfrozenxid. > You can do something similar locally on linux with > make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true > (the prove_installcheck=true to prevent tap tests from running, we don't seem > to have another way for that) > > I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more > load concurrently than running tests serially... Can't get it to fail locally with that recipe. > > Assert(vacrel->NewRelfrozenXid == OldestXmin || > > TransactionIdPrecedesOrEquals(vacrel->relfrozenxid, > > vacrel->NewRelfrozenXid)); > > The comment in your patch says "is either older or newer than FreezeLimit" - I > assume that's some rephrasing damage? Both the comment and the assertion are correct. I see what you mean, though. > Perhaps it's worth commiting improved assertions on master? If this is indeed > a pre-existing bug, and we're just missing due to slightly less stringent > asserts, we could rectify that separately. I don't think there's much chance of the assertion actually hitting without the rest of the patch series. The new relfrozenxid value is always going to be OldestXmin - vacuum_min_freeze_age on HEAD, while with the patch it's sometimes close to OldestXmin. Especially when you have lots of dead tuples that you churn through constantly (like pgbench_tellers, or like these system catalogs on the CI test machine). > Hm. This triggers some vague memories. There's some oddities around shared > relations being vacuumed separately in all the databases and thus having > separate horizons. That's what I was thinking of, obviously. > After "remembering" that, I looked in the cirrus log for the failed run, and > the worker was processing shared a shared relation last: > > 2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG: automatic analyze of table "contrib_regression.pg_catalog.pg_authid" I noticed the same thing myself. Should have said sooner. > Perhaps this ought to be an elog() instead of an Assert()? Something has gone > pear shaped if we get here... It's a bit annoying though, because it'd have to > be a PANIC to be visible on the bf / CI :(. Yeah, a WARNING would be good here. I can write a new version of my patch series with a separation patch for that this evening. Actually, better make it a PANIC for now... -- Peter Geoghegan
On Wed, Mar 30, 2022 at 7:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > Yeah, a WARNING would be good here. I can write a new version of my > patch series with a separation patch for that this evening. Actually, > better make it a PANIC for now... Attached is v14, which includes a new patch that PANICs like that in vac_update_relstats() --- 0003. This approach also covers manual VACUUMs, which isn't the case with the failing assertion, which is in instrumentation code (actually VACUUM VERBOSE might hit it). I definitely think that something like this should be committed. Silently ignoring system catalog corruption isn't okay. -- Peter Geoghegan
Attachment
Hi, I was able to trigger the crash. cat ~/tmp/pgbench-createdb.sql CREATE DATABASE pgb_:client_id; DROP DATABASE pgb_:client_id; pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql while I was also running for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s installcheck -j48 -s prove_installcheck=true USE_MODULE_DB=1> /tmp/ci-$i.log 2>&1; done I triggered twice now, but it took a while longer the second time. (gdb) bt full #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49 set = {__val = {4194304, 0, 0, 0, 0, 0, 216172782113783808, 2, 2377909399344644096, 18446497967838863616, 0, 0, 0,0, 0, 0}} pid = <optimized out> tid = <optimized out> ret = <optimized out> #1 0x00007fe49a2db546 in __GI_abort () at abort.c:79 save_stage = 1 act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0}}, sa_flags = 0, sa_restorer = 0x107e0} sigs = {__val = {32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}} #2 0x00007fe49b9706f1 in ExceptionalCondition (conditionName=0x7fe49ba0618d "diff > 0", errorType=0x7fe49ba05bd1 "FailedAssertion", fileName=0x7fe49ba05b90 "/home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c", lineNumber=724) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69 No locals. #3 0x00007fe49b2fc739 in heap_vacuum_rel (rel=0x7fe497a8d148, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10) at /home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:724 buf = { data = 0x7fe49c17e238 "automatic vacuum of table \"contrib_regression_dict_int.pg_catalog.pg_database\": indexscans: 1\npages: 0 removed, 3 remain, 3 scanned (100.00% of total)\ntuples: 49 removed, 53 remain, 9 are dead but no"...,len = 279, maxlen = 1024, cursor = 0} msgfmt = 0x7fe49ba06038 "automatic vacuum of table \"%s.%s.%s\": index scans: %d\n" diff = 0 endtime = 702011687982080 vacrel = 0x7fe49c19b5b8 verbose = false instrument = true ru0 = {tv = {tv_sec = 1648696487, tv_usec = 975963}, ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec= 0, tv_usec = 3086}, { --Type <RET> for more, q to quit, c to continue without paging--c ru_maxrss = 10824, __ru_maxrss_word = 10824}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word= 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 449, __ru_minflt_word = 449}, {ru_majflt = 0, __ru_majflt_word= 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word= 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word= 0}, {ru_nvcsw = 2, __ru_nvcsw_word = 2}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}}} starttime = 702011687975964 walusage_start = {wal_records = 0, wal_fpi = 0, wal_bytes = 0} walusage = {wal_records = 11, wal_fpi = 7, wal_bytes = 30847} secs = 0 usecs = 6116 read_rate = 16.606033355134073 write_rate = 7.6643230869849575 aggressive = false skipwithvm = true frozenxid_updated = true minmulti_updated = true orig_rel_pages = 3 new_rel_pages = 3 new_rel_allvisible = 0 indnames = 0x7fe49c19bb28 errcallback = {previous = 0x0, callback = 0x7fe49b3012fd <vacuum_error_callback>, arg = 0x7fe49c19b5b8} startreadtime = 180 startwritetime = 0 OldestXmin = 67552 FreezeLimit = 4245034848 OldestMxact = 224 MultiXactCutoff = 4289967520 __func__ = "heap_vacuum_rel" #4 0x00007fe49b523d92 in table_relation_vacuum (rel=0x7fe497a8d148, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10) at/home/andres/src/postgresql/src/include/access/tableam.h:1680 No locals. #5 0x00007fe49b527032 in vacuum_rel (relid=1262, relation=0x7fe49c1ae360, params=0x7fe49c130d7c) at /home/andres/src/postgresql/src/backend/commands/vacuum.c:2065 lmode = 4 rel = 0x7fe497a8d148 lockrelid = {relId = 1262, dbId = 0} toast_relid = 0 save_userid = 10 save_sec_context = 0 save_nestlevel = 2 __func__ = "vacuum_rel" #6 0x00007fe49b524c3b in vacuum (relations=0x7fe49c1b03a8, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10, isTopLevel=true)at /home/andres/src/postgresql/src/backend/commands/vacuum.c:482 vrel = 0x7fe49c1ae3b8 cur__state = {l = 0x7fe49c1b03a8, i = 0} cur = 0x7fe49c1b03c0 _save_exception_stack = 0x7fff97e35a10 _save_context_stack = 0x0 _local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318940970843, 9223372036854775747, 0, 0, 0, 6126579318957748059,6139499258682879835}, __mask_was_saved = 0, __saved_mask = {__val = {32, 140619848279000, 8590910454,140619848278592, 32, 140619848278944, 7784, 140619848278592, 140619848278816, 140735741647200, 140619839915137,8458711686435861857, 32, 4869, 140619848278592, 140619848279024}}}} _do_rethrow = false in_vacuum = true stmttype = 0x7fe49baff1a7 "VACUUM" in_outer_xact = false use_own_xacts = true __func__ = "vacuum" #7 0x00007fe49b6d483d in autovacuum_do_vac_analyze (tab=0x7fe49c130d78, bstrategy=0x7fe49c130e10) at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:3247 rangevar = 0x7fe49c1ae360 rel = 0x7fe49c1ae3b8 rel_list = 0x7fe49c1ae3f0 #8 0x00007fe49b6d34bc in do_autovacuum () at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:2495 _save_exception_stack = 0x7fff97e35d70 _save_context_stack = 0x0 _local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318779490139, 9223372036854775747, 0, 0, 0, 6126579319014371163,6139499700101525339}, __mask_was_saved = 0, __saved_mask = {__val = {140619840139982, 140735741647712,140619841923928, 957, 140619847223443, 140735741647656, 140619847312112, 140619847223451, 140619847223443,140619847224399, 0, 139637976727552, 140619817480714, 140735741647616, 140619839856340, 1024}}}} _do_rethrow = false tab = 0x7fe49c130d78 skipit = false stdVacuumCostDelay = 0 stdVacuumCostLimit = 200 iter = {cur = 0x7fe497668da0, end = 0x7fe497668da0} relid = 1262 classTup = 0x7fe497a6c568 isshared = true cell__state = {l = 0x7fe49c130d40, i = 0} classRel = 0x7fe497a5ae18 tuple = 0x0 relScan = 0x7fe49c130928 dbForm = 0x7fe497a64fb8 table_oids = 0x7fe49c130d40 orphan_oids = 0x0 ctl = {num_partitions = 0, ssize = 0, dsize = 1296236544, max_dsize = 140619847224424, keysize = 4, entrysize = 96,hash = 0x0, match = 0x0, keycopy = 0x0, alloc = 0x0, hcxt = 0x7fff97e35c50, hctl = 0x7fe49b9a787e <AllocSetFree+670>} table_toast_map = 0x7fe49c19d2f0 cell = 0x7fe49c130d58 shared = 0x7fe49c17c360 dbentry = 0x7fe49c18d7a0 bstrategy = 0x7fe49c130e10 key = {sk_flags = 0, sk_attno = 17, sk_strategy = 3, sk_subtype = 0, sk_collation = 950, sk_func = {fn_addr = 0x7fe49b809a6a<chareq>, fn_oid = 61, fn_nargs = 2, fn_strict = true, fn_retset = false, fn_stats = 2 '\002', fn_extra = 0x0,fn_mcxt = 0x7fe49c12f7f0, fn_expr = 0x0}, sk_argument = 116} pg_class_desc = 0x7fe49c12f910 effective_multixact_freeze_max_age = 400000000 did_vacuum = false found_concurrent_worker = false i = 32740 __func__ = "do_autovacuum" #9 0x00007fe49b6d21c4 in AutoVacWorkerMain (argc=0, argv=0x0) at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:1719 dbname = "contrib_regression_dict_int\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318890639195, 9223372036854775747, 0, 0, 0, 6126579318785781595,6139499699353759579}, __mask_was_saved = 1, __saved_mask = {__val = {18446744066192964099, 8, 140735741648416,140735741648352, 3156423108750738944, 0, 30, 140735741647888, 140619835812981, 140735741648080, 32666874400,140735741648448, 140619836964693, 140735741652128, 2586778441, 140735741648448}}}} dbid = 205328 __func__ = "AutoVacWorkerMain" #10 0x00007fe49b6d1d5b in StartAutoVacWorker () at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:1504 worker_pid = 0 __func__ = "StartAutoVacWorker" #11 0x00007fe49b6e79af in StartAutovacuumWorker () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5635 bn = 0x7fe49c0da920 __func__ = "StartAutovacuumWorker" #12 0x00007fe49b6e745d in sigusr1_handler (postgres_signal_arg=10) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5340 save_errno = 4 __func__ = "sigusr1_handler" #13 <signal handler called> No locals. #14 0x00007fe49a3a9fc4 in __GI___select (nfds=8, readfds=0x7fff97e36c20, writefds=0x0, exceptfds=0x0, timeout=0x7fff97e36ca0)at ../sysdeps/unix/sysv/linux/select.c:71 sc_ret = -4 sc_ret = <optimized out> s = <optimized out> us = <optimized out> ns = <optimized out> ts64 = {tv_sec = 59, tv_nsec = 765565741} pts64 = <optimized out> r = <optimized out> #15 0x00007fe49b6e26c7 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1765 timeout = {tv_sec = 60, tv_usec = 0} rmask = {fds_bits = {224, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}} selres = -1 now = 1648696487 readmask = {fds_bits = {224, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}} nSockets = 8 last_lockfile_recheck_time = 1648696432 last_touch_time = 1648696072 __func__ = "ServerLoop" #16 0x00007fe49b6e2031 in PostmasterMain (argc=55, argv=0x7fe49c0aa2d0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1473 opt = -1 status = 0 userDoption = 0x7fe49c0951d0 "/srv/dev/pgdev-dev/" listen_addr_saved = true i = 64 output_config_variable = 0x0 __func__ = "PostmasterMain" #17 0x00007fe49b5d2808 in main (argc=55, argv=0x7fe49c0aa2d0) at /home/andres/src/postgresql/src/backend/main/main.c:202 do_check_root = true Greetings, Andres Freund
On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote: > I triggered twice now, but it took a while longer the second time. Great. I wonder if you can get an RR recording... -- Peter Geoghegan
Hi, On 2022-03-30 20:28:44 -0700, Andres Freund wrote: > I was able to trigger the crash. > > cat ~/tmp/pgbench-createdb.sql > CREATE DATABASE pgb_:client_id; > DROP DATABASE pgb_:client_id; > > pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql > > while I was also running > > for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s installcheck -j48 -s prove_installcheck=trueUSE_MODULE_DB=1 > /tmp/ci-$i.log 2>&1; done > > I triggered twice now, but it took a while longer the second time. Forgot to say how postgres was started. Via my usual devenv script, which results in: + /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -c hba_file=/home/andres/tmp/pgdev/pg_hba.conf -D /srv/dev/pgdev-dev/-p 5440 -c shared_buffers=2GB -c wal_level=hot_standby -c max_wal_senders=10 -c track_io_timing=on -crestart_after_crash=false -c max_prepared_transactions=20 -c log_checkpoints=on -c min_wal_size=48MB -c max_wal_size=150GB-c 'cluster_name=dev assert' -c ssl_cert_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.pem -c ssl_key_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.key-c 'log_line_prefix=%m [%p][%b][%v:%x][%a] ' -c shared_buffers=16MB-c log_min_messages=debug1 -c log_connections=on -c allow_in_place_tablespaces=1 -c log_autovacuum_min_duration=0-c log_lock_waits=true -c autovacuum_naptime=10s -c fsync=off Greetings, Andres Freund
Hi, On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote: > On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote: > > I triggered twice now, but it took a while longer the second time. > > Great. > > I wonder if you can get an RR recording... Started it, but looks like it's too slow. (gdb) p MyProcPid $1 = 2172500 (gdb) p vacrel->NewRelfrozenXid $3 = 717 (gdb) p vacrel->relfrozenxid $4 = 717 (gdb) p OldestXmin $5 = 5112 (gdb) p aggressive $6 = false There was another autovacuum of pg_database 10s before: 2022-03-30 20:35:17.622 PDT [2165344][autovacuum worker][5/3:0][] LOG: automatic vacuum of table "postgres.pg_catalog.pg_database":index scans: 1 pages: 0 removed, 3 remain, 3 scanned (100.00% of total) tuples: 61 removed, 4 remain, 1 are dead but not yet removable removable cutoff: 1921, older by 3 xids when operation ended new relfrozenxid: 717, which is 3 xids ahead of previous value index scan needed: 3 pages from table (100.00% of total) had 599 dead item identifiers removed index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_database_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable I/O timings: read: 0.029 ms, write: 0.034 ms avg read rate: 134.120 MB/s, avg write rate: 89.413 MB/s buffer usage: 35 hits, 12 misses, 8 dirtied WAL usage: 12 records, 5 full page images, 27218 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s The dying backend: 2022-03-30 20:35:27.668 PDT [2172500][autovacuum worker][7/0:0][] DEBUG: autovacuum: processing database "contrib_regression_hstore" ... 2022-03-30 20:35:27.690 PDT [2172500][autovacuum worker][7/674:0][] CONTEXT: while cleaning up index "pg_database_oid_index"of relation "pg_catalog.pg_database" Greetings, Andres Freund
On Wed, Mar 30, 2022 at 9:04 PM Andres Freund <andres@anarazel.de> wrote: > (gdb) p vacrel->NewRelfrozenXid > $3 = 717 > (gdb) p vacrel->relfrozenxid > $4 = 717 > (gdb) p OldestXmin > $5 = 5112 > (gdb) p aggressive > $6 = false Does this OldestXmin seem reasonable at this point in execution, based on context? Does it look too high? Something else? -- Peter Geoghegan
Hi, On 2022-03-30 21:04:07 -0700, Andres Freund wrote: > On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote: > > On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote: > > > I triggered twice now, but it took a while longer the second time. > > > > Great. > > > > I wonder if you can get an RR recording... > > Started it, but looks like it's too slow. > > (gdb) p MyProcPid > $1 = 2172500 > > (gdb) p vacrel->NewRelfrozenXid > $3 = 717 > (gdb) p vacrel->relfrozenxid > $4 = 717 > (gdb) p OldestXmin > $5 = 5112 > (gdb) p aggressive > $6 = false I added a bunch of debug elogs to see what sets *frozenxid_updated to true. (gdb) p *vacrel $1 = {rel = 0x7fe24f3e0148, indrels = 0x7fe255c17ef8, nindexes = 2, aggressive = false, skipwithvm = true, failsafe_active= false, consider_bypass_optimization = true, do_index_vacuuming = true, do_index_cleanup = true, do_rel_truncate = true, bstrategy= 0x7fe255bb0e28, pvs = 0x0, relfrozenxid = 717, relminmxid = 6, old_live_tuples = 42, OldestXmin = 20751, vistest = 0x7fe255058970 <GlobalVisSharedRels>,FreezeLimit = 4244988047, MultiXactCutoff = 4289967302, NewRelfrozenXid = 717, NewRelminMxid = 6, skippedallvis = false, relnamespace = 0x7fe255c17bf8"pg_catalog", relname = 0x7fe255c17cb8 "pg_database", indname = 0x0, blkno = 4294967295, offnum = 0, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP,verbose = false, dead_items = 0x7fe255c131d0, rel_pages = 8, scanned_pages = 8, removed_pages = 0, lpdead_item_pages = 0, missed_dead_pages= 0, nonempty_pages = 8, new_rel_tuples = 124, new_live_tuples = 42, indstats = 0x7fe255c18320, num_index_scans = 0, tuples_deleted = 0, lpdead_items= 0, live_tuples = 42, recently_dead_tuples = 82, missed_dead_tuples = 0} But the debug elog reports that relfrozenxid updated 714 -> 717 relminmxid updated 1 -> 6 Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid from the shared relcache init file written by another backend: 2022-03-30 21:10:47.626 PDT [2625038][autovacuum worker][6/433:0][] LOG: automatic vacuum of table "contrib_regression_postgres_fdw.pg_catalog.pg_database":index scans: 1 pages: 0 removed, 8 remain, 8 scanned (100.00% of total) tuples: 4 removed, 114 remain, 72 are dead but not yet removable removable cutoff: 20751, older by 596 xids when operation ended new relfrozenxid: 717, which is 3 xids ahead of previous value new relminmxid: 6, which is 5 mxids ahead of previous value index scan needed: 3 pages from table (37.50% of total) had 8 dead item identifiers removed index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_database_oid_index": pages: 6 in total, 0 newly deleted, 2 currently deleted, 2 reusable I/O timings: read: 0.050 ms, write: 0.102 ms avg read rate: 209.860 MB/s, avg write rate: 76.313 MB/s buffer usage: 42 hits, 22 misses, 8 dirtied WAL usage: 13 records, 5 full page images, 33950 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s ... 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][:0][] DEBUG: InitPostgres 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] DEBUG: my backend ID is 6 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] LOG: reading shared init file 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] DEBUG: StartTransaction(1) name: unnamed; blockState:DEFAULT; state: INPROGRESS, xid/sub> 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] LOG: reading non-shared init file This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's normally fairly harmless - I think. Perhaps we should just fetch the horizons from the "local" catalog for shared rels? Greetings, Andres Freund
Hi, On 2022-03-30 21:11:48 -0700, Peter Geoghegan wrote: > On Wed, Mar 30, 2022 at 9:04 PM Andres Freund <andres@anarazel.de> wrote: > > (gdb) p vacrel->NewRelfrozenXid > > $3 = 717 > > (gdb) p vacrel->relfrozenxid > > $4 = 717 > > (gdb) p OldestXmin > > $5 = 5112 > > (gdb) p aggressive > > $6 = false > > Does this OldestXmin seem reasonable at this point in execution, based > on context? Does it look too high? Something else? Reasonable: (gdb) p *ShmemVariableCache $1 = {nextOid = 78969, oidCount = 2951, nextXid = {value = 21411}, oldestXid = 714, xidVacLimit = 200000714, xidWarnLimit= 2107484361, xidStopLimit = 2144484361, xidWrapLimit = 2147484361, oldestXidDB = 1, oldestCommitTsXid = 0, newestCommitTsXid = 0, latestCompletedXid= {value = 21408}, xactCompletionCount = 1635, oldestClogXid = 714} I think the explanation I just sent explains the problem, without "in-memory" confusion about what's running and what's not. Greetings, Andres Freund
On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote: > But the debug elog reports that > > relfrozenxid updated 714 -> 717 > relminmxid updated 1 -> 6 > > Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid > from the shared relcache init file written by another backend: We should have added logging of relfrozenxid and relminmxid a long time ago. > This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's > normally fairly harmless - I think. Is this one pretty old? > Perhaps we should just fetch the horizons from the "local" catalog for shared > rels? Not sure what you mean. -- Peter Geoghegan
On Wed, Mar 30, 2022 at 9:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Perhaps we should just fetch the horizons from the "local" catalog for shared > > rels? > > Not sure what you mean. Wait, you mean use vacrel->relfrozenxid directly? Seems kind of ugly... -- Peter Geoghegan
Hi, On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote: > On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote: > > But the debug elog reports that > > > > relfrozenxid updated 714 -> 717 > > relminmxid updated 1 -> 6 > > > > Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid > > from the shared relcache init file written by another backend: > > We should have added logging of relfrozenxid and relminmxid a long time ago. At least at DEBUG1 or such. > > This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's > > normally fairly harmless - I think. > > Is this one pretty old? What do you mean with "this one"? The cause for the assert failure? I'm not sure there's a proper bug on HEAD here. I think at worst it can delay the horizon increasing a bunch, by falsely not using an aggressive vacuum when we should have - might even be limited to a single autovacuum cycle. > > Perhaps we should just fetch the horizons from the "local" catalog for shared > > rels? > > Not sure what you mean. Basically, instead of relying on the relcache, which for shared relation is vulnerable to seeing "too new" horizons due to the shared relcache init file, explicitly load relfrozenxid / relminmxid from the the catalog / syscache. I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid etc. Whereas right now we only fetch the pg_class row in vac_update_relstats(), but use the relcache before. Greetings, Andres Freund
Hi, On 2022-03-30 21:59:15 -0700, Andres Freund wrote: > On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote: > > On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote: > > > Perhaps we should just fetch the horizons from the "local" catalog for shared > > > rels? > > > > Not sure what you mean. > > Basically, instead of relying on the relcache, which for shared relation is > vulnerable to seeing "too new" horizons due to the shared relcache init file, > explicitly load relfrozenxid / relminmxid from the the catalog / syscache. > > I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using > SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid > etc. Whereas right now we only fetch the pg_class row in > vac_update_relstats(), but use the relcache before. Perhaps we should explicitly mask out parts of relcache entries in the shared init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to Invalid* or such. I even wonder if we should just generally move those out of the fields we have in the relcache, not just for shared rels loaded from the init fork. Presumably by just moving them into the CATALOG_VARLEN ifdef. The only place that appears to access rd_rel->relfrozenxid outside of DDL is heap_abort_speculative(). Greetings, Andres Freund
On Thu, Mar 31, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote: > Perhaps we should explicitly mask out parts of relcache entries in the shared > init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to > Invalid* or such. That has the advantage of being more honest. If you're going to break the abstraction, then it seems best to break it in an obvious way, that leaves no doubts about what you're supposed to be relying on. This bug doesn't seem like the kind of thing that should be left as-is. If only because it makes it hard to add something like a WARNING when we make relfrozenxid go backwards (on the basis of the existing value apparently being in the future), which we really should have been doing all along. The whole reason why we overwrite pg_class.relfrozenxid values from the future is to ameliorate the effects of more serious bugs like the pg_upgrade/pg_resetwal one fixed in commit 74cf7d46 not so long ago (mid last year). We had essentially the same pg_upgrade "from the future" bug twice (once for relminmxid in the MultiXact bug era, another more recent version affecting relfrozenxid). > The only place that appears to access rd_rel->relfrozenxid outside of DDL is > heap_abort_speculative(). I wonder how necessary that really is. Even if the XID is before relfrozenxid, does that in itself really make it "in the future"? Obviously it's often necessary to make the assumption that allowing wraparound amounts to allowing XIDs "from the future" to exist, which is dangerous. But why here? Won't pruning by VACUUM eventually correct the issue anyway? -- Peter Geoghegan
Hi, On 2022-03-31 09:58:18 -0700, Peter Geoghegan wrote: > On Thu, Mar 31, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote: > > The only place that appears to access rd_rel->relfrozenxid outside of DDL is > > heap_abort_speculative(). > > I wonder how necessary that really is. Even if the XID is before > relfrozenxid, does that in itself really make it "in the future"? > Obviously it's often necessary to make the assumption that allowing > wraparound amounts to allowing XIDs "from the future" to exist, which > is dangerous. But why here? Won't pruning by VACUUM eventually correct > the issue anyway? I don't think we should weaken defenses against xids from before relfrozenxid in vacuum / amcheck / .... If anything we should strengthen them. Isn't it also just plainly required for correctness? We'd not necessarily trigger a vacuum in time to remove the xid before approaching wraparound if we put in an xid before relfrozenxid? That happening in prune_xid is obviously les bad than on actual data, but still. ISTM we should just use our own xid. Yes, it might delay cleanup a bit longer. But unless there's already crud on the page (with prune_xid already set, the abort of the speculative insertion isn't likely to make the difference? Greetings, Andres Freund
On Wed, Mar 30, 2022 at 9:59 PM Andres Freund <andres@anarazel.de> wrote: > I'm not sure there's a proper bug on HEAD here. I think at worst it can delay > the horizon increasing a bunch, by falsely not using an aggressive vacuum when > we should have - might even be limited to a single autovacuum cycle. So, to be clear: vac_update_relstats() never actually considered the new relfrozenxid value from its vacuumlazy.c caller to be "in the future"? It just looked that way to the failing assertion in vacuumlazy.c, because its own version of the original relfrozenxid was stale from the beginning? And so the worst problem is probably just that we don't use aggressive VACUUM when we really should in rare cases? -- Peter Geoghegan
On Thu, Mar 31, 2022 at 10:11 AM Andres Freund <andres@anarazel.de> wrote: > I don't think we should weaken defenses against xids from before relfrozenxid > in vacuum / amcheck / .... If anything we should strengthen them. > > Isn't it also just plainly required for correctness? We'd not necessarily > trigger a vacuum in time to remove the xid before approaching wraparound if we > put in an xid before relfrozenxid? That happening in prune_xid is obviously > les bad than on actual data, but still. Yeah, you're right. Ambiguity about stuff like this should be avoided on general principle. > ISTM we should just use our own xid. Yes, it might delay cleanup a bit > longer. But unless there's already crud on the page (with prune_xid already > set, the abort of the speculative insertion isn't likely to make the > difference? Speculative insertion abort is pretty rare in the real world, I bet. The speculative insertion precheck is very likely to work almost always with real workloads. -- Peter Geoghegan
Hi, On 2022-03-31 10:12:49 -0700, Peter Geoghegan wrote: > On Wed, Mar 30, 2022 at 9:59 PM Andres Freund <andres@anarazel.de> wrote: > > I'm not sure there's a proper bug on HEAD here. I think at worst it can delay > > the horizon increasing a bunch, by falsely not using an aggressive vacuum when > > we should have - might even be limited to a single autovacuum cycle. > > So, to be clear: vac_update_relstats() never actually considered the > new relfrozenxid value from its vacuumlazy.c caller to be "in the > future"? No, I added separate debug messages for those, and also applied your patch, and it didn't trigger. I don't immediately see how we could end up computing a frozenxid value that would be problematic? The pgcform->relfrozenxid value will always be the "local" value, which afaics can be behind the other database's value (and thus behind the value from the relcache init file). But it can't be ahead, we have the proper invalidations for that (I think). I do think we should apply a version of the warnings you have (with a WARNING instead of PANIC obviously). I think it's bordering on insanity that we have so many paths to just silently fix stuff up around vacuum. It's like we want things to be undebuggable, and to give users no warnings about something being up. > It just looked that way to the failing assertion in > vacuumlazy.c, because its own version of the original relfrozenxid was > stale from the beginning? And so the worst problem is probably just > that we don't use aggressive VACUUM when we really should in rare > cases? Yes, I think that's right. Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5 and fsync=off made the crash trigger more quickly. Greetings, Andres Freund
On Thu, Mar 31, 2022 at 10:50 AM Andres Freund <andres@anarazel.de> wrote: > > So, to be clear: vac_update_relstats() never actually considered the > > new relfrozenxid value from its vacuumlazy.c caller to be "in the > > future"? > > No, I added separate debug messages for those, and also applied your patch, > and it didn't trigger. The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". Plus the other related assert I mentioned did not trigger. So when this "diff" assert did trigger, the value of "diff" must have been 0 (not a negative value). While this state does technically indicate that the "existing" relfrozenxid value (actually a stale version) appears to be "in the future" (because the OldestXmin XID might still never have been allocated), it won't ever be in the future according to vac_update_relstats() (even if it used that version). I suppose that I might be wrong about that, somehow -- anything is possible. The important point is that there is currently no evidence that this bug (or any very recent bug) could ever allow vac_update_relstats() to actually believe that it needs to update relfrozenxid/relminmxid, purely because the existing value is in the future. The fact that vac_update_relstats() doesn't log/warn when this happens is very unfortunate, but there is nevertheless no evidence that that would have informed us of any bug on HEAD, even including the actual bug here, which is a bug in vacuumlazy.c (not in vac_update_relstats). > I do think we should apply a version of the warnings you have (with a WARNING > instead of PANIC obviously). I think it's bordering on insanity that we have > so many paths to just silently fix stuff up around vacuum. It's like we want > things to be undebuggable, and to give users no warnings about something being > up. Yeah, it's just totally self defeating to not at least log it. I mean this is a code path that is only hit once per VACUUM, so there is practically no risk of that causing any new problems. > Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5 > and fsync=off made the crash trigger more quickly. I'll try to do that today. I'm not feeling the most energetic right now, to be honest. -- Peter Geoghegan
On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan <pg@bowt.ie> wrote: > The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". Attached is v15. I plan to commit the first two patches (the most substantial two patches by far) in the next couple of days, barring objections. v15 removes this "Assert(diff > 0)" assertion from 0001. It's not adding any value, now that the underlying issue that it accidentally brought to light is well understood (there are still more robust assertions to the relfrozenxid/relminmxid invariants). "Assert(diff > 0)" is liable to fail until the underlying bug on HEAD is fixed, which can be treated as separate work. I also refined the WARNING patch in v15. It now actually issues WARNINGs (rather than PANICs, which were just a temporary debugging measure in v14). Also fixed a compiler warning in this patch, based on a complaint from CFBot's CompilerWarnings task. I can delay commiting this WARNING patch until right before feature freeze. Seems best to give others more opportunity for comments. -- Peter Geoghegan
Attachment
Hi, On 2022-04-01 10:54:14 -0700, Peter Geoghegan wrote: > On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan <pg@bowt.ie> wrote: > > The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". > > Attached is v15. I plan to commit the first two patches (the most > substantial two patches by far) in the next couple of days, barring > objections. Just saw that you committed: Wee! I think this will be a substantial improvement for our users. While I was writing the above I, again, realized that it'd be awfully nice to have some accumulated stats about (auto-)vacuum's effectiveness. For us to get feedback about improvements more easily and for users to know what aspects they need to tune. Knowing how many times a table was vacuumed doesn't really tell that much, and requiring to enable log_autovacuum_min_duration and then aggregating those results is pretty painful (and version dependent). If we just collected something like: - number of heap passes - time spent heap vacuuming - number of index scans - time spent index vacuuming - time spent delaying - percentage of non-yet-removable vs removable tuples it'd start to be a heck of a lot easier to judge how well autovacuum is coping. If we tracked the related pieces above in the index stats (or perhaps additionally there), it'd also make it easier to judge the cost of different indexes. - Andres
On Sun, Apr 3, 2022 at 12:05 PM Andres Freund <andres@anarazel.de> wrote: > Just saw that you committed: Wee! I think this will be a substantial > improvement for our users. I hope so! I think that it's much more useful as the basis for future work than as a standalone thing. Users of Postgres 15 might not notice a huge difference. But it opens up a lot of new directions to take VACUUM in. I would like to get rid of anti-wraparound VACUUMs and aggressive VACUUMs in Postgres 16. This isn't as radical as it sounds. It seems quite possible to find a way for *every* VACUUM to become aggressive progressively and dynamically. We'll still need to have autovacuum.c know about wraparound, but it should just be just another threshold, not fundamentally different to the other thresholds (except that it's still used when autovacuum is nominally disabled). The behavior around autovacuum cancellations is probably still going to be necessary when age(relfrozenxid) gets too high, but it shouldn't be conditioned on what age(relfrozenxid) *used to be*, when the autovacuum started. That could have been a long time ago. It should be based on what's happening *right now*. > While I was writing the above I, again, realized that it'd be awfully nice to > have some accumulated stats about (auto-)vacuum's effectiveness. For us to get > feedback about improvements more easily and for users to know what aspects > they need to tune. Strongly agree. And I'm excited about the potential of the shared memory stats patch to enable more thorough instrumentation, which allows us to improve things with feedback that we just can't get right now. VACUUM is still too complicated -- that makes this kind of analysis much harder, even for experts. You need more continuous behavior to get value from this kind of analysis. There are too many things that might end up mattering, that really shouldn't ever matter. Too much potential for strange illogical discontinuities in performance over time. Having only one type of VACUUM (excluding VACUUM FULL) will be much easier for users to reason about. But I also think that it'll be much easier for us to reason about. For example, better autovacuum scheduling will be made much easier if autovacuum.c can just assume that every VACUUM operation will do the same amount of work. (Another problem with the scheduling is that it uses ANALYZE statistics (sampling) in a way that just doesn't make any sense for something like VACUUM, which is an inherently dynamic and cyclic process.) None of this stuff has to rely on my patch for freezing. We don't necessarily have to make every VACUUM advance relfrozenxid to do all this. The important point is that we definitely shouldn't be putting off *all* freezing of all-visible pages in non-aggressive VACUUMs (or in VACUUMs that are not expected to advance relfrozenxid). Even a very conservative implementation could achieve all this; we need only spread out the burden of freezing all-visible pages over time, across multiple VACUUM operations. Make the behavior continuous. > Knowing how many times a table was vacuumed doesn't really tell that much, and > requiring to enable log_autovacuum_min_duration and then aggregating those > results is pretty painful (and version dependent). Yeah. Ideally we could avoid making the output of log_autovacuum_min_duration into an API, by having a real API instead. The output probably needs to evolve some more. A lot of very basic information wasn't there until recently. > If we just collected something like: > - number of heap passes > - time spent heap vacuuming > - number of index scans > - time spent index vacuuming > - time spent delaying You forgot FPIs. > - percentage of non-yet-removable vs removable tuples I think that we should address this directly too. By "taking a snapshot of the visibility map", so we at least don't scan/vacuum heap pages that don't really need it. This is also valuable because it makes slowing down VACUUM (maybe slowing it down a lot) have fewer downsides. At least we'll have "locked in" our scanned_pages, which we can figure out in full before we really scan even one page. > it'd start to be a heck of a lot easier to judge how well autovacuum is > coping. What about the potential of the shared memory stats stuff to totally replace the use of ANALYZE stats in autovacuum.c? Possibly with help from vacuumlazy.c, and the visibility map? I see a lot of potential for exploiting the visibility map more, both within vacuumlazy.c itself, and for autovacuum.c scheduling [1]. I'd probably start with the scheduling stuff, and only then work out how to show users more actionable information. [1] https://postgr.es/m/CAH2-Wzkt9Ey9NNm7q9nSaw5jdBjVsAq3yvb4UT4M93UaJVd_xg@mail.gmail.com -- Peter Geoghegan
On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan <pg@bowt.ie> wrote: > I also refined the WARNING patch in v15. It now actually issues > WARNINGs (rather than PANICs, which were just a temporary debugging > measure in v14). Going to commit this remaining patch tomorrow, barring objections. -- Peter Geoghegan
Hi, On 2022-04-04 19:32:13 -0700, Peter Geoghegan wrote: > On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I also refined the WARNING patch in v15. It now actually issues > > WARNINGs (rather than PANICs, which were just a temporary debugging > > measure in v14). > > Going to commit this remaining patch tomorrow, barring objections. The remaining patch are the warnings in vac_update_relstats(), correct? I guess one could argue they should be LOG rather than WARNING, but I find the project stance on that pretty impractical. So warning's ok with me. Not sure why you used errmsg_internal()? Otherwise LGTM. Greetings, Andres Freund
On Mon, Apr 4, 2022 at 8:18 PM Andres Freund <andres@anarazel.de> wrote: > The remaining patch are the warnings in vac_update_relstats(), correct? I > guess one could argue they should be LOG rather than WARNING, but I find the > project stance on that pretty impractical. So warning's ok with me. Right. The reason I used WARNINGs was because it matches vaguely related WARNINGs in vac_update_relstats()'s sibling function, vacuum_set_xid_limits(). > Not sure why you used errmsg_internal()? The usual reason for using errmsg_internal(), I suppose. I tend to do that with corruption related messages on the grounds that they're usually highly obscure issues that are (by definition) never supposed to happen. The only thing that a user can be expected to do with the information from the message is to report it to -bugs, or find some other similar report. -- Peter Geoghegan
On Mon, Apr 4, 2022 at 8:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > Right. The reason I used WARNINGs was because it matches vaguely > related WARNINGs in vac_update_relstats()'s sibling function, > vacuum_set_xid_limits(). Okay, pushed the relfrozenxid warning patch. Thanks -- Peter Geoghegan
On 4/3/22 12:05 PM, Andres Freund wrote: > While I was writing the above I, again, realized that it'd be awfully nice to > have some accumulated stats about (auto-)vacuum's effectiveness. For us to get > feedback about improvements more easily and for users to know what aspects > they need to tune. > > Knowing how many times a table was vacuumed doesn't really tell that much, and > requiring to enable log_autovacuum_min_duration and then aggregating those > results is pretty painful (and version dependent). > > If we just collected something like: > - number of heap passes > - time spent heap vacuuming > - number of index scans > - time spent index vacuuming > - time spent delaying The number of passes would let you know if maintenance_work_mem is too small (or to stop killing 187M+ tuples in one go). The timing info would give you an idea of the impact of throttling. > - percentage of non-yet-removable vs removable tuples This'd give you an idea how bad your long-running-transaction problem is. Another metric I think would be useful is the average utilization of your autovac workers. No spare workers means you almost certainly have tables that need vacuuming but have to wait. As a single number, it'd also be much easier for users to understand. I'm no stats expert, but one way to handle that cheaply would be to maintain an engineering-weighted-mean of the percentage of autovac workers that are in use at the end of each autovac launcher cycle (though that would probably not work great for people that have extreme values for launcher delay, or constantly muck with launcher_delay). > > it'd start to be a heck of a lot easier to judge how well autovacuum is > coping. > > If we tracked the related pieces above in the index stats (or perhaps > additionally there), it'd also make it easier to judge the cost of different > indexes. > > - Andres > >
On Thu, Apr 14, 2022 at 4:19 PM Jim Nasby <nasbyj@amazon.com> wrote: > > - percentage of non-yet-removable vs removable tuples > > This'd give you an idea how bad your long-running-transaction problem is. VACUUM fundamentally works by removing those tuples that are considered dead according to an XID-based cutoff established when the operation begins. And so many very long running VACUUM operations will see dead-but-not-removable tuples even when there are absolutely no long running transactions (nor any other VACUUM operations). The only long running thing involved might be our own long running VACUUM operation. I would like to reduce the number of non-removal dead tuples encountered by VACUUM by "locking in" heap pages that we'd like to scan up front. This would work by having VACUUM create its own local in-memory copy of the visibility map before it even starts scanning heap pages. That way VACUUM won't end up visiting heap pages just because they were concurrently modified half way through our VACUUM (by some other transactions). We don't really need to scan these pages at all -- they have dead tuples, but not tuples that are "dead to VACUUM". The key idea here is to remove a big unnatural downside to slowing VACUUM down. The cutoff would almost work like an MVCC snapshot, that described precisely the work that VACUUM needs to do (which pages to scan) up-front. Once that's locked in, the amount of work we're required to do cannot go up as we're doing it (or it'll be less of an issue, at least). It would also help if VACUUM didn't scan pages that it already knows don't have any dead tuples. The current SKIP_PAGES_THRESHOLD rule could easily be improved. That's almost the same problem. -- Peter Geoghegan