Thread: Fixing a couple of buglets in how VACUUM sets visibility map bits
Attached are two patches, each of which fixes two historical buglets around VACUUM's approach to setting bits in the visibility map. (Whether or not this is actually refactoring work or hardening work is debatable, I suppose.) The first patch makes sure that the snapshotConflictHorizon cutoff (XID cutoff for recovery conflicts) is never a special XID, unless that XID is InvalidTransactionId, which is interpreted as a snapshotConflictHorizon value that will never need a recovery conflict (per the general convention for snapshotConflictHorizon values explained above ResolveRecoveryConflictWithSnapshot). This patch establishes a hard rule that snapshotConflictHorizon values can never be a special XID value, unless it's InvalidTransactionId. An assertion enforces the rule for us in REDO routines (at the point that they call ResolveRecoveryConflictWithSnapshot with the WAL record's snapshotConflictHorizon XID value). The second patch makes sure that VACUUM can never set a page all-frozen in the visibility map without also setting the same page all-visible in the same call to visibilitymap_set() -- regardless of what we think we know about the current state of the all-visible bit in the VM. The second patch adjusts one of the visibilitymap_set() calls in vacuumlazy.c that would previously sometimes set a page's all-frozen bit without also setting its all-visible bit. This could allow VACUUM to leave a page all-frozen but not all-visible in the visibility map (since the value of all_visible_according_to_vm can go stale). I think that this should be treated as a basic visibility map invariant: an all-frozen page must also be all-visible, by definition, so why should it be physically possible for the VM to give a contradictory picture of the all_visible/all_frozen status of any one page? Assertions are added that more or less make this rule into an invariant. amcheck/pg_visibility coverage might make sense too, but I haven't done that here. The second patch also adjusts a later visibilitymap_set() call site (the one used just after heap vacuuming runs in the final heap pass) in roughly the same way. It no longer reads from the visibility map to see what bits need to be changed. The existing approach here seems rather odd. The whole point of calling lazy_vacuum_heap_page() is to set LP_DEAD items referenced by VACUUM's dead_items array to LP_UNUSED -- there has to have been at least one LP_DEAD item on the page for us to end up here (which a Postgres 14 era assertion verifies for us). So we already know perfectly well that the visibility map shouldn't indicate that the page is all-visible yet -- why bother asking the VM? And besides, any call to visibilitymap_set() will only modify the VM when it directly observes that the bits have changed -- so why even attempt to duplicate that on the caller side? It seems to me that the visibilitymap_get_status() call inside lazy_vacuum_heap_page() is actually abused to work as a substitute for visibilitymap_pin(). Why not use top-level visibilitymap_pin() calls instead, just like we do it in the first heap pass? That's how it's done in the second patch; it adds a visibilitymap_pin() call in lazy_vacuum_heap_rel()'s blkno-wise loop. That gives us parity between the first and second heap pass, which seems like a clear maintainability win -- everybody can pass the already-pinned/already-setup vmbuffer by value. -- Peter Geoghegan
Attachment
On Sat, Dec 31, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > The first patch makes sure that the snapshotConflictHorizon cutoff > (XID cutoff for recovery conflicts) is never a special XID, unless > that XID is InvalidTransactionId, which is interpreted as a > snapshotConflictHorizon value that will never need a recovery conflict > (per the general convention for snapshotConflictHorizon values > explained above ResolveRecoveryConflictWithSnapshot). Pushed this just now. Attached is another very simple refactoring patch for vacuumlazy.c. It makes vacuumlazy.c save the result of get_database_name() in vacrel, which matches what we already do with things like get_namespace_name(). Would be helpful if I could get a +1 on v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is somewhat more substantial than the others. -- Peter Geoghegan
Attachment
On Mon, Jan 2, 2023 at 10:31 AM Peter Geoghegan <pg@bowt.ie> wrote: > Would be helpful if I could get a +1 on > v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is > somewhat more substantial than the others. There has been no response on this thread for over a full week at this point. I'm CC'ing Robert now, since the bug is from his commit a892234f83. Attached revision of the "don't unset all-visible bit while unsetting all-frozen bit" patch adds some assertions that verify that visibility_cutoff_xid is InvalidTransactionId as expected when we go to set any page all-frozen in the VM. It also broadens an existing nearby test for corruption, which gives us some chance of detecting and repairing corruption of this sort that might have slipped in in the field. My current plan is to commit something like this in another week or so, barring any objections. -- Peter Geoghegan
Attachment
Hi, On 2023-01-08 14:39:19 -0800, Peter Geoghegan wrote: > One of the calls to visibilitymap_set() during VACUUM's initial heap > pass could unset a page's all-visible bit during the process of setting > the same page's all-frozen bit. How? visibilitymap_set() just adds flags, it doesn't remove any already existing bits: map[mapByte] |= (flags << mapOffset); It'll afaict lead to potentially unnecessary WAL records though, which does seem buggy: if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS)) here we check for *equivalence*, but then below we just or-in flags. So visibilitymap_set() with just one of the flags bits set in the parameters, but both set in the page would end up WAL logging unnecessarily. > @@ -2388,8 +2398,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel) > static void > lazy_vacuum_heap_rel(LVRelState *vacrel) > { > - int index; > - BlockNumber vacuumed_pages; > + int index = 0; > + BlockNumber vacuumed_pages = 0; > Buffer vmbuffer = InvalidBuffer; > LVSavedErrInfo saved_err_info; > > @@ -2406,42 +2416,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) > VACUUM_ERRCB_PHASE_VACUUM_HEAP, > InvalidBlockNumber, InvalidOffsetNumber); > > - vacuumed_pages = 0; > - > - index = 0; > @@ -2473,12 +2484,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) > */ > static int > lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer, > - int index, Buffer *vmbuffer) > + int index, Buffer vmbuffer) > { > VacDeadItems *dead_items = vacrel->dead_items; > Page page = BufferGetPage(buffer); > OffsetNumber unused[MaxHeapTuplesPerPage]; > - int uncnt = 0; > + int nunused = 0; > TransactionId visibility_cutoff_xid; > bool all_frozen; > LVSavedErrInfo saved_err_info; > @@ -2508,10 +2519,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer, > > Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid)); > ItemIdSetUnused(itemid); > - unused[uncnt++] = toff; > + unused[nunused++] = toff; > } > > - Assert(uncnt > 0); > + Assert(nunused > 0); > > /* Attempt to truncate line pointer array now */ > PageTruncateLinePointerArray(page); > @@ -2527,13 +2538,13 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer, > xl_heap_vacuum xlrec; > XLogRecPtr recptr; > > - xlrec.nunused = uncnt; > + xlrec.nunused = nunused; > > XLogBeginInsert(); > XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum); > > XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); > - XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber)); > + XLogRegisterBufData(0, (char *) unused, nunused * sizeof(OffsetNumber)); > > recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM); > You have plenty of changes like this, which are afaict entirely unrelated to the issue the commit is fixing, in here. It just makes it hard to review the patch. Greetings, Andres Freund
On Sun, Jan 8, 2023 at 3:53 PM Andres Freund <andres@anarazel.de> wrote: > How? See the commit message for the scenario I have in mind, which involves a concurrent HOT update that aborts. We're vulnerable to allowing "all-frozen but not all-visible" inconsistencies because of two factors: this business with not passing VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to visibilitymap_set(), *and* the use of all_visible_according_to_vm to set the VM (a local variable that can go stale). We sort of assume that all_visible_according_to_vm cannot go stale here without our detecting it. That's almost always the case, but it's not quite guaranteed. > visibilitymap_set() just adds flags, it doesn't remove any already > existing bits: I know. The concrete scenario I have in mind is very subtle (if the problem was this obvious I'm sure somebody would have noticed it by now, since we do hit this visibilitymap_set() call site reasonably often). A concurrent HOT update will certainly clear all the bits for us, which is enough. > You have plenty of changes like this, which are afaict entirely unrelated to > the issue the commit is fixing, in here. It just makes it hard to review the > patch. I didn't think that it was that big of a deal to tweak the style of one or two details in and around lazy_vacuum_heap_rel() in passing, for consistency with lazy_scan_heap(), since the patch already needs to do some of that. I do take your point, though. -- Peter Geoghegan
On Sun, Jan 8, 2023 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > We're vulnerable to allowing "all-frozen but not all-visible" > inconsistencies because of two factors: this business with not passing > VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to > visibilitymap_set(), *and* the use of all_visible_according_to_vm to > set the VM (a local variable that can go stale). We sort of assume > that all_visible_according_to_vm cannot go stale here without our > detecting it. That's almost always the case, but it's not quite > guaranteed. On further reflection even v2 won't repair the page-level PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we might actually leave the all-frozen bit set in the VM, while both the all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset. Again, all due to the approach we take with all_visible_according_to_vm, which can go stale independently of both the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my example problem scenario). FWIW I don't have this remaining problem in my VACUUM freezing/scanning strategies patchset. It just gets rid of all_visible_according_to_vm altogether, which makes things a lot simpler at the point that we set VM bits at the end of lazy_scan_heap -- there is nothing left that can become stale. Quite a lot of the code is just removed; there is exactly one call to visibilitymap_set() at the end of lazy_scan_heap with the patchset, that does everything we need. The patchset also has logic for setting PD_ALL_VISIBLE when it needs to be set, which isn't (and shouldn't) be conditioned on whether we're doing a "all visible -> all frozen " transition or a "neither -> all visible" transition. What it actually needs to be conditioned on is whether it's unset now, and so needs to be set in passing, as part of setting one or both VM bits -- simple as that. -- Peter Geoghegan
On Sun, Jan 8, 2023 at 6:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > On further reflection even v2 won't repair the page-level > PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we > might actually leave the all-frozen bit set in the VM, while both the > all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset. > Again, all due to the approach we take with > all_visible_according_to_vm, which can go stale independently of both > the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my > example problem scenario). Attached is v3, which explicitly checks the need to set the PD_ALL_VISIBLE flag at the relevant visibilitymap_set() call site. It also has improved comments. -- Peter Geoghegan
Attachment
Hi, On 2023-01-08 16:27:59 -0800, Peter Geoghegan wrote: > On Sun, Jan 8, 2023 at 3:53 PM Andres Freund <andres@anarazel.de> wrote: > > How? > > See the commit message for the scenario I have in mind, which involves > a concurrent HOT update that aborts. I looked at it. I specifically was wondering about this part of it: > One of the calls to visibilitymap_set() during VACUUM's initial heap > pass could unset a page's all-visible bit during the process of setting > the same page's all-frozen bit. Which I just don't see as possible, due to visibilitymap_set() simply never unsetting bits. I think that's just an imprecise formulation though - the problem is that we can call visibilitymap_set() with just VISIBILITYMAP_ALL_FROZEN, even though VISIBILITYMAP_ALL_VISIBLE was concurrently unset. ISTM that we ought to update all_visible_according_to_vm from PageIsAllVisible() once we've locked the page. Even if we avoid this specific case, it seems a recipe for future bugs to have a potentially outdated variable around. Greetings, Andres Freund
Hi, On 2023-01-09 10:16:03 -0800, Peter Geoghegan wrote: > Attached is v3, which explicitly checks the need to set the PD_ALL_VISIBLE > flag at the relevant visibilitymap_set() call site. It also has improved > comments. Afaict we'll need to backpatch this all the way? > From e7788ebdb589fb7c6f866cf53658cc369f9858b5 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan <pg@bowt.ie> > Date: Sat, 31 Dec 2022 15:13:01 -0800 > Subject: [PATCH v3] Don't accidentally unset all-visible bit in VM. > > One of the calls to visibilitymap_set() during VACUUM's initial heap > pass could unset a page's all-visible bit during the process of setting > the same page's all-frozen bit. As just mentioned upthread, this just seems wrong. > This could happen in the event of a > concurrent HOT update from a transaction that aborts soon after. Since > the all_visible_according_to_vm local variable that lazy_scan_heap works > off of when setting the VM doesn't reflect the current state of the VM, > and since visibilitymap_set() just requested that the all-frozen bit get > set in one case, there was a race condition. Heap pages could initially > be all-visible just as all_visible_according_to_vm is established, then > not be all-visible after the update, and then become eligible to be set > all-visible once more following pruning by VACUUM. There is no reason > why VACUUM can't remove a concurrently aborted heap-only tuple right > away, and so no reason why such a page won't be able to reach the > relevant visibilitymap_set() call site. Do you have a reproducer for this? > @@ -1120,8 +1123,8 @@ lazy_scan_heap(LVRelState *vacrel) > * got cleared after lazy_scan_skip() was called, so we must recheck > * with buffer lock before concluding that the VM is corrupt. > */ > - else if (all_visible_according_to_vm && !PageIsAllVisible(page) > - && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer)) > + else if (all_visible_according_to_vm && !PageIsAllVisible(page) && > + visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0) > { > elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u", > vacrel->relname, blkno); Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it might be useful to know what bit was wrong when debugging problems. > @@ -1164,12 +1167,34 @@ lazy_scan_heap(LVRelState *vacrel) > !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer)) > { > /* > - * We can pass InvalidTransactionId as the cutoff XID here, > - * because setting the all-frozen bit doesn't cause recovery > - * conflicts. > + * Avoid relying on all_visible_according_to_vm as a proxy for the > + * page-level PD_ALL_VISIBLE bit being set, since it might have > + * become stale -- even when all_visible is set in prunestate. > + * > + * Consider the example of a page that starts out all-visible and > + * then has a tuple concurrently deleted by an xact that aborts. > + * The page will be all_visible_according_to_vm, and will have > + * all_visible set in prunestate. It will nevertheless not have > + * PD_ALL_VISIBLE set by here (plus neither VM bit will be set). > + * And so we must check if PD_ALL_VISIBLE needs to be set. > */ > + if (!PageIsAllVisible(page)) > + { > + PageSetAllVisible(page); > + MarkBufferDirty(buf); > + } > + > + /* > + * Set the page all-frozen (and all-visible) in the VM. > + * > + * We can pass InvalidTransactionId as our visibility_cutoff_xid, > + * since a snapshotConflictHorizon sufficient to make everything > + * safe for REDO was logged when the page's tuples were frozen. > + */ > + Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid)); > visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr, > vmbuffer, InvalidTransactionId, > + VISIBILITYMAP_ALL_VISIBLE | > VISIBILITYMAP_ALL_FROZEN); > } > > @@ -1311,7 +1336,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block, > > /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */ > if (!vacrel->skipwithvm) > + { > + /* Caller shouldn't rely on all_visible_according_to_vm */ > + *next_unskippable_allvis = false; > break; > + } > > /* > * Aggressive VACUUM caller can't skip pages just because they are > @@ -1818,7 +1847,11 @@ retry: > * cutoff by stepping back from OldestXmin. > */ > if (prunestate->all_visible && prunestate->all_frozen) > + { > + /* Using same cutoff when setting VM is now unnecessary */ > snapshotConflictHorizon = prunestate->visibility_cutoff_xid; > + prunestate->visibility_cutoff_xid = InvalidTransactionId; > + } > else > { > /* Avoids false conflicts when hot_standby_feedback in use */ > @@ -2388,8 +2421,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel) > static void > lazy_vacuum_heap_rel(LVRelState *vacrel) > { > - int index; > - BlockNumber vacuumed_pages; > + int index = 0; > + BlockNumber vacuumed_pages = 0; > Buffer vmbuffer = InvalidBuffer; > LVSavedErrInfo saved_err_info; > > @@ -2406,42 +2439,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) > VACUUM_ERRCB_PHASE_VACUUM_HEAP, > InvalidBlockNumber, InvalidOffsetNumber); > > - vacuumed_pages = 0; > - > - index = 0; > while (index < vacrel->dead_items->num_items) > { > - BlockNumber tblk; > + BlockNumber blkno; > Buffer buf; > Page page; > Size freespace; > > vacuum_delay_point(); I still think such changes are inappropriate for a bugfix, particularly one that needs to be backpatched. Greetings, Andres Freund
On Mon, Jan 9, 2023 at 11:44 AM Andres Freund <andres@anarazel.de> wrote: > I think that's just an imprecise formulation though - the problem is that we > can call visibilitymap_set() with just VISIBILITYMAP_ALL_FROZEN, even though > VISIBILITYMAP_ALL_VISIBLE was concurrently unset. That's correct. You're right that my description of the problem from the commit message was confusing. But we're on the same page about the problem now. > ISTM that we ought to update all_visible_according_to_vm from > PageIsAllVisible() once we've locked the page. Even if we avoid this specific > case, it seems a recipe for future bugs to have a potentially outdated > variable around. I basically agree, but some of the details are tricky. As I mentioned already, my work on visibility map snapshots just gets rid of all_visible_according_to_vm, which is my preferred long term approach. We will very likely need to keep all_visible_according_to_vm as a cache for performance reasons for as long as we have SKIP_PAGES_THRESHOLD. Can we just update all_visible_according_to_vm using PageIsAllVisible(), without making all_visible_according_to_vm significantly less useful as a cache? Maybe. Not sure offhand. -- Peter Geoghegan
On Mon, Jan 2, 2023 at 1:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Dec 31, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > > The first patch makes sure that the snapshotConflictHorizon cutoff > > (XID cutoff for recovery conflicts) is never a special XID, unless > > that XID is InvalidTransactionId, which is interpreted as a > > snapshotConflictHorizon value that will never need a recovery conflict > > (per the general convention for snapshotConflictHorizon values > > explained above ResolveRecoveryConflictWithSnapshot). > > Pushed this just now. > > Attached is another very simple refactoring patch for vacuumlazy.c. It > makes vacuumlazy.c save the result of get_database_name() in vacrel, > which matches what we already do with things like > get_namespace_name(). > > Would be helpful if I could get a +1 on > v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is > somewhat more substantial than the others. I feel that you should at least have a reproducer for these problems posted to the thread, and ideally a regression test, before committing things. I think it's very hard to understand what the problems are right now. I don't particularly have a problem with the idea of 0001, because if we use InvalidTransactionId to mean that there cannot be any conflicts, we do not need FrozenTransactionId to mean the same thing. Picking one or the other makes sense. Perhaps we would need two values if we both needed a value that meant "conflict with nothing" and also a value that meant "conflict with everything," but in that case I suppose we would want FrozenTransactionId to be the one that meant conflict with nothing, since it logically precedes all other XIDs, and conflicts are with XIDs that precede the value in the record. However, I don't find the patch very clear, either. It doesn't update any comments, not even this one: /* * It's possible that we froze tuples and made the page's XID cutoff * (for recovery conflict purposes) FrozenTransactionId. This is okay * because visibility_cutoff_xid will be logged by our caller in a * moment. */ - Assert(cutoff == FrozenTransactionId || + Assert(!TransactionIdIsValid(cutoff) || cutoff == prunestate->visibility_cutoff_xid); Isn't the comment now incorrect as a direct result of the changes in the patch? As for 0002, I agree that it's bad if we can get into a state where the all-frozen bit is set and the all-visible bit is not. I'm not completely sure what concrete harm that will cause, but it does not seem good. But I also *entirely* agree with Andres that patches should run around adjusting nearby code - e.g. variable names - in ways that aren't truly necessary. That just makes life harder, not only for anyone who wants to review the patch now, but also for future readers who may need to understand what the patch changed and why. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jan 9, 2023 at 11:57 AM Andres Freund <andres@anarazel.de> wrote: > Afaict we'll need to backpatch this all the way? I thought that we probably wouldn't need to, at first. But I now think that we really have to. I didn't realize that affected visibilitymap_set() calls could generate useless set-VM WAL records until you pointed it out. That's far more likely to happen than the race condition that I described -- it has nothing at all to do with concurrency. That's what clinches it for me. > > One of the calls to visibilitymap_set() during VACUUM's initial heap > > pass could unset a page's all-visible bit during the process of setting > > the same page's all-frozen bit. > > As just mentioned upthread, this just seems wrong. I don't know why this sentence ever made sense to me. Anyway, it's not important now. > Do you have a reproducer for this? No, but I'm quite certain that the race can happen. If it's important to have a reproducer then I can probably come up with one. I could likely figure out a way to write an isolation test that reliably triggers the issue. It would have to work by playing games with cleanup lock/buffer pin waits, since that's the only thing that the test can hook into to make things happen in just the right/wrong order. > > elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\"page %u", > > vacrel->relname, blkno); > > Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it > might be useful to know what bit was wrong when debugging problems. Theoretically it might change again, if we call visibilitymap_get_status() again. Maybe I should just broaden the error message a bit instead? > I still think such changes are inappropriate for a bugfix, particularly one > that needs to be backpatched. I'll remove the changes that are inessential in the next revision. I wouldn't have done it if I'd fully understood the seriousness of the issue from the start. If you're really concerned about diff size then I should point out that the changes to lazy_vacuum_heap_rel() aren't strictly necessary, and probably shouldn't be backpatched. I deemed that in scope because it's part of the same overall problem of updating the visibility map based on potentially stale information. It makes zero sense to check with the visibility map before updating it when we already know that the page is all-visible. I mean, are we trying to avoid the work of needlessly updating the visibility map in cases where its state was corrupt, but then became uncorrupt (relative to the heap page) by mistake? -- Peter Geoghegan
On Mon, Jan 9, 2023 at 12:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > I feel that you should at least have a reproducer for these problems > posted to the thread, and ideally a regression test, before committing > things. I think it's very hard to understand what the problems are > right now. Hard to understand relative to what, exactly? We're talking about a very subtle race condition here. I'll try to come up with a reproducer, but I *utterly* reject your assertion that it's a hard requirement, sight unseen. Why should those be the parameters of the discussion? For one thing I'm quite confident that I'm right, with or without a reproducer. And my argument isn't all that hard to follow, if you have relevant expertise, and actually take the time. But even this is unlikely to matter much. Even if I somehow turn out to have been completely wrong about the race condition, it is still self-evident that the problem of uselessly WAL logging non-changes to the VM exists. That doesn't require any concurrent access at all. It's a natural consequence of calling visibilitymap_set() with VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code for 2 minutes to see it. > I don't particularly have a problem with the idea of 0001, because if > we use InvalidTransactionId to mean that there cannot be any > conflicts, we do not need FrozenTransactionId to mean the same thing. > Picking one or the other makes sense. We've already picked one, many years ago -- InvalidTransactionId. This is a long established convention, common to all REDO routines that are capable of creating granular conflicts. I already committed 0001 over a week ago. We were calling ResolveRecoveryConflictWithSnapshot with FrozenTransactionId arguments before now, which was 100% guaranteed to be a waste of cycles. I saw no need to wait more than a few days for a +1, given that this particular issue was so completely clear cut. -- Peter Geoghegan
On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > I didn't realize that affected visibilitymap_set() calls could > generate useless set-VM WAL records until you pointed it out. That's > far more likely to happen than the race condition that I described -- > it has nothing at all to do with concurrency. That's what clinches it > for me. I didn't spend as much time on this as I'd like to so far, but I think that this concern about visibilitymap_set() actually turns out to not apply. The visibilitymap_set() call in question is gated by a "!VM_ALL_FROZEN()", which is enough to avoid the problem with writing useless VM set records. That doesn't make me doubt my original concern about races where the all-frozen bit can be set, without setting the all-visible bit, and without accounting for the fact that it changed underneath us. That scenario will have !VM_ALL_FROZEN(), so that won't save us. (And we won't test VM_ALL_VISIBLE() or PD_ALL_VISIBLE in a way that is sufficient to realize that all_visible_according_to_vm is stale. prunestate.all_visible being set doesn't reliably indicate that it's not stale, but lazy_scan_heap incorrectly believes that it really does work that way.) -- Peter Geoghegan
On Mon, Jan 9, 2023 at 5:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jan 9, 2023 at 12:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I feel that you should at least have a reproducer for these problems > > posted to the thread, and ideally a regression test, before committing > > things. I think it's very hard to understand what the problems are > > right now. > > Hard to understand relative to what, exactly? We're talking about a > very subtle race condition here. > > I'll try to come up with a reproducer, but I *utterly* reject your > assertion that it's a hard requirement, sight unseen. Why should those > be the parameters of the discussion? > > For one thing I'm quite confident that I'm right, with or without a > reproducer. And my argument isn't all that hard to follow, if you have > relevant expertise, and actually take the time. Look, I don't want to spend time arguing about what seem to me to be basic principles of good software engineering. When I don't put test cases into my patches, people complain at me and tell me that I'm a bad software engineer because I didn't include test cases. Your argument here seems to be that you're such a good software engineer that you don't need any test cases to know what the bug is or that you've fixed it correctly. That seems like a surprising argument, but even if it's true, test cases can have considerable value to future code authors, because it allows them to avoid reintroducing bugs that have previously been fixed. In my opinion, it's not worth trying to have automated test cases for absolutely every bug we fix, because many of them would be really hard to develop and executing all of them every time we do anything would be unduly time-consuming. But I can't remember the last time before this that someone wanted to commit a patch for a data corruption issue without even providing a test case that other people can run manually. If you think that is or ought to be standard practice, I can only say that I disagree. I don't particularly appreciate the implication that I either lack relevant or expertise or don't actually take time, either. I spent an hour yesterday looking at your patches yesterday and didn't feel I was very close to understanding 0002 in that time. I feel that if the patches were better-written, with relevant comments and test cases and really good commit messages and a lack of extraneous changes, I believe I probably would have gotten a lot further in the same amount of time. There is certainly an alternate explanation, which is that I am stupid. I'm inclined to think that's not the correct explanation, but most stupid people believe that they aren't, so that doesn't really prove anything. > But even this is > unlikely to matter much. Even if I somehow turn out to have been > completely wrong about the race condition, it is still self-evident > that the problem of uselessly WAL logging non-changes to the VM > exists. That doesn't require any concurrent access at all. It's a > natural consequence of calling visibilitymap_set() with > VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code > for 2 minutes to see it. Apparently not, because I spent more time than that. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 10, 2023 at 10:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > Look, I don't want to spend time arguing about what seem to me to be > basic principles of good software engineering. When I don't put test > cases into my patches, people complain at me and tell me that I'm a > bad software engineer because I didn't include test cases. Your > argument here seems to be that you're such a good software engineer > that you don't need any test cases to know what the bug is or that > you've fixed it correctly. That's not what I said. This is a straw man. What I actually said was that there is no reason to declare up front that the only circumstances under which a fix could be committed is when a clean repro is available. I never said that a test case has little or no value, and I certainly didn't assert that we definitely don't need a test case to proceed with a commit -- since I am not in the habit of presumptuously attaching conditions to such things well in advance. > I don't particularly appreciate the implication that I either lack > relevant or expertise or don't actually take time, either. The implication was only that you didn't take the time. Clearly you have the expertise. Obviously you're very far from stupid. I have been unable to reproduce the problem, and think it's possible that the issue cannot be triggered in practice. Though only through sheer luck. Here's why that is: While pruning will remove aborted dead tuples, freezing will not remove the xmax of an aborted update unless the XID happens to be < OldestXmin. With my problem scenario, the page will be all_visible in prunestate, but not all_frozen -- so it dodges the relevant visibilitymap_set() call site. That just leaves inserts that abort, I think. An aborted insert will be totally undone by pruning, but that does still leave behind an LP_DEAD item that needs to be vacuumed in the second heap pass. This means that we can only set the page all-visible/all-frozen in the VM in the second heap pass, which also dodges the relevant visibilitymap_set() call site. In summary, I think that there is currently no way that we can have the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving the page all_frozen. It can happen and leave the page all_visible, but not all_frozen, due to these very fine details. (Assuming I haven't missed another path to the problem with aborted Multis or something, but looks like I haven't.) -- Peter Geoghegan
On Tue, Jan 10, 2023 at 11:47 AM Peter Geoghegan <pg@bowt.ie> wrote: > In summary, I think that there is currently no way that we can have > the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving > the page all_frozen. It can happen and leave the page all_visible, but > not all_frozen, due to these very fine details. (Assuming I haven't > missed another path to the problem with aborted Multis or something, > but looks like I haven't.) Actually, FreezeMultiXactId() can fully remove an xmax that has some member XIDs >= OldestXmin, provided FRM_NOOP processing isn't possible, at least when no individual member is still running. Doesn't have to involve transaction aborts at all. Let me go try to break it that way... -- Peter Geoghegan
On Tue, Jan 10, 2023 at 2:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > What I actually said was that there is no reason to declare up front > that the only circumstances under which a fix could be committed is > when a clean repro is available. I never said that a test case has > little or no value, and I certainly didn't assert that we definitely > don't need a test case to proceed with a commit -- since I am not in > the habit of presumptuously attaching conditions to such things well > in advance. I don't understand what distinction you're making. It seems like hair-splitting to me. We should be able to reproduce problems like this reliably, at least with the aid of a debugger and some breakpoints, before we go changing the code. The risk of being wrong is quite high because the code is subtle, and the consequences of being wrong are potentially very bad because the code is critical to data integrity. If the reproducer doesn't require a debugger or other extreme contortions, then we should consider reducing it to a TAP test that can be committed. If you agree with that, then I'm not sure what your last email was complaining about. If you disagree, then I don't know why. > I have been unable to reproduce the problem, and think it's possible > that the issue cannot be triggered in practice. Though only through > sheer luck. Here's why that is: > > While pruning will remove aborted dead tuples, freezing will not > remove the xmax of an aborted update unless the XID happens to be < > OldestXmin. With my problem scenario, the page will be all_visible in > prunestate, but not all_frozen -- so it dodges the relevant > visibilitymap_set() call site. > > That just leaves inserts that abort, I think. An aborted insert will > be totally undone by pruning, but that does still leave behind an > LP_DEAD item that needs to be vacuumed in the second heap pass. This > means that we can only set the page all-visible/all-frozen in the VM > in the second heap pass, which also dodges the relevant > visibilitymap_set() call site. I guess I'm not very sure that this is sheer luck. It seems like we could equally well suppose that the people who wrote the code correctly understood the circumstances under which we needed to avoid calling visibilitymap_set(), and wrote the code in a way that accomplished that purpose. Maybe there's contrary evidence or maybe it is actually broken somehow, but that's not currently clear to me. For the purposes of clarifying my understanding, is this the code you're principally worried about? /* * If the all-visible page is all-frozen but not marked as such yet, * mark it as all-frozen. Note that all_frozen is only valid if * all_visible is true, so we must check both prunestate fields. */ else if (all_visible_according_to_vm && prunestate.all_visible && prunestate.all_frozen && !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer)) { /* * We can pass InvalidTransactionId as the cutoff XID here, * because setting the all-frozen bit doesn't cause recovery * conflicts. */ visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr, vmbuffer, InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN); } Or maybe this one? if (PageIsAllVisible(page)) { uint8 flags = 0; uint8 vm_status = visibilitymap_get_status(vacrel->rel, blkno, vmbuffer); /* Set the VM all-frozen bit to flag, if needed */ if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0) flags |= VISIBILITYMAP_ALL_VISIBLE; if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen) flags |= VISIBILITYMAP_ALL_FROZEN; Assert(BufferIsValid(*vmbuffer)); if (flags != 0) visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer, visibility_cutoff_xid, flags); } These are the only two call sites in vacuumlazy.c where I can see there being a theoretical risk of the kind of problem that you're describing. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 10, 2023 at 12:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > I don't understand what distinction you're making. It seems like > hair-splitting to me. We should be able to reproduce problems like > this reliably, at least with the aid of a debugger and some > breakpoints, before we go changing the code. So we can *never* change something defensively, on the basis of a suspected or theoretical hazard, either in backbranches or just on HEAD? Not under any circumstances, ever? > The risk of being wrong > is quite high because the code is subtle, and the consequences of > being wrong are potentially very bad because the code is critical to > data integrity. If the reproducer doesn't require a debugger or other > extreme contortions, then we should consider reducing it to a TAP test > that can be committed. If you agree with that, then I'm not sure what > your last email was complaining about. I was complaining about your prescribing conditions on proceeding with a commit, based on an understanding of things that you yourself acknowledged as incomplete. I cannot imagine how you read that as an unwillingness to test the issue, especially given that I agreed to work on that before you chimed in. > > I have been unable to reproduce the problem, and think it's possible > > that the issue cannot be triggered in practice. Though only through > > sheer luck. Here's why that is: > I guess I'm not very sure that this is sheer luck. That's just my characterization. Other people can make up their own minds. > For the purposes of clarifying my understanding, is this the code > you're principally worried about? > visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr, > vmbuffer, InvalidTransactionId, > VISIBILITYMAP_ALL_FROZEN); Obviously I meant this call site, since it's the only one that passes VISIBILITYMAP_ALL_FROZEN as its flags, without also passing VISIBILITYMAP_ALL_VISIBLE -- in vacuumlazy.c, and in general. The other visibilitymap_set() callsite that you quoted is from the second heap pass, where LP_DEAD items are vacuumed and become LP_UNUSED items. That isn't buggy, but it is a silly approach, in that it cares about what the visibility map says about the page being all-visible, as if it might take a dissenting view that needs to be taken into consideration (obviously we know what's going on with the page because we just scanned it ourselves, and determined that it was at least all-visible). -- Peter Geoghegan
On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > Actually, FreezeMultiXactId() can fully remove an xmax that has some > member XIDs >= OldestXmin, provided FRM_NOOP processing isn't > possible, at least when no individual member is still running. Doesn't > have to involve transaction aborts at all. > > Let me go try to break it that way... Attached patch shows how this could break. It adds an assertion that checks that the expected PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It also comments out FreezeMultiXactId()'s FRM_NOOP handling. The FRM_NOOP case is really just an optimization, and shouldn't be needed for correctness. This is amply demonstrated by running "meston test" with the patch applied, which will pass without incident. I can get the PD_ALL_VISIBLE assertion to fail by following this procedure with the patch applied: * Run a plain VACUUM to set all the pages from a table all-visible, but not all-frozen. * Set a breakpoint that will hit after all_visible_according_to_vm is set to true, for an interesting blkno. * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the relevant visibilitymap_set() call site (the one that passes VISIBILITYMAP_ALL_FROZEN as its flags, without also passing VISIBILITYMAP_ALL_VISIBLE). Now all_visible_according_to_vm is set to true, but we don't have a lock/pin on the same heap page just yet. * Acquire several non-conflicting row locks on a row on the block in question, so that a new multi is allocated. * End every session whose XID is stored in our multi (commit/abort). * Within GDB, continue from before -- observe assertion failure. Obviously this scenario doesn't demonstrate the presence of a bug -- not quite. But it does prove that we rely on FRM_NOOP to not allow the VM to become corrupt, which just doesn't make any sense, and can't have been intended. At a minimum, it strongly suggests that the current approach is very fragile. -- Peter Geoghegan
Attachment
On Tue, Jan 10, 2023 at 4:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the > relevant visibilitymap_set() call site (the one that passes > VISIBILITYMAP_ALL_FROZEN as its flags, without also passing > VISIBILITYMAP_ALL_VISIBLE). > > Now all_visible_according_to_vm is set to true, but we don't have a > lock/pin on the same heap page just yet. > > * Acquire several non-conflicting row locks on a row on the block in > question, so that a new multi is allocated. Forgot to mention that there needs to be a HOT update mixed in with these SELECT ... FOR SHARE row lockers, too, which must abort once its XID has been added to a multi. Obviously heap_lock_tuple() won't ever unset VISIBILITYMAP_ALL_VISIBLE or PD_ALL_VISIBLE (it only ever clears VISIBILITYMAP_ALL_FROZEN) -- so we need a heap_update() to clear all of these status bits. This enables the assertion to fail because: * Pruning can get rid of the aborted successor heap-only tuple right away, so it is not going to block us from setting the page all_visible (that just leaves the original tuple to consider). * The original tuple's xmax is a Multi, so it won't automatically be ineligible for freezing because it's > OldestXmin in this scenario. * FreezeMultiXactId() processing will completely remove xmax, without caring too much about cutoffs like OldestXmin -- it only cares about whether each individual XID needs to be kept or not. (Granted, FreezeMultiXactId() will only remove xmax like this because I deliberately removed its FRM_NOOP handling, but that is a very delicate thing to rely on, especially from such a great distance. I can't imagine that it doesn't fail on HEAD for any reason beyond sheer luck.) -- Peter Geoghegan
On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jan 9, 2023 at 11:57 AM Andres Freund <andres@anarazel.de> wrote: > > Afaict we'll need to backpatch this all the way? > > I thought that we probably wouldn't need to, at first. But I now think > that we really have to. Attached is v4. This is almost the same as v3. The only notable change is in how the issue is explained in comments, and in the commit message. I have revised my opinion on this question once more. In light of what has come to light about the issue from recent testing, I lean towards a HEAD-only commit once again. What do you think? I still hope to be able to commit this on my original timeline (on Monday or so), without the issue taking up too much more of everybody's time. Thanks -- Peter Geoghegan