Re: Why clearing the VM doesn't require registering vm buffer in wal record - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Why clearing the VM doesn't require registering vm buffer in wal record
Date
Msg-id CAAKRu_Z_KAAZAHtuorifR2MRc_MkEcHf-C_t4b9HZaHpa3nriw@mail.gmail.com
Whole thread
In response to Re: Why clearing the VM doesn't require registering vm buffer in wal record  (Andres Freund <andres@anarazel.de>)
Responses Re: Why clearing the VM doesn't require registering vm buffer in wal record
List pgsql-hackers
Thanks for the review! I incorporated all your suggestions in attached
v2 unless otherwise noted below.

Based on an LLM-assisted self-review of this code, I think there is a
pre-existing bug in heap_lock_updated_tuple_rec() where we don't reset
cleared_all_frozen and may have a stale value if we iterate more than
once (new tuple versions) or use the l4 goto label (retries for the
same tuple version). This could mean we unnecessarily clear all-frozen
on the standby (and unnecessarily register the VM buffer after my
patch). I couldn't come up with a repro though, so I didn't include a
patch to fix it, as I wanted to confirm that someone else also thought
this is a bug.

On Fri, May 1, 2026 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2026-04-30 17:44:28 -0400, Melanie Plageman wrote:
>
> > Attached is a patch set to fix the issue based largely on the work you
> > started on your branch. I attached the version targeting master/19
> > which is prefixed with v1_PGMASTER and the version targeting 18,
> > prefixed v1_PG18. The pg 18 changes aren't a straight cherry-pick to
> > 17 (the earliest I'll backpatch because that was when incremental
> > backup was introduced) because the redo functions live in a different
> > file in 18 than in 17, but I want to avoid discussing three different
> > versions of this patch set on this thread.
>
> Makes sense.
>
> I think it's worth explicitly bringing up that this means that we leave an
> arguable corruption case in the earlier branches.

But can there be such a case since we read the VM with ZERO_AND_LOCK?
Are there other cases that rely only on the buffers registered? Is it
possible it isn't reachable before 17?

> I thought I'd start looking at the 18 patches.

Attached v2_PG18 and v2_PG19 include all the updates from your
previous review applied to both branches.

> > @@ -331,7 +332,10 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
> >
> >       UnlockReleaseBuffer(buf);
> >
> > -     if (vmbuf != InvalidBuffer)
> > +
> > +     if (did_modify_vm)
> > +         UnlockReleaseBuffer(vmbuf);
> > +     else if (BufferIsValid(vmbuf))
> >           ReleaseBuffer(vmbuf);
> >
> >       /* Update the current_start_ptr before moving to the next page. */
>
> Why did you touch heap_surgery in 18?  That shouldn't be necessary, right?

It does visibilitymap_clear() and is also broken. Though, as you
pointed out off-list, my fix wasn't even far enough. It logs a WAL
record for the heap page and a separate one for the VM update which
isn't correct. v2 has a single WAL record with both registered.

> It probably doesn't matter, but it seems like it'd be ever so slightly more
> correct to first do visibilitymap_clear_locked() and then
> PageClearAllVisible(), given
> a) the rule that there should never be a VM bit set when !PD_ALL_VISIBLE
> b) VM reads happening without page locks
>
> I don't think it can ever lead to a wrong result, but still seems better to do
> it the "right" way round.

I've changed this throughout the patch set.

> > @@ -2228,13 +2231,22 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> >       /* filtering by origin on a row level is much more efficient */
> >       XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
> >
> > +     if (all_visible_cleared)
> > +         XLogRegisterBuffer(1, vmbuffer, 0);
> > +
>
> For update you created HEAP_UPDATE_BLKREF_* to distinguish the different block
> refernces.  I think that was a good call to make the code at least somewhat
> understandable.  I think once we do it for heap_update(), we should just do it
> for heap_{insert,multi_insert,delete,update} too.

I've done this in both versions as 0001.

> Am I blind, or have not documented the lock ordering rules for heap vs VM
> anywhere?

I don't see anywhere where something like this is documented. Though
I'm not entirely sure what you mean. In the cases with one VM block
and one heap block, you mean we should have documented that you should
acquire both locks before making changes to either page?

Thinking about this made me wonder if I should refactor the patches a
bit to acquire the vmbuffer lock before the critical section instead.
This would make the heap_insert() go from this

    START_CRIT_SECTION();

    RelationPutHeapTuple(relation, buffer, heaptup,
                         (options & HEAP_INSERT_SPECULATIVE) != 0);

    if (PageIsAllVisible(page))
    {
        all_visible_cleared = true;

        LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
        visibilitymap_clear(relation->rd_locator,
                            ItemPointerGetBlockNumber(&(heaptup->t_self)),
                            vmbuffer, VISIBILITYMAP_VALID_BITS,
                            InvalidXLogRecPtr);
        PageClearAllVisible(BufferGetPage(buffer));
    }

to something more like this:

    if (PageIsAllVisible(page))
    {
        clear_all_visible = true;
        LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
    }

    START_CRIT_SECTION();

    RelationPutHeapTuple(relation, buffer, heaptup,
                         (options & HEAP_INSERT_SPECULATIVE) != 0);

    if (clear_all_visible)
    {
        visibilitymap_clear(relation->rd_locator,
                            ItemPointerGetBlockNumber(&(heaptup->t_self)),
                            vmbuffer, VISIBILITYMAP_VALID_BITS,
                            InvalidXLogRecPtr);
        PageClearAllVisible(BufferGetPage(buffer));
    }


> I think here we actually have a lock ordering issue:
<-- snip -->
> So I think we need to be careful and acquire the exlusive locks on the two vm
> pages in ascending block number order.  That'd turn the above into:
>
> B1: lock VM1
> B2: lock VM1 <waits>
> B1: lock VM2
> B1: xloginsert etc
> B1: release locks
> B2: lock VM1 <completes>
> B2: lock VM2

Right, good catch. I'd not thought of that at all. I've done this in
attached v2.

> > @@ -363,9 +363,24 @@ heap_xlog_delete(XLogReaderState *record)
> >       Relation    reln = CreateFakeRelcacheEntry(target_locator);
> >       Buffer      vmbuffer = InvalidBuffer;
> >
> > -     visibilitymap_pin(reln, blkno, &vmbuffer);
> > -     visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
> > -     ReleaseBuffer(vmbuffer);
> > +     if (XLogRecHasBlockRef(record, 1))

> I think we need a comment explaining why we need this if
> (XLogRecHasBlockRef()) dance.  Otherwise whoever has to look at this for some
> bug in the future will be rather confused.

I've added a comment by the new helper function that I added.

> Perhaps just have one comment, with the other places referring to it?  Might
> also be sane to leave a breadcrumb on the WAL logging side, that it used to
> look different in the same major version.

What do you mean by the WAL logging side? This already is the redo
routine. Or do you mean something else by breadcrumb?

> Wonder if we could package this into some kind of helper, there are a bunch of
> copies of this now. But there's enough small differences that it's not
> trivial.

I did this.

Just to confirm, we do _not_ want to touch XLOG_PAGE_MAGIC on the back
branches because it specifically doesn't change the WAL format
compatibility.

> Hm. Could this be simplified to something like:
>
>                 if (has_vm_new)
>                 {
>                         Buffer          vmbuffer = InvalidBuffer;
>
>                         if (XLogReadBufferForRedo(record, HEAP_UPDATE_BLKREF_VM_NEW, &vmbuffer) ==
>                                 BLK_NEEDS_REDO)
>                         {
>                                 if (new_cleared)
>                                         visibilitymap_clear_locked(reln, newblk, vmbuffer,
>                                                                                            VISIBILITYMAP_VALID_BITS);
>                                 if (old_cleared && !has_vm_old)
>                                         visibilitymap_clear_locked(reln, oldblk, vmbuffer,
>                                                                                            VISIBILITYMAP_VALID_BITS);
>                                 PageSetLSN(BufferGetPage(vmbuffer), lsn);
>                         }
>
>                         if (BufferIsValid(vmbuffer))
>                                 UnlockReleaseBuffer(vmbuffer);
>                 }
>
>                 if (has_vm_old)
>                 {
>                         Buffer          vmbuffer = InvalidBuffer;
>
>                         Assert(old_cleared);
>
>                         if (XLogReadBufferForRedo(record, HEAP_UPDATE_BLKREF_VM_OLD, &vmbuffer) ==
>                                 BLK_NEEDS_REDO)
>                         {
>                                 visibilitymap_clear_locked(reln, oldblk, vmbuffer,
>                                                                                    VISIBILITYMAP_VALID_BITS);
>                                 PageSetLSN(BufferGetPage(vmbuffer), lsn);
>                         }
>
>                         if (BufferIsValid(vmbuffer))
>                                 UnlockReleaseBuffer(vmbuffer);
>                 }
>
>                 if (!has_vm_new && !has_vm_old)
>                 {

I tried this ordering. Personally, I find my current way more clear.
Then the different cases are self-contained. If both the vmbits for
two heap pages are set on the same VM page, you can understand what is
registered without looking across multiple if statements.

I also prefer to have the same structure in replay as in normal
operation. And in normal operation, because of the page lock ordering
concerns, the single VM page vs two VM page cases are even more
different.

> Would it make sense to have an assert like
>
> Assert(has_vm_new || !has_vm_old);  /* should never just have old VM buffer */

It is okay to have just VM old if only the VM bits on the old page are
set (because there are two VM buffers but the new one is not
all-visible).

> > +bool
> > +visibilitymap_clear_locked(Relation rel, BlockNumber heapBlk, Buffer vmbuf, uint8 flags)
>
> Given this is a new API, I wonder if we shouldn't already start using the new
> signature. If there's any extensions using it (I think there is at least one),
> this would just require them to have an ifdef for <17, 17-18, and >= 19.

So in 19, I just changed the function signature of
visibilitymap_clear() a bit and did not add a
visibilitymap_clear_locked(). This was doable because I felt more
comfortable changing 19 to not take a Relation and instead take a
RelFileLocator as a parameter because visibilitymap_set() already does
this on 19.

I am wondering in 19 if we should just set the page LSN in the call
sites and not try to have visibilitymap_clear() take an LSN. You had a
separate visibilitymap_clear_redo() function that took an LSN. I guess
I could make a separate function that just calls visibilitymap_clear()
and then does PageSetLSN(), but that seemed kind of silly. What do you
think?

Separately, in 18, I say visibilitymap_clear() is there for backwards
compatibility but then I use it in vacuumlazy.c where we clear the VM
without WAL logging when it is corrupted. Should I not use it there
because it is supposed ot be only for backwards compatibility? Or does
it not matter...

> > +my ($post_del_vis, $post_del_frz) = get_vm_summary($primary, 'vm_delete_test');
> > +cmp_ok($post_del_vis, '<', $pre_del_vis,
> > + "DELETE test: all-visible cleared after delete");
> > +cmp_ok($post_del_frz, '<', $pre_del_frz,
> > + "DELETE test: all-frozen cleared after delete");
> > +
> > +my ($incr2, $incr2_name) = take_incr_backup($full2);
> > +validate_restored_vm($full2, $full2_name, $incr2, $incr2_name,
> > + 'vm_delete_test', 'DELETE test');
>
> Hm. Since these are looking at different tables, can't we do that with one
> backup + incremental backup + combine + validate cycle for all of them?

I've done this

> Also seems like the modifications, tests and messages for all of them could be
> moved to a helper?

Also did this.

> I assume you verified that the tests catch the problem without the fix?

Yes, though the update case was doing lock first and thus not
exercising exactly what I intended. I think it is fixed now.

The Cluster.pm changes that pipes the pg_combinebackup output to a
temp file and then prints that to the regress log if a test fails
still need review. I had some AI-assistance writing it, and I'm pretty
sure it behaves as I intended but I don't know if it is idiomatic perl
or how we do things or whatever.

> > @@ -499,6 +500,21 @@ verify_heapam(PG_FUNCTION_ARGS)
> >       ctx.blkno = BufferGetBlockNumber(ctx.buffer);
> >       ctx.page = BufferGetPage(ctx.buffer);
> >
> > +     /*
> > +      * It is corruption of the page-level PD_ALL_VISIBLE flag is clear and
> > +      * the visibility map bits corresponding to this heap page are set.
> > +      */
> > +     vmbits = visibilitymap_get_status(ctx.rel, ctx.blkno, &vmbuffer);
>
> Maybe it's worth adding a comment explaining that visibilitymap_get_status()
> won't fail, because the VM code uses ZERO_ON_ERROR and
> visibilitymap_get_status() accepts there no being underlying page?
>
> That'd basically address why this isn't meaningfully harming this goal:
>  * This code goes to some trouble to avoid crashing the server even if the
>  * table pages are badly corrupted, but it's probably not perfect.

Well, I wouldn't necessarily expect visibilitymap_get_status() to fail
because of PD_ALL_VISIBLE being clear on the heap pages. And
ZERO_ON_ERROR isn't why that doesn't happen. But I think your point is
we should explain to the reader that visibilitymap_get_status() itself
won't assert/ERROR out when called on either a corrupted heap page or
corrupted VM page. Or is that not what you are saying?

- Melanie

Attachment

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: problems with toast.* reloptions
Next
From: Nathan Bossart
Date:
Subject: Re: bump minimum supported version of psql and pg_{dump,dumpall,upgrade} to v10