Re: Why clearing the VM doesn't require registering vm buffer in wal record - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: Why clearing the VM doesn't require registering vm buffer in wal record |
| Date | |
| Msg-id | 66mqpfyti3qhfttcsv6r2lbvqqd32rrmpn6i47ovrsnvguts46@gou54xc5jld3 Whole thread |
| In response to | Re: Why clearing the VM doesn't require registering vm buffer in wal record (Melanie Plageman <melanieplageman@gmail.com>) |
| List | pgsql-hackers |
Hi,
On 2026-04-30 17:44:28 -0400, Melanie Plageman wrote:
> On Thu, Mar 5, 2026 at 4:01 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2026-03-05 15:38:24 -0500, Andres Freund wrote:
> >
> > > There's explicit code for ignoring the FSM, but I don't see the same for the
> > > VM. And that makes sense: VM changes are mostly WAL logged, just not
> > > completely / generically (i.e. this complaint), whereas FSM changes are not
> > > WAL logged at all.
> >
> > Unfortunately I can confirm that incremental backups end up with an outdated
> > VM.
> >
> > An unfortunate kicker: It looks like verify_heapam() doesn't even notice :(.
>
> Attached is a patch set to fix the issue based largely on the work you
> started on your branch. I attached the version targeting master/19
> which is prefixed with v1_PGMASTER and the version targeting 18,
> prefixed v1_PG18. The pg 18 changes aren't a straight cherry-pick to
> 17 (the earliest I'll backpatch because that was when incremental
> backup was introduced) because the redo functions live in a different
> file in 18 than in 17, but I want to avoid discussing three different
> versions of this patch set on this thread.
Makes sense.
I think it's worth explicitly bringing up that this means that we leave an
arguable corruption case in the earlier branches.
I think it's the right call: The changes are non-trivial and there are
plenty-inter branch changes. And I can't come up with realistic scenarios
where it'd be a problem without something like incremental backup (as
pg_rewind just always copies the VM).
> I added a TAP test that does something like Andres' incremental backup
> repro but tests more variations. Because init_from_backup() runs
> pg_combinebackup with the --debug flag, the log output is very verbose
> (i.e. every single copied file is logged). I suggest we turn it off by
> default in tests. I included a patch that does that. It isn't required
> to commit my test, but my test does produce 5000 lines of regression
> log output which seems...not ideal.
>
> The patch set targeting master includes a few more changes aimed at
> catching bugs like this in the future. 0005 adds a check to
> verify_heapam() that PD_ALL_VISIBLE is never clear when the VM is set.
> 0006 stops masking PD_ALL_VISIBLE during WAL consistency checking. And
> 0007 is a version of a patch Andres started which validates that every
> block registered with a WAL record is read during recovery.
I thought I'd start looking at the 18 patches.
> From 8ef81094e39a34d845b1a40cad7de469c8501bee Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 29 Apr 2026 12:53:49 -0400
> Subject: [PATCH v1_PG18 1/3] Fix WAL logging of VM clears
>
> Previously, we failed to register the visibility map buffer when
> emitting WAL after clearing the VM bits for heap pages (recovery code
> read the VM page normally). This meant that we couldn't emit FPIs of VM
> pages when clearing VM bits. While this would not result in a checksum
> error because the visibility map is read with RBM_ZERO_ON_ERROR, WAL
> summarizer only includes pages which were registered, so a restore from
> an incremental backup would not include the modified VM pages -- leading
> to data corruption.
>
> Fix this by registering the VM buffer in the WAL record when clearing VM
> bits. This also means the VM buffer must be locked throughout the
> duration of the critical section where we make changes to the VM and
> heap page and emit WAL.
> diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
> index 3e86283beb7..d7e4e051fff 100644
> --- a/contrib/pg_surgery/heap_surgery.c
> +++ b/contrib/pg_surgery/heap_surgery.c
> @@ -266,9 +266,10 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
> */
> if (PageIsAllVisible(page))
> {
> + LockBuffer(vmbuf, BUFFER_LOCK_EXCLUSIVE);
> PageClearAllVisible(page);
> - visibilitymap_clear(rel, blkno, vmbuf,
> - VISIBILITYMAP_VALID_BITS);
> + visibilitymap_clear_locked(rel, blkno, vmbuf,
> + VISIBILITYMAP_VALID_BITS);
> did_modify_vm = true;
> }
> }
> @@ -331,7 +332,10 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
>
> UnlockReleaseBuffer(buf);
>
> - if (vmbuf != InvalidBuffer)
> +
> + if (did_modify_vm)
> + UnlockReleaseBuffer(vmbuf);
> + else if (BufferIsValid(vmbuf))
> ReleaseBuffer(vmbuf);
>
> /* Update the current_start_ptr before moving to the next page. */
Why did you touch heap_surgery in 18? That shouldn't be necessary, right?
> @@ -2135,10 +2136,12 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> if (PageIsAllVisible(BufferGetPage(buffer)))
> {
> all_visible_cleared = true;
> +
> + LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> PageClearAllVisible(BufferGetPage(buffer));
> - visibilitymap_clear(relation,
> - ItemPointerGetBlockNumber(&(heaptup->t_self)),
> - vmbuffer, VISIBILITYMAP_VALID_BITS);
> + visibilitymap_clear_locked(relation,
> + ItemPointerGetBlockNumber(&(heaptup->t_self)),
> + vmbuffer, VISIBILITYMAP_VALID_BITS);
> }
>
> /*
It probably doesn't matter, but it seems like it'd be ever so slightly more
correct to first do visibilitymap_clear_locked() and then
PageClearAllVisible(), given
a) the rule that there should never be a VM bit set when !PD_ALL_VISIBLE
b) VM reads happening without page locks
I don't think it can ever lead to a wrong result, but still seems better to do
it the "right" way round.
> @@ -2228,13 +2231,22 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> /* filtering by origin on a row level is much more efficient */
> XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
>
> + if (all_visible_cleared)
> + XLogRegisterBuffer(1, vmbuffer, 0);
> +
For update you created HEAP_UPDATE_BLKREF_* to distinguish the different block
refernces. I think that was a good call to make the code at least somewhat
understandable. I think once we do it for heap_update(), we should just do it
for heap_{insert,multi_insert,delete,update} too.
> @@ -2503,10 +2515,11 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
> if (PageIsAllVisible(page) && !(options & HEAP_INSERT_FROZEN))
> {
> all_visible_cleared = true;
> + LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> PageClearAllVisible(page);
> - visibilitymap_clear(relation,
> - BufferGetBlockNumber(buffer),
> - vmbuffer, VISIBILITYMAP_VALID_BITS);
> + visibilitymap_clear_locked(relation,
> + BufferGetBlockNumber(buffer),
> + vmbuffer, VISIBILITYMAP_VALID_BITS);
> }
Every time I see one of these new lock acquisitions I get worried about lock
ordering issues leading to deadlocks. Wrongly here, because
visibilitymap_clear() would also have acquired the lock.
Am I blind, or have not documented the lock ordering rules for heap vs VM
anywhere?
> @@ -3133,13 +3154,23 @@ l1:
> /* filtering by origin on a row level is much more efficient */
> XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
>
> + if (all_visible_cleared)
> + XLogRegisterBuffer(1, vmbuffer, 0);
> +
> recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
On the topic of complaining about things you have no responsibility for
whatsoever: It's so annoying that git picks up goto labels as diff "subjects"
(the "l1:" above) ...
> + /*
> + * Clear PD_ALL_VISIBLE flags and reset visibility map bits for any heap
> + * pages that were all-visible. If there are two heap pages, we may need
> + * to clear VM bits for both.
> + */
> + if (PageIsAllVisible(page) &&
> + newbuf != buffer && PageIsAllVisible(newpage) &&
> + vmbuffer_new == vmbuffer)
> {
> + /*
> + * This is the more complicated case: both the new and old heap pages
> + * are all-visible and both their VM bits are on the same page of the
> + * VM, so we register a single VM buffer as HEAP_UPDATE_BLKREF_VM_NEW
> + * in the WAL record. We must be careful to only lock and register one
> + * buffer, even though we modify it twice -- once for each heap
> + * block's VM bits.
> + */
> - if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
> + else
> {
> - all_visible_cleared_new = true;
> - PageClearAllVisible(BufferGetPage(newbuf));
> - visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
> - vmbuffer_new, VISIBILITYMAP_VALID_BITS);
> + /*
> + * In all the remaining cases, we will clear at most one heap block's
> + * VM bits per VM page, so we don't have to worry about locking or
> + * registering the same buffer twice.
> + */
> + if (PageIsAllVisible(page))
> + {
> + PageClearAllVisible(page);
> + LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> + visibilitymap_clear_locked(relation, block,
> + vmbuffer, VISIBILITYMAP_VALID_BITS);
> + modified_vmbuffer = true;
> + all_visible_cleared = true;
> + }
> + if (newbuf != buffer && PageIsAllVisible(newpage))
> + {
> + PageClearAllVisible(newpage);
> + LockBuffer(vmbuffer_new, BUFFER_LOCK_EXCLUSIVE);
> + visibilitymap_clear_locked(relation, BufferGetBlockNumber(newbuf),
> + vmbuffer_new, VISIBILITYMAP_VALID_BITS);
> + modified_vmbuffer_new = true;
> + all_visible_cleared_new = true;
> + }
I think here we actually have a lock ordering issue:
Assuming that we have four all-visible heap pages, Block HA1, HA2, which are
covered by VM1 and HB1 and HB2 which are covered by VM2. You could get into
the unlucky situation that you have one backend B1 updating a tuple from HA1
and placing it on HB1, and a backend B2 updating a tuple on HB2 and updating
it to HA2.
In that case you could have the following lock acquisition order:
B1: lock HA1
B1: lock HB1
B2: lock HA2
B2: lock HB2
(hio.c will make sure to acquire the locks in ascending page order)
B1: lock VM1
B2: lock VM2
B1: lock VM2 <waits>
B2: lock VM1 <undetected deadlock>
It might not be possible to reach this, due to the newtupsize > pagefree case
clearing the VM bit. But I think you could get unlucky and have it get set
again. For that I think the target page would have to be empty, otherwise we'd
not have set it as all-visible with a pin, but that seems possible.
Coming back later: The newtupsize > pagefree case doesn't even help, because
inexplicably we are just unsetting all frozen, even though we're about to
update the page! WHYYY!?!
So I think we need to be careful and acquire the exlusive locks on the two vm
pages in ascending block number order. That'd turn the above into:
B1: lock VM1
B2: lock VM1 <waits>
B1: lock VM2
B1: xloginsert etc
B1: release locks
B2: lock VM1 <completes>
B2: lock VM2
We just need to do the lock acquisition that way, not the setting and clearing
of the bits. So it's perhaps going to not be *too* ugly.
> @@ -363,9 +363,24 @@ heap_xlog_delete(XLogReaderState *record)
> Relation reln = CreateFakeRelcacheEntry(target_locator);
> Buffer vmbuffer = InvalidBuffer;
>
> - visibilitymap_pin(reln, blkno, &vmbuffer);
> - visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
> - ReleaseBuffer(vmbuffer);
> + if (XLogRecHasBlockRef(record, 1))
> + {
> + if (XLogReadBufferForRedo(record, 1, &vmbuffer) == BLK_NEEDS_REDO)
> + {
> + visibilitymap_clear_locked(reln,
> + blkno, vmbuffer,
> + VISIBILITYMAP_VALID_BITS);
> + PageSetLSN(BufferGetPage(vmbuffer), lsn);
> + }
> + if (BufferIsValid(vmbuffer))
> + UnlockReleaseBuffer(vmbuffer);
> + }
> + else
> + {
> + visibilitymap_pin(reln, blkno, &vmbuffer);
> + visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
> + ReleaseBuffer(vmbuffer);
> + }
> FreeFakeRelcacheEntry(reln);
> }
I think we need a comment explaining why we need this if
(XLogRecHasBlockRef()) dance. Otherwise whoever has to look at this for some
bug in the future will be rather confused.
Perhaps just have one comment, with the other places referring to it? Might
also be sane to leave a breadcrumb on the WAL logging side, that it used to
look different in the same major version.
Wonder if we could package this into some kind of helper, there are a bunch of
copies of this now. But there's enough small differences that it's not
trivial.
>
> @@ -691,6 +742,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
> Buffer obuffer,
> nbuffer;
> Page page;
> + bool new_cleared,
> + old_cleared;
> OffsetNumber offnum;
> ItemId lp = NULL;
> HeapTupleData oldtup;
> @@ -728,14 +781,107 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
There's this comment nearby:
/*
* In normal operation, it is important to lock the two pages in
* page-number order, to avoid possible deadlocks against other update
* operations going the other way. However, during WAL replay there can
* be no other update happening, so we don't need to worry about that. But
* we *do* need to worry that we don't expose an inconsistent state to Hot
* Standby queries --- so the original page can't be unlocked before we've
* added the new tuple to the new page.
*/
I think it's not wrong, but man, that seems like a fragile assumption.
> * The visibility map may need to be fixed even if the heap page is
> * already up-to-date.
> */
> - if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
> + new_cleared = (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED) != 0;
> + old_cleared = (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED) != 0;
> + if (new_cleared || old_cleared)
> {
> Relation reln = CreateFakeRelcacheEntry(rlocator);
> - Buffer vmbuffer = InvalidBuffer;
> + bool has_vm_old = XLogRecHasBlockRef(record, HEAP_UPDATE_BLKREF_VM_OLD);
> + bool has_vm_new = XLogRecHasBlockRef(record, HEAP_UPDATE_BLKREF_VM_NEW);
> +
> + if (has_vm_new || has_vm_old)
> + {
> + /*
> + * If both the old and new heap pages were all-visible and their
> + * VM bits are on the same VM page, that single VM page is
> + * registered as HEAP_UPDATE_BLKREF_VM_NEW. Clear both heap
> + * blocks' VM bits from the single provided VM buffer.
> + */
> + bool same_vm_page = new_cleared && old_cleared &&
> + !has_vm_old && has_vm_new;
> +
> + if (same_vm_page)
> + {
> + Buffer vmbuffer = InvalidBuffer;
> +
> + if (XLogReadBufferForRedo(record, HEAP_UPDATE_BLKREF_VM_NEW, &vmbuffer) ==
> + BLK_NEEDS_REDO)
> + {
> + visibilitymap_clear_locked(reln, oldblk, vmbuffer,
> + VISIBILITYMAP_VALID_BITS);
> + visibilitymap_clear_locked(reln, newblk, vmbuffer,
> + VISIBILITYMAP_VALID_BITS);
> + PageSetLSN(BufferGetPage(vmbuffer), lsn);
> + }
> +
> + if (BufferIsValid(vmbuffer))
> + UnlockReleaseBuffer(vmbuffer);
> + }
> + else
> + {
> + /*
> + * We can be sure that we need to clear at most one heap
> + * page's VM bits in each registered VM buffer.
> + */
> + if (new_cleared)
> + {
> + Buffer vmbuffer = InvalidBuffer;
> +
> + if (XLogReadBufferForRedo(record, HEAP_UPDATE_BLKREF_VM_NEW, &vmbuffer) ==
> + BLK_NEEDS_REDO)
> + {
> + visibilitymap_clear_locked(reln, newblk, vmbuffer,
> + VISIBILITYMAP_VALID_BITS);
> + PageSetLSN(BufferGetPage(vmbuffer), lsn);
> + }
> +
> + if (BufferIsValid(vmbuffer))
> + UnlockReleaseBuffer(vmbuffer);
> + }
> + if (old_cleared)
> + {
> + Buffer vmbuffer = InvalidBuffer;
> +
> + if (XLogReadBufferForRedo(record, HEAP_UPDATE_BLKREF_VM_OLD, &vmbuffer) ==
> + BLK_NEEDS_REDO)
> + {
> + visibilitymap_clear_locked(reln, oldblk, vmbuffer,
> + VISIBILITYMAP_VALID_BITS);
> + PageSetLSN(BufferGetPage(vmbuffer), lsn);
> + }
> +
> + if (BufferIsValid(vmbuffer))
> + UnlockReleaseBuffer(vmbuffer);
> + }
> + }
Hm. Could this be simplified to something like:
if (has_vm_new)
{
Buffer vmbuffer = InvalidBuffer;
if (XLogReadBufferForRedo(record, HEAP_UPDATE_BLKREF_VM_NEW, &vmbuffer) ==
BLK_NEEDS_REDO)
{
if (new_cleared)
visibilitymap_clear_locked(reln, newblk, vmbuffer,
VISIBILITYMAP_VALID_BITS);
if (old_cleared && !has_vm_old)
visibilitymap_clear_locked(reln, oldblk, vmbuffer,
VISIBILITYMAP_VALID_BITS);
PageSetLSN(BufferGetPage(vmbuffer), lsn);
}
if (BufferIsValid(vmbuffer))
UnlockReleaseBuffer(vmbuffer);
}
if (has_vm_old)
{
Buffer vmbuffer = InvalidBuffer;
Assert(old_cleared);
if (XLogReadBufferForRedo(record, HEAP_UPDATE_BLKREF_VM_OLD, &vmbuffer) ==
BLK_NEEDS_REDO)
{
visibilitymap_clear_locked(reln, oldblk, vmbuffer,
VISIBILITYMAP_VALID_BITS);
PageSetLSN(BufferGetPage(vmbuffer), lsn);
}
if (BufferIsValid(vmbuffer))
UnlockReleaseBuffer(vmbuffer);
}
if (!has_vm_new && !has_vm_old)
{
...
Would it make sense to have an assert like
Assert(has_vm_new || !has_vm_old); /* should never just have old VM buffer */
> +bool
> +visibilitymap_clear_locked(Relation rel, BlockNumber heapBlk, Buffer vmbuf, uint8 flags)
Given this is a new API, I wonder if we shouldn't already start using the new
signature. If there's any extensions using it (I think there is at least one),
this would just require them to have an ifdef for <17, 17-18, and >= 19.
> +# Set up primary with WAL summarization enabled.
> +my $primary = PostgreSQL::Test::Cluster->new('primary');
> +$primary->init(has_archiving => 1, allows_streaming => 1);
> +$primary->append_conf('postgresql.conf', <<EOF);
> +summarize_wal = on
> +autovacuum = off
> +EOF
A here doc with a closing paren is correct perl? You learn something every
day.
Not particularly pretty imo, but whatever, it's perl.
> +##
> +## Test 1: INSERT clears all-visible and all-frozen bits
> +##
> +
> +$primary->safe_psql('postgres', q{VACUUM (FREEZE) vm_insert_test});
> +my ($pre_ins_vis, $pre_ins_frz) = get_vm_summary($primary, 'vm_insert_test');
> +cmp_ok($pre_ins_vis, '>', 0,
> + "INSERT test: pages are all-visible after vacuum");
> +cmp_ok($pre_ins_frz, '>', 0,
> + "INSERT test: pages are all-frozen after vacuum");
> +
> +my ($full1, $full1_name) = take_full_backup();
> +
> +$primary->safe_psql('postgres',
> + q{INSERT INTO vm_insert_test VALUES (1)});
> +
> +my ($post_ins_vis, $post_ins_frz) = get_vm_summary($primary, 'vm_insert_test');
> +cmp_ok($post_ins_vis, '<', $pre_ins_vis,
> + "INSERT test: all-visible cleared after insert");
> +cmp_ok($post_ins_frz, '<', $pre_ins_frz,
> + "INSERT test: all-frozen cleared after insert");
> +
> +my ($incr1, $incr1_name) = take_incr_backup($full1);
> +validate_restored_vm($full1, $full1_name, $incr1, $incr1_name,
> + 'vm_insert_test', 'INSERT test');
> +
> +##
> +## Test 2: DELETE clears all-visible and all-frozen bits
> +##
> +
> +$primary->safe_psql('postgres', q{VACUUM (FREEZE) vm_delete_test});
> +my ($pre_del_vis, $pre_del_frz) = get_vm_summary($primary, 'vm_delete_test');
> +cmp_ok($pre_del_vis, '>', 0,
> + "DELETE test: pages are all-visible after vacuum");
> +cmp_ok($pre_del_frz, '>', 0,
> + "DELETE test: pages are all-frozen after vacuum");
> +
> +my ($full2, $full2_name) = take_full_backup();
> +
> +$primary->safe_psql('postgres',
> + q{DELETE FROM vm_delete_test WHERE id = 1});
> +
> +my ($post_del_vis, $post_del_frz) = get_vm_summary($primary, 'vm_delete_test');
> +cmp_ok($post_del_vis, '<', $pre_del_vis,
> + "DELETE test: all-visible cleared after delete");
> +cmp_ok($post_del_frz, '<', $pre_del_frz,
> + "DELETE test: all-frozen cleared after delete");
> +
> +my ($incr2, $incr2_name) = take_incr_backup($full2);
> +validate_restored_vm($full2, $full2_name, $incr2, $incr2_name,
> + 'vm_delete_test', 'DELETE test');
Hm. Since these are looking at different tables, can't we do that with one
backup + incremental backup + combine + validate cycle for all of them?
Also seems like the modifications, tests and messages for all of them could be
moved to a helper?
I assume you verified that the tests catch the problem without the fix?
> From bb4c03a887f8ec5395ce37c09279a0786712b06c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 29 Apr 2026 10:20:20 -0400
> Subject: [PATCH v1_PGMASTER 5/7] Add VM corruption check to verify_heapam()
>
> pg_check_visible() can catch cases were tuples are not visible to all
> and the VM is set, but it doesn't specifically check page hints. This
> doesn't check, for example, an INSERT failing to clear PD_ALL_VISIBLE.
>
> This also helps users who have access to amcheck but not pg_visibility
> to diagnose corruption.
>
> Author: Melanie Plageman <melanieplageman@gmail.com>
> ---
> contrib/amcheck/verify_heapam.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
> index 20ff58aa782..0e023871566 100644
> --- a/contrib/amcheck/verify_heapam.c
> +++ b/contrib/amcheck/verify_heapam.c
> @@ -481,6 +481,7 @@ verify_heapam(PG_FUNCTION_ARGS)
>
> while ((ctx.buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
> {
> + uint8 vmbits;
> OffsetNumber maxoff;
> OffsetNumber predecessor[MaxOffsetNumber];
> OffsetNumber successor[MaxOffsetNumber];
> @@ -499,6 +500,21 @@ verify_heapam(PG_FUNCTION_ARGS)
> ctx.blkno = BufferGetBlockNumber(ctx.buffer);
> ctx.page = BufferGetPage(ctx.buffer);
>
> + /*
> + * It is corruption of the page-level PD_ALL_VISIBLE flag is clear and
> + * the visibility map bits corresponding to this heap page are set.
> + */
> + vmbits = visibilitymap_get_status(ctx.rel, ctx.blkno, &vmbuffer);
Maybe it's worth adding a comment explaining that visibilitymap_get_status()
won't fail, because the VM code uses ZERO_ON_ERROR and
visibilitymap_get_status() accepts there no being underlying page?
That'd basically address why this isn't meaningfully harming this goal:
* This code goes to some trouble to avoid crashing the server even if the
* table pages are badly corrupted, but it's probably not perfect.
> + if (!PageIsAllVisible(ctx.page) &&
> + (vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
> + {
> + ctx.offnum = InvalidOffsetNumber;
> + ctx.attnum = -1;
> + report_corruption(&ctx,
> + psprintf("page is not marked all-visible in page header but visibility map bit is
set"));
I'd just check for both bits, not just all visible. Not that it's likely
you'd have just ALL_FROZEN set, but ...
> From 2be0dee6aa408752200ef9adfa9bf6a94716b4b2 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 29 Apr 2026 10:33:46 -0400
> Subject: [PATCH v1_PGMASTER 6/7] Stop masking PD_ALL_VISIBLE when checking WAL
> consistency
>
> Setting and clearing PD_ALL_VISIBLE is WAL-logged, so there is no need
> to mask it for the purposes of WAL consistency checking.
>
> There is one minor exception, which is that when fixing VM corruption,
> we clear PD_ALL_VISIBLE on the heap page without specifically emitting
> WAL. This is a situation where corruption has already happened, however,
> so it is okay if WAL consistency checking flags this as well.
>
> Stop masking PD_ALL_VISIBLE, as doing so can only conceal real bugs.
>
> Author: Melanie Plageman <melanieplageman@gmail.com>
> ---
> src/backend/access/common/bufmask.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/src/backend/access/common/bufmask.c b/src/backend/access/common/bufmask.c
> index 5f63d04c9cb..93e34ea38b8 100644
> --- a/src/backend/access/common/bufmask.c
> +++ b/src/backend/access/common/bufmask.c
> @@ -53,12 +53,6 @@ mask_page_hint_bits(Page page)
> /* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints. */
> PageClearFull(page);
> PageClearHasFreeLinePointers(page);
> -
> - /*
> - * PD_ALL_VISIBLE is masked during WAL consistency checking. XXX: It is
> - * worth investigating if we could stop doing this.
> - */
> - PageClearAllVisible(page);
I'm still baffled we did this. A real hear-no-evil-see-no-evil approach.
> diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
> index 8849610db00..63747aec566 100644
> --- a/src/backend/access/transam/xlogreader.c
> +++ b/src/backend/access/transam/xlogreader.c
> @@ -1794,6 +1794,10 @@ DecodeXLogRecord(XLogReaderState *state,
>
> blk = &decoded->blocks[block_id];
> blk->in_use = true;
> +#ifdef USE_ASSERT_CHECKING
> + blk->used_read = false;
> +#endif
> +
> blk->apply_image = false;
>
> COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index c236e2b7969..4e1d50f642d 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -1962,6 +1962,15 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
> /* Now apply the WAL record itself */
> GetRmgr(record->xl_rmid).rm_redo(xlogreader);
>
> +#ifdef USE_ASSERT_CHECKING
> + for (int block_id = 0; block_id <= xlogreader->record->max_block_id; block_id++)
> + {
> + DecodedBkpBlock *blk = &xlogreader->record->blocks[block_id];
> +
> + Assert(!blk->in_use || blk->used_read);
> + }
> +#endif
Maybe deserves a comment like:
Verify that all block references were used during replay. This helps detect
bugs in redo routines. Every referenced block needs to be replayed with
XLogReadBufferForRedoExtended(), possibly via a helper, to ensure that we
replay FPIs, extend the relation if necessary and other similar protections.
Greetings,
Andres Freund
pgsql-hackers by date: