Re: Why clearing the VM doesn't require registering vm buffer in wal record - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Why clearing the VM doesn't require registering vm buffer in wal record
Date
Msg-id c9dfba94-4f8e-4cb3-9068-6aec8fcd9a2f@vondra.me
Whole thread Raw
In response to Re: Why clearing the VM doesn't require registering vm buffer in wal record  (Andres Freund <andres@anarazel.de>)
Responses Re: Why clearing the VM doesn't require registering vm buffer in wal record
List pgsql-hackers
On 3/12/26 18:05, Andres Freund wrote:
> Hi,
> 
> On 2026-03-06 11:49:20 -0500, Robert Haas wrote:
>> On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <andres@anarazel.de> wrote:
>>> Which perhaps also should have emitted an FPI when clearing a bit? But I'm
>>> unsure that that was required at the time. OTOH, it did seem to generate an
>>> FPI for setting a VM bit, so ...
>>
>> After booting up the part of my brain that worked on this back in
>> 2011, I think I reasoned that clearing a bit was idempotent and didn't
>> depend on the prior page state, so that we would be adequate protected
>> without an FPI. I had to add a WAL record to *set* the VM bit, because
>> that wasn't logged at all, but the actual of clearing the VM bit
>> piggybacked on the heap change that necessitated doing so, and having
>> that record also emit an FPI for the VM page didn't seem to me to add
>> anything.
>>
>> So I think technically it's the addition of checksums that turns this
>> into a real bug, because now torn pages are a checksum-failure hazard.
>> However, that would have been extremely hard to notice given that I
>> seem never to have actually documented that reasoning in a comment
>> anywhere.
> 
> And then it got a lot worse with incremental backup support...
> 
> 
>> I don't think we should mess around with trying to make the behavior
>> here conditional on wal_level, checksums, etc. We should just do it
>> all the time to keep the logic simple.
> 
> Agreed.
> 
> 
> I've been working on a fix for this (still pretty raw though).  What a mess.
> 
> 
> As part of validating that I added error checks that disallow reads in the
> startup process for anything other than the FSM fork.  Which triggered, due to
> visibilitymap_prepare_truncate() reading a VM page.
> 
> Which in turn made me wonder: Is it actually OK for
> visibilitymap_prepare_truncate() to do separate WAL logging from
> XLOG_SMGR_TRUNCATE?
> 
> I suspect not, consider this scenario:
> 
> 1) primary does XLOG_FPI for the VM page
> 2) checkpoint
> 3) primary does XLOG_SMGR_TRUNCATE
> 4) standby replays XLOG_FPI
> 5) primary performs a restartpoint
> 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
>    visibilitymap_prepare_truncate, which dirties the VM page
> 7) standby crashes while writing out the VM page
> 8) standby does recovery and now finds a torn VM page, triggering a checksum
>    failure
> 
> Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just
> wipe out that region of the VM?
> 

Shouldn't it be possible to test this using injection points?

> 
> Separately, is it actually sufficient that visibilitymap_prepare_truncate()
> only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and forth
> in my head about danger scenarios involving PITR or crashes inbetween
> prepare_truncate() and the actual truncation.
> 
> 
> ISTM that we should not do visibilitymap_prepare_truncate() during recovery
> and always WAL log the prepare_truncate() on the primary.  Does that sound
> sane?
> 

I'm considering writing a stress test to try to shake out issues in
this. That strategy was pretty successful for the online checksums
patch, where we had a hunch it might not be quite correct- but it was
hard to come up with a reproducer. The stress test failures were very
useful in that they gave us proof of issues and some examples.

But the crucial part is an ability to verify correctness. With the
checksums it's easy enough - just verify checksums / look for checksum
failures in the server log. But what would that be here?

regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Next
From: Alvaro Herrera
Date:
Subject: Re: Adding REPACK [concurrently]