Re: Why clearing the VM doesn't require registering vm buffer in wal record - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Why clearing the VM doesn't require registering vm buffer in wal record |
| Date | |
| Msg-id | a80a8ad6-ffbf-4c07-8365-7460685d1fea@vondra.me Whole thread Raw |
| In response to | Re: Why clearing the VM doesn't require registering vm buffer in wal record (Andres Freund <andres@anarazel.de>) |
| Responses |
Re: Why clearing the VM doesn't require registering vm buffer in wal record
|
| List | pgsql-hackers |
On 3/12/26 22:26, Andres Freund wrote: > Hi, > > On 2026-03-12 20:27:32 +0100, Tomas Vondra wrote: >>> I've been working on a fix for this (still pretty raw though). What a mess. >>> >>> >>> As part of validating that I added error checks that disallow reads in the >>> startup process for anything other than the FSM fork. Which triggered, due to >>> visibilitymap_prepare_truncate() reading a VM page. >>> >>> Which in turn made me wonder: Is it actually OK for >>> visibilitymap_prepare_truncate() to do separate WAL logging from >>> XLOG_SMGR_TRUNCATE? >>> >>> I suspect not, consider this scenario: >>> >>> 1) primary does XLOG_FPI for the VM page >>> 2) checkpoint >>> 3) primary does XLOG_SMGR_TRUNCATE >>> 4) standby replays XLOG_FPI >>> 5) primary performs a restartpoint >>> 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does >>> visibilitymap_prepare_truncate, which dirties the VM page >>> 7) standby crashes while writing out the VM page >>> 8) standby does recovery and now finds a torn VM page, triggering a checksum >>> failure >>> >>> Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just >>> wipe out that region of the VM? >>> >> >> Shouldn't it be possible to test this using injection points? > > Not trivially, I think? For one, you're not normally going to get a torn > page, even if you kill the server in the right moment, so you're rarely going > to be able to reproduce checksum errors without further infrastructure. But > also, since the failure behaviour is going to be to clear the VM, you'd not > really see bad consequences, I think? > > > I think we really need some logic during WAL replay that does more > verification of pages that are read in (i.e. do not have an FPI applied) by > the startup process. > > - Direct uses XLogReadBufferExtended() should only be allowed with > ZERO_ON_ERROR, as it has no provisions for applying FPIs. > > - Unless FPWs are disabled or ZERO_ON_ERROR is used: > > - the read in page must have an LSN newer than the last REDO LSN, otherwise > there should have been an FPI > > - the page LSN may not have an LSN from the future (i.e BLK_DONE > should never happen with FPWs) > > - all FPIs in the record have been applied > > > > A slightly different, but related, thing that we are missing is infrastructure > to detect pages that have been dirtied without associated WAL logging. I > wonder if we could use a bit in the buffer descriptor to track if a page has > been dirtied by anything other than a hint bit write. Or maybe we could > detect it when unlocking the page. > > >>> Separately, is it actually sufficient that visibilitymap_prepare_truncate() >>> only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth >>> in my head about danger scenarios involving PITR or crashes inbetween >>> prepare_truncate() and the actual truncation. >>> >>> >>> ISTM that we should not do visibilitymap_prepare_truncate() during recovery >>> and always WAL log the prepare_truncate() on the primary. Does that sound >>> sane? >>> >> >> I'm considering writing a stress test to try to shake out issues in >> this. > > We could certainly use more of that. > > >> That strategy was pretty successful for the online checksums patch, where we >> had a hunch it might not be quite correct- but it was hard to come up with a >> reproducer. The stress test failures were very useful in that they gave us >> proof of issues and some examples. > > >> But the crucial part is an ability to verify correctness. > > Unfortunately I don't think we have a particularly good infrastructure for > detecting problems right now :( > > > The verification ideas I posted above would, I think, help to detect some > issues, but I don't think they'd catch more complicated things. > > > We have wal_consistency_checking, but as it runs just after the redo routine, > it doesn't catch problems like not including things in the WAL record or > covering related actions in two WAL records that could be separated by a > checkpoint. We really should have a tool that compares a primary and a > replica after doing recovery using the masking infrastructure from > wal_consistency_checking. > Silly idea - wouldn't it be possible to detect this particular issue solely based on WAL? We could even have an off-line tool that reads WAL and checks that we have all FPIs for all the changes. > >> With the checksums it's easy enough - just verify checksums / look for >> checksum failures in the server log. But what would that be here? > > Unfortunately that's not even something you can really can rely on, it's > entirely possible to see checksum errors for the FSM without it being a bug, > as it's not WAL logged. > I did not mean to imply that any checksum failure is necessarily an outright bug. But I think this kind of issues being "normal" is a hint maybe not WAL-logging FSM is not such a good design choice anymore. regards -- Tomas Vondra
pgsql-hackers by date: