Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Changing the state of data checksums in a running cluster |
| Date | |
| Msg-id | 775fddbc-4dec-413d-89bb-1a8186858ee6@vondra.me Whole thread Raw |
| In response to | Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>) |
| List | pgsql-hackers |
On 2/5/26 21:01, Daniel Gustafsson wrote: >> On 3 Feb 2026, at 17:58, Daniel Gustafsson <daniel@yesql.se> wrote: > >> I've implemented this in the attached patch and it seems very promising. > > Turns out I was a bit quick on the send button, and having been focused on the > failing test_checksums suite I missed running the other suites. (I also > managed to off-by-one-year name it 202503..) Thanks to some quick off-list > feedback, here is an updated version which fixes the failing tests and has > quite a number of smaller touchups. > > Like the previous version, I am not sure I have the CHECKPOINT_xxx logic right > and would love some eyes on it. > Thanks for the updated patch. Indeed, I see occasional failures in the 006 test (I suppose 007 would fail too, but I haven't tried). It's been ages since I looked at this patch, and I managed to forget most of it. And I'm still learning how this new approach works, so I don't have a great idea what's the root cause yet. But let me describe what I see and ask some questions. Maybe that'll help us to figure out what might be wrong. Here's what the failing run did: [22:11:00.610](0.364s) ok 2 - ensure checksums are set to off [22:11:00.616](0.006s) # LSN before enabling: 0/4F0BCD78 [22:11:00.629](0.013s) ok 3 - ensure data checksums are transitioned to inprogress-on [22:11:09.594](8.965s) ok 4 - ensure data checksums are transitioned to on [22:11:09.600](0.005s) # LSN after enabling: 0/B1CBD290 [22:11:09.675](0.075s) ok 5 - ensure data pages can be read back on primary ### Stopping node "main" using mode fast # Running: pg_ctl --pgdata /home/user/work/postgres/src/test/modules/test_checksums/tmp_check/t_006_pgbench_single_main_data/pgdata --mode fast stop waiting for server to shut down.... done Latest checkpoint location: 0/B28CA0D8 Latest checkpoint's REDO location: 0/B28CA0D8 [22:11:11.517](0.486s) ok 8 - ensure checksums are set to on [22:11:11.525](0.008s) # LSN before disabling: 0/B3B23C88 [22:11:11.539](0.013s) ok 9 - ensure data checksums are transitioned to off [22:11:11.545](0.006s) # LSN after disabling: 0/B3B3F430 [22:11:13.596](2.052s) ok 10 - ensure data pages can be read back on primary ### Stopping node "main" using mode immediate # Running: pg_ctl --pgdata /home/user/work/postgres/src/test/modules/test_checksums/tmp_check/t_006_pgbench_single_main_data/pgdata --mode immediate stop waiting for server to shut down.... done Latest checkpoint location: 0/B28CA158 Latest checkpoint's REDO location: 0/B28CA158 ### Starting node "main" ### Stopping node "main" using mode fast [22:11:14.130](0.428s) not ok 12 - no checksum validation errors in primary log (during WAL recovery) [22:11:14.130](0.000s) [22:11:14.130](0.000s) # Failed test 'no checksum validation errors in primary log (during WAL recovery)' # at t/006_pgbench_single.pl line 180. So it: 1) starts with checksums=off 2) flips them on by LSN 0/B1CBD290 3) restarts the instance with 'fast' mode 4) flips checksums to 'off' 5) restarts the instance with 'immediate' mode (=> crash) 6) finds checksum failures in the server log (after redo) The failures like this (all seem to be from VM, and the VM page LSNs are somewhat interesting too - I'll get to that shortly): 2026-02-05 22:11:13.724 CET startup[286330] LOG: redo starts at 0/B28CA1D8 2026-02-05 22:11:13.729 CET startup[286330] LOG: page verification failed, calculated checksum 17285 but expected 50010 (page LSN 0/A1CF17F0) 2026-02-05 22:11:13.729 CET startup[286330] CONTEXT: WAL redo at 0/B295E930 for Heap/LOCK: ... 2026-02-05 22:11:13.729 CET startup[286330] LOG: invalid page in block 4 of relation "base/5/16409_vm"; zeroing out page 2026-02-05 22:11:13.732 CET startup[286330] LOG: page verification failed, calculated checksum 49687 but expected 51187 (page LSN 0/B29F5A20) 2026-02-05 22:11:13.732 CET startup[286330] CONTEXT: WAL redo at 0/B299E5F8 for Heap/UPDATE: ... 2026-02-05 22:11:13.732 CET startup[286330] LOG: invalid page in block 42 of relation "base/5/16409_fsm"; zeroing out page ... During the test, the cluster went through these checkpoints / checksums: lsn: 0/4E00E600, desc: CHECKPOINT_REDO wal_level replica; checksums off lsn: 0/4F0C8B10, prev 0/4F0C6DD8, desc: CHECKSUMS inprogress-on lsn: 0/52010FA8, desc: CHECKPOINT_REDO wal_level replica; checksums inprogress-on lsn: 0/56015400, desc: CHECKPOINT_REDO wal_level replica; checksums inprogress-on ... lsn: 0/AA005E00, desc: CHECKPOINT_REDO wal_level replica; checksums inprogress-on lsn: 0/AE004988, desc: CHECKPOINT_REDO wal_level replica; checksums inprogress-on lsn: 0/B1C7DFC8, prev 0/B1C7DFA0, desc: CHECKSUMS on lsn: 0/B200BE38, desc: CHECKPOINT_REDO wal_level replica; checksums on lsn: 0/B28CA0D8, desc: CHECKPOINT_SHUTDOWN redo 0/B28CA0D8; ... checksums on; shutdown lsn: 0/B28CA158, desc: CHECKPOINT_SHUTDOWN redo 0/B28CA158; ... checksums on; shutdown lsn: 0/B3B2F780, prev 0/B3B2F730, desc: CHECKSUMS inprogress-off lsn: 0/B3B2F858, prev 0/B3B2F810, desc: CHECKSUMS off ... crash / immediate restart ... (after 0/B3B3F430) lsn: 0/B47DED08, desc: CHECKPOINT_SHUTDOWN redo 0/B47DED08; ... checksums off; shutdown lsn: 0/B5A7A600, prev 0/B5A788C8, desc: CHECKSUMS inprogress-on The redo starts right after the checkpoint in 0/B28CA158. The failure happens when trying to apply a FPW record to block in 1663/5/16409, but the failure is actually in the associated VM, where block 4 has the wrong checksum. How come? Grepping the WAL, it seems the last time that VM block was changed is at LSN 0/A1CF17F0 (which does match the failure log message): lsn: 0/A1CEF7A0, ... desc: FPI , blkref #0: rel 1663/5/16409 fork vm blk 4 FPW And AFAICS there are no other WAL records touching this particular vm page (not even the XLOG_HEAP2_VISIBLE records). Judging by the CHECKSUMS messages, the LSN 0/A1CEF7A0 is after the cluster switched to the inprogress-on state, so at that point the checksum should have been correct, right? So how come the checksum is not correct when applying the heap page FPW at LSN 0/B295E930? Either it got broken after enabling checksums in some way, or it's the redo that got confused about the current "current" checksum state. It's a bit unfortunate we only detect invalid checksums if the failures appear in the server log. I wonder if we could run pg_checksums after the cluster gets shut down in 'fast' mode (we can't do that with immediate shutdowns, of course). Didn't we do that in the past, actually? I suppose there was a reason why we removed it, but maybe we could still run that at least in some cases? Anyway, let's assume the checksum was correct after enabling checksums. That means it had to be broken by the redo, somehow? But why would that be? The XLOG_CHECKSUMS record says we switched to "inprogress-off" at 0/B3B2F780, and "off" at 0/B3B2F858. Only then we stop updating the checksums, right? We "crash" shortly after LSN 0/B3B3F430 (because that's where the immediate restart is, roughly), and there's a read-write pgbench running in the background during all this. It might have modified the VM, but why wouldn't that set the page LSN? The second checksum failure is a bit more interesting - the VM page has LSN 0/B29F5A20, which is *after* the redo location (and that checkpoint has set "checksums on"). And the inprogress-off does not appear until LSN 0/B3B2F780. So how come the VM page has invalid checksum? I've only ever seen checksum failured on VM pages, never on any other object. I have a feeling we're missing some important detail about how visibility maps work ... regards -- Tomas Vondra
pgsql-hackers by date: