Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Changing the state of data checksums in a running cluster |
Date | |
Msg-id | 3e67160c-3676-4419-b635-1fdb80dc128e@vondra.me Whole thread Raw |
In response to | Re: Changing the state of data checksums in a running cluster (Tomas Vondra <tomas@vondra.me>) |
List | pgsql-hackers |
On 8/29/25 16:38, Tomas Vondra wrote: > On 8/29/25 16:26, Tomas Vondra wrote: >> ... >> >> I've seen these failures after changing checksums in both directions, >> both after enabling and disabling. But I've only ever saw this after >> immediate shutdown, never after fast shutdown. (It's interesting the >> pg_checksums failed only after fast shutdowns ...). >> > > Of course, right after I send a message, it fails after a fast shutdown, > contradicting my observation ... > >> Could it be that the redo happens to start from an older position, but >> using the new checksum version? >> > > ... but it also provided more data supporting this hypothesis. I added > logging of pg_current_wal_lsn() before / after changing checksums on the > primary, and I see this: > > 1) LSN before: 14/2B0F26A8 > 2) enable checksums > 3) LSN after: 14/EE335D60 > 4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench) > 5) standby restarts with -m fast > 6) redo starts at 14/230043B0, which is *before* enabling checksums > > I guess this is the root cause. A bit more detailed log attached. > I kept stress testing this over the weekend, and I think I found two issues causing the checksum failures, both for a single node and on a standby: 1) no checkpoint in the "disable path" In the "enable" path, a checkpoint it enforced before flipping the state from "inprogress-on" to "on". It's hidden in the ProcessAllDatabases, but it's there. But the "off" path does not do that, probably on the assumption that we'll always see the writes in the WAL order, so that we'll see the XLOG_CHECKSUMS setting checksums=off before seeing any writes without checksums. And in the happy path this works fine - the standby is happy, etc. But what about after a crash / immediate shutdown? Consider a sequence like this: a) we have checksums=on b) write to page P, updating the checksum c) start disabling checksums d) progress to inprogress-off e) progress to off f) write to page P, without checksum update g) the write to P gets evicted (small shared buffers, ...) h) crash / immediate shutdown Recovery starts from a LSN before (a), so we believe checksums=on. We try to redo the write to P, which starts by reading the page from disk, to check the page LSN. We still think checksums=on, and to read the LSN we need to verify the checksum. But the page was modified without the checksum, and evicted. Kabooom! This is not that hard to trigger by hand. Add a long at the end of SetDataChecksumsOff, start a pgbench on a scale larger than shared buffers and call pg_disable_data_checksums(). Once it gets stuck on the sleep, give it more time to dirty and evict some pages, then kill -9. On recovery you should get the same checksum failures. FWIW I've only ever seen failures on fsm/vm forks, which matches what I see in the TAP tests. But isn't it a bit strange? I think the "disable" path needs a checkpoint between inprogress-off and off states, same as the "enable" path. 2) no restart point on the standby The standby has a similar issue, I think. Even if the primary creates all the necessary checkpoints, the standby may not need to create the restart point for them. If you look into xlog_redo, it only "remembers" the checkpoint position, it does not trigger a restart point. Than only happens in XLogPageRead, based on distance from the previous one. So a very similar failure to the primary is possible, even with the extra checkpoint fixing (1). The primary flips checksums in either direction, generating checkpoints, but the standby does not create the restart points. But it applies WAL, and some of the pages without checksums get evicted. And then the standby fails, and goes to some redo position far back, and runs into the same checksum failure when trying to check page LSN. I think the standby needs some logic to force restart point creation when the checksum flag changed. I have an experimental WIP branch at: https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks It fixes the TAP issues reported earlier (and a couple more), and it does a bunch of additional tweaks: a) A lot of debug messages that helped me to figure this out. This is probably way too much, especially the controlfile updates can be very noisy on a standby. b) Adds a simpler TAP, testing just a single node (should be easier to understand than with failures on standby). c) Adds an explicit checkpoints, to fix (1). It probably adds too many checkpoints, though? AFAICS a checkpoint after the "inprogress" phase should be enough, a checkpoint after the "on/off" can go away. d) Forces creating a restart point on the first checkpoint after XLOG_CHECKSUMS record. It's done in a bit silly way, using a static flag. Maybe there's a more elegant approach, say by comparing the checksum value in ControlFile to the received checkpoint? e) Randomizes a couple more GUC values. This needs more thought, it was done blindly before better understanding how the failures happen (it requires buffers evicted, not hitting max_wal_size, ...). There are more params worth randomizing (e.g. the "fast" flag). Anyway, with (c) and (d) applied, the checksum failures go away. It may not be 100% right (e.g. we could do away with fewer checkpoints), but it seems to be the right direction. I don't have time to cleanup the branch more, I've already spent too much time looking at LSNs advancing in weird ways :-( Hopefully it's good enough to show what needs to be fixed, etc. If there's a new version, I'm happy to rerun the tests on my machines, ofc. However, there still are more bugs. Attached is a log from a crash after hitting the assert into AbsorbChecksumsOffBarrier: Assert((LocalDataChecksumVersion != PG_DATA_CHECKSUM_VERSION) && (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION || LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)); This happened while flipping checksums to 'off, but the backend already thinks checksum are 'off': LocalDataChecksumVersion==0 I think this implies some bug in setting up LocalDataChecksumVersion after connection, because this is for a query checking the checksum state, executed by the TAP test (in a new connection, right?). I haven't looked into this more, but how come the "off" direction does not need to check InitialDataChecksumTransition? I think the TAP test turned out to be very useful, so far. While investigating on this, I thought about a couple more tweaks to make it detect additional issues (on top of the randomization). - Right now the shutdowns/restarts happen only in very limited places. The checksum flips from on/off or off/on, and then a restart happens. AFAICS it never happens in the "inprogress" phases, right? - The pgbench clients connect once, so there are almost no new connections while flipping checksums. Maybe some of the pgbenches should run with "-C", to open new connections. It was pretty lucky the TAP query hit the assert, this would make it more likely. regards -- Tomas Vondra
Attachment
pgsql-hackers by date: