Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Changing the state of data checksums in a running cluster |
Date | |
Msg-id | f528413c-477a-4ec3-a0df-e22a80ffbe41@vondra.me Whole thread Raw |
In response to | Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>) |
Responses |
Re: Changing the state of data checksums in a running cluster
|
List | pgsql-hackers |
On 3/14/25 15:06, Daniel Gustafsson wrote: >> On 14 Mar 2025, at 14:38, Daniel Gustafsson <daniel@yesql.se> wrote: >>> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote: > >>> This is "ephemeral" in the sense that setting the value to "on" again >>> would be harmless, and indeed a non-assert build will run just fine. >> >> As mentioned off-list, being able to loosen the restriction for the first >> barrier seen seem like a good way to keep this assertion. Removing it is of >> course the alternative solution, as it's not causing any issues, but given how >> handy it's been to find actual issues it would be good to be able to keep it. >> >>> i.e. to first register into procsignal, and then read the new value. >>> AFAICS this guarantees we won't lose any checksum version updates. It >>> does mean we still can get a barrier for a value we've already seen, but >>> I think we should simply ignore this for the very first update. >> >> Calling functions with sideeffects in setting state seems like a bad idea >> before ProcSignalInit has run, that's thinko on my part in this patch. Your >> solution of reordering seems like the right way to handle this. > > 0006 in the attached version is what I have used when testing the above, along > with an update to the copyright year which I had missed doing earlier. It also > contains the fix in LocalProcessControlFile which I had in my local tree, I > think we need something like that at least. > Thanks, here's an updated patch version - I squashed all the earlier parts, but I kept your changes and my adjustments separate, for clarity. A couple comments: 1) I don't think the comment before InitialDataChecksumTransition was entirely accurate, because it said we can see the duplicate state only for "on" state. But AFAICS we can see duplicate values for any states, except that we only have an assert for the "on" so we don't notice the other cases. I wonder if we could strengthen this a bit, by adding some asserts for the other states too. 2) I admit it's rather subjective, but I didn't like how you did the assert in AbsorbChecksumsOnBarrier. But looking at it now in the diff, maybe it was more readable ... 3) I renamed InitLocalControldata() to InitLocalDataChecksumVersion(). The name was entirely misleading, because it now initializes the flag in XLogCtl, it has nothing to do with control file. 4) I realized AuxiliaryProcessMainCommon() may be a better place to initialize the checksum flag for non-backend processes. In fact, doing it in postmaster_child_launch() had the same race condition because it happened before ProcSignalInit(). I'm sure there's cleanup possible in a bunch of places, but the really bad thing is I realized the handling on a standby is not quite correct. I don't know what exactly is happening, there's too many moving bits, but here's what I see ... Every now and then, after restarting the standby, it logs a bunch of page verification failures. Stuff like this: WARNING: page verification failed, calculated checksum 9856 but expected 0 CONTEXT: WAL redo at 0/3447BA8 for Heap2/VISIBLE: snapshotConflictHorizon: 0, flags: 0x03; blkref #0: rel 1663/16384/16401, fork 2, blk 0 FPW; blkref #1: rel 1663/16384/16401, blk 0 WARNING: page verification failed, LSN 0/CF54C10 This is after an immediate shutdown, but I've seen similar failures for fast shutdowns too (the root causes may be different / may need a different fix, not sure). The instance restarts, and the "startup" process starts recovery LOG: redo starts at 0/2000028 This matches LSN from the very first start of the standby - there were no restart points since then, apparently. And since then the primary did this with the checksums (per pg_waldump): lsn: 0/0ECCFC48, prev 0/0ECCFBA0, desc: CHECKSUMS inprogress-off lsn: 0/0ECD0168, prev 0/0ECD0128, desc: CHECKSUMS off The instance already saw both records before the immediate shutdown (per the logging in patch 0004), but after the restart the instance goes back to having checksums enabled again data_checksum_version = 1 Which is correct, because it starts at 0/2000028, which is before either of the XLOG_CHECKSUMS records. But then at 0/3447BA8 (which is *before* either of the checksum changes) it tries to read a page from disk, and hits a checksum error. That page is from the future (per the page LSN logged by patch 0004), but it's still before both XLOG_CHECKSUMS messages. So how come the page has pd_checksum 0? I'd have understood if the page came "broken" from the primary, but I've not seen a single page verification failure on that side (and it's subject to the same fast/immediate restarts, etc). I wonder if this might be related to how we enforce checkpoints only when setting the checksums to "on" on the primary. Maybe that's safe on primary but not on a standby? FWIW I've seen similar issues for "fast" shutdowns too - at least the symptoms are similar, but the mechanism might be a bit different. In particular, I suspect there's some sort of thinko in updating the data_checksum_version in the control file, but I can't put my finger on it yet. Another thing I noticed is this comment in CreateRestartPoint(), before one of the early exits: /* * If the last checkpoint record we've replayed is already our last * restartpoint, we can't perform a new restart point. We still update * minRecoveryPoint in that case, so that if this is a shutdown restart * point, we won't start up earlier than before. That's not strictly * necessary, but when hot standby is enabled, it would be rather weird * if the database opened up for read-only connections at a * point-in-time before the last shutdown. Such time travel is still * possible in case of immediate shutdown, though. * ... I wonder if this "time travel backwards" might be an issue for this too, because it might mean we end up picking the wrong data_checksum_version from the control file. In any case, if this happens, we don't get to the ControlFile->data_checksum_version update a bit further down. And there's another condition that can skip that. I'll continue investigating this next week, but at this point I'm quite confused and would be grateful for any insights ... regards -- Tomas Vondra
Attachment
pgsql-hackers by date: