Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Changing the state of data checksums in a running cluster |
Date | |
Msg-id | ab76670d-3e2c-4af4-9e29-ffaec54a370d@vondra.me Whole thread Raw |
In response to | Re: Changing the state of data checksums in a running cluster (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Changing the state of data checksums in a running cluster
|
List | pgsql-hackers |
On 8/27/25 14:42, Tomas Vondra wrote: > On 8/27/25 14:39, Tomas Vondra wrote: >> ... >> >> And this happened on Friday: >> >> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0 >> Author: Alexander Korotkov <akorotkov@postgresql.org> >> Date: Fri Aug 22 18:44:39 2025 +0300 >> >> Revert "Get rid of WALBufMappingLock" >> >> This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683. >> It appears that conditional variables are not suitable for use >> inside critical sections. If WaitLatch()/WaitEventSetWaitBlock() >> face postmaster death, they exit, releasing all locks instead of >> PANIC. In certain situations, this leads to data corruption. >> >> ... >> >> I think it's very likely the checksums were broken by this. After all, >> that linked thread has subject "VM corruption on standby" and I've only >> ever seen checksum failures on standby on the _vm fork. >> > > Forgot to mention - I did try with c13070a27b reverted, and with that I > can reproduce the checksum failures again (using the fixed TAP test). > > It's not a definitive proof, but it's a hint c13070a27b63 was causing > the checksum failures. > Unfortunately, it seems I spoke too soon :-( I decided to test this on multiple machines overnight, and it still fails on the slower ones. Attached is a patch addressing a couple more issues, to makes the TAP test work well enough. (Attached as .txt, to not confuse cfbot). - The pgbench started by IPC::Run::start() needs to be finished, to release resources. Otherwise it leaks file descriptors (and there's a bunch of "defunct" pgbench processes), which may be a problem with increased number of iterations. - AFAICS the pgbench can't use stdin/stdout/stderr, otherwise the pipes get broken when the command fails (after restart). I just used /dev/null instead, the stdout/stderr was not used anyway (or was it?). - commented out the pg_checksums call, because of the issues mentioned earlier (I was trying to make it work by remembering the state, but it seems to not make it into control file before shutdown occasionally) I increased the number of iterations to 1000+ and ran it on three machines: - ryzen (new machine from ~2024) - xeon (old slow machine from ~2016) - rpi5 (very slow machine) I haven't seen a single failure on ryzen, after ~3000 iterations. But both xeon and rpi5 show a number of failures. Xeon has about 35 reports of 'Failed test', rpi5 and about 10. My guess is it's something about timing. It works on the "fast" ryzen, but fails on xeon which is ~3-4x slower. And rpi5, which is even slower. The other reason why it seems unrelated to the reverted commit is that it's not just about visibility maps (which was got corrupted). I see checksum failures on VM and FSM. I think I forgot about the FSM cases, and by the fact that I saw no failures on the ryzen post revert. But clearly, other machines still have issues. Another interesting fact is that the checksum failures happen both on the primary and the standby, it's not just a standby issue. But again, this sees to be machine-dependent. On the rpi5 I've only seen standby issues. The xeon sees failures both on primary/standby (roughly 1:1). There are more weird things. If I grep for page failures, I see this (a more detailed log attached): ----------- # 2025-08-28 22:33:28.195 CEST startup[177466] LOG: page verification failed, calculated checksum 25350 but expected 44559 # 2025-08-28 22:33:28.197 CEST startup[177466] LOG: page verification failed, calculated checksum 25350 but expected 44559 # 2025-08-28 22:33:28.199 CEST startup[177466] LOG: page verification failed, calculated checksum 59909 but expected 53920 # 2025-08-28 22:33:28.201 CEST startup[177466] LOG: page verification failed, calculated checksum 59909 but expected 53920 # 2025-08-28 22:33:28.206 CEST startup[177466] LOG: page verification failed, calculated checksum 59909 but expected 53920 # 2025-08-28 22:33:28.207 CEST startup[177466] LOG: page verification failed, calculated checksum 25350 but expected 44559 ----------- This is right after a single restart, while doing the recovery. The weird thing is, this is all for just two FSM pages! ----------- LOG: invalid page in block 2 of relation "base/5/16410_fsm"; zeroing out page LOG: invalid page in block 2 of relation "base/5/16408_fsm"; zeroing out page ----------- And the calculated/expected checksums repeat! It's just different WAL records hitting the same page, and complaining about the same issue, after claiming the page was zeroed out. Isn't that weird? How come the page doesn't "get" the correct checksum after the first redo? I've seen these failures after changing checksums in both directions, both after enabling and disabling. But I've only ever saw this after immediate shutdown, never after fast shutdown. (It's interesting the pg_checksums failed only after fast shutdowns ...). Could it be that the redo happens to start from an older position, but using the new checksum version? regards -- Tomas Vondra
Attachment
pgsql-hackers by date: