Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Changing the state of data checksums in a running cluster
Date
Msg-id dfe57980-f594-46c5-af39-852ff30d34fa@vondra.me
Whole thread Raw
In response to Re: Changing the state of data checksums in a running cluster  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On 8/27/25 14:39, Tomas Vondra wrote:
> ...
>
> And this happened on Friday:
> 
> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
> Author: Alexander Korotkov <akorotkov@postgresql.org>
> Date:   Fri Aug 22 18:44:39 2025 +0300
> 
>     Revert "Get rid of WALBufMappingLock"
> 
>     This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
>     It appears that conditional variables are not suitable for use
>     inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
>     face postmaster death, they exit, releasing all locks instead of
>     PANIC.  In certain situations, this leads to data corruption.
> 
>     ...
> 
> I think it's very likely the checksums were broken by this. After all,
> that linked thread has subject "VM corruption on standby" and I've only
> ever seen checksum failures on standby on the _vm fork.
> 

Forgot to mention - I did try with c13070a27b reverted, and with that I
can reproduce the checksum failures again (using the fixed TAP test).

It's not a definitive proof, but it's a hint c13070a27b63 was causing
the checksum failures.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Changing the state of data checksums in a running cluster
Next
From: Kirill Reshke
Date:
Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)