On 3/13/26 11:41, Ishan joshi wrote:
> Hi Team,
>
> I found an issue with PG v16.9 patroni setup where our standby node
> replication and disaster replication site replication broken with below
> error. It looks like WAL corruption which later part of archive file.
>
>
> CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
> off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel
> 1663/33195/410203483, blk 25329"
> PANIC: WAL contains references to invalid pages"
> CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
> off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0:
> rel1663/33195/410203483, blk 25329"
> WARNING: page 25329 of relation base/33195/410203483 does not exist"
> INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a
> leader (pg-patroni-node2-0)"
> [61]LOG: terminating any other active server processes"
> [61]LOG: startup process (PID 72) was terminated by signal 6: Aborted"
> [61]LOG: shutting down due to startup process failure"
> [61]LOG: database system is shut down"
> INFO: establishing a new patroni heartbeat connection to postgres"
> INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0"
> WARNING: Retry got exception: connection problems"
> WARNING: Failed to determine PostgreSQL state from the connection,
> fallingback to cached role"
> INFO: Error communicating with PostgreSQL. Will try again later"
> WARNING: Postgresql is not running."
>
>
> Primary db was not impacted, however standby node and DR site
> replication broken, I tried to reinit with latest backup + archive
> loading from pgbackrest backup but it fails with same error once the
> corrupt wal/archive file applying the changes. I had to reinit with
> pgbasebackup with 40TB database which took about 45 hrs of time.
>
> As I understand the transcation create table ->performed DML and then
> drop the table or transaction could be rollback that makes RACE
> condition in WAL file creation and got failed while applying the same in
> standby/DR site.
>
It's hard to say what caused this, but it might be interesting to look
at the WAL using pg_waldump. First at the WAL segment containing the
record triggering the failure, and then also at WAL segments before that
containing references to relation 1663/33195/410203483 (and especially
page 25329).
It is interesting this succeeded on a primary, but failed on standby.
Is there anything special about the relation 1663/33195/410203483? Do
you know if it's a regular / temporary table, etc?
regards
--
Tomas Vondra