Re: Replication to standby broke with WAL file corruption - Mailing list pgsql-general

From Tomas Vondra
Subject Re: Replication to standby broke with WAL file corruption
Date
Msg-id eb716c9a-983c-4ba3-9ceb-7d60d1825e4f@vondra.me
Whole thread Raw
In response to Replication to standby broke with WAL file corruption  (Ishan joshi <ishanjoshi@live.com>)
Responses Re: Replication to standby broke with WAL file corruption
List pgsql-general
On 3/13/26 11:41, Ishan joshi wrote:
> Hi Team,
> 
> I found an issue with PG v16.9 patroni setup where our standby node
> replication and disaster replication site replication broken with below
> error. It looks like WAL corruption which later part of archive file.
> 
> 
> CONTEXT:  WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
> off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel
> 1663/33195/410203483, blk 25329"
> PANIC:  WAL contains references to invalid pages"
> CONTEXT:  WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
> off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0:
> rel1663/33195/410203483, blk 25329"
> WARNING:  page 25329 of relation base/33195/410203483 does not exist"
> INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a
> leader (pg-patroni-node2-0)"
> [61]LOG:  terminating any other active server processes"
> [61]LOG:  startup process (PID 72) was terminated by signal 6: Aborted"
> [61]LOG:  shutting down due to startup process failure"
> [61]LOG:  database system is shut down"
> INFO: establishing a new patroni heartbeat connection to postgres"
> INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0"
> WARNING: Retry got exception: connection problems"
> WARNING: Failed to determine PostgreSQL state from the connection,
> fallingback to cached role"
> INFO: Error communicating with PostgreSQL. Will try again later"
> WARNING: Postgresql is not running."
> 
> 
> Primary db was not impacted, however standby node and DR site
> replication broken, I tried to reinit with latest backup + archive
> loading from pgbackrest backup but it fails with same error once the
> corrupt wal/archive file applying the changes. I had to reinit with
> pgbasebackup with 40TB database which took about 45 hrs of time.
> 
> As I understand the transcation create table ->performed DML and then
> drop the table or transaction could be rollback that makes RACE
> condition in WAL file creation and got failed while applying the same in
> standby/DR site.
> 

It's hard to say what caused this, but it might be interesting to look
at the WAL using pg_waldump. First at the WAL segment containing the
record triggering the failure, and then also at WAL segments before that
containing references to relation 1663/33195/410203483 (and especially
page 25329).

It is interesting this succeeded on a primary, but failed on standby.

Is there anything special about the relation 1663/33195/410203483? Do
you know if it's a regular / temporary table, etc?


regards

-- 
Tomas Vondra




pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Does included columns part of the PK
Next
From: Tomas Vondra
Date:
Subject: Re: Index scan with bitmap filter - has this been explored