standby fails with out-of-order XID insertion upon start - Mailing list pgsql-bugs

From Radoslav Nedyalkov
Subject standby fails with out-of-order XID insertion upon start
Date
Msg-id CANhtRibYPVfH65phBtVFQGL1FvYHEk30G_0rNLbpLy_Qa4K8Zg@mail.gmail.com
Whole thread Raw
List pgsql-bugs
Hi Team,

We were building a standby (pg11.11) as log-shipping and then switched to stream.
However after changing the recovery.conf and restarting the server it failed with 

2021-03-31 14:23:04 UTC [84999] 0: [27-1] user=,db= host=,app=[] FATAL:  out-of-order XID insertion in KnownAssignedXids
2021-03-31 14:23:04 UTC [84999] 0: [28-1] user=,db= host=,app=[] CONTEXT:  WAL redo at A8EA2/3A900150 for Standby/RUNNING_XACTS: nextXid 55152875 latestCompletedXid 55152874 oldestRunningXid 4260572256; 13 xacts: 53056024 53575669 41821378 53575647 41706387 53575629 4260572256 51340860 54917157 44843675 44370470 27311547 55152872

More details about the setup:
Basebackup was build from another standby (pg11.11) Amazon Linux 5.4.95-42.163.amzn2.x86_64
Wal logs and streaming were from master (pg11.8) Centos7 3.10.0-1127.10.1.el7.x86_64
Standby was hot_standby=on. It was restarted when it was already operational and up-to-date 
with the master.
The basebackup took 10 hours to build and then another 7 to catch up with the wal apply.
Master was running some long running transactions. One of these - 4260572256  was lasting before and after and I guess a wraparound happened during the build.(excuse my hint here).

After the failure we started the standby with hot_standby = off for a minute and it applied the "offending" wal. We put it  'on' and it failed again with the same error. We left it running 'off' for a couple of hours in a hope out-of-order txs will be gone. And it worked. We switched it 'on' and it became a fully functional standby which we promoted as master and made a couple of tests.

Attaching a more detailed log. 
Here is the relevant section from the wal file.

rmgr: Btree       len (rec/tot):     53/  5525, tx:   55152872, lsn: A8EA2/3A8FEA00, prev A8EA2/3A8FE118, desc: INSERT_LEAF off 171, blkref #0: rel 1663/16401
/30619182 blk 16810 FPW
rmgr: Btree       len (rec/tot):     64/    64, tx:   55152872, lsn: A8EA2/3A8FFF98, prev A8EA2/3A8FEA00, desc: INSERT_LEAF off 198, blkref #0: rel 1663/16401
/30619190 blk 1627541
rmgr: Standby     len (rec/tot):    345/   345, tx:          0, lsn: A8EA2/3A8FFFD8, prev A8EA2/3A8FFF98, desc: LOCK xid 53575669 db 16401 rel 46272403 xid 41
821378 db 16401 rel 145209998 xid 53575669 db 16401 rel 46272400 xid 4260572256 db 16401 rel 144945162 xid 4260572256 db 16401 rel 144943160 xid 41821378 db 1
6401 rel 145212760 xid 27311547 db 16401 rel 145174122 xid 53575669 db 16401 rel 85876526 xid 53575647 db 16401 rel 145251397 xid 53575629 db 16401 rel 335265
72 xid 53575669 db 16401 rel 46272405 xid 53575629 db 16401 rel 145251380 xid 4260572256 db 16401 rel 144943131 xid 51340860 db 16401 rel 145230700 xid 426057
2256 db 16401 rel 144943165 xid 41821378 db 16401 rel 145212765 xid 41821378 db 16401 rel 145209993 xid 54917157 db 16401 rel 145276215 xid 41821378 db 16401
rel 145209944 xid 53575629 db 16401 rel 145251385 xid 54917157 db 16401 rel 145276220 xid 4260572256 db 16401 rel 144945157 xid 41706387 db 16401 rel 14520963
9 xid 53575647 db 16401 rel 89115947 xid 53575647 db 16401 rel 145251392 xid 53056024 db 16401 rel 145235217
rmgr: Standby     len (rec/tot):    102/   102, tx:          0, lsn: A8EA2/3A900150, prev A8EA2/3A8FFFD8, desc: RUNNING_XACTS nextXid 55152875 latestCompletedXid 55152874 oldestRunningXid 4260572256; 13 xacts: 53056024 53575669 41821378 53575647 41706387 53575629 4260572256 51340860 54917157 44843675 44370470 27311547 55152872
rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: A8EA2/3A9001B8, prev A8EA2/3A900150, desc: CHECKPOINT_ONLINE redo A8EA2/30DFAEA8; tli 7; prev tli 7; fpw true; xid 10:55138257; oid 145279495; multi 259; offset 660; oldest xid 3783095690 in DB 16401; oldest multi 1 in DB 16401; oldest/newest commit timestamp xid: 0/0; oldest running xid 4260572256; online
rmgr: Heap        len (rec/tot):    130/   130, tx:   55152872, lsn: A8EA2/3A900228, prev A8EA2/3A9001B8, desc: INSERT off 47, blkref #0: rel 1663/16401/88935494 blk 15803338

Thanks and regards,
Radoslav
Attachment

pgsql-bugs by date:

Previous
From: "egashira.yusuke@fujitsu.com"
Date:
Subject: RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted
Next
From: "Boris P. Korzun"
Date:
Subject: Re: Inconsistent behavior of pg_dump/pg_restore on DEFAULT PRIVILEGES