out-of-order XID insertion in KnownAssignedXids - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject out-of-order XID insertion in KnownAssignedXids
Date
Msg-id 0c96b653-4696-d4b4-6b5d-78143175d113@postgrespro.ru
Whole thread Raw
Responses Re: out-of-order XID insertion in KnownAssignedXids  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Hi hackers,

Looks like there is a bug with logging running transactions XIDs and 
prepared transactions.
One of our customers get error "FATAL: out-of-order XID insertion in 
KnownAssignedXids"
trying to apply backup.
WAL contains the following record:

rmgr: Standby     len (rec/tot):     98/    98, tx: 0, lsn: 
1418/A9A76C90, prev 1418/A9A76C48, desc: RUNNING_XACTS nextXid 
2004495309 latestCompletedXid 2004495307 oldestRunningXid 2004495290; 3 
xacts: 2004495290 2004495308 2004495308

As you can notice, XID 2004495308 is encountered twice which cause error 
in KnownAssignedXidsAdd:

     if (head > tail &&
         TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1], 
from_xid))
     {
         KnownAssignedXidsDisplay(LOG);
         elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
     }

The probability of this error is very small but it can quite easily 
reproduced: you should just set breakpoint in debugger after calling 
MarkAsPrepared in twophase.c and then try to prepare any transaction.
MarkAsPrepared  will add GXACT to proc array and at this moment there 
will be two entries in procarray with the same XID:

(gdb) p procArray->numProcs
$2 = 4
(gdb) p allPgXact[procArray->pgprocnos[0]]
$4 = {xid = 513976717, xmin = 0, vacuumFlags = 0 '\000', overflowed = 0 
'\000', delayChkpt = 1 '\001', nxids = 0 '\000',
   used = 0 '\000', parent = 0x0}
(gdb) p allPgXact[procArray->pgprocnos[1]]
$5 = {xid = 0, xmin = 0, vacuumFlags = 0 '\000', overflowed = 0 '\000', 
delayChkpt = 0 '\000', nxids = 0 '\000',
   used = 0 '\000', parent = 0x0}
(gdb) p allPgXact[procArray->pgprocnos[2]]
$6 = {xid = 0, xmin = 0, vacuumFlags = 0 '\000', overflowed = 0 '\000', 
delayChkpt = 0 '\000', nxids = 0 '\000',
   used = 0 '\000', parent = 0x0}
(gdb) p allPgXact[procArray->pgprocnos[3]]
$7 = {xid = 513976717, xmin = 0, vacuumFlags = 0 '\000', overflowed = 0 
'\000', delayChkpt = 0 '\000', nxids = 0 '\000',
   used = 0 '\000', parent = 0x0}

Then you should just wait for sometime until checkpoint timeout is 
triggered and it  logs snapshot:

(gdb) bt
#0  0x00000000007f3dab in GetRunningTransactionData () at procarray.c:2240
#1  0x00000000007fab22 in LogStandbySnapshot () at standby.c:943
#2  0x000000000077cde8 in BackgroundWriterMain () at bgwriter.c:331
#3  0x00000000005377f3 in AuxiliaryProcessMain (argc=2, 
argv=0x7ffe00aa00e0) at bootstrap.c:446
#4  0x000000000078e07e in StartChildProcess (type=BgWriterProcess) at 
postmaster.c:5323
#5  0x000000000078b6f0 in reaper (postgres_signal_arg=17) at 
postmaster.c:2948
#6  <signal handler called>
#7  0x00007f1356d665b3 in __select_nocancel () at 
../sysdeps/unix/syscall-template.S:84
#8  0x0000000000789931 in ServerLoop () at postmaster.c:1765
#9  0x000000000078906c in PostmasterMain (argc=3, argv=0x1902640) at 
postmaster.c:1406
#10 0x00000000006d0e4f in main (argc=3, argv=0x1902640) at main.c:228

Now generated RUNNING_XACTS record contains duplicated XIDs.

I want to ask opinion of community about the best way of fixing this 
problem.
Should we avoid storing duplicated XIDs in procarray (by invalidating 
XID in original pgaxct)
or eliminate/change check for duplicate in KnownAssignedXidsAdd (for 
example just ignore duplicates)?





pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pg_upgrade failed with ERROR: null relpartbound for relation18159 error.
Next
From: Anushi Maheshwari
Date:
Subject: GCI Mentor