Hi,
There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea,
which may cause failures when starting a replica, making it unusable.
The commit message for 8431e296ea is not very clear about what exactly
is being done and why, but the root cause is that at while processing
RUNNING_XACTS, the XIDs are sorted like this:
/*
* Sort the array so that we can add them safely into
* KnownAssignedXids.
*/
qsort(xids, nxids, sizeof(TransactionId), xidComparator);
where "safely" likely means "not violating the ordering expected by
KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values
as plain uint32 values, while KnownAssignedXidsAdd actually calls
TransactionIdFollowsOrEquals() and compares the logical XIDs :-(
Triggering this is pretty simple - all you need is two transactions with
XIDs before/after the 4B limit, and then (re)start a replica. The
replica refuses to start with a message like this:
LOG: 9 KnownAssignedXids (num=4 tail=0 head=4) [0]=32705 [1]=32706
[2]=32707 [3]=32708
CONTEXT: WAL redo at 0/6000120 for Standby/RUNNING_XACTS: nextXid
32715 latestCompletedXid 32714 oldestRunningXid
4294967001; 8 xacts: 32708 32707 32706 32705 4294967009
4294967008 4294967007 4294967006
FATAL: out-of-order XID insertion in KnownAssignedXids
Clearly, we add the 4 "younger" XIDs first (because that's what the XID
comparator does), but then KnownAssignedXidsAdd thinks there's some sort
of corruption because logically 4294967006 is older.
This does not affect replicas in STANDBY_SNAPSHOT_READY state, because
in that case ProcArrayApplyRecoveryInfo ignores RUNNING_XACTS messages.
The probability of hitting this in practice is proportional to how long
you leave transactions running. The system where we observed this leaves
transactions with XIDs open for days, and the age may be ~40M.
Intuitivelly, that's ~40M/4B (=1%) probability that at any given time
there are transactions with contradicting ordering. So most restarts
worked fine, until one that happened at just the "right" time.
This likely explains why we never got any reports about this - most
systems probably don't leave transactions running for this long, so the
probability is much lower. And replica restarts are generally not that
common events either.
Attached patch is fixing this by just sorting the XIDs logically. The
xidComparator is meant for places that can't do logical ordering. But
these XIDs come from RUNNING_XACTS, so they actually come from the same
wraparound epoch (so sorting logically seems perfectly fine).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company