Thread: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas

Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas

From
Tomas Vondra
Date:
Hi,

There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea, 
which may cause failures when starting a replica, making it unusable. 
The commit message for 8431e296ea is not very clear about what exactly 
is being done and why, but the root cause is that at while processing 
RUNNING_XACTS, the XIDs are sorted like this:

     /*
      * Sort the array so that we can add them safely into
      * KnownAssignedXids.
      */
     qsort(xids, nxids, sizeof(TransactionId), xidComparator);

where "safely" likely means "not violating the ordering expected by 
KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values 
as plain uint32 values, while KnownAssignedXidsAdd actually calls 
TransactionIdFollowsOrEquals() and compares the logical XIDs :-(

Triggering this is pretty simple - all you need is two transactions with 
XIDs before/after the 4B limit, and then (re)start a replica. The 
replica refuses to start with a message like this:

     LOG:  9 KnownAssignedXids (num=4 tail=0 head=4) [0]=32705 [1]=32706
           [2]=32707 [3]=32708
     CONTEXT:  WAL redo at 0/6000120 for Standby/RUNNING_XACTS: nextXid
               32715 latestCompletedXid 32714 oldestRunningXid
               4294967001; 8 xacts: 32708 32707 32706 32705 4294967009
               4294967008 4294967007 4294967006
     FATAL:  out-of-order XID insertion in KnownAssignedXids

Clearly, we add the 4 "younger" XIDs first (because that's what the XID 
comparator does), but then KnownAssignedXidsAdd thinks there's some sort 
of corruption because logically 4294967006 is older.

This does not affect replicas in STANDBY_SNAPSHOT_READY state, because 
in that case ProcArrayApplyRecoveryInfo ignores RUNNING_XACTS messages.


The probability of hitting this in practice is proportional to how long 
you leave transactions running. The system where we observed this leaves 
transactions with XIDs open for days, and the age may be ~40M. 
Intuitivelly, that's ~40M/4B (=1%) probability that at any given time 
there are transactions with contradicting ordering. So most restarts 
worked fine, until one that happened at just the "right" time.

This likely explains why we never got any reports about this - most 
systems probably don't leave transactions running for this long, so the 
probability is much lower. And replica restarts are generally not that 
common events either.

Attached patch is fixing this by just sorting the XIDs logically. The 
xidComparator is meant for places that can't do logical ordering. But 
these XIDs come from RUNNING_XACTS, so they actually come from the same 
wraparound epoch (so sorting logically seems perfectly fine).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas

From
"Bossart, Nathan"
Date:
On 1/22/22, 4:43 PM, "Tomas Vondra" <tomas.vondra@enterprisedb.com> wrote:
> There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea,
> which may cause failures when starting a replica, making it unusable.
> The commit message for 8431e296ea is not very clear about what exactly
> is being done and why, but the root cause is that at while processing
> RUNNING_XACTS, the XIDs are sorted like this:
>
>      /*
>       * Sort the array so that we can add them safely into
>       * KnownAssignedXids.
>       */
>      qsort(xids, nxids, sizeof(TransactionId), xidComparator);
>
> where "safely" likely means "not violating the ordering expected by
> KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values
> as plain uint32 values, while KnownAssignedXidsAdd actually calls
> TransactionIdFollowsOrEquals() and compares the logical XIDs :-(

Wow, nice find.

> This likely explains why we never got any reports about this - most
> systems probably don't leave transactions running for this long, so the
> probability is much lower. And replica restarts are generally not that
> common events either.

I'm aware of one report with the same message [0], but I haven't read
closely enough to determine whether it is the same issue.  It looks
like that particular report was attributed to backup_label being
removed.

> Attached patch is fixing this by just sorting the XIDs logically. The
> xidComparator is meant for places that can't do logical ordering. But
> these XIDs come from RUNNING_XACTS, so they actually come from the same
> wraparound epoch (so sorting logically seems perfectly fine).

The patch looks reasonable to me.

Nathan

[0] https://postgr.es/m/1476795473014.15979.2188%40webmail4


On 1/24/22 22:28, Bossart, Nathan wrote:
> On 1/22/22, 4:43 PM, "Tomas Vondra" <tomas.vondra@enterprisedb.com> wrote:
>> There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea,
>> which may cause failures when starting a replica, making it unusable.
>> The commit message for 8431e296ea is not very clear about what exactly
>> is being done and why, but the root cause is that at while processing
>> RUNNING_XACTS, the XIDs are sorted like this:
>>
>>       /*
>>        * Sort the array so that we can add them safely into
>>        * KnownAssignedXids.
>>        */
>>       qsort(xids, nxids, sizeof(TransactionId), xidComparator);
>>
>> where "safely" likely means "not violating the ordering expected by
>> KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values
>> as plain uint32 values, while KnownAssignedXidsAdd actually calls
>> TransactionIdFollowsOrEquals() and compares the logical XIDs :-(
> 
> Wow, nice find.
> 
>> This likely explains why we never got any reports about this - most
>> systems probably don't leave transactions running for this long, so the
>> probability is much lower. And replica restarts are generally not that
>> common events either.
> 
> I'm aware of one report with the same message [0], but I haven't read
> closely enough to determine whether it is the same issue.  It looks
> like that particular report was attributed to backup_label being
> removed.
> 

Yeah, I saw that thread too, and I don't think it's the same issue. As 
you say, it seems to be caused by the backup_label shenanigans, and 
there's also the RUNNING_XACTS message:

Sep 20 15:00:27 ... CONTEXT: xlog redo Standby/RUNNING_XACTS: nextXid 
38585 latestCompletedXid 38571 oldestRunningXid 38572; 14 xacts: 38573 
38575 38579 38578 38574 38581 38580 38576 38577 38572 38582 38584 38583 
38583

The XIDs don't cross the 4B boundary at all, so this seems unrelated.


>> Attached patch is fixing this by just sorting the XIDs logically. The
>> xidComparator is meant for places that can't do logical ordering. But
>> these XIDs come from RUNNING_XACTS, so they actually come from the same
>> wraparound epoch (so sorting logically seems perfectly fine).
> 
> The patch looks reasonable to me.
> 

Thanks!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas

From
Michael Paquier
Date:
On Mon, Jan 24, 2022 at 10:45:48PM +0100, Tomas Vondra wrote:
> On 1/24/22 22:28, Bossart, Nathan wrote:
>>> Attached patch is fixing this by just sorting the XIDs logically. The
>>> xidComparator is meant for places that can't do logical ordering. But
>>> these XIDs come from RUNNING_XACTS, so they actually come from the same
>>> wraparound epoch (so sorting logically seems perfectly fine).
>>
>> The patch looks reasonable to me.
>
> Thanks!

Could it be possible to add a TAP test?  One idea would be to rely on
pg_resetwal -x and -e close to the 4B limit to set up a node before
stressing the scenario of this bug, so that would be rather cheap.
--
Michael

Attachment

On 1/25/22 04:25, Michael Paquier wrote:
> On Mon, Jan 24, 2022 at 10:45:48PM +0100, Tomas Vondra wrote:
>> On 1/24/22 22:28, Bossart, Nathan wrote:
>>>> Attached patch is fixing this by just sorting the XIDs logically. The
>>>> xidComparator is meant for places that can't do logical ordering. But
>>>> these XIDs come from RUNNING_XACTS, so they actually come from the same
>>>> wraparound epoch (so sorting logically seems perfectly fine).
>>>
>>> The patch looks reasonable to me.
>>
>> Thanks!
> 
> Could it be possible to add a TAP test?  One idea would be to rely on
> pg_resetwal -x and -e close to the 4B limit to set up a node before
> stressing the scenario of this bug, so that would be rather cheap.

I actually tried doing that, but I was not very happy with the result. 
The test has to call pg_resetwal, but then it also has to fake pg_xact 
data and so on, which seemed a bit ugly so did not include the test in 
the patch.

But maybe there's a better way to do this, so here it is. I've kept it 
separately, so that it's possible to apply it without the fix, to verify 
it actually triggers the issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas

From
Michael Paquier
Date:
On Wed, Jan 26, 2022 at 07:31:00PM +0100, Tomas Vondra wrote:
> I actually tried doing that, but I was not very happy with the result. The
> test has to call pg_resetwal, but then it also has to fake pg_xact data and
> so on, which seemed a bit ugly so did not include the test in the patch.

Indeed, the dependency to /dev/zero is not good either.  The patch
logic looks good to me.
--
Michael

Attachment
On 1/26/22 23:54, Michael Paquier wrote:
> On Wed, Jan 26, 2022 at 07:31:00PM +0100, Tomas Vondra wrote:
>> I actually tried doing that, but I was not very happy with the result. The
>> test has to call pg_resetwal, but then it also has to fake pg_xact data and
>> so on, which seemed a bit ugly so did not include the test in the patch.
> 
> Indeed, the dependency to /dev/zero is not good either.  The patch
> logic looks good to me.

OK, I've pushed the patch. We may consider adding a TAP test later, if 
we find a reasonably clean approach.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company