Home > mailing lists

Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas - Mailing list pgsql-hackers

From	Bossart, Nathan
Subject	Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas
Date	January 24, 2022 21:28:43
Msg-id	A890C641-2858-40EA-8B9A-D50EAA2EBD44@amazon.com Whole thread Raw
In response to	Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses	Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas
List	pgsql-hackers

Tree view

On 1/22/22, 4:43 PM, "Tomas Vondra" <tomas.vondra@enterprisedb.com> wrote:
> There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea,
> which may cause failures when starting a replica, making it unusable.
> The commit message for 8431e296ea is not very clear about what exactly
> is being done and why, but the root cause is that at while processing
> RUNNING_XACTS, the XIDs are sorted like this:
>
>      /*
>       * Sort the array so that we can add them safely into
>       * KnownAssignedXids.
>       */
>      qsort(xids, nxids, sizeof(TransactionId), xidComparator);
>
> where "safely" likely means "not violating the ordering expected by
> KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values
> as plain uint32 values, while KnownAssignedXidsAdd actually calls
> TransactionIdFollowsOrEquals() and compares the logical XIDs :-(

Wow, nice find.

> This likely explains why we never got any reports about this - most
> systems probably don't leave transactions running for this long, so the
> probability is much lower. And replica restarts are generally not that
> common events either.

I'm aware of one report with the same message [0], but I haven't read
closely enough to determine whether it is the same issue.  It looks
like that particular report was attributed to backup_label being
removed.

> Attached patch is fixing this by just sorting the XIDs logically. The
> xidComparator is meant for places that can't do logical ordering. But
> these XIDs come from RUNNING_XACTS, so they actually come from the same
> wraparound epoch (so sorting logically seems perfectly fine).

The patch looks reasonable to me.

Nathan

[0] https://postgr.es/m/1476795473014.15979.2188%40webmail4

pgsql-hackers by date:

From: Andres Freund
Date: 24 January 2022, 21:28:13
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

From: Robert Haas
Date: 24 January 2022, 21:31:08
Subject: Re: [BUG]Update Toast data failure in logical replication

Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas - Mailing list pgsql-hackers

Previous

Next