Thread: Assertion failure with epoch when replaying standby records for 2PC

Assertion failure with epoch when replaying standby records for 2PC

From
Michael Paquier
Date:
Hi all,

rorqual has failed today with a very interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31

This has caused an assertion failure for a 2PC transaction when
replaying one of the tests from the main regression suite:
2024-01-17 05:08:23.143 UTC [3242608] DETAIL:  Last completed transaction was at log time 2024-01-17
05:08:22.920244+00.
TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]

This refers to the following in twophase.c with
AdjustToFullTransactionId():
    nextXid = XidFromFullTransactionId(nextFullXid);
    epoch = EpochFromFullTransactionId(nextFullXid);

    if (unlikely(xid > nextXid))
    {
        /* Wraparound occurred, must be from a prev epoch. */
        Assert(epoch > 0);
        epoch--;
    }

This would mean that we've found a way to get a negative epoch, which
should not be possible.

Alexander, you have added this code in 5a1dfde8334b when switching the
2PC file names to use FullTransactionIds.  Could you check please?
--
Michael

Attachment

Re: Assertion failure with epoch when replaying standby records for 2PC

From
Alexander Korotkov
Date:
Hi, Michael!

On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier <michael@paquier.xyz> wrote:
> rorqual has failed today with a very interesting failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31
>
> This has caused an assertion failure for a 2PC transaction when
> replaying one of the tests from the main regression suite:
> 2024-01-17 05:08:23.143 UTC [3242608] DETAIL:  Last completed transaction was at log time 2024-01-17
05:08:22.920244+00.
> TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
> postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
> postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
> postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
> postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
> postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
> postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
> postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
> postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
> postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
> postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
> postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
> postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
> postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
> /lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
> postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]
>
> This refers to the following in twophase.c with
> AdjustToFullTransactionId():
>     nextXid = XidFromFullTransactionId(nextFullXid);
>     epoch = EpochFromFullTransactionId(nextFullXid);
>
>     if (unlikely(xid > nextXid))
>     {
>         /* Wraparound occurred, must be from a prev epoch. */
>         Assert(epoch > 0);
>         epoch--;
>     }
>
> This would mean that we've found a way to get a negative epoch, which
> should not be possible.
>
> Alexander, you have added this code in 5a1dfde8334b when switching the
> 2PC file names to use FullTransactionIds.  Could you check please?

Thank you for reporting!  I'm going to look at this in the next couple of days.

------
Regards,
Alexander Korotkov



Re: Assertion failure with epoch when replaying standby records for 2PC

From
Alexander Korotkov
Date:
On Wed, Jan 17, 2024 at 11:08 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier <michael@paquier.xyz> wrote:
> > rorqual has failed today with a very interesting failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31
> >
> > This has caused an assertion failure for a 2PC transaction when
> > replaying one of the tests from the main regression suite:
> > 2024-01-17 05:08:23.143 UTC [3242608] DETAIL:  Last completed transaction was at log time 2024-01-17
05:08:22.920244+00.
> > TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
> > postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
> > postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
> > postgres: standby_1: startup recovering
00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
> > postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
> > postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
> > postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
> > postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
> > postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
> > postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
> > postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
> > postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
> > postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
> > postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
> > /lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
> > postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]
> >
> > This refers to the following in twophase.c with
> > AdjustToFullTransactionId():
> >     nextXid = XidFromFullTransactionId(nextFullXid);
> >     epoch = EpochFromFullTransactionId(nextFullXid);
> >
> >     if (unlikely(xid > nextXid))
> >     {
> >         /* Wraparound occurred, must be from a prev epoch. */
> >         Assert(epoch > 0);
> >         epoch--;
> >     }
> >
> > This would mean that we've found a way to get a negative epoch, which
> > should not be possible.
> >
> > Alexander, you have added this code in 5a1dfde8334b when switching the
> > 2PC file names to use FullTransactionIds.  Could you check please?
>
> Thank you for reporting!  I'm going to look at this in the next couple of days.

Oh, that is a forgotten piece I've already discovered.
https://www.postgresql.org/message-id/CAPpHfdv%3DVahovNqJHBqr0ejHvx%3DeDuGYySC48Wcvp%2BGDxYLCJg%40mail.gmail.com
I'm going to do some additional checks and push.

------
Regards,
Alexander Korotkov