Thread: Assertion failure with epoch when replaying standby records for 2PC
Hi all, rorqual has failed today with a very interesting failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31 This has caused an assertion failure for a 2PC transaction when replaying one of the tests from the main regression suite: 2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00. TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610 postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1] postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e] postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120] postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357] postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5] postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490] postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404] postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8] postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6] postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd] postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235] postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10] postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce] /lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785] postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1] This refers to the following in twophase.c with AdjustToFullTransactionId(): nextXid = XidFromFullTransactionId(nextFullXid); epoch = EpochFromFullTransactionId(nextFullXid); if (unlikely(xid > nextXid)) { /* Wraparound occurred, must be from a prev epoch. */ Assert(epoch > 0); epoch--; } This would mean that we've found a way to get a negative epoch, which should not be possible. Alexander, you have added this code in 5a1dfde8334b when switching the 2PC file names to use FullTransactionIds. Could you check please? -- Michael
Attachment
Re: Assertion failure with epoch when replaying standby records for 2PC
From
Alexander Korotkov
Date:
Hi, Michael! On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier <michael@paquier.xyz> wrote: > rorqual has failed today with a very interesting failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31 > > This has caused an assertion failure for a 2PC transaction when > replaying one of the tests from the main regression suite: > 2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00. > TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610 > postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1] > postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e] > postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120] > postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357] > postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5] > postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490] > postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404] > postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8] > postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6] > postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd] > postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235] > postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10] > postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce] > /lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785] > postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1] > > This refers to the following in twophase.c with > AdjustToFullTransactionId(): > nextXid = XidFromFullTransactionId(nextFullXid); > epoch = EpochFromFullTransactionId(nextFullXid); > > if (unlikely(xid > nextXid)) > { > /* Wraparound occurred, must be from a prev epoch. */ > Assert(epoch > 0); > epoch--; > } > > This would mean that we've found a way to get a negative epoch, which > should not be possible. > > Alexander, you have added this code in 5a1dfde8334b when switching the > 2PC file names to use FullTransactionIds. Could you check please? Thank you for reporting! I'm going to look at this in the next couple of days. ------ Regards, Alexander Korotkov
Re: Assertion failure with epoch when replaying standby records for 2PC
From
Alexander Korotkov
Date:
On Wed, Jan 17, 2024 at 11:08 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier <michael@paquier.xyz> wrote: > > rorqual has failed today with a very interesting failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31 > > > > This has caused an assertion failure for a 2PC transaction when > > replaying one of the tests from the main regression suite: > > 2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00. > > TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610 > > postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1] > > postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e] > > postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120] > > postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357] > > postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5] > > postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490] > > postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404] > > postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8] > > postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6] > > postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd] > > postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235] > > postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10] > > postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce] > > /lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca] > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785] > > postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1] > > > > This refers to the following in twophase.c with > > AdjustToFullTransactionId(): > > nextXid = XidFromFullTransactionId(nextFullXid); > > epoch = EpochFromFullTransactionId(nextFullXid); > > > > if (unlikely(xid > nextXid)) > > { > > /* Wraparound occurred, must be from a prev epoch. */ > > Assert(epoch > 0); > > epoch--; > > } > > > > This would mean that we've found a way to get a negative epoch, which > > should not be possible. > > > > Alexander, you have added this code in 5a1dfde8334b when switching the > > 2PC file names to use FullTransactionIds. Could you check please? > > Thank you for reporting! I'm going to look at this in the next couple of days. Oh, that is a forgotten piece I've already discovered. https://www.postgresql.org/message-id/CAPpHfdv%3DVahovNqJHBqr0ejHvx%3DeDuGYySC48Wcvp%2BGDxYLCJg%40mail.gmail.com I'm going to do some additional checks and push. ------ Regards, Alexander Korotkov