[BUG] FailedAssertion in SnapBuildPurgeOlderTxn - Mailing list pgsql-hackers

From Maxim Orlov
Subject [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Date
Msg-id CACG=ezZoz_KG+Ryh9MrU_g5e0HiVoHocEvqFF=NRrhrwKmEQJQ@mail.gmail.com
Whole thread Raw
Responses Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
List pgsql-hackers
Hi!


PREAMBLE

For a last couple of months, I stumbled into a problem while running tests on ARM (Debain, aarch64) and some more wired platforms
for my 64–bit XIDs patch set. Test contrib/test_decoding (catalog_change_snapshot) rarely failed with the next message:

TRAP: FailedAssertion("TransactionIdIsNormal(InitialRunningXacts[0]) && TransactionIdIsNormal(builder->xmin)", File: "snapbuild.c"

I have plenty of failing on ARM, couple on x86 and none (if memory serves) on x86–64.

At first, my thought was to blame my 64–bit XID patch for what, but this is not the case. This error persist from PG15 to PG10
without any patch applied. Though hard to reproduce.


PROBLEM

After some investigation, I think, the problem is in the snapbuild.c (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
array in the context of builder->context, but for the time when we call SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F) values. This is not caused buildfarm to fail due to that code:

if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
                                 builder->xmin))
    return;

Since the cluster is initialised with XID way less than 0x7F7F7F7F, we get to return here, but the problem is still existing.
I've attached the patch based on branch REL_15_STABLE to reproduce the problem on x86-64.

On my patch set of 64–bit XID's this problem manifested since I do init cluster with XID far beyond 32–bit bound.

Alternatively, I did try to use my patch [1] to init cluster with first transaction 2139062143 (i.e. 0x7F7F7F7F).
Then put pg_sleep call just like in the attached patch:
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -968,6 +968,8 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
        if (NInitialRunningXacts == 0)
                return;

+       pg_usleep(1000000L * 2L);
+
        /* bound check if there is at least one transaction to remove */
        if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
                                                                         builder->xmin))

Run installcheck-force for many times for a test_decoding/catalog_change_snapshot's and got a segfault.


CONCLUSION

In snapbuild.c, context allocated array InitialRunningXacts may be free'd, this caused assertion failed (at best) or
may lead to the more serious problems.


P.S.

Simple fix like:
@@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
                 * changes. See SnapBuildXidSetCatalogChanges.
                 */
                NInitialRunningXacts = nxacts;
-               InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+               InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, sz);
                memcpy(InitialRunningXacts, running->xids, sz);
                qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);

seems to solve the described problem, but I'm not in the context of [0] and why array is allocated in builder->context.



--
Best regards,
Maxim Orlov.
Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: ExecRTCheckPerms() and many prunable partitions
Next
From: Maxim Orlov
Date:
Subject: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns