TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers

From Tomas Vondra
Subject TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date
Msg-id a89b46b6-0239-2fd5-71a9-b19b1f7a7145@enterprisedb.com
Whole thread Raw
Responses Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
List pgsql-hackers
Hi,

I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.

The crash always happens in the "ddl" test. The backtrace always looks
like this:

    (ExceptionalCondition+0x98)[0x8f6f7c]
    (+0x57a7ec)[0x6827ec]
    (+0x579edc)[0x681edc]
    (ReorderBufferAddNewTupleCids+0x60)[0x686758]
    (SnapBuildProcessNewCid+0x94)[0x68b920]
    (heap2_decode+0x17c)[0x671584]
    (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
    (+0x570f88)[0x678f88]
    (pg_logical_slot_get_changes+0x1c)[0x6790fc]
    (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
    (+0x3be638)[0x4c6638]
    (+0x3a2c14)[0x4aac14]
    (ExecScan+0x8c)[0x4aaca8]
    (+0x3bea14)[0x4c6a14]
    (+0x39ea60)[0x4a6a60]
    (+0x392378)[0x49a378]
    (+0x39520c)[0x49d20c]
    (standard_ExecutorRun+0x214)[0x49aad8]
    (ExecutorRun+0x64)[0x49a8b8]
    (+0x62f53c)[0x73753c]
    (PortalRun+0x27c)[0x737198]
    (+0x627e78)[0x72fe78]
    (PostgresMain+0x9a0)[0x73512c]
    (+0x547be8)[0x64fbe8]
    (+0x547540)[0x64f540]
    (+0x542d30)[0x64ad30]
    (PostmasterMain+0x1460)[0x64a574]
    (+0x418888)[0x520888]

I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.

However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:

    WARNING:  ==============================================
    WARNING:  txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
    WARNING:  txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
    WARNING:  ==============================================

The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.

But the very first WAL record for 849 is

    ASSIGNMENT xtop 848: subxacts: 849

so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.

Furthermore, the WAL is almost exactly the same in both cases. Attached
are two dumps from a failed and successful run (only the part related to
these two xids is included). There are very few differences - there is a
PRUNE in the failed case, and a LOCK / RUNNING_XACTS moved a bit.


Any ideas?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: Re: freeing LDAPMessage in CheckLDAPAuth
Next
From: Amit Kapila
Date:
Subject: Re: pgsql: Add ALTER SUBSCRIPTION ... SKIP.