Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date
Msg-id CAA4eK1+NNjJMnvRtee_e7pxSv2vaK2=iT-dbJPG1kNN+ELm7Nw@mail.gmail.com
Whole thread Raw
In response to TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
List pgsql-hackers
On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I've been running some valgrind tests on rpi4/aarch64, and I get a crash
> in test_decoding ddl test in ~50% runs. I don't see the same failure
> without valgrind or on 32-bit system (hundreds of runs, no crashes), so
> I suspect this is a race condition, and with valgrind the timing changes
> in a way to make it more likely.
>
> The crash always happens in the "ddl" test. The backtrace always looks
> like this:
>
>     (ExceptionalCondition+0x98)[0x8f6f7c]
>     (+0x57a7ec)[0x6827ec]
>     (+0x579edc)[0x681edc]
>     (ReorderBufferAddNewTupleCids+0x60)[0x686758]
>     (SnapBuildProcessNewCid+0x94)[0x68b920]
>     (heap2_decode+0x17c)[0x671584]
>     (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
>     (+0x570f88)[0x678f88]
>     (pg_logical_slot_get_changes+0x1c)[0x6790fc]
>     (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
>     (+0x3be638)[0x4c6638]
>     (+0x3a2c14)[0x4aac14]
>     (ExecScan+0x8c)[0x4aaca8]
>     (+0x3bea14)[0x4c6a14]
>     (+0x39ea60)[0x4a6a60]
>     (+0x392378)[0x49a378]
>     (+0x39520c)[0x49d20c]
>     (standard_ExecutorRun+0x214)[0x49aad8]
>     (ExecutorRun+0x64)[0x49a8b8]
>     (+0x62f53c)[0x73753c]
>     (PortalRun+0x27c)[0x737198]
>     (+0x627e78)[0x72fe78]
>     (PostgresMain+0x9a0)[0x73512c]
>     (+0x547be8)[0x64fbe8]
>     (+0x547540)[0x64f540]
>     (+0x542d30)[0x64ad30]
>     (PostmasterMain+0x1460)[0x64a574]
>     (+0x418888)[0x520888]
>
> I'm unable to get a better backtrace from the valgrind-produces core
> usign gdb, for some reason.
>
> However, I've modified AssertTXNLsnOrder() - which is where the assert
> is checked - to also dump toplevel_by_lsn instead of just triggering the
> assert, and the result is always like this:
>
>     WARNING:  ==============================================
>     WARNING:  txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>     WARNING:  txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>     WARNING:  ==============================================
>
> The LSNs change a bit between the runs, but the failing transactions are
> always 848 and 849. Also, both transactions have exactly the same info.
>
> But the very first WAL record for 849 is
>
>     ASSIGNMENT xtop 848: subxacts: 849
>
> so it's strange 849 is in the toplevel_by_lsn list at all, because it
> clearly is a subxact of 848.
>

There is no guarantee that toplevel_by_lsn won't have subxact. As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

[1] - https://www.postgresql.org/message-id/CAA4eK1LK1nxOTL32OP%3DejhPoBsUP4Bvwb3Ly%3DfethyJ-KbaXyw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: pgsql: Add ALTER SUBSCRIPTION ... SKIP.
Next
From: Tomas Vondra
Date:
Subject: Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)