Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) |
Date | |
Msg-id | b3d74bba-743f-27fe-d098-841c8eb64562@enterprisedb.com Whole thread Raw |
In response to | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
|
List | pgsql-hackers |
On 9/4/22 13:49, Amit Kapila wrote: > On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> I've been running some valgrind tests on rpi4/aarch64, and I get a crash >> in test_decoding ddl test in ~50% runs. I don't see the same failure >> without valgrind or on 32-bit system (hundreds of runs, no crashes), so >> I suspect this is a race condition, and with valgrind the timing changes >> in a way to make it more likely. >> >> The crash always happens in the "ddl" test. The backtrace always looks >> like this: >> >> (ExceptionalCondition+0x98)[0x8f6f7c] >> (+0x57a7ec)[0x6827ec] >> (+0x579edc)[0x681edc] >> (ReorderBufferAddNewTupleCids+0x60)[0x686758] >> (SnapBuildProcessNewCid+0x94)[0x68b920] >> (heap2_decode+0x17c)[0x671584] >> (LogicalDecodingProcessRecord+0xbc)[0x670cd0] >> (+0x570f88)[0x678f88] >> (pg_logical_slot_get_changes+0x1c)[0x6790fc] >> (ExecMakeTableFunctionResult+0x29c)[0x4a92c0] >> (+0x3be638)[0x4c6638] >> (+0x3a2c14)[0x4aac14] >> (ExecScan+0x8c)[0x4aaca8] >> (+0x3bea14)[0x4c6a14] >> (+0x39ea60)[0x4a6a60] >> (+0x392378)[0x49a378] >> (+0x39520c)[0x49d20c] >> (standard_ExecutorRun+0x214)[0x49aad8] >> (ExecutorRun+0x64)[0x49a8b8] >> (+0x62f53c)[0x73753c] >> (PortalRun+0x27c)[0x737198] >> (+0x627e78)[0x72fe78] >> (PostgresMain+0x9a0)[0x73512c] >> (+0x547be8)[0x64fbe8] >> (+0x547540)[0x64f540] >> (+0x542d30)[0x64ad30] >> (PostmasterMain+0x1460)[0x64a574] >> (+0x418888)[0x520888] >> >> I'm unable to get a better backtrace from the valgrind-produces core >> usign gdb, for some reason. >> >> However, I've modified AssertTXNLsnOrder() - which is where the assert >> is checked - to also dump toplevel_by_lsn instead of just triggering the >> assert, and the result is always like this: >> >> WARNING: ============================================== >> WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0 >> WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0 >> WARNING: ============================================== >> >> The LSNs change a bit between the runs, but the failing transactions are >> always 848 and 849. Also, both transactions have exactly the same info. >> >> But the very first WAL record for 849 is >> >> ASSIGNMENT xtop 848: subxacts: 849 >> >> so it's strange 849 is in the toplevel_by_lsn list at all, because it >> clearly is a subxact of 848. >> > > There is no guarantee that toplevel_by_lsn won't have subxact. I don't think that's quite true - toplevel_by_lsn should not contain any *known* subxacts. Yes, we may initially add a subxact to the list, but as soon as we get assignment record, it should be removed. See what ReorderBufferAssignChild does. And in this case the ASSIGNMENT is the first WAL record we get for 849 (in fact, isn't that guaranteed since 7259736a6e?), so we know from the very beginning 849 is a subxact. > As per > my understanding, the problem I reported in the email [1] is the same > and we have seen this in BF failures as well. I posted a way to > reproduce it in that email. It seems this is possible if the decoding > gets XLOG_HEAP2_NEW_CID as the first record (belonging to a > subtransaction) after XLOG_RUNNING_XACTS. > Interesting. That's certainly true for WAL in the crashing case: rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn: 0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850 latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848 rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6 regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: