Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date
Msg-id 7950f8c6-3fa9-ad32-b040-7024d8eb1bea@enterprisedb.com
Whole thread Raw
In response to Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
List pgsql-hackers
On 9/4/22 14:24, Tomas Vondra wrote:
> 
> 
> On 9/4/22 13:49, Amit Kapila wrote:
>> On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> I've been running some valgrind tests on rpi4/aarch64, and I get a crash
>>> in test_decoding ddl test in ~50% runs. I don't see the same failure
>>> without valgrind or on 32-bit system (hundreds of runs, no crashes), so
>>> I suspect this is a race condition, and with valgrind the timing changes
>>> in a way to make it more likely.
>>>
>>> The crash always happens in the "ddl" test. The backtrace always looks
>>> like this:
>>>
>>>     (ExceptionalCondition+0x98)[0x8f6f7c]
>>>     (+0x57a7ec)[0x6827ec]
>>>     (+0x579edc)[0x681edc]
>>>     (ReorderBufferAddNewTupleCids+0x60)[0x686758]
>>>     (SnapBuildProcessNewCid+0x94)[0x68b920]
>>>     (heap2_decode+0x17c)[0x671584]
>>>     (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
>>>     (+0x570f88)[0x678f88]
>>>     (pg_logical_slot_get_changes+0x1c)[0x6790fc]
>>>     (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
>>>     (+0x3be638)[0x4c6638]
>>>     (+0x3a2c14)[0x4aac14]
>>>     (ExecScan+0x8c)[0x4aaca8]
>>>     (+0x3bea14)[0x4c6a14]
>>>     (+0x39ea60)[0x4a6a60]
>>>     (+0x392378)[0x49a378]
>>>     (+0x39520c)[0x49d20c]
>>>     (standard_ExecutorRun+0x214)[0x49aad8]
>>>     (ExecutorRun+0x64)[0x49a8b8]
>>>     (+0x62f53c)[0x73753c]
>>>     (PortalRun+0x27c)[0x737198]
>>>     (+0x627e78)[0x72fe78]
>>>     (PostgresMain+0x9a0)[0x73512c]
>>>     (+0x547be8)[0x64fbe8]
>>>     (+0x547540)[0x64f540]
>>>     (+0x542d30)[0x64ad30]
>>>     (PostmasterMain+0x1460)[0x64a574]
>>>     (+0x418888)[0x520888]
>>>
>>> I'm unable to get a better backtrace from the valgrind-produces core
>>> usign gdb, for some reason.
>>>
>>> However, I've modified AssertTXNLsnOrder() - which is where the assert
>>> is checked - to also dump toplevel_by_lsn instead of just triggering the
>>> assert, and the result is always like this:
>>>
>>>     WARNING:  ==============================================
>>>     WARNING:  txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>>>     WARNING:  txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>>>     WARNING:  ==============================================
>>>
>>> The LSNs change a bit between the runs, but the failing transactions are
>>> always 848 and 849. Also, both transactions have exactly the same info.
>>>
>>> But the very first WAL record for 849 is
>>>
>>>     ASSIGNMENT xtop 848: subxacts: 849
>>>
>>> so it's strange 849 is in the toplevel_by_lsn list at all, because it
>>> clearly is a subxact of 848.
>>>
>>
>> There is no guarantee that toplevel_by_lsn won't have subxact.
> 
> I don't think that's quite true - toplevel_by_lsn should not contain any
> *known* subxacts. Yes, we may initially add a subxact to the list, but
> as soon as we get assignment record, it should be removed. See what
> ReorderBufferAssignChild does.
> 
> And in this case the ASSIGNMENT is the first WAL record we get for 849
> (in fact, isn't that guaranteed since 7259736a6e?), so we know from the
> very beginning 849 is a subxact.
> 
> 
>> As per
>> my understanding, the problem I reported in the email [1] is the same
>> and we have seen this in BF failures as well. I posted a way to
>> reproduce it in that email. It seems this is possible if the decoding
>> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
>> subtransaction) after XLOG_RUNNING_XACTS.
>>
> 
> Interesting. That's certainly true for WAL in the crashing case:
> 
> rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
> 0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
> latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
> rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
> 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
> 58/38; cmin: 1, cmax: 14, combo: 6
> 

I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.

That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.

Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.

However, when processing the NEW_CID record:

tx:        849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6

we ultimately do this in SnapBuildProcessNewCid:

#1  0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2  0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848


So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: failing to build preproc.c on solaris with sun studio
Next
From: Andrew Dunstan
Date:
Subject: Re: failing to build preproc.c on solaris with sun studio