Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date
Msg-id ffe91d51-50d5-dc3d-1e51-40bac132101b@enterprisedb.com
Whole thread Raw
In response to Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers

On 9/5/22 08:35, Amit Kapila wrote:
> On Sun, Sep 4, 2022 at 11:10 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 9/4/22 16:08, Tomas Vondra wrote:
>>> ...
>>>
>>> so in fact we *know* 849 is a subxact of 848, but we don't call
>>> ReorderBufferAssignChild in this case. In fact we can't even do the
>>> assignment easily in this case, because we create the subxact first, so
>>> that the crash happens right when we attempt to create the toplevel one,
>>> and we never even get a chance to do the assignment:
>>>
>>> 1) process the NEW_CID record, logged for 849 (subxact)
>>> 2) process CIDs in the WAL record, which has topleve_xid 848
>>>
>>>
>>> So IMHO we need to figure out what to do for WAL records that create
>>> both the toplevel and subxact - either we need to skip them, or rethink
>>> how we create the ReorderBufferTXN structs.
>>>
>>
>> This fixes the crash for me, by adding a ReorderBufferAssignChild call
>> to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
>> ensure we don't try to create the top xact before updating the subxact
>> and removing it from the toplevel_by_lsn list.
>>
>> Essentially, what's happening is this:
>>
>> 1) We read the NEW_CID record, which is logged with XID 849, i.e. the
>> subxact. But we don't know it's a subxact, so we create it as a
>> top-level xact with the LSN.
>>
>> 2) We start processing contents of the NEW_CID, which however has info
>> that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
>> promptly does ReorderBufferTXNByXid() with the top-level XID, which
>> creates it with the same LSN, and crashes because of the assert.
>>
>> I'm not sure what's the right/proper way to fix this ...
>>
>> The problem is ReorderBufferAssignChild was coded in a way that did not
>> expect the subxact to be created first (as a top-level xact).
>>
> 
> I think there was a previously hard-coded way to detect that and we
> have removed it in commit
> (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3ff789acfb2754cd7b5e87f6f4463fd08e35996).
> I think it is possible that subtransaction gets logged without
> previous top-level txn record as shown in the commit shared.
> 

Well, yes and no.

This wouldn't detect the issue, because the assert happens in the first
ReorderBufferTXNByXid(), so it's still crash (in assert-enabled build,
at least).

Maybe removing the assumption was the wrong thing, and we should have
changed the code so that we don't violate it? That's kinda what my "fix"
does, in a way.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Column Filtering in Logical Replication
Next
From: Andrey Lepikhov
Date:
Subject: Re: [POC] Allow flattening of subquery with a link to upper query