On Sun, Sep 4, 2022 at 11:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 9/4/22 16:08, Tomas Vondra wrote:
> > ...
> >
> > so in fact we *know* 849 is a subxact of 848, but we don't call
> > ReorderBufferAssignChild in this case. In fact we can't even do the
> > assignment easily in this case, because we create the subxact first, so
> > that the crash happens right when we attempt to create the toplevel one,
> > and we never even get a chance to do the assignment:
> >
> > 1) process the NEW_CID record, logged for 849 (subxact)
> > 2) process CIDs in the WAL record, which has topleve_xid 848
> >
> >
> > So IMHO we need to figure out what to do for WAL records that create
> > both the toplevel and subxact - either we need to skip them, or rethink
> > how we create the ReorderBufferTXN structs.
> >
>
> This fixes the crash for me, by adding a ReorderBufferAssignChild call
> to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
> ensure we don't try to create the top xact before updating the subxact
> and removing it from the toplevel_by_lsn list.
>
> Essentially, what's happening is this:
>
> 1) We read the NEW_CID record, which is logged with XID 849, i.e. the
> subxact. But we don't know it's a subxact, so we create it as a
> top-level xact with the LSN.
>
> 2) We start processing contents of the NEW_CID, which however has info
> that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
> promptly does ReorderBufferTXNByXid() with the top-level XID, which
> creates it with the same LSN, and crashes because of the assert.
>
> I'm not sure what's the right/proper way to fix this ...
>
> The problem is ReorderBufferAssignChild was coded in a way that did not
> expect the subxact to be created first (as a top-level xact).
>
I think there was a previously hard-coded way to detect that and we
have removed it in commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3ff789acfb2754cd7b5e87f6f4463fd08e35996).
I think it is possible that subtransaction gets logged without
previous top-level txn record as shown in the commit shared.
--
With Regards,
Amit Kapila.