Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date
Msg-id CAA4eK1Kro0rej=ZXMhcdjs+aYsZvNywu3-cqdRUtyAp4zqpVWw@mail.gmail.com
Whole thread Raw
In response to Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Wed, Jun 7, 2023 at 6:18 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 6/6/23 17:42, Tomas Vondra wrote:
> >
>
> In investigated this a bit more, and the problem actually seems to be
> more like this:
>
> 1) we create a new logical replication slot
>
> 2) while building the initial snapshot, we start with current insert
> location, and then process records
>
> 3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
> SnapBuildFindSnapshot
>
> 4) SnapBuildFindSnapshot does this:
>
>     else if (!builder->building_full_snapshot &&
>              SnapBuildRestore(builder, lsn))
>     {
>         /* there won't be any state to cleanup */
>         return false;
>     }
>
> 5) because create_logical_replication_slot and get_changes both call
> CreateInitDecodingContext with needs_full_snapshot=false, we end up
> calling SnapBuildRestore()
>
> 6) once in a while this likely hits a snapshot created by a concurrent
> session (for another logical slot) with SNAPBUILD_CONSISTENT state
>

I think this analysis doesn't seem to match what you mentioned in the
previous email which is as follows:
> > However, we skip the assignment, because the log for this call of
> > get_changes says this:
> >
> >    LOG:  logical decoding found consistent point at 0/6462E5D8
> >
> > so we fail to realize the 26662 is a subxact.

This is because the above LOG is printed when
"running->oldestRunningXid == running->nextXid" not when we restore
the snapshot. Am, I missing something?

>
> I don't know what's the correct fix for this. Maybe we should set
> needs_full_snapshot=true in create_logical_replication_slot when
> creating the initial snapshot? Maybe we should use true even in
> pg_logical_slot_get_changes_guts? This seems to fix the crashes ...
>

I don't think that is advisable because setting "needs_full_snapshot"
to true for decoding means the snapshot will start tracking
non-catalog committed xacts as well which is costly and is not
required for this case.

> That'll prevent reading the serialized snapshots like this, but how
> could that ever work? It seems pretty much guaranteed to ignore any
> assignments that happened right before the snapshot?
>

This part needs some analysis/thoughts. BTW, do you mean that it skips
the assignment (a) because the assignment record is before we reach a
consistent point, or (b) because we start reading WAL after the
assignment, or (c) something else? If you intend to say (a) then can
you please point me to the code you are referring to for the same?

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Let's make PostgreSQL multi-threaded
Next
From: Jaime Casanova
Date:
Subject: is pg_log_standby_snapshot() really needed?