Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date
Msg-id 07301efe-2f6f-145c-dbfd-20991c89f586@enterprisedb.com
Whole thread Raw
In response to RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
List pgsql-hackers

On 6/14/23 15:39, Zhijie Hou (Fujitsu) wrote:
> On Wednesday, June 14, 2023 5:05 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>> ...
>>
>> Also, can you try if we still stream the partial transaction with
>> create_logical_replication_slot building a full snapshot?
> 
> Yes, It can fix this problem because force create_logical_replication_slot
> build a full snapshot can avoid restoring the snapshot. But I am not sure if
> this is the best fix for this for the same reason(it's costly) mentioned by
> Amit[1].
> 

Costly compared to the current behavior? Sure, but considering the
current behavior is incorrect/broken, that seems rather irrelevant.
Incorrect behavior can be infinitely faster.

I doubt it's significantly costlier than just setting the "full
snapshot" flag when building the initial snapshot - sure, it will take
more time than now, but that's kinda the whole point. It seems to me the
problem is exactly that it *doesn't* wait long enough.

I may be misunderstanding the solution you proposed, but this:

    One idea to fix the partial change stream problem would be that we
    record all the running transaction's xid when restoring the snapshot
    in SnapBuildFindSnapshot(), and in the following decoding, we skip
    decoding changes for the recorded transaction

sounds pretty close to what building a correct snapshot actually does.

But maybe I'm wrong - ultimately, the way to compare those approaches
seems to be to prototype this proposal, and do some tests.

There's also the question of back branches, and it seems way simpler to
just flip a flag and disable broken optimization than doing fairly
complex stuff to save it.

I'd also point out that (a) this only affects the initial snapshot, not
every time we start the decoding context, and (b) the slots created from
walsender already do that with (unless when copy_data=false).

So if needs_full_snapshot=true fixes the issue, I'd just do that as the
first step - in master and backpatches. And then if we want to salvage
the optimization, we can try fixing it in master.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: "Tristan Partin"
Date:
Subject: Re: Use COPY for populating all pgbench tables
Next
From: "Tristan Partin"
Date:
Subject: Re: Use COPY for populating all pgbench tables