RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) |
Date | |
Msg-id | OS0PR01MB57162DAF28227A05581957519455A@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
|
List | pgsql-hackers |
On Tuesday, June 13, 2023 12:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > > > Well, I think the issue is pretty clear - we end up with an initial > > snapshot that's in between the ASSIGNMENT and NEW_CID, and because > > NEW_CID has both xact and subxact XID it fails because we add two TXNs > > with the same LSN, not realizing one of them is subxact. > > > > That's obviously wrong, although somewhat benign in production because > > it only fails because of hitting an assert. > > > > Doesn't this indicate that we can end up decoding a partial transaction when > we restore a snapshot? Won't that be a problem even for production? Yes, I think it can cause the problem that only partial changes of a transaction are streamed. I tried to reproduce this and here are the steps. Note, to make sure the test won't be affected by other running_xact WALs, I changed LOG_SNAPSHOT_INTERVAL_MS to a bigger number. session 1: ----- create table test(a int); SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding'); ----- session 2: ----- - Start a transaction BEGIN; INSERT INTO test VALUES(1); ----- session 3: ----- - Create another slot isolation_slot_2, it should choose a restart_point which is - after the changes that happened in session 2. Note, to let the current slot - restore another snapshot, we need to use gdb to block the current backend at - SnapBuildFindSnapshot(), the backend should have logged the running_xacts WAL - before reaching SnapBuildFindSnapshot. SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding'); ----- session 1: ----- - Since there is a running_xacts which session 3 logged, the current backend will - serialize the snapshot when decoding the running_xacts WAL, and the snapshot - can be used by other slots (e.g. isolation_slot_2) SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); ----- session 2: ----- - Insert some different data and commit the transaction. INSERT INTO test VALUES(2); INSERT INTO test VALUES(3); INSERT INTO test VALUES(4); COMMIT ----- session 3: ----- - Release the process and try to stream the changes, since the restart point is - at the middle of the transaction, it will stream partial changes of the - transaction which was committed in session 2: SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); ----- Results (partial streamed changes): postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids','0'); data ----------------------------------------- BEGIN table public.test: INSERT: a[integer]:2 table public.test: INSERT: a[integer]:3 table public.test: INSERT: a[integer]:4 COMMIT (5 rows) > > > Regular builds are likely to > > just ignore it, although I haven't checked if the COMMIT cleanup (I > > wonder if we remove the subxact from the toplevel list on commit). > > > > I think the problem is we just grab an existing snapshot, before all > > running xacts complete. Maybe we should fix that, and leave the > > needs_full_snapshot alone. > > > > It is not clear what exactly you have in mind to fix this because if there is no > running xact, we don't even need to restore the snapshot because of a prior > check "if (running->oldestRunningXid == > running->nextXid)". I think the main problem is that we started > decoding immediately from the point where we restored a snapshot as at that > point we could have some partial running xacts. Best Regards, Hou zj
pgsql-hackers by date: