Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) |
Date | |
Msg-id | bedc421a-3cdd-ca5a-7b9e-bc6da989f500@enterprisedb.com Whole thread Raw |
In response to | Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) (Alexander Lakhin <exclusion@gmail.com>) |
Responses |
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
|
List | pgsql-hackers |
On 6/6/23 14:00, Alexander Lakhin wrote: > Hello Tomas, > > 06.06.2023 12:56, Tomas Vondra wrote: >> On 6/6/23 11:00, Alexander Lakhin wrote: >>> Hello, >>> ...> With the debug logging added inside AssertTXNLsnOrder() I see: >>> ctx->snapshot_builder->start_decoding_at: 209807224, >>> ctx->reader->EndRecPtr: 210043072, >>> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0 >>> and inside the loop: >>> cur_txn->first_lsn: 209792872 >>> cur_txn->first_lsn: 209975744 >>> cur_txn->first_lsn: 210043008 >>> cur_txn->first_lsn: 210043008 >>> and it triggers the Assert. >>> >> So what's the prev_first_lsn value for these first_lsn values? How does >> it change over time? Did you try looking at the pg_waldump for these >> positions? > > With more logging I've got (for another run): > ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80 > > ctx->snapshot_builder->start_decoding_at: c1f2cc0, > ctx->reader->EndRecPtr: c1fbcc0, > SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0 > prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80 > prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80 > TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ... > > waldump for 00000001000000000000000C shows: > grep c1fbc80: > rmgr: Heap2 len (rec/tot): 60/ 60, tx: 3398, lsn: > 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 1663/18763/19987, tid: > 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295 > rmgr: Heap len (rec/tot): 59/ 59, tx: 3398, lsn: > 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off: 1, flags: 0x08, > blkref #0: rel 1663/18763/19987 blk 0 > > grep '( 3397| 3398)' I've been able to reproduce this, after messing with the script a little bit (I had to skip the test_decoding regression tests, because that was complaining about slots already existing etc). Anyway, AssertTXNLsnOrder sees these two transactions (before aborting): 26662 0/6462E6F0 (first 0/0) 26661 0/6462E6F0 (first 0/6462E6F0) where 26661 is the top xact, 26662 is a subxact of 26661. This is clearly a problem, because we really should not have subxact in this list once the assignment gets applied. And the relevant WAL looks like this: --------------------------------------------------------------------- 26662, lsn: 0/645EDAA0, prev 0/645EDA60, desc: ASSIGNMENT xtop 26661: subxacts: 26662 26662, lsn: 0/645EDAD0, prev 0/645EDAA0, desc: INSERT+INIT off: 1, flags: 0x08, blkref #0: rel 1663/125447/126835 blk 0 ... 0, lsn: 0/6462E5D8, prev 0/6462E2A0, desc: RUNNING_XACTS nextXid 26673 latestCompletedXid 26672 oldestRunningXid 26661; 3 xacts: 26667 26661 26664; 3 subxacts: 26668 26662 26665 ... 26662, lsn: 0/6462E6F0, prev 0/6462E678, desc: NEW_CID rel: 1663/125447/126841, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295 26662, lsn: 0/6462E730, prev 0/6462E6F0, desc: INSERT+INIT off: 1, flags: 0x08, blkref #0: rel 1663/125447/126841 blk 0 26661, lsn: 0/6462E770, prev 0/6462E730, desc: COMMIT 2023-06-06 16:41:24.442870 CEST; subxacts: 26662 --------------------------------------------------------------------- so the assignment is the *first* thing that happens for these xacts. However, we skip the assignment, because the log for this call of get_changes says this: LOG: logical decoding found consistent point at 0/6462E5D8 so we fail to realize the 26662 is a subxact. Then when processing the NEW_CID, SnapBuildProcessNewCid chimes in and does this: ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn); ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn, xlrec->target_locator, xlrec->target_tid, xlrec->cmin, xlrec->cmax, xlrec->combocid); and ReorderBufferAddNewTupleCids() proceeds to enter an entry for the passed XID (which is xlrec->top_xid, 26661), but with LSN of the WAL record. But ReorderBufferXidSetCatalogChanges() already did the same thing for the subxact 26662, as it has no idea it's a subxact (due to the skipped assignment). I haven't figured out what exactly is happening / what it should be doing instead. But it seems wrong to skip the assignment - I wonder if SnapBuildProcessRunningXacts might be doing that too eagerly. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: