RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder() - Mailing list pgsql-bugs
From | Hayato Kuroda (Fujitsu) |
---|---|
Subject | RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder() |
Date | |
Msg-id | TYCPR01MB1207717063D701F597EF98A0CF5272@TYCPR01MB12077.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder() ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Responses |
Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()
RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder() |
List | pgsql-bugs |
Dear Haiyang, Alexander, I analyzed the second failure reported in [1]. The failure happened on the all Supported branches. Attached patches fixes two failures [1] [2] on PG12-PG15, and only the failure [1] on PG16-HEAD. Below part describes the second failure. The issue would be occurred when: 1) Logical decoding starts from the middle of a sub-transaction, and 2) NEW_CID record is initially decoded in the sub-transaction, and 3) An arbitrary changes are decoded in the sub-transaction. ## Stuck trace Just in case, below is a stuck trace I got. ``` #0 0x00007f85e64af387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007f85e64b0a78 in __GI_abort () at abort.c:90 #2 0x0000000000b082d3 in ExceptionalCondition (conditionName=0xcc83e7 "prev_txn_is_empty", errorType=0xcc8184 "FailedAssertion", fileName=0xcc81d0 "reorderbuffer.c", lineNumber=932) at assert.c:69 #3 0x00000000008e023d in AssertTXNLsnOrder (rb=0x1a69e00) at reorderbuffer.c:932 #4 0x00000000008e3cf5 in ReorderBufferSetBaseSnapshot (rb=0x1a69e00, xid=757, lsn=24749472, snap=0x1a660e8) at reorderbuffer.c:3140 #5 0x00000000008e910a in SnapBuildProcessChange (builder=0x1a65de0, xid=757, lsn=24749472) at snapbuild.c:799 #6 0x00000000008d2740 in DecodeHeapOp (ctx=0x1a53d60, buf=0x7ffe24280460) at decode.c:521 #7 0x00000000008d1df7 in LogicalDecodingProcessRecord (ctx=0x1a53d60, record=0x1a54160) at decode.c:154 #8 0x00000000008d9766 in pg_logical_slot_get_changes_guts (fcinfo=0x1a41f20, confirm=true, binary=false) at logicalfuncs.c:296 #9 0x00000000008d989b in pg_logical_slot_get_changes (fcinfo=0x1a41f20) at logicalfuncs.c:365 ... ``` ## Reproducer Each attached patches contained the reproducer. As I said above, logical decoding is started from the middle of the sub-transaction initial decoded record must be NEW_CID. Below shows the wal records from the decoding start point. ``` $ pg_waldump tmp_check_iso/data/pg_wal/000000010000000000000001 -s 0/179A4E8 rmgr: ..., tx: 0, lsn: 0/0179A4E8, prev 0/0179A4A8, desc: CHECKPOINT_ONLINE... rmgr: ..., tx: 757, lsn: 0/0179A560, prev 0/0179A4E8, desc: NEW_CID ... rmgr: ..., tx: 757, lsn: 0/0179A5A0, prev 0/0179A560, desc: INSERT+INIT... rmgr: ..., tx: 756, lsn: 0/0179A5E0, prev 0/0179A5A0, desc: COMMIT 2024-03-07 13:54:56.243746 UTC; subxacts: 757 rmgr: ..., tx: 0, lsn: 0/0179A618, prev 0/0179A5E0, desc: RUNNING_XACTS ... ``` ## Analysis SnapBuildProcessNewCid() generates a two transactions (xid = 757 and 756) based on the same wal record, so both entries have same first_lsn in ReorderBufferTXN. However, they are not associated as top-sub relationship so that they are pushed to toplevel_by_lsn list. Since AssertTXNLsnOrder() does not assumes the case which two entries have same first_lsn, it leads an assertion failure. Note that the sub-transaction is generated earlier than top one and sub is not an empty, an initial fix was not sufficient. ### Detailed analysis I added a debug variable in AssertTXNLsnOrder to preserve the previous entry in toplevel_by_lsn loop, and I found that two ReorderBufferTNXes have same first_lsn. Also, according to above, 757 is a sub-transaction of 756. ``` (gdb) p *prev $1 = {txn_flags = 1, xid = 757, toplevel_xid = 0, gid = 0x0, first_lsn = 24749408, ...nentries = 1, ..., ntuplecids = 0,...} (gdb) p *cur_txn $2 = {txn_flags = 0, xid = 756, toplevel_xid = 0, gid = 0x0, first_lsn = 24749408, ...nentries = 0, ..., ntuplecids = 1,...} ``` Based on above and some debug outputs, I considered a scenario. Below flow showed a case when the sub-transaction (xid = 757) was decoded. ``` DecodeHeap2Op(info = XLOG_HEAP2_NEW_CID) SnapBuildProcessNewCid() ReorderBufferXidSetCatalogChanges(xid, lsn) ReorderBufferTXNByXid(xid, lsn) -> A ReorderBufferTXN for subtransaciton (757) was generated. The first_lsn was the head of NEW_CID record. ReorderBufferAddNewCommandId(xid, lsn) ReorderBufferQueueChange(xid, lsn) -> A ReorderBufferChange was queued to the subtransaction (757) ... ReorderBufferAddNewTupleCids(xlrec->top_xid, lsn) ReorderBufferTXNByXid() -> A ReorderBufferTXN for top-transaciton (756) was generated. The first_lsn was the head of NEW_CID record. ... DecodeHeapOp(info = XLOG_HEAP_INSERT) SnapBuildProcessChange(xid) ReorderBufferSetBaseSnapshot(xid, lsn) AssertTXNLsnOrder() -> A subtransaction was found and it was not an empty transaction. -> Next entry was a top-transaction. The previous entry was not empty, and they had same first_lsn. It caused an Assertion failure. ``` ## How to fix I think the straightforward fix is to associate them to top-sub relationship, and attached patch did it. Thought? [1]: https://www.postgresql.org/message-id/TYCPR01MB12077573479C5A2471BDE8065F5232%40TYCPR01MB12077.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/18369-ad61699bf91c5bc0%40postgresql.org Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Attachment
pgsql-bugs by date: