RE: test_decoding assertion failure for the loss of top-sub transaction relationship - Mailing list pgsql-hackers

From kuroda.hayato@fujitsu.com
Subject RE: test_decoding assertion failure for the loss of top-sub transaction relationship
Date
Msg-id TYAPR01MB58660803BCAA7849C8584AA4F57E9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: test_decoding assertion failure for the loss of top-sub transaction relationship  ("kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com>)
Responses RE: test_decoding assertion failure for the loss of top-sub transaction relationship
Re: test_decoding assertion failure for the loss of top-sub transaction relationship
List pgsql-hackers
Dear hackers,

> I agreed both that DEBUG2 messages are still useful but we should not
> change the condition for output. So I prefer the idea suggested by Amit.

PSA newer patch, which contains the fix and test.

> > I have not verified but I think we need to backpatch this till 14
> > because prior to that in DecodeCommit, we use to set the top-level txn
> > as having catalog changes based on the if there are invalidation
> > messages in the commit record. So, in the current scenario shared by
> > Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be
> > marked as having catalog changes.
> 
> I and Osumi-san are now investigating that, so please wait further reports and
> patches.

We investigated it about older versions, and in some versions *another stack-trace* has been found.


About PG10-13, indeed, the failure was not occurred.
In these versions transactions are regarded as
that have catalog changes when the commit record has XACT_XINFO_HAS_INVALS flag.
This flag will be set if the transaction has invalidation messages.

When sub transaction changes system catalogs and user commits,
all invalidation messages allocated in sub transaction will be transferred to top transaction.
Therefore both transactions will be marked as containing catalog changes.


About PG14 and 15, however, another stack-trace has been found.
While executing the same workload, we got followings at the same SQL statement;

```
(gdb) backtrace
#0  0x00007fa78c6dc387 in raise () from /lib64/libc.so.6
#1  0x00007fa78c6dda78 in abort () from /lib64/libc.so.6
#2  0x0000000000b16680 in ExceptionalCondition (conditionName=0xcd3ab0 "txn->ninvalidations == 0", errorType=0xcd3284
"FailedAssertion",
 
    fileName=0xcd32d0 "reorderbuffer.c", lineNumber=2936) at assert.c:69
#3  0x00000000008e9e70 in ReorderBufferForget (rb=0x12b5b10, xid=735, lsn=24125384) at reorderbuffer.c:2936
#4  0x00000000008d9493 in DecodeCommit (ctx=0x12a2d20, buf=0x7ffe08b236b0, parsed=0x7ffe08b23510, xid=734,
two_phase=false)at decode.c:733
 
#5  0x00000000008d8962 in DecodeXactOp (ctx=0x12a2d20, buf=0x7ffe08b236b0) at decode.c:279
#6  0x00000000008d85e2 in LogicalDecodingProcessRecord (ctx=0x12a2d20, record=0x12a30e0) at decode.c:142
#7  0x00000000008dfef2 in pg_logical_slot_get_changes_guts (fcinfo=0x129acb0, confirm=true, binary=false) at
logicalfuncs.c:296
#8  0x00000000008e002f in pg_logical_slot_get_changes (fcinfo=0x129acb0) at logicalfuncs.c:365
...
(gdb) frame 4
#4  0x00000000008d9493 in DecodeCommit (ctx=0x14cfd20, buf=0x7ffc638b0ca0, parsed=0x7ffc638b0b00, xid=734,
two_phase=false)at decode.c:733
 
733                             ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
(gdb) list
728              */
729             if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
730             {
731                     for (i = 0; i < parsed->nsubxacts; i++)
732                     {
733                             ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
734                     }
735                     ReorderBufferForget(ctx->reorder, xid, buf->origptr);
736 
737                     return;
(gdb) frame 3
#3  0x00000000008e9e70 in ReorderBufferForget (rb=0x14e2b10, xid=735, lsn=24125152) at reorderbuffer.c:2936
2936                    Assert(txn->ninvalidations == 0);
(gdb) list
2931             */
2932            if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
2933                    ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
2934                                                                                       txn->invalidations);
2935            else
2936                    Assert(txn->ninvalidations == 0);
2937 
2938            /* remove potential on-disk data, and deallocate */
2939            ReorderBufferCleanupTXN(rb, txn);
2940    }
(gdb) print *txn
$1 = {txn_flags = 3, xid = 735, toplevel_xid = 734, gid = 0x0, first_lsn = 24113488, final_lsn = 24125152, end_lsn = 0,
toptxn= 0x14ecb98, 
 
  restart_decoding_lsn = 24113304, origin_id = 0, origin_lsn = 0, commit_time = 0, base_snapshot = 0x0,
base_snapshot_lsn= 0, 
 
  base_snapshot_node = {prev = 0x14ecc00, next = 0x14e2b28}, snapshot_now = 0x0, command_id = 4294967295, nentries = 5,
nentries_mem= 5, 
 
  changes = {head = {prev = 0x14eecf8, next = 0x14eeb18}}, tuplecids = {head = {prev = 0x14ecb10, next = 0x14ecb10}},
ntuplecids= 0, 
 
  tuplecid_hash = 0x0, toast_hash = 0x0, subtxns = {head = {prev = 0x14ecb38, next = 0x14ecb38}}, nsubtxns = 0,
ninvalidations= 3, 
 
  invalidations = 0x14e2d28, node = {prev = 0x14ecc68, next = 0x14ecc68}, size = 452, total_size = 452,
concurrent_abort= false, 
 
  output_plugin_private = 0x0}
```

In these versions DecodeCommit() said OK. However, we have met another failure
because the ReorderBufferTXN of the sub transaction had invalidation messages but it did not have base_snapshot.

I thought that this failure was occurred the only the base_snapshot of the sub transaction is transferred via
ReorderBufferTransferSnapToParent()
when a transaction is assigned as child, but its invalidation messages are not.

I was not sure what's the proper way to fix it.
The solution I've thought at first was transporting all invalidations from sub to top like
ReorderBufferTransferSnapToParent(),
but I do not know its side effect. Moreover, how do we deal with ReorderBufferChange?
Should we transfer them too? If so, how about the ordering of changes?
Alternative solustion was just remove the assertion, but was it OK?

How do you think? I want to hear your comments and suggestions...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

pgsql-hackers by date:

Previous
From: Antonin Houska
Date:
Subject: Return value of PathNameOpenFile()
Next
From: Zhang Mingli
Date:
Subject: Re: Remove dead macro exec_subplan_get_plan