Thread: test_decoding assertion failure for the loss of top-sub transaction relationship

test_decoding assertion failure for the loss of top-sub transaction relationship

From
"osumi.takamichi@fujitsu.com"
Date:
Hi, hackers


I've met an assertion failure of logical decoding with below scenario on HEAD.

---
<preparation>
create table tab1 (val integer);
select 'init' from  pg_create_logical_replication_slot('regression_slot', 'test_decoding');

<session1>
begin;
savepoint sp1;
insert into tab1 values (1);

<session2>
checkpoint; -- for RUNNING_XACT
select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts',
'1');

<session1>
truncate tab1; -- for NEW_CID
commit;
begin;
insert into tab1 values (3);

<session2>
checkpoint;
select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts',
'1');

<session1>
commit;

<session2>

select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts',
'1');
---


Here, it's not a must but is advisable to make LOG_SNAPSHOT_INTERVAL_MS bigger so that
we can issue RUNNING_XACT according to our checkpoint commands explicitly.

In the above scenario, the first checkpoint generates RUNNING_XACT after the wal record
(for ReorderBufferAssignChild) that associates sub transaction with its top transaction.
This means that once we restart from RUNNING_XACT, we lose the association between top
transaction and sub transaction and then we can't mark the top transaction as catalog
modifying transaction by decoding NEW_CID (written after RUNNING_XACT), if the
sub transaction changes the catalog.

Therefore, this leads to the failure for the assert that can check
the consistency that when one sub transaction modifies the catalog,
its top transaction should be marked so as well.

I feel we need to remember the relationship between top transaction and sub transaction
in the serialized snapshot even before changing catalog at decoding RUNNING_XACT,
so that we can keep track of the association after the restart. What do you think ?


The stack call of this failure and related information is below.

(gdb) bt
#0  0x00007f2632588387 in raise () from /lib64/libc.so.6
#1  0x00007f2632589a78 in abort () from /lib64/libc.so.6
#2  0x0000000000b3eba1 in ExceptionalCondition (conditionName=0xd137e0 "!needs_snapshot || needs_timetravel",
    errorType=0xd130c5 "FailedAssertion", fileName=0xd130b9 "snapbuild.c", lineNumber=1116) at assert.c:69
#3  0x0000000000911257 in SnapBuildCommitTxn (builder=0x23f0638, lsn=22386632, xid=728, nsubxacts=1,
    subxacts=0x2bfcc88, xinfo=79) at snapbuild.c:1116
#4  0x00000000008fa420 in DecodeCommit (ctx=0x23e0108, buf=0x7fff4a1f9220, parsed=0x7fff4a1f9020, xid=728,
    two_phase=false) at decode.c:630
#5  0x00000000008f9953 in xact_decode (ctx=0x23e0108, buf=0x7fff4a1f9220) at decode.c:216
#6  0x00000000008f967d in LogicalDecodingProcessRecord (ctx=0x23e0108, record=0x23e04a0) at decode.c:119
#7  0x0000000000900b63 in pg_logical_slot_get_changes_guts (fcinfo=0x23d80a8, confirm=true, binary=false)
    at logicalfuncs.c:271
#8  0x0000000000900ca0 in pg_logical_slot_get_changes (fcinfo=0x23d80a8) at logicalfuncs.c:338
...
(gdb) frame 3
#3  0x0000000000911257 in SnapBuildCommitTxn (builder=0x23f0638, lsn=22386632, xid=728, nsubxacts=1,
    subxacts=0x2bfcc88, xinfo=79) at snapbuild.c:1116
1116            Assert(!needs_snapshot || needs_timetravel);
(gdb) list
1111            {
1112                    /* record that we cannot export a general snapshot anymore */
1113                    builder->committed.includes_all_transactions = false;
1114            }
1115
1116            Assert(!needs_snapshot || needs_timetravel);
1117
1118            /*
1119             * Adjust xmax of the snapshot builder, we only do that for committed,
1120             * catalog modifying, transactions, everything else isn't interesting for



Best Regards,
    Takamichi Osumi




RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From
"kuroda.hayato@fujitsu.com"
Date:
Hi Hackers,

> Therefore, this leads to the failure for the assert that can check
> the consistency that when one sub transaction modifies the catalog,
> its top transaction should be marked so as well.
>
> I feel we need to remember the relationship between top transaction and sub
> transaction
> in the serialized snapshot even before changing catalog at decoding
> RUNNING_XACT,
> so that we can keep track of the association after the restart. What do you think ?

PSA patch that fixes the failure.
This adds pairs of sub-top transactions to the SnapBuild, and it will be serialized and restored.
The pair will be checked when we mark the ReorderBufferTXN as RBTXN_HAS_CATALOG_CHANGES.

Thanks to off-list discussion with Osumi-san.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: test_decoding assertion failure for the loss of top-sub transaction relationship

From
Kyotaro Horiguchi
Date:
Good catch, and thanks for the patch!

At Fri, 2 Sep 2022 01:08:04 +0000, "kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote in 
> PSA patch that fixes the failure.
> This adds pairs of sub-top transactions to the SnapBuild, and it will be serialized and restored.
> The pair will be checked when we mark the ReorderBufferTXN as RBTXN_HAS_CATALOG_CHANGES.

A commit record has all subtransaction ids and SnapBuildCommitTxn()
already checks if every one has catalog changes before checking the
top transaction's catalog changes. So, no need to record top-sub
transaction relationship to serialized snapshots. If any of the
subtransactions has catalog changes, the commit contains catalog
changes.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Fri, Sep 2, 2022 at 6:38 AM kuroda.hayato@fujitsu.com
<kuroda.hayato@fujitsu.com> wrote:
>
> Hi Hackers,
>
> > Therefore, this leads to the failure for the assert that can check
> > the consistency that when one sub transaction modifies the catalog,
> > its top transaction should be marked so as well.
> >
> > I feel we need to remember the relationship between top transaction and sub
> > transaction
> > in the serialized snapshot even before changing catalog at decoding
> > RUNNING_XACT,
> > so that we can keep track of the association after the restart. What do you think ?
>
> PSA patch that fixes the failure.
> This adds pairs of sub-top transactions to the SnapBuild, and it will be serialized and restored.
> The pair will be checked when we mark the ReorderBufferTXN as RBTXN_HAS_CATALOG_CHANGES.

It seems that SnapBuildCommitTxn() is already taking care of adding
the top transaction to the committed transaction if any subtransaction
has the catalog changes, it has just missed setting the flag so I
think just setting the flag like this should be sufficient no?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index 1ff2c12..ee3f695 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1086,6 +1086,7 @@ SnapBuildCommitTxn(SnapBuild *builder,
XLogRecPtr lsn, TransactionId xid,
        else if (sub_needs_timetravel)
        {
                /* track toplevel txn as well, subxact alone isn't meaningful */
+               needs_timetravel = true;
                SnapBuildAddCommittedTxn(builder, xid);
        }
        else if (needs_timetravel)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: test_decoding assertion failure for the loss of top-sub transaction relationship

From
Kyotaro Horiguchi
Date:
At Fri, 2 Sep 2022 10:59:56 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Fri, Sep 2, 2022 at 6:38 AM kuroda.hayato@fujitsu.com
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Hi Hackers,
> >
> > > Therefore, this leads to the failure for the assert that can check
> > > the consistency that when one sub transaction modifies the catalog,
> > > its top transaction should be marked so as well.
> > >
> > > I feel we need to remember the relationship between top transaction and sub
> > > transaction
> > > in the serialized snapshot even before changing catalog at decoding
> > > RUNNING_XACT,
> > > so that we can keep track of the association after the restart. What do you think ?
> >
> > PSA patch that fixes the failure.
> > This adds pairs of sub-top transactions to the SnapBuild, and it will be serialized and restored.
> > The pair will be checked when we mark the ReorderBufferTXN as RBTXN_HAS_CATALOG_CHANGES.
> 
> It seems that SnapBuildCommitTxn() is already taking care of adding
> the top transaction to the committed transaction if any subtransaction
> has the catalog changes, it has just missed setting the flag so I
> think just setting the flag like this should be sufficient no?

Oops! That's right.

> diff --git a/src/backend/replication/logical/snapbuild.c
> b/src/backend/replication/logical/snapbuild.c
> index 1ff2c12..ee3f695 100644
> --- a/src/backend/replication/logical/snapbuild.c
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -1086,6 +1086,7 @@ SnapBuildCommitTxn(SnapBuild *builder,
> XLogRecPtr lsn, TransactionId xid,
>         else if (sub_needs_timetravel)
>         {
>                 /* track toplevel txn as well, subxact alone isn't meaningful */
> +               needs_timetravel = true;
>                 SnapBuildAddCommittedTxn(builder, xid);
>         }
>         else if (needs_timetravel)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Horiguchi-san, Dilip,

Thank you for replying!

> > It seems that SnapBuildCommitTxn() is already taking care of adding
> > the top transaction to the committed transaction if any subtransaction
> > has the catalog changes, it has just missed setting the flag so I
> > think just setting the flag like this should be sufficient no?
>
> Oops! That's right.

Basically I agreed, but I was not sure the message "found top level transaction..."
should be output or not. It may be useful even if one of sub transactions contains the change.

How about following?

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bf72ad45ec..a630522907 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1086,8 +1086,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
                }
        }

-       /* if top-level modified catalog, it'll need a snapshot */
-       if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
+       /*
+        * if top-level or one of sub modified catalog, it'll need a snapshot.
+        *
+        * Normally the second check is not needed because the relation between
+        * top-sub transactions is tracked on the ReorderBuffer layer, and the top
+        * transaction is marked as containing catalog changes if its children are.
+        * But in some cases the relation may be missed, in which case only the sub
+        * transaction may be marked as containing catalog changes.
+        */
+       if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo)
+               || sub_needs_timetravel)
        {
                elog(DEBUG2, "found top level transaction %u, with catalog changes",
                         xid);
@@ -1095,11 +1104,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
                needs_timetravel = true;
                SnapBuildAddCommittedTxn(builder, xid);
        }
-       else if (sub_needs_timetravel)
-       {
-               /* track toplevel txn as well, subxact alone isn't meaningful */
-               SnapBuildAddCommittedTxn(builder, xid);
-       }
        else if (needs_timetravel)
        {
                elog(DEBUG2, "forced transaction %u to do timetravel", xid);

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




On Fri, Sep 2, 2022 at 11:25 AM kuroda.hayato@fujitsu.com
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Horiguchi-san, Dilip,
>
> Thank you for replying!
>
> > > It seems that SnapBuildCommitTxn() is already taking care of adding
> > > the top transaction to the committed transaction if any subtransaction
> > > has the catalog changes, it has just missed setting the flag so I
> > > think just setting the flag like this should be sufficient no?
> >
> > Oops! That's right.
>
> Basically I agreed, but I was not sure the message "found top level transaction..."
> should be output or not. It may be useful even if one of sub transactions contains the change.
>
> How about following?
>
> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
> index bf72ad45ec..a630522907 100644
> --- a/src/backend/replication/logical/snapbuild.c
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -1086,8 +1086,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>                 }
>         }
>
> -       /* if top-level modified catalog, it'll need a snapshot */
> -       if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
> +       /*
> +        * if top-level or one of sub modified catalog, it'll need a snapshot.
> +        *
> +        * Normally the second check is not needed because the relation between
> +        * top-sub transactions is tracked on the ReorderBuffer layer, and the top
> +        * transaction is marked as containing catalog changes if its children are.
> +        * But in some cases the relation may be missed, in which case only the sub
> +        * transaction may be marked as containing catalog changes.
> +        */
> +       if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo)
> +               || sub_needs_timetravel)
>         {
>                 elog(DEBUG2, "found top level transaction %u, with catalog changes",
>                          xid);
> @@ -1095,11 +1104,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>                 needs_timetravel = true;
>                 SnapBuildAddCommittedTxn(builder, xid);
>         }
> -       else if (sub_needs_timetravel)
> -       {
> -               /* track toplevel txn as well, subxact alone isn't meaningful */
> -               SnapBuildAddCommittedTxn(builder, xid);
> -       }
>         else if (needs_timetravel)
>         {
>                 elog(DEBUG2, "forced transaction %u to do timetravel", xid);

Yeah, I am fine with this as well.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: test_decoding assertion failure for the loss of top-sub transaction relationship

From
Kyotaro Horiguchi
Date:
At Fri, 2 Sep 2022 11:27:23 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Fri, Sep 2, 2022 at 11:25 AM kuroda.hayato@fujitsu.com
> <kuroda.hayato@fujitsu.com> wrote:
> > How about following?
> >
> > diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
> > index bf72ad45ec..a630522907 100644
> > --- a/src/backend/replication/logical/snapbuild.c
> > +++ b/src/backend/replication/logical/snapbuild.c
> > @@ -1086,8 +1086,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
> >                 }
> >         }
> >
> > -       /* if top-level modified catalog, it'll need a snapshot */
> > -       if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
> > +       /*
> > +        * if top-level or one of sub modified catalog, it'll need a snapshot.
> > +        *
> > +        * Normally the second check is not needed because the relation between
> > +        * top-sub transactions is tracked on the ReorderBuffer layer, and the top
> > +        * transaction is marked as containing catalog changes if its children are.
> > +        * But in some cases the relation may be missed, in which case only the sub
> > +        * transaction may be marked as containing catalog changes.
> > +        */
> > +       if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo)
> > +               || sub_needs_timetravel)
> >         {
> >                 elog(DEBUG2, "found top level transaction %u, with catalog changes",
> >                          xid);
> > @@ -1095,11 +1104,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
> >                 needs_timetravel = true;
> >                 SnapBuildAddCommittedTxn(builder, xid);
> >         }
> > -       else if (sub_needs_timetravel)
> > -       {
> > -               /* track toplevel txn as well, subxact alone isn't meaningful */
> > -               SnapBuildAddCommittedTxn(builder, xid);
> > -       }
> >         else if (needs_timetravel)
> >         {
> >                 elog(DEBUG2, "forced transaction %u to do timetravel", xid);
> 
> Yeah, I am fine with this as well.

I'm basically fine, too. But this is a bug that needs back-patching
back to 10. This change changes the condition for the DEBUG2 message.
So we need to add an awkward if() condition for the DEBUG2 message.
Looking that the messages have different debug-level, I doubt there
have been a chance they are useful.  If we remove the two DEBUGx
messages, I'm fine with the change.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Fri, Sep 2, 2022 at 12:25 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Fri, 2 Sep 2022 11:27:23 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > On Fri, Sep 2, 2022 at 11:25 AM kuroda.hayato@fujitsu.com
> > <kuroda.hayato@fujitsu.com> wrote:
> > > How about following?
> > >
> > > diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
> > > index bf72ad45ec..a630522907 100644
> > > --- a/src/backend/replication/logical/snapbuild.c
> > > +++ b/src/backend/replication/logical/snapbuild.c
> > > @@ -1086,8 +1086,17 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
> > >                 }
> > >         }
> > >
> > > -       /* if top-level modified catalog, it'll need a snapshot */
> > > -       if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
> > > +       /*
> > > +        * if top-level or one of sub modified catalog, it'll need a snapshot.
> > > +        *
> > > +        * Normally the second check is not needed because the relation between
> > > +        * top-sub transactions is tracked on the ReorderBuffer layer, and the top
> > > +        * transaction is marked as containing catalog changes if its children are.
> > > +        * But in some cases the relation may be missed, in which case only the sub
> > > +        * transaction may be marked as containing catalog changes.
> > > +        */
> > > +       if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo)
> > > +               || sub_needs_timetravel)
> > >         {
> > >                 elog(DEBUG2, "found top level transaction %u, with catalog changes",
> > >                          xid);
> > > @@ -1095,11 +1104,6 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
> > >                 needs_timetravel = true;
> > >                 SnapBuildAddCommittedTxn(builder, xid);
> > >         }
> > > -       else if (sub_needs_timetravel)
> > > -       {
> > > -               /* track toplevel txn as well, subxact alone isn't meaningful */
> > > -               SnapBuildAddCommittedTxn(builder, xid);
> > > -       }
> > >         else if (needs_timetravel)
> > >         {
> > >                 elog(DEBUG2, "forced transaction %u to do timetravel", xid);
> >
> > Yeah, I am fine with this as well.
>
> I'm basically fine, too. But this is a bug that needs back-patching
> back to 10.
>

I have not verified but I think we need to backpatch this till 14
because prior to that in DecodeCommit, we use to set the top-level txn
as having catalog changes based on the if there are invalidation
messages in the commit record. So, in the current scenario shared by
Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be
marked as having catalog changes.

> This change changes the condition for the DEBUG2 message.
> So we need to add an awkward if() condition for the DEBUG2 message.
> Looking that the messages have different debug-level, I doubt there
> have been a chance they are useful.  If we remove the two DEBUGx
> messages, I'm fine with the change.
>

I think these DEBUG2 messages could be useful, so instead of removing
these, I suggest we should follow Dilip's proposed fix and maybe add a
new DEBUG2 message on the lines of (("forced transaction %u to do
timetravel due to one of its subtransaction", xid) in the else if
(sub_needs_timetravel) condition if we think that will be useful too
but I am fine leaving the addition of new DEBUG2 message.

-- 
With Regards,
Amit Kapila.



RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From
"kuroda.hayato@fujitsu.com"
Date:
> > I'm basically fine, too. But this is a bug that needs back-patching
> > back to 10.
> >
> 
> I have not verified but I think we need to backpatch this till 14
> because prior to that in DecodeCommit, we use to set the top-level txn
> as having catalog changes based on the if there are invalidation
> messages in the commit record. So, in the current scenario shared by
> Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be
> marked as having catalog changes.

I and Osumi-san are now investigating that, so please wait further reports and patches.

> > This change changes the condition for the DEBUG2 message.
> > So we need to add an awkward if() condition for the DEBUG2 message.
> > Looking that the messages have different debug-level, I doubt there
> > have been a chance they are useful.  If we remove the two DEBUGx
> > messages, I'm fine with the change.
> >
> 
> I think these DEBUG2 messages could be useful, so instead of removing
> these, I suggest we should follow Dilip's proposed fix and maybe add a
> new DEBUG2 message on the lines of (("forced transaction %u to do
> timetravel due to one of its subtransaction", xid) in the else if
> (sub_needs_timetravel) condition if we think that will be useful too
> but I am fine leaving the addition of new DEBUG2 message.

I agreed both that DEBUG2 messages are still useful but we should not
change the condition for output. So I prefer the idea suggested by Amit.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


On Fri, Sep 2, 2022 at 6:26 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
>
> I've met an assertion failure of logical decoding with below scenario on HEAD.
>
> ---
> <preparation>
> create table tab1 (val integer);
> select 'init' from  pg_create_logical_replication_slot('regression_slot', 'test_decoding');
>
> <session1>
> begin;
> savepoint sp1;
> insert into tab1 values (1);
>
> <session2>
> checkpoint; -- for RUNNING_XACT
> select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts',
'1');
>
> <session1>
> truncate tab1; -- for NEW_CID
> commit;
> begin;
> insert into tab1 values (3);
>

BTW, if I just change the truncate statement to "Analyze tab1" in your
entire test then I am getting a different assertion failure:

postgres.exe!ExceptionalCondition(const char * conditionName, const
char * errorType, const char * fileName, int lineNumber) Line 70 C
postgres.exe!AssertTXNLsnOrder(ReorderBuffer * rb) Line 902 C
postgres.exe!ReorderBufferTXNByXid(ReorderBuffer * rb, unsigned int
xid, bool create, bool * is_new, unsigned __int64 lsn, bool
create_as_top) Line 681 C
postgres.exe!ReorderBufferAddNewTupleCids(ReorderBuffer * rb, unsigned
int xid, unsigned __int64 lsn, RelFileLocator locator, ItemPointerData
tid, unsigned int cmin, unsigned int cmax, unsigned int combocid) Line
3188 C
postgres.exe!SnapBuildProcessNewCid(SnapBuild * builder, unsigned int
xid, unsigned __int64 lsn, xl_heap_new_cid * xlrec) Line 823 C
postgres.exe!heap2_decode(LogicalDecodingContext * ctx,
XLogRecordBuffer * buf) Line 408 C
postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext *
ctx, XLogReaderState * record) Line 119 C
postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData
* fcinfo, bool confirm, bool binary) Line 274 C
postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData *
fcinfo) Line 339 C

This is matching with call stack we see intermittently in the BF
[1][2]. The difference with your scenario is that the Truncate
statement generates an additional WAL XLOG_STANDBY_LOCK prior to
XLOG_HEAP2_NEW_CID.

I think we can fix this in the below ways:
a. Assert(prev_first_lsn <= cur_txn->first_lsn); -- Explain in
comments that it is possible when subtransaction and transaction are
not previously logged as it happened in this scenario
b. track txn of prev_first_lsn (say as prev_txn) and check if
prev_txn's toptxn is the same as cur_txn or cur_txn's toptxn is the
same as the prev_txn then perform assert mentioned in (a) else, keep
the current Assert.

It seems (b) will be more robust.

Thoughts?

Note: I added Sawada-San as sometime back we had an offlist discussion
on this intermittent BF failure but we were not able to reach the
exact test which can show this failure.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-08-20%2002%3A45%3A34
[2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09

-- 
With Regards,
Amit Kapila.



RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From
"kuroda.hayato@fujitsu.com"
Date:
Dear hackers,

> I agreed both that DEBUG2 messages are still useful but we should not
> change the condition for output. So I prefer the idea suggested by Amit.

PSA newer patch, which contains the fix and test.

> > I have not verified but I think we need to backpatch this till 14
> > because prior to that in DecodeCommit, we use to set the top-level txn
> > as having catalog changes based on the if there are invalidation
> > messages in the commit record. So, in the current scenario shared by
> > Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be
> > marked as having catalog changes.
> 
> I and Osumi-san are now investigating that, so please wait further reports and
> patches.

We investigated it about older versions, and in some versions *another stack-trace* has been found.


About PG10-13, indeed, the failure was not occurred.
In these versions transactions are regarded as
that have catalog changes when the commit record has XACT_XINFO_HAS_INVALS flag.
This flag will be set if the transaction has invalidation messages.

When sub transaction changes system catalogs and user commits,
all invalidation messages allocated in sub transaction will be transferred to top transaction.
Therefore both transactions will be marked as containing catalog changes.


About PG14 and 15, however, another stack-trace has been found.
While executing the same workload, we got followings at the same SQL statement;

```
(gdb) backtrace
#0  0x00007fa78c6dc387 in raise () from /lib64/libc.so.6
#1  0x00007fa78c6dda78 in abort () from /lib64/libc.so.6
#2  0x0000000000b16680 in ExceptionalCondition (conditionName=0xcd3ab0 "txn->ninvalidations == 0", errorType=0xcd3284
"FailedAssertion",
 
    fileName=0xcd32d0 "reorderbuffer.c", lineNumber=2936) at assert.c:69
#3  0x00000000008e9e70 in ReorderBufferForget (rb=0x12b5b10, xid=735, lsn=24125384) at reorderbuffer.c:2936
#4  0x00000000008d9493 in DecodeCommit (ctx=0x12a2d20, buf=0x7ffe08b236b0, parsed=0x7ffe08b23510, xid=734,
two_phase=false)at decode.c:733
 
#5  0x00000000008d8962 in DecodeXactOp (ctx=0x12a2d20, buf=0x7ffe08b236b0) at decode.c:279
#6  0x00000000008d85e2 in LogicalDecodingProcessRecord (ctx=0x12a2d20, record=0x12a30e0) at decode.c:142
#7  0x00000000008dfef2 in pg_logical_slot_get_changes_guts (fcinfo=0x129acb0, confirm=true, binary=false) at
logicalfuncs.c:296
#8  0x00000000008e002f in pg_logical_slot_get_changes (fcinfo=0x129acb0) at logicalfuncs.c:365
...
(gdb) frame 4
#4  0x00000000008d9493 in DecodeCommit (ctx=0x14cfd20, buf=0x7ffc638b0ca0, parsed=0x7ffc638b0b00, xid=734,
two_phase=false)at decode.c:733
 
733                             ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
(gdb) list
728              */
729             if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
730             {
731                     for (i = 0; i < parsed->nsubxacts; i++)
732                     {
733                             ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
734                     }
735                     ReorderBufferForget(ctx->reorder, xid, buf->origptr);
736 
737                     return;
(gdb) frame 3
#3  0x00000000008e9e70 in ReorderBufferForget (rb=0x14e2b10, xid=735, lsn=24125152) at reorderbuffer.c:2936
2936                    Assert(txn->ninvalidations == 0);
(gdb) list
2931             */
2932            if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
2933                    ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
2934                                                                                       txn->invalidations);
2935            else
2936                    Assert(txn->ninvalidations == 0);
2937 
2938            /* remove potential on-disk data, and deallocate */
2939            ReorderBufferCleanupTXN(rb, txn);
2940    }
(gdb) print *txn
$1 = {txn_flags = 3, xid = 735, toplevel_xid = 734, gid = 0x0, first_lsn = 24113488, final_lsn = 24125152, end_lsn = 0,
toptxn= 0x14ecb98, 
 
  restart_decoding_lsn = 24113304, origin_id = 0, origin_lsn = 0, commit_time = 0, base_snapshot = 0x0,
base_snapshot_lsn= 0, 
 
  base_snapshot_node = {prev = 0x14ecc00, next = 0x14e2b28}, snapshot_now = 0x0, command_id = 4294967295, nentries = 5,
nentries_mem= 5, 
 
  changes = {head = {prev = 0x14eecf8, next = 0x14eeb18}}, tuplecids = {head = {prev = 0x14ecb10, next = 0x14ecb10}},
ntuplecids= 0, 
 
  tuplecid_hash = 0x0, toast_hash = 0x0, subtxns = {head = {prev = 0x14ecb38, next = 0x14ecb38}}, nsubtxns = 0,
ninvalidations= 3, 
 
  invalidations = 0x14e2d28, node = {prev = 0x14ecc68, next = 0x14ecc68}, size = 452, total_size = 452,
concurrent_abort= false, 
 
  output_plugin_private = 0x0}
```

In these versions DecodeCommit() said OK. However, we have met another failure
because the ReorderBufferTXN of the sub transaction had invalidation messages but it did not have base_snapshot.

I thought that this failure was occurred the only the base_snapshot of the sub transaction is transferred via
ReorderBufferTransferSnapToParent()
when a transaction is assigned as child, but its invalidation messages are not.

I was not sure what's the proper way to fix it.
The solution I've thought at first was transporting all invalidations from sub to top like
ReorderBufferTransferSnapToParent(),
but I do not know its side effect. Moreover, how do we deal with ReorderBufferChange?
Should we transfer them too? If so, how about the ordering of changes?
Alternative solustion was just remove the assertion, but was it OK?

How do you think? I want to hear your comments and suggestions...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From
"kuroda.hayato@fujitsu.com"
Date:
> I was not sure what's the proper way to fix it.
> The solution I've thought at first was transporting all invalidations from sub to top
> like ReorderBufferTransferSnapToParent(),
> but I do not know its side effect. Moreover, how do we deal with
> ReorderBufferChange?
> Should we transfer them too? If so, how about the ordering of changes?
> Alternative solustion was just remove the assertion, but was it OK?

PSA the PoC patch for discussion. In this patch only invalidation messages are transported,
changes hold by subtxn are ignored. This can be passed the reported workload.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment
On Tue, Sep 6, 2022 at 1:17 PM kuroda.hayato@fujitsu.com
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> > I agreed both that DEBUG2 messages are still useful but we should not
> > change the condition for output. So I prefer the idea suggested by Amit.
>
> PSA newer patch, which contains the fix and test.
>
> > > I have not verified but I think we need to backpatch this till 14
> > > because prior to that in DecodeCommit, we use to set the top-level txn
> > > as having catalog changes based on the if there are invalidation
> > > messages in the commit record. So, in the current scenario shared by
> > > Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be
> > > marked as having catalog changes.
> >
> > I and Osumi-san are now investigating that, so please wait further reports and
> > patches.
>
> We investigated it about older versions, and in some versions *another stack-trace* has been found.
>
>
> About PG10-13, indeed, the failure was not occurred.
> In these versions transactions are regarded as
> that have catalog changes when the commit record has XACT_XINFO_HAS_INVALS flag.
> This flag will be set if the transaction has invalidation messages.
>
> When sub transaction changes system catalogs and user commits,
> all invalidation messages allocated in sub transaction will be transferred to top transaction.
> Therefore both transactions will be marked as containing catalog changes.
>
>
> About PG14 and 15, however, another stack-trace has been found.
> While executing the same workload, we got followings at the same SQL statement;
>

Did you get this new assertion failure after you applied the patch for
the first failure? Because otherwise, how can you reach it with the
same test case?

About patch:
else if (sub_needs_timetravel)
  {
- /* track toplevel txn as well, subxact alone isn't meaningful */
+ elog(DEBUG2, "forced transaction %u to do timetravel due to one of
its subtransaction",
+ xid);
+ needs_timetravel = true;
  SnapBuildAddCommittedTxn(builder, xid);

Why did you remove the above comment? I think it still makes sense to retain it.

-- 
With Regards,
Amit Kapila.



RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Amit,

Thanks for giving comments!

> Did you get this new assertion failure after you applied the patch for
> the first failure? Because otherwise, how can you reach it with the
> same test case?

The first failure is occurred only in the HEAD, so I did not applied the first patch
to REL14 and REL15.
This difference is caused because the commit [Fix catalog lookup...] in REL15(272248a) and older is different
from the HEAD one.
In order versions SnapBuildXidSetCatalogChanges() has been added. In the function
a transaction will be marked as containing catalog changes if the transaction is in InitialRunningXacts,
and after that the relation between sub-top transactions is assigned based on the parsed->subxact.
The marking avoids the first failure, but the assignment triggers new failure.


> About patch:
> else if (sub_needs_timetravel)
>   {
> - /* track toplevel txn as well, subxact alone isn't meaningful */
> + elog(DEBUG2, "forced transaction %u to do timetravel due to one of
> its subtransaction",
> + xid);
> + needs_timetravel = true;
>   SnapBuildAddCommittedTxn(builder, xid);
> 
> Why did you remove the above comment? I think it still makes sense to retain it.

Fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: test_decoding assertion failure for the loss of top-sub transaction relationship

From
Masahiko Sawada
Date:
On Wed, Sep 7, 2022 at 11:06 AM kuroda.hayato@fujitsu.com
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Amit,
>
> Thanks for giving comments!
>
> > Did you get this new assertion failure after you applied the patch for
> > the first failure? Because otherwise, how can you reach it with the
> > same test case?
>
> The first failure is occurred only in the HEAD, so I did not applied the first patch
> to REL14 and REL15.
> This difference is caused because the commit [Fix catalog lookup...] in REL15(272248a) and older is different
> from the HEAD one.
> In order versions SnapBuildXidSetCatalogChanges() has been added. In the function
> a transaction will be marked as containing catalog changes if the transaction is in InitialRunningXacts,
> and after that the relation between sub-top transactions is assigned based on the parsed->subxact.
> The marking avoids the first failure, but the assignment triggers new failure.
>
>
> > About patch:
> > else if (sub_needs_timetravel)
> >   {
> > - /* track toplevel txn as well, subxact alone isn't meaningful */
> > + elog(DEBUG2, "forced transaction %u to do timetravel due to one of
> > its subtransaction",
> > + xid);
> > + needs_timetravel = true;
> >   SnapBuildAddCommittedTxn(builder, xid);
> >
> > Why did you remove the above comment? I think it still makes sense to retain it.
>
> Fixed.

Here are some review comments for v2 patch:

+# Test that we can force the top transaction to do timetravel when one of sub
+# transactions needs that. This is necessary when we restart decoding
from RUNNING_XACT
+# without the wal to associate subtransaction to its top transaction.

I don't think the second sentence is necessary.

---
The last decoding
+# starts from the first checkpoint and NEW_CID of "s0_truncate"
doesn't mark the top
+# transaction as catalog modifying transaction. In this scenario, the
enforcement sets
+# needs_timetravel to true even if the top transaction is regarded as
that it does not
+# have catalog changes and thus the decoding works without a
contradition that one
+# subtransaction needed timetravel while its top transaction didn't.

I don't understand the last sentence, probably it's a long sentence.

How about the following description?

# Test that we can handle the case where only subtransaction is marked
as containing
# catalog changes. The last decoding starts from NEW_CID generated by
"s0_truncate" and
# marks only the subtransaction as containing catalog changes but we
don't create the
# association between top-level transaction and subtransaction yet.
When decoding the
# commit record of the top-level transaction, we must force the
top-level transaction
# to do timetravel since one of its subtransactions is marked as
containing catalog changes.

---
+ elog(DEBUG2, "forced transaction %u to do timetravel due to one of
its subtransaction",
+ xid);
+ needs_timetravel = true;

I think "one of its subtransaction" should be "one of its subtransactions".

Regards,

--
Masahiko Sawada



RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Sawada-san,

Thank you for reviewing HEAD patch! PSA v3 patch.

> +# Test that we can force the top transaction to do timetravel when one of sub
> +# transactions needs that. This is necessary when we restart decoding
> from RUNNING_XACT
> +# without the wal to associate subtransaction to its top transaction.
>
> I don't think the second sentence is necessary.
>
> ---
> The last decoding
> +# starts from the first checkpoint and NEW_CID of "s0_truncate"
> doesn't mark the top
> +# transaction as catalog modifying transaction. In this scenario, the
> enforcement sets
> +# needs_timetravel to true even if the top transaction is regarded as
> that it does not
> +# have catalog changes and thus the decoding works without a
> contradition that one
> +# subtransaction needed timetravel while its top transaction didn't.
>
> I don't understand the last sentence, probably it's a long sentence.
>
> How about the following description?
>
> # Test that we can handle the case where only subtransaction is marked
> as containing
> # catalog changes. The last decoding starts from NEW_CID generated by
> "s0_truncate" and
> # marks only the subtransaction as containing catalog changes but we
> don't create the
> # association between top-level transaction and subtransaction yet.
> When decoding the
> # commit record of the top-level transaction, we must force the
> top-level transaction
> # to do timetravel since one of its subtransactions is marked as
> containing catalog changes.

Seems good, I replaced all of comments to yours.

> + elog(DEBUG2, "forced transaction %u to do timetravel due to one of
> its subtransaction",
> + xid);
> + needs_timetravel = true;
>
> I think "one of its subtransaction" should be "one of its subtransactions".

Fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: test_decoding assertion failure for the loss of top-sub transaction relationship

From
Masahiko Sawada
Date:
On Wed, Sep 7, 2022 at 11:06 AM kuroda.hayato@fujitsu.com
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Amit,
>
> Thanks for giving comments!
>
> > Did you get this new assertion failure after you applied the patch for
> > the first failure? Because otherwise, how can you reach it with the
> > same test case?
>
> The first failure is occurred only in the HEAD, so I did not applied the first patch
> to REL14 and REL15.
> This difference is caused because the commit [Fix catalog lookup...] in REL15(272248a) and older is different
> from the HEAD one.
> In order versions SnapBuildXidSetCatalogChanges() has been added. In the function
> a transaction will be marked as containing catalog changes if the transaction is in InitialRunningXacts,
> and after that the relation between sub-top transactions is assigned based on the parsed->subxact.
> The marking avoids the first failure, but the assignment triggers new failure.
>

FYI, as I just replied to the related thread[1], the assertion failure
in REL14 and REL15 can be fixed by the patch proposed there. So I'd
like to see how the discussion goes. Regardless of this proposed fix,
the patch proposed by Kuroda-san is required for HEAD, REL14, and
REL15, in order to fix the assertion failure in SnapBuildCommitTxn().

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoA1gV9pfu8hoXpTQBWH8uEMRg_F_MKM%2BU3Sr0HnyH4AUQ%40mail.gmail.com

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Sawada-san,

> FYI, as I just replied to the related thread[1], the assertion failure
> in REL14 and REL15 can be fixed by the patch proposed there. So I'd
> like to see how the discussion goes. Regardless of this proposed fix,
> the patch proposed by Kuroda-san is required for HEAD, REL14, and
> REL15, in order to fix the assertion failure in SnapBuildCommitTxn().

I understood that my patches for REL14 and REL15 might be not needed.
I will check the thread later. Thanks!

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: test_decoding assertion failure for the loss of top-sub transaction relationship

From
Masahiko Sawada
Date:
On Wed, Oct 12, 2022 at 3:35 PM kuroda.hayato@fujitsu.com
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > FYI, as I just replied to the related thread[1], the assertion failure
> > in REL14 and REL15 can be fixed by the patch proposed there. So I'd
> > like to see how the discussion goes. Regardless of this proposed fix,
> > the patch proposed by Kuroda-san is required for HEAD, REL14, and
> > REL15, in order to fix the assertion failure in SnapBuildCommitTxn().
>
> I understood that my patches for REL14 and REL15 might be not needed.

No, sorry for confusing you. I meant that even if we agreed with the
patch I proposed there, your patch is still required to fix the issue.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com