Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date
Msg-id 20211008.165055.1621145185927268721.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
At Thu, 7 Oct 2021 13:20:14 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> Another idea to fix this problem would be that before calling
> SnapBuildCommitTxn() we create transaction entries in ReorderBuffer
> for (sub)transactions whose COMMIT record has XACT_XINFO_HAS_INVALS,
> and then mark all of them as catalog-changed by calling
> ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for
> this idea. What the patch does is essentially the same as what the
> proposed patch does. But the patch doesn't modify the
> SnapBuildCommitTxn(). And we remember the list of last running
> transactions in reorder buffer and the list is periodically purged
> during decoding RUNNING_XACTS records, eventually making it empty.

I came up with the third way.  SnapBuildCommitTxn already properly
handles the case where a ReorderBufferTXN with
RBTXN_HAS_CATALOG_CHANGES.  So this issue can be resolved by create
such ReorderBufferTXNs in SnapBuildProcessRunningXacts.

One problem with this is that change creates the case where multiple
ReorderBufferTXNs share the same first_lsn.  I haven't come up with a
clean idea to avoid relaxing the restriction of AssertTXNLsnOrder..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 46e66608cf..503116764f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -887,9 +887,14 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
         if (cur_txn->end_lsn != InvalidXLogRecPtr)
             Assert(cur_txn->first_lsn <= cur_txn->end_lsn);
 
-        /* Current initial LSN must be strictly higher than previous */
+        /*
+         * Current initial LSN must be strictly higher than previous. except
+         * this transaction is created by XLOG_RUNNING_XACTS.  If one
+         * XLOG_RUNNING_XACTS creates multiple transactions, they share the
+         * same LSN. See SnapBuildProcessRunningXacts.
+         */
         if (prev_first_lsn != InvalidXLogRecPtr)
-            Assert(prev_first_lsn < cur_txn->first_lsn);
+            Assert(prev_first_lsn <= cur_txn->first_lsn);
 
         /* known-as-subtxn txns must not be listed */
         Assert(!rbtxn_is_known_subxact(cur_txn));
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a5333349a8..58859112dc 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1097,6 +1097,20 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
      */
     if (builder->state < SNAPBUILD_CONSISTENT)
     {
+        /*
+         * At the time we passed the first XLOG_RUNNING_XACTS record, the
+         * transactions notified by the record may have updated
+         * catalogs. Register the transactions with marking them as having
+         * caused catalog changes.  The worst misbehavior here is some spurious
+         * invalidation at decoding start.
+         */
+        if (builder->state == SNAPBUILD_START)
+        {
+            for (int i = 0 ; i < running->xcnt + running->subxcnt ; i++)
+                ReorderBufferXidSetCatalogChanges(builder->reorder,
+                                                  running->xids[i], lsn);
+        }
+
         /* returns false if there's no point in performing cleanup just yet */
         if (!SnapBuildFindSnapshot(builder, lsn, running))
             return;

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Improve the HINT message of the ALTER command for postgres_fdw
Next
From: Michael Paquier
Date:
Subject: Re: More business with $Test::Builder::Level in the TAP tests