Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date
Msg-id CAA4eK1Jg+u=ouXbC5EVLVnQ4sUhrR1P+p0LQSjPFmMP4jNYmuw@mail.gmail.com
Whole thread Raw
In response to Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
List pgsql-hackers
On Wed, Jul 6, 2022 at 7:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I'll post a new version patch in the next email with replying to other comments.
>

Okay, thanks for working on this. Few comments/suggestions on
poc_remember_last_running_xacts_v2 patch:

1.
+ReorderBufferSetLastRunningXactsCatalogChanges(ReorderBuffer *rb,
TransactionId xid,
+    uint32 xinfo, int subxcnt,
+    TransactionId *subxacts, XLogRecPtr lsn)
+{
...
...
+
+ test = bsearch(&xid, rb->last_running_xacts, rb->n_last_running_xacts,
+    sizeof(TransactionId), xidComparator);
+
+ if (test == NULL)
+ {
+ for (int i = 0; i < subxcnt; i++)
+ {
+ test = bsearch(&subxacts[i], rb->last_running_xacts, rb->n_last_running_xacts,
+    sizeof(TransactionId), xidComparator);
...

Is there ever a possibility that the top transaction id is not in the
running_xacts list but one of its subxids is present? If yes, it is
not very obvious at least to me so adding a comment here could be
useful. If not, then why do we need this additional check for each of
the sub-transaction ids?

2.
@@ -627,6 +647,15 @@ DecodeCommit(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
  commit_time = parsed->origin_timestamp;
  }

+ /*
+ * Set the last running xacts as containing catalog change if necessary.
+ * This must be done before SnapBuildCommitTxn() so that we include catalog
+ * change transactions to the historic snapshot.
+ */
+ ReorderBufferSetLastRunningXactsCatalogChanges(ctx->reorder, xid,
parsed->xinfo,
+    parsed->nsubxacts, parsed->subxacts,
+    buf->origptr);
+
  SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
     parsed->nsubxacts, parsed->subxacts);

As mentioned previously as well, marking it before SnapBuildCommitTxn
has one disadvantage, we sometimes do this work even if the snapshot
state is SNAPBUILD_START or SNAPBUILD_BUILDING_SNAPSHOT in which case
SnapBuildCommitTxn wouldn't do anything. Can we instead check whether
the particular txn has invalidations and is present in the
last_running_xacts list along with the check
ReorderBufferXidHasCatalogChanges? I think that has the additional
advantage that we don't need this additional marking if the xact is
already marked as containing catalog changes.

3.
1.
+ /*
+ * We rely on HEAP2_NEW_CID records and XACT_INVALIDATIONS to know
+ * if the transaction has changed the catalog, and that information
+ * is not serialized to SnapBuilder.  Therefore, if the logical
+ * decoding decodes the commit record of the transaction that actually
+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot so up creating the wrong snapshot.

The part of the sentence "... snapshot so up creating the wrong
snapshot." is not clear. In this comment, at one place you have used
two spaces after a full stop, and at another place, there is one
space. I think let's follow nearby code practice to use a single space
before a new sentence.

4.
+void
+ReorderBufferProcessLastRunningXacts(ReorderBuffer *rb,
xl_running_xacts *running)
+{
+ /* Quick exit if there is no longer last running xacts */
+ if (likely(rb->n_last_running_xacts == 0))
+ return;
+
+ /* First call, build the last running xact list */
+ if (rb->n_last_running_xacts == -1)
+ {
+ int nxacts = running->subxcnt + running->xcnt;
+ Size sz = sizeof(TransactionId) * nxacts;;
+
+ rb->last_running_xacts = MemoryContextAlloc(rb->context, sz);
+ memcpy(rb->last_running_xacts, running->xids, sz);
+ qsort(rb->last_running_xacts, nxacts, sizeof(TransactionId), xidComparator);
+
+ rb->n_last_running_xacts = nxacts;
+
+ return;
+ }

a. Can we add the function header comments for this function?
b. We seem to be tracking the running_xact information for the first
running_xact record after start/restart. The name last_running_xacts
doesn't sound appropriate for that, how about initial_running_xacts?

5.
+ /*
+ * Purge xids in the last running xacts list if we can do that for at least
+ * one xid.
+ */
+ if (NormalTransactionIdPrecedes(rb->last_running_xacts[0],
+ running->oldestRunningXid))

I think it would be a good idea to add a few lines here explaining why
it is safe to purge. IIUC, it is because the commit for those xacts
would have already been processed and we don't need such a xid
anymore.

6. As per the discussion above in this thread having
XACT_XINFO_HAS_INVALS in the commit record doesn't indicate that the
xact has catalog changes, so can we add somewhere in comments that for
such a case we can't distinguish whether the txn has catalog change
but we still mark the txn has catalog changes? Can you please share
one example for this case?

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH] Fix pg_upgrade test from v10
Next
From: Peter Eisentraut
Date:
Subject: Re: Unify DLSUFFIX on Darwin