Thread: RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

Hi


On Tuesday, March 16, 2021 1:35 AM Oh, Mike <minsoo@amazon.com> wrote:
> We noticed that the logical replication could fail when the
> Standby::RUNNING_XACT record is generated in the middle of a catalog
> modifying transaction and if the logical decoding has to restart from the
> RUNNING_XACT
> WAL entry.
...
> Proposed solution:
> If we’re decoding a catalog modifying commit record, then check whether
> it’s part of the RUNNING_XACT xid’s processed @ the restart_lsn. If so,
> then add its xid & subxacts in the committed txns list in the logical decoding
> snapshot.
> 
> Please refer to the attachment for the proposed patch.


Let me share some review comments for the patch.

(1) last_running declaration

Isn't it better to add static for this variable,
because we don't use this in other places ?

@@ -85,6 +86,9 @@ static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
                                                          XLogRecordBuffer *buf, Oid dbId,
                                                          RepOriginId origin_id);

+/* record previous restart_lsn running xacts */
+xl_running_xacts *last_running = NULL;


(2) DecodeStandbyOp's memory free

I'm not sure when
we pass this condition with already allocated last_running,
but do you need to free it's xid array here as well,
if last_running isn't null ?
Otherwise, we'll miss the chance after this.

+                               /* record restart_lsn running xacts */
+                               if (MyReplicationSlot && (buf->origptr == MyReplicationSlot->data.restart_lsn))
+                               {
+                                       if (last_running)
+                                               free(last_running);
+
+                                       last_running = NULL;

(3) suggestion of small readability improvement

We calculate the same size twice here and DecodeCommit.
I suggest you declare a variable that stores the computed result of size,
which might shorten those codes.

+                                       /*
+                                        * xl_running_xacts contains a xids Flexible Array
+                                        * and its size is subxcnt + xcnt.
+                                        * Take that into account while allocating
+                                        * the memory for last_running.
+                                        */
+                                       last_running = (xl_running_xacts *) malloc(sizeof(xl_running_xacts)
+
       + sizeof(TransactionId )
 
+
       * (running->subxcnt + running->xcnt));
 
+                                       memcpy(last_running, running, sizeof(xl_running_xacts)
+                                                                                                                +
(sizeof(TransactionId)
+                                                                                                                *
(running->subxcnt+ running->xcnt)));
 

Best Regards,
    Takamichi Osumi


On Friday, September 24, 2021 5:03 PM I wrote:
> On Tuesday, March 16, 2021 1:35 AM Oh, Mike <minsoo@amazon.com> wrote:
> > We noticed that the logical replication could fail when the
> > Standby::RUNNING_XACT record is generated in the middle of a catalog
> > modifying transaction and if the logical decoding has to restart from
> > the RUNNING_XACT WAL entry.
> ...
> > Proposed solution:
> > If we’re decoding a catalog modifying commit record, then check
> > whether it’s part of the RUNNING_XACT xid’s processed @ the
> > restart_lsn. If so, then add its xid & subxacts in the committed txns
> > list in the logical decoding snapshot.
> >
> > Please refer to the attachment for the proposed patch.
> 
> 
> Let me share some review comments for the patch.
....
> (3) suggestion of small readability improvement
> 
> We calculate the same size twice here and DecodeCommit.
> I suggest you declare a variable that stores the computed result of size, which
> might shorten those codes.
> 
> +                                       /*
> +                                        * xl_running_xacts contains a xids
> Flexible Array
> +                                        * and its size is subxcnt + xcnt.
> +                                        * Take that into account while
> allocating
> +                                        * the memory for last_running.
> +                                        */
> +                                       last_running = (xl_running_xacts *)
> malloc(sizeof(xl_running_xacts)
> +
> + sizeof(TransactionId )
> +
> * (running->subxcnt + running->xcnt));
> +                                       memcpy(last_running, running,
> sizeof(xl_running_xacts)
> +
> + (sizeof(TransactionId)
> +
> + * (running->subxcnt + running->xcnt)));
Let me add one more basic review comment in DecodeStandbyOp().

Why do you call raw malloc directly ?
You don't have the basic check whether the return value is
NULL or not and intended to call palloc here instead ?


Best Regards,
    Takamichi Osumi