Thread: RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
From
"osumi.takamichi@fujitsu.com"
Date:
Hi On Tuesday, March 16, 2021 1:35 AM Oh, Mike <minsoo@amazon.com> wrote: > We noticed that the logical replication could fail when the > Standby::RUNNING_XACT record is generated in the middle of a catalog > modifying transaction and if the logical decoding has to restart from the > RUNNING_XACT > WAL entry. ... > Proposed solution: > If we’re decoding a catalog modifying commit record, then check whether > it’s part of the RUNNING_XACT xid’s processed @ the restart_lsn. If so, > then add its xid & subxacts in the committed txns list in the logical decoding > snapshot. > > Please refer to the attachment for the proposed patch. Let me share some review comments for the patch. (1) last_running declaration Isn't it better to add static for this variable, because we don't use this in other places ? @@ -85,6 +86,9 @@ static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, Oid dbId, RepOriginId origin_id); +/* record previous restart_lsn running xacts */ +xl_running_xacts *last_running = NULL; (2) DecodeStandbyOp's memory free I'm not sure when we pass this condition with already allocated last_running, but do you need to free it's xid array here as well, if last_running isn't null ? Otherwise, we'll miss the chance after this. + /* record restart_lsn running xacts */ + if (MyReplicationSlot && (buf->origptr == MyReplicationSlot->data.restart_lsn)) + { + if (last_running) + free(last_running); + + last_running = NULL; (3) suggestion of small readability improvement We calculate the same size twice here and DecodeCommit. I suggest you declare a variable that stores the computed result of size, which might shorten those codes. + /* + * xl_running_xacts contains a xids Flexible Array + * and its size is subxcnt + xcnt. + * Take that into account while allocating + * the memory for last_running. + */ + last_running = (xl_running_xacts *) malloc(sizeof(xl_running_xacts) + + sizeof(TransactionId ) + * (running->subxcnt + running->xcnt)); + memcpy(last_running, running, sizeof(xl_running_xacts) + + (sizeof(TransactionId) + * (running->subxcnt+ running->xcnt))); Best Regards, Takamichi Osumi
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
From
"osumi.takamichi@fujitsu.com"
Date:
On Friday, September 24, 2021 5:03 PM I wrote: > On Tuesday, March 16, 2021 1:35 AM Oh, Mike <minsoo@amazon.com> wrote: > > We noticed that the logical replication could fail when the > > Standby::RUNNING_XACT record is generated in the middle of a catalog > > modifying transaction and if the logical decoding has to restart from > > the RUNNING_XACT WAL entry. > ... > > Proposed solution: > > If we’re decoding a catalog modifying commit record, then check > > whether it’s part of the RUNNING_XACT xid’s processed @ the > > restart_lsn. If so, then add its xid & subxacts in the committed txns > > list in the logical decoding snapshot. > > > > Please refer to the attachment for the proposed patch. > > > Let me share some review comments for the patch. .... > (3) suggestion of small readability improvement > > We calculate the same size twice here and DecodeCommit. > I suggest you declare a variable that stores the computed result of size, which > might shorten those codes. > > + /* > + * xl_running_xacts contains a xids > Flexible Array > + * and its size is subxcnt + xcnt. > + * Take that into account while > allocating > + * the memory for last_running. > + */ > + last_running = (xl_running_xacts *) > malloc(sizeof(xl_running_xacts) > + > + sizeof(TransactionId ) > + > * (running->subxcnt + running->xcnt)); > + memcpy(last_running, running, > sizeof(xl_running_xacts) > + > + (sizeof(TransactionId) > + > + * (running->subxcnt + running->xcnt))); Let me add one more basic review comment in DecodeStandbyOp(). Why do you call raw malloc directly ? You don't have the basic check whether the return value is NULL or not and intended to call palloc here instead ? Best Regards, Takamichi Osumi