RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns - Mailing list pgsql-hackers

From osumi.takamichi@fujitsu.com
Subject RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date
Msg-id OSBPR01MB4888FAB98EAA23CA3FEE1CE7EDB29@OSBPR01MB4888.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Thursday, October 7, 2021 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Regarding the patch details, I have two comments:
> 
> ---
> + if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) && last_running) {
> +     /* make last_running->xids bsearch()able */
> +     qsort(last_running->xids,
> +              last_running->subxcnt + last_running->xcnt,
> +              sizeof(TransactionId), xidComparator);
> 
> The patch does qsort() every time when the commit message has
> XACT_XINFO_HAS_INVALS. IIUC the xids we need to remember is the only
> xids that are recorded in the first replayed XLOG_RUNNING_XACTS, right? If so,
> we need to do qsort() once, can remove xid from the array once it gets
> committed, and then can eventually make last_running empty so that we can
> skip even TransactionIdInArray().
> 
> ---
> Since last_running is allocated by malloc() and it isn't freed even after finishing
> logical decoding.
> 
> 
> Another idea to fix this problem would be that before calling
> SnapBuildCommitTxn() we create transaction entries in ReorderBuffer for
> (sub)transactions whose COMMIT record has XACT_XINFO_HAS_INVALS,
> and then mark all of them as catalog-changed by calling
> ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for this idea.
> What the patch does is essentially the same as what the proposed patch does.
> But the patch doesn't modify the SnapBuildCommitTxn(). And we remember
> the list of last running transactions in reorder buffer and the list is periodically
> purged during decoding RUNNING_XACTS records, eventually making it
> empty.
Thanks for the patch.

Conducted a quick check of the POC.

Test of check-world PASSED with your patch and head.
Also, the original scenario described in [1] looks fine
with your revised patch and LOG_SNAPSHOT_INTERVAL_MS expansion in the procedure.

The last command in the provided steps showed below.

postgres=# select * from pg_logical_slot_get_changes('bdt_slot', null, null);
    lsn    | xid |                  data                  
-----------+-----+----------------------------------------
 0/1560020 | 710 | BEGIN 710
 0/1560020 | 710 | table public.bdt: INSERT: a[integer]:1
 0/1560140 | 710 | COMMIT 710


Minor comments for DecodeStandbyOp changes I noticed instantly
(1) minor suggestion of your comment.


+                                * has done catalog changes without these records, we miss to add
+                                * the xid to the snapshot so up creating the wrong snapshot. To

"miss to add" would be miss adding or fail to add.
And "up creating" is natural in this sentence ?

(2) a full-width space between "it'" and "s" in the next sentence.

+                                * mark an xid that actually has not done that but it’s not a



[1] - https://www.postgresql.org/message-id/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com

Best Regards,
    Takamichi Osumi



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: storing an explicit nonce
Next
From: Michael Paquier
Date:
Subject: Re: Add jsonlog log_destination for JSON server logs