Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date
Msg-id CAD21AoDFSYfg6=K=t2R_uq=r6dqLhTFHzMCAOcOC0ySWuG8dOA@mail.gmail.com
Whole thread Raw
In response to Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Jun 14, 2022 at 3:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 13, 2022 at 8:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jun 7, 2022 at 9:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > >
> > > > poc_add_regression_tests.patch adds regression tests for this bug. The
> > > > regression tests are required for both HEAD and back-patching but I've
> > > > separated this patch for testing the above two patches easily.
> > > >
> >
> > Thank you for the comments.
> >
> > >
> > > Few comments on the test case patch:
> > > ===============================
> > > 1.
> > > +# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
> > > +# only its COMMIT record, because it starts from the RUNNING_XACT
> > > record emitted
> > > +# during the first checkpoint execution.  This transaction must be marked as
> > > +# catalog-changes while decoding the COMMIT record and the decoding
> > > of the INSERT
> > > +# record must read the pg_class with the correct historic snapshot.
> > > +permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
> > > "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
> > > "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
> > >
> > > Will this test always work? What if we get an additional running_xact
> > > record between steps "s0_commit" and "s0_begin" that is logged via
> > > bgwriter? You can mimic that by adding an additional checkpoint
> > > between those two steps. If we do that, the test will pass even
> > > without the patch because I think the last decoding will start
> > > decoding from this new running_xact record.
> >
> > Right. It could pass depending on the timing but doesn't fail
> > depending on the timing. I think we need to somehow stop bgwriter to
> > make the test case stable but it seems unrealistic.
> >
>
> Agreed, in my local testing for this case, I use to increase
> LOG_SNAPSHOT_INTERVAL_MS to avoid such a situation but I understand it
> is not practical via test.
>
> > Do you have any
> > better ideas?
> >
>
> No, I don't have any better ideas. I think it is better to add some
> information related to this in the comments because it may help to
> improve the test in the future if we come up with a better idea.

I also don't have any better ideas to make it stable, and agreed. I've
attached an updated version patch for adding regression tests.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

pgsql-hackers by date:

Previous
From: Jeremy Schneider
Date:
Subject: Re: Collation version tracking for macOS
Next
From: Peter Geoghegan
Date:
Subject: Re: better page-level checksums