Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date
Msg-id CAA4eK1LBtv6ayE+TvCcPmC-xse=DVg=SmbyQD1nv_AaqcpUJEg@mail.gmail.com
Whole thread Raw
In response to Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
List pgsql-hackers
On Fri, Jul 29, 2022 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > Yeah, your description makes sense to me. I've also considered how to
> > hit this path but I guess it is never hit. Thinking of it in another
> > way, first of all, at least 2 catalog modifying transactions have to
> > be running while writing a xl_running_xacts. The serialized snapshot
> > that is written when we decode the first xl_running_xact has two
> > transactions. Then, one of them is committed before the second
> > xl_running_xacts. The second serialized snapshot has only one
> > transaction. Then, the transaction is also committed after that. Now,
> > in order to execute the path, we need to start decoding from the first
> > serialized snapshot. However, if we start from there, we cannot decode
> > the full contents of the transaction that was committed later.
> >
>
> I think then we should change this code in the master branch patch
> with an additional comment on the lines of: "Either all the xacts got
> purged or none. It is only possible to partially remove the xids from
> this array if one or more of the xids are still running but not all.
> That can happen if we start decoding from a point (LSN where the
> snapshot state became consistent) where all the xacts in this were
> running and then at least one of those got committed and a few are
> still running. We will never start from such a point because we won't
> move the slot's restart_lsn past the point where the oldest running
> transaction's restart_decoding_lsn is."
>

Unfortunately, this theory doesn't turn out to be true. While
investigating the latest buildfarm failure [1], I see that it is
possible that only part of the xacts in the restored catalog modifying
xacts list needs to be purged. See the attached where I have
demonstrated it via a reproducible test. It seems the point we were
missing was that to start from a point where two or more catalog
modifying were serialized, it requires another open transaction before
both get committed, and then we need the checkpoint or other way to
force running_xacts record in-between the commit of initial two
catalog modifying xacts. There could possibly be other ways as well
but the theory above wasn't correct.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-08-25%2004%3A15%3A34


--
With Regards,
Amit Kapila.

Attachment

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: windows cfbot failing: my_perl
Next
From: John Naylor
Date:
Subject: Re: use ARM intrinsics in pg_lfind32() where available