Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date
Msg-id CAD21AoAVTB48YHV5XJJLBsiRkeTqk-BrjnXasnBshPiq_WdUbQ@mail.gmail.com
Whole thread Raw
In response to Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
List pgsql-hackers
On Thu, Jul 28, 2022 at 8:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 28, 2022 at 3:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jul 26, 2022 at 1:22 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Okay, I've attached an updated patch that does the above idea. Could
> > > you please do the performance tests again to see if the idea can help
> > > reduce the overhead, Shi yu?
> > >
> >
> > While reviewing the patch for HEAD, I have changed a few comments. See
> > attached, if you agree with these changes then include them in the
> > next version.
> >
>
> I have another comment on this patch:
> SnapBuildPurgeOlderTxn()
> {
> ...
> + if (surviving_xids > 0)
> + memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
> + surviving_xids * sizeof(TransactionId))
> ...
>
> For this code to hit, we must have a situation where one or more of
> the xacts in this array must be still running. And, if that is true,
> we would not have started from the restart point where the
> corresponding snapshot (that contains the still running xacts) has
> been serialized because we advance the restart point to not before the
> oldest running xacts restart_decoding_lsn. This may not be easy to
> understand so let me take an example to explain. Say we have two
> transactions t1 and t2, and both have made catalog changes. We want a
> situation where one of those gets purged and the other remains in
> builder->catchange.xip array. I have tried variants of the below
> sequence to see if I can get into the required situation but am not
> able to make it.
>
> Session-1
> Checkpoint -1;
> T1
> DDL
>
> Session-2
> T2
> DDL
>
> Session-3
> Checkpoint-2;
> pg_logical_slot_get_changes()
>  -- Here when we serialize the snapshot corresponding to
> CHECKPOINT-2's running_xact record, we will serialize both t1 and t2
> as catalog-changing xacts.
>
> Session-1
> T1
> Commit;
>
> Checkpoint;
> pg_logical_slot_get_changes()
>  -- Here we will restore from Checkpoint-1's serialized snapshot and
> won't be able to move restart_point to Checkpoint-2 because T2 is
> still open.
>
> Now, as per my understanding, it is only possible to move
> restart_point to Checkpoint-2 if T2 gets committed/rolled-back in
> which case we will never have that in surviving_xids array after the
> purge.
>
> It is possible I am missing something here. Do let me know your thoughts.

Yeah, your description makes sense to me. I've also considered how to
hit this path but I guess it is never hit. Thinking of it in another
way, first of all, at least 2 catalog modifying transactions have to
be running while writing a xl_running_xacts. The serialized snapshot
that is written when we decode the first xl_running_xact has two
transactions. Then, one of them is committed before the second
xl_running_xacts. The second serialized snapshot has only one
transaction. Then, the transaction is also committed after that. Now,
in order to execute the path, we need to start decoding from the first
serialized snapshot. However, if we start from there, we cannot decode
the full contents of the transaction that was committed later.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: [Commitfest 2022-07] Patch Triage: Needs Review, Part 2
Next
From: Justin Pryzby
Date:
Subject: Re: Cirrus CI (Windows help wanted)