On Wed, May 14, 2025 at 9:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 13, 2025 at 4:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 13, 2025 at 3:48 PM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > Hi All,
> > >
> > > It is a spin-off thread from earlier discussions at [1] and [2].
> > >
> > > While analyzing the slot-sync BF failure as stated in [1], it was
> > > observed that there are chances that confirmed_flush_lsn may move
> > > backward depending on the feedback messages received from the
> > > downstream system. It was suspected that the backward movement of
> > > confirmed_flush_lsn may result in data duplication issues. Earlier we
> > > were able to successfully reproduce the issue with two_phase enabled
> > > subscriptions (see[2]). Now on further analysing, it seems possible
> > > that data duplication issues may happen without two-phase as well.
> >
> > Thanks for the detailed explanation. Before we focus on patching the
> > symptoms, I’d like to explore whether the issue can be addressed on
> > the subscriber side. Specifically, have we analyzed if there’s a way
> > to prevent the subscriber from moving the LSN backward in the first
> > place? That might lead to a cleaner and more robust solution overall.
> >
>
> The subscriber doesn't move the LSN backwards, it only shares the
> information with the publisher, which is the latest value of remote
> LSN tracked by the origin. Now, as explained in email [1], the
> subscriber doesn't persistently store/advance the LSN, for which it
> doesn't have to do anything like DDLs, or any other non-published
> DMLs. However, subscribers need to send confirmation of such LSNs for
> synchronous replication. This is commented in the code as well, see
> comments in CreateDecodingContext (It might seem like we should error
> out in this case, but it's pretty common for a client to acknowledge a
> LSN it doesn't have to do anything for ...). As mentioned in email[1],
> persisting the LSN information that the subscriber doesn't have to do
> anything with could be a noticeable performance overhead.
Thanks for your response.
What I meant wasn’t that the subscriber is moving the confirmed LSN
backward, nor was I suggesting we fix it by persisting the LSN on the
subscriber side. My point was: the fact that the subscriber is sending
an LSN older than one it has already sent, does that indicate a bug on
the subscriber side? And if so, should the logic be fixed there?
I understand this might not be feasible, and it may not even be a bug
on the subscriber side, it could be an intentional part of the design.
But my question was whether we’ve already considered and ruled out
that possibility.
That said, I’m planning to dig deeper into the full sequence of steps
to understand exactly how this behavior is occurring. Hopefully, from
there, I might get a better idea of why the subscriber is doing that.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com