Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From shveta malik
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAJpy0uBAKjU67jA31+d=B2gQK2ngiao8qBOXym+eG-g2_G1Hwg@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Thu, Jul 17, 2025 at 9:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jul 11, 2025 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 10, 2025 at 6:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Jul 9, 2025 at 9:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > > I think that even with retain_conflict_info = off, there is probably a
> > > > > point at which the subscriber can no longer keep up with the
> > > > > publisher. For example, if with retain_conflict_info = off we can
> > > > > withstand 100 clients running at the same time, then the fact that
> > > > > this performance degradation occurred with 15 clients explains that
> > > > > performance degradation is much more likely to occur because of
> > > > > retain_conflict_info = on.
> > > > >
> > > > > Test cases 3 and 4 are typical cases where this feature is used since
> > > > > the  conflicts actually happen on the subscriber, so I think it's
> > > > > important to look at the performance in these cases. The worst case
> > > > > scenario for this feature is that when this feature is turned on, the
> > > > > subscriber cannot keep up even with a small load, and with
> > > > > max_conflict_retetion_duration we enter a loop of slot invalidation
> > > > > and re-creating, which means that conflict cannot be detected
> > > > > reliably.
> > > > >
> > > >
> > > > As per the above observations, it is less of a regression of this
> > > > feature but more of a lack of parallel apply or some kind of pre-fetch
> > > > for apply, as is recently proposed [1]. I feel there are use cases, as
> > > > explained above, for which this feature would work without any
> > > > downside, but due to a lack of some sort of parallel apply, we may not
> > > > be able to use it without any downside for cases where the contention
> > > > is only on a smaller set of tables. We have not tried, but may in
> > > > cases where contention is on a smaller set of tables, if users
> > > > distribute workload among different pub-sub pairs by using row
> > > > filters, there also, we may also see less regression. We can try that
> > > > as well.
> > >
> > > While I understand that there are some possible solutions we have
> > > today to reduce the contention, I'm not really sure these are really
> > > practical solutions as it increases the operational costs instead.
> > >
> >
> > I assume by operational costs you mean defining the replication
> > definitions such that workload is distributed among multiple apply
> > workers via subscriptions either by row_filters, or by defining
> > separate pub-sub pairs of a set of tables, right? If so, I agree with
> > you but I can't think of a better alternative. Even without this
> > feature as well, we know in such cases the replication lag could be
> > large as is evident in recent thread [1] and some offlist feedback by
> > people using native logical replication. As per a POC in the
> > thread[1], parallelizing apply or by using some prefetch, we could
> > reduce the lag but we need to wait for that work to mature to see the
> > actual effect of it.
> >
> > The path I see with this work is to clearly document the cases
> > (configuration) where this feature could be used without much downside
> > and keep the default value of subscription option to enable this as
> > false (which is already the case with the patch). Do you see any
> > better alternative for moving forward?
>
> I was just thinking about what are the most practical use cases where
> a user would need multiple active writer nodes. Most applications
> typically function well with a single active writer node. While it's
> beneficial to have multiple nodes capable of writing for immediate
> failover (e.g., if the current writer goes down), or they select a
> primary writer via consensus algorithms like Raft/Paxos, I rarely
> encounter use cases where users require multiple active writer nodes
> for scaling write workloads.

Thank you for the feedback. In the scenario with a single writer node
and a subscriber with RCI enabled, we have not observed any
regression.  Please refer to the test report at [1], specifically test
cases 1 and 2, which involve a single writer node. Next, we can test a
scenario with multiple (2-3) writer nodes publishing changes, and a
subscriber node subscribing to those writers with RCI enabled, which
can even serve as a good use case of the conflict detection we are
targeting through RCI enabling.

>
> One common use case for multiple active writer nodes is in
> geographically distributed systems. Here, a dedicated writer in each
> zone can significantly reduce write latency by sending writes to the
> nearest zone.
>
> In a multi-zone replication setup with an active writer in each zone
> and data replicated across all zones, performance can be impacted by
> factors like network latency. However, if such configurations are
> implemented wisely and subscriptions are managed effectively, this
> performance impact can be minimized.
>
> IMHO, the same principle applies to this case when
> ‘retain_conflict_info’ is set to ON. If this setting is enabled, it
> should only be used where absolutely essential. Additionally, the user
> or DBA must carefully consider other factors. For instance, if they
> use a single subscriber in each zone and subscribe to everything
> across all zones, performance will significantly degrade. However, if
> managed properly by subscribing only to data relevant to each zone and
> using multiple subscribers for parallel apply of different
> tables/partitions to reduce delay, it should work fine.
>

Strongly agree with this. We tested scenarios involving multiple
subscribers, each subscribing to exclusive data,  as well as
publishers using row filters. In both cases, no regressions were
observed. Please refer to the test results at [2] and [3].

[1]:

https://www.postgresql.org/message-id/OSCPR01MB1496663AED8EEC566074DFBC9F54CA%40OSCPR01MB14966.jpnprd01.prod.outlook.com

[2]:
row filter -
https://www.postgresql.org/message-id/OSCPR01MB149660DD40A9D7C18E2E11C97F548A%40OSCPR01MB14966.jpnprd01.prod.outlook.com

[3]:
Multiple subscriptions -
https://www.postgresql.org/message-id/CABdArM5kvA7mPLLwy6XEDkHi0MNs1RidvAcYmm2uVd95U%3DyzwQ%40mail.gmail.com

thanks
Shveta



pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: Non-text mode for pg_dumpall
Next
From: vignesh C
Date:
Subject: Re: Logical Replication of sequences