Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAA4eK1Lqxrh_Gj4SmYJ-QagXBPNeeoQU6mxZ4RXj8FTaK6L55w@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Tue, Jul 8, 2025 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jul 7, 2025 at 12:03 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
>
> I think these performance regressions occur because at some point the
> subscriber can no longer keep up with the changes occurring on the
> publisher. This is because the publisher runs multiple transactions
> simultaneously, while the Subscriber applies them with one apply
> worker. When retain_conflict_info = on, the performance of the apply
> worker deteriorates because it retains dead tuples, and as a result it
> gradually cannot keep up with the publisher, the table bloats, and the
> TPS of pgbench executed on the subscriber is also affected. This
> happened when only 40 clients (or 15 clients according to the results
> of test 4?) were running simultaneously.
>

I think here the primary reason is the speed of one apply worker vs.
15 or 40 clients working on the publisher, and all the data is being
replicated. We don't see regression at 3 clients, which suggests apply
worker is able to keep up with that much workload. Now, we have
checked that if the workload is slightly different such that fewer
clients (say 1-3) work on same set of tables and then we make
different set of pub-sub pairs for all such different set of clients
(for example, 3 clients working on tables t1 and t2, other 3 clients
working on tables t3 and t4; then we can have 2 pub-sub pairs, one for
tables t1, t2, and other for t3-t4 ) then there is almost negligible
regression after enabling retain_conflict_info. Additionally, for very
large transactions that can be parallelized, we shouldn't see any
regression because those can be applied in parallel.

> I think that even with retain_conflict_info = off, there is probably a
> point at which the subscriber can no longer keep up with the
> publisher. For example, if with retain_conflict_info = off we can
> withstand 100 clients running at the same time, then the fact that
> this performance degradation occurred with 15 clients explains that
> performance degradation is much more likely to occur because of
> retain_conflict_info = on.
>
> Test cases 3 and 4 are typical cases where this feature is used since
> the  conflicts actually happen on the subscriber, so I think it's
> important to look at the performance in these cases. The worst case
> scenario for this feature is that when this feature is turned on, the
> subscriber cannot keep up even with a small load, and with
> max_conflict_retetion_duration we enter a loop of slot invalidation
> and re-creating, which means that conflict cannot be detected
> reliably.
>

As per the above observations, it is less of a regression of this
feature but more of a lack of parallel apply or some kind of pre-fetch
for apply, as is recently proposed [1]. I feel there are use cases, as
explained above, for which this feature would work without any
downside, but due to a lack of some sort of parallel apply, we may not
be able to use it without any downside for cases where the contention
is only on a smaller set of tables. We have not tried, but may in
cases where contention is on a smaller set of tables, if users
distribute workload among different pub-sub pairs by using row
filters, there also, we may also see less regression. We can try that
as well.

>
> > I think the hot standby feedback also has a similar impact on the performance
> > of the primary, which is done to prevent the early removal of data necessary
> > for the standby, ensuring that it remains accessible when needed.
>
> Right. I think it's likely to happen if there is a long running
> read-only query on the standby. But does it happen also when there are
> only short read-only transactions on the standbys?
>

IIUC, the regression happens simply by increasing the value of
recovery_min_apply_delay. See case 5 in email [2]. This is to show the
point that we can see some regression in physical replication when
there is a delay in replication.

[1] - https://www.postgresql.org/message-id/7b60e4e1-de40-4956-8135-cb1dc2be62e9%40garret.ru
[2] - https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: A recent message added to pg_upgade
Next
From: Dean Rasheed
Date:
Subject: Re: Allow ON CONFLICT DO UPDATE to return EXCLUDED values