On Wed, Aug 20, 2025 at 11:08 AM Zane Duffield <duffieldzane@gmail.com> wrote:
>>
>> > On Monday, August 18, 2025 4:12 PM Zane Duffield
>> > <duffieldzane@gmail.com> wrote:
>> > > On Mon, Aug 11, 2025 at 9:28 PM Zhijie Hou (Fujitsu)
>> > > <mailto:houzj.fnst@fujitsu.com> wrote:
>
>
> Yes, I think it is the cause of the lag (every peak lines up directly with a restart of the apply workers), but I'm
notsure how it relates to the complete stall shown in confirmed_flush_lsn_lag_graph_2025_08_09.png (attached again).
>
>>
>> > This might be due to a SIGINT triggered by a lock_timeout or statement_timeout,
>> > although it's a bit weried that there are no timeout messages present in the logs.
>> > If my assumption is correct, the behavior is understandable: the parallel apply
>> > worker waits for the leader to send more data for the streamed transaction by
>> > acquiring and waiting on a lock. However, the leader might be occupied with
>> > other transactions, preventing it from sending additional data, which could
>> > potentially lead to a lock timeout.
>> >
>> > To confirm this, could you please provide the values you have set for
>> > lock_timeout, statement_timeout (on subscriber), and
>> > logical_decoding_work_mem (on publisher) ?
>
>
> lock_timeout = 30s
> statement_timeout = 4h
> logical_decoding_work_mem = 64MB
>
>>
>> >
>> > Additionally, for testing purposes, is it possible to disable these timeouts (by
>> > setting the lock_timeout and statement_timeout GUCs to their default values)
>> > in your testing environment to assess whether the lag still persists? This
>> > approach can help us determine whether the timeouts are causing the lag.
>
>
> This was a good question. See the attached confirmed_flush_lsn_lag_graph_2025_08_19.png.
> After setting lock_timeout to zero, the periodic peaks of lag were eliminated, and the restarts of the apply workers
inthe log are also eliminated.
>
So, this was the reason. As explained by Hou-San, in his previous
response, such a lock_timeout can lead to parallel apply worker exit
while waiting for more data from the leader. I think you need to
either set lock_timeout as 0 or set it to a higher value similar to
what you set for statement_timeout.
>
> One other thing I wonder is whether autovacuum on the subscriber has anything to do with the lock timeouts. I'm not
surewhether this could explain the perpetually-restarting apply workers that we witnessed on 2025-08-09, though.
>
No, as per my understanding it is because parallel apply worker
exiting due to lock_timeout set in the test. Ideally, the patch
proposed by Kuroda-San should show in LOGs that the parallel worker is
exiting due to lock_timeout. Can you try that once?
--
With Regards,
Amit Kapila.