Re: Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Exit walsender before confirming remote flush in logical replication
Date
Msg-id CAHGQGwELRshB7z4PdkON1AGXvFu88s4vbF61TX=Tn-2_c4_pYg@mail.gmail.com
Whole thread
In response to Re: Exit walsender before confirming remote flush in logical replication  (Andrey Silitskiy <a.silitskiy@postgrespro.ru>)
List pgsql-hackers
On Mon, Mar 30, 2026 at 12:14 PM Andrey Silitskiy
<a.silitskiy@postgrespro.ru> wrote:
>
> On Mar 29, 2026 Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>  > One possible idea why hand may happen for is is that
>  > WalSndWaitForWal() has missing WalSndCheckShutdownTimeout() call.
>
> On Mar 25, 2026 Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>  > I tested wal_sender_shutdown_timeout under several configurations and
>  > encountered a case where the primary shutdown got stuck, ...
>
> Thanks for your help in finding the issue!
>
> I reproduced the problem, in this configuration it turned out that the
> walsender was not terminated by wal_sender_shutdown_timeout in
> WalSndWaitForWal(), but only when the physical slot was checked for
> inactive flag,
> which caused shutdown to hang.

Regarding the issue I reported, Vitaly's analysis upthread seems correct to me.
If WalSndComputeSleeptime() is called before WalSndCheckShutdownTimeout(), then
shutdown_request_timestamp is still 0, so wal_sender_shutdown_timeout is not
taken into account even though shutdown has already been requested
(i.e., got_STOPPING || got_SIGUSR2 is true).

In that case, if wal_sender_timeout is large, the computed sleep time can also
be large, and the walsender may wait in WalSndWait() longer than intended.

To fix this, walsender should call WalSndCheckShutdownTimeout() first so that
shutdown_request_timestamp is set before computing the sleep time. The v7 patch
already does this, which looks good to me. The comments for that
WalSndCheckShutdownTimeout() might need an update, though.

Regards,

--
Fujii Masao



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Initial COPY of Logical Replication is too slow
Next
From: Fujii Masao
Date:
Subject: Re: Exit walsender before confirming remote flush in logical replication