Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Date
Msg-id CAHGQGwG3L=ppus6D6+RXxfZEdFgoAstJnbau=UU9WJZWdAoRoA@mail.gmail.com
Whole thread
In response to Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?  (Nisha Moond <nisha.moond412@gmail.com>)
List pgsql-hackers
On Wed, Mar 25, 2026 at 1:51 AM Nisha Moond <nisha.moond412@gmail.com> wrote:
> Thank you, Fujii-san, for sharing the steps. I am now able to
> reproduce the behavior where promotion gets stuck because the slot
> sync worker remains in a wait loop.

Thanks for the test!


> As an experiment, I tried setting tcp_user_timeout to 7000 / 15000
> (using slightly higher values for debugging). With this setting, the
> TCP stack terminates the connection if data sent to the primary
> remains unacknowledged beyond the configured timeout (e.g., due to a
> network drop). In such cases the slot sync worker exits instead of
> waiting indefinitely. With an appropriately tuned timeout, this could
> help avoid the promotion issue by ensuring the worker does not remain
> stuck when the connection to the primary is lost.

Yes, TCP timeout settings like tcp_user_timeout, keepalives,
and net.ipv4.tcp_retries2 can help in this situation. However,
they involve a trade-off: using very small timeouts can reduce
failover time but increases the risk of false network failure detection,
while larger timeouts (e.g., 10s) avoid false positives but can
delay failover by that amount.

Because of this, I think it's better to address the issue without
relying on such TCP timeout parameters.

Also, tcp_user_timeout is not available on platforms that don't
support TCP_USER_TIMEOUT (e.g., Windows).

Regards,

--
Fujii Masao



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Eliminating SPI / SQL from some RI triggers - take 3
Next
From: Xuneng Zhou
Date:
Subject: Re: log_checkpoints: count WAL segment creations from all processes