Re: Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers
| From | Chao Li |
|---|---|
| Subject | Re: Exit walsender before confirming remote flush in logical replication |
| Date | |
| Msg-id | 43F849DC-885F-4D60-94F6-78B5D645E24D@gmail.com Whole thread Raw |
| In response to | Re: Exit walsender before confirming remote flush in logical replication (Fujii Masao <masao.fujii@gmail.com>) |
| Responses |
Re: Exit walsender before confirming remote flush in logical replication
|
| List | pgsql-hackers |
> On Apr 7, 2026, at 13:39, Fujii Masao <masao.fujii@gmail.com> wrote: > > On Tue, Apr 7, 2026 at 12:32 AM Andres Freund <andres@anarazel.de> wrote: >> Failed on CI just now: >> >> https://cirrus-ci.com/task/6745359004729344?logs=test_world#L410 >> https://api.cirrus-ci.com/v1/artifact/task/6745359004729344/testrun/build/testrun/subscription/038_walsnd_shutdown_timeout/log/regress_log_038_walsnd_shutdown_timeout >> >> [14:58:26.146](0.066s) ok 3 - have walreceiver pid 13796 >> ### Stopping node "publisher" using mode fast >> # Running: pg_ctl --pgdata /home/postgres/postgres/build/testrun/subscription/038_walsnd_shutdown_timeout/data/t_038_walsnd_shutdown_timeout_publisher_data/pgdata --modefast stop >> waiting for server to shut down........................................................................................................................... failed >> pg_ctl: server does not shut down >> # pg_ctl stop failed: 256 >> # Postmaster PID for node "publisher" is 3679 >> [15:00:38.178](132.032s) Bail out! pg_ctl stop failed > > Thanks for reporting this! > > From the CI results [1], the failure in 038_walsnd_shutdown_timeout.pl appears > to occur intermittently on FreeBSD. The failing case tests that, when both > physical and logical replication are in use with slotsync enabled and both are > stalled (walreceiver on the standby and the logical apply worker on > the subscriber are blocked), shutting down the primary completes due to > wal_sender_shutdown_timeout. > > On FreeBSD, however, it seems that after the shutdown request, the physical > walsender can occasionally keep running, preventing shutdown from completing. > As a result, pg_ctl stop times out and the test fails. > > I’ll investigate the cause. If it takes time to identify, I may temporarily > disable just this test case so it doesn’t block other development and testing, > then re-enable it once the issue is fixed. > > Regards, > > [1] > https://cirrus-ci.com/build/5134823678803968 > https://cirrus-ci.com/build/5735329598013440 > https://cirrus-ci.com/build/5917696627310592 > https://cirrus-ci.com/build/5742460250357760 > > -- > Fujii Masao > > I have some CF entries failed on this test case as well, so I tried to look into the problem. I have a finding for your reference. With a8f45dee917, wal_sender_shutdown_timeout is only enforced while the walsender keeps returning to WalSndCheckShutdownTimeout()in the main loops, but there is a path to enter WalSndDone: ``` /* * When SIGUSR2 arrives, we send any outstanding logs up to the * shutdown checkpoint record (i.e., the latest record), wait for * them to be replicated to the standby, and exit. This may be a * normal termination at shutdown, or a promotion, the walsender * is not sure which. */ if (got_SIGUSR2) WalSndDone(send_data); ``` Once entering WalSndDone(), it might call pg_flush() and get stuck: ``` if (WalSndCaughtUp && sentPtr == replicatedPtr && !pq_is_send_pending()) { QueryCompletion qc; /* Inform the standby that XLOG streaming is done */ SetQueryCompletion(&qc, CMDTAG_COPY, 0); EndCommand(&qc, DestRemote, false); pq_flush(); proc_exit(0); ``` And once stuck, it will never get back to WalSndCheckShutdownTimeout(), so the new GUC timeout cannot rescue it. In WalSndDoneImmediate(), pq_flush_if_writable() is used, and the comment talks about the possible stuck: ``` /* * Note that the output buffer may be full during the forced shutdown * of walsender. If pq_flush() is called at that time, the walsender * process will be stuck. Therefore, call pq_flush_if_writable() * instead. Successful reception of the done message with the * walsender forced into a shutdown is not guaranteed. */ pq_flush_if_writable(); ``` So, maybe switch to use pq_flush_if_writable() in WalSndDone()? Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
pgsql-hackers by date: