Re: Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers

From Chao Li
Subject Re: Exit walsender before confirming remote flush in logical replication
Date
Msg-id 43F849DC-885F-4D60-94F6-78B5D645E24D@gmail.com
Whole thread Raw
In response to Re: Exit walsender before confirming remote flush in logical replication  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Exit walsender before confirming remote flush in logical replication
List pgsql-hackers

> On Apr 7, 2026, at 13:39, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Tue, Apr 7, 2026 at 12:32 AM Andres Freund <andres@anarazel.de> wrote:
>> Failed on CI just now:
>>
>> https://cirrus-ci.com/task/6745359004729344?logs=test_world#L410
>>
https://api.cirrus-ci.com/v1/artifact/task/6745359004729344/testrun/build/testrun/subscription/038_walsnd_shutdown_timeout/log/regress_log_038_walsnd_shutdown_timeout
>>
>> [14:58:26.146](0.066s) ok 3 - have walreceiver pid 13796
>> ### Stopping node "publisher" using mode fast
>> # Running: pg_ctl --pgdata
/home/postgres/postgres/build/testrun/subscription/038_walsnd_shutdown_timeout/data/t_038_walsnd_shutdown_timeout_publisher_data/pgdata
--modefast stop 
>> waiting for server to shut
down...........................................................................................................................
failed
>> pg_ctl: server does not shut down
>> # pg_ctl stop failed: 256
>> # Postmaster PID for node "publisher" is 3679
>> [15:00:38.178](132.032s) Bail out!  pg_ctl stop failed
>
> Thanks for reporting this!
>
> From the CI results [1], the failure in 038_walsnd_shutdown_timeout.pl appears
> to occur intermittently on FreeBSD. The failing case tests that, when both
> physical and logical replication are in use with slotsync enabled and both are
> stalled (walreceiver on the standby and the logical apply worker on
> the subscriber are blocked), shutting down the primary completes due to
> wal_sender_shutdown_timeout.
>
> On FreeBSD, however, it seems that after the shutdown request, the physical
> walsender can occasionally keep running, preventing shutdown from completing.
> As a result, pg_ctl stop times out and the test fails.
>
> I’ll investigate the cause. If it takes time to identify, I may temporarily
> disable just this test case so it doesn’t block other development and testing,
> then re-enable it once the issue is fixed.
>
> Regards,
>
> [1]
> https://cirrus-ci.com/build/5134823678803968
> https://cirrus-ci.com/build/5735329598013440
> https://cirrus-ci.com/build/5917696627310592
> https://cirrus-ci.com/build/5742460250357760
>
> --
> Fujii Masao
>
>

I have some CF entries failed on this test case as well, so I tried to look into the problem. I have a finding for your
reference.

With a8f45dee917, wal_sender_shutdown_timeout is only enforced while the walsender keeps returning to
WalSndCheckShutdownTimeout()in the main loops, but there is a path to enter WalSndDone: 
```
            /*
             * When SIGUSR2 arrives, we send any outstanding logs up to the
             * shutdown checkpoint record (i.e., the latest record), wait for
             * them to be replicated to the standby, and exit. This may be a
             * normal termination at shutdown, or a promotion, the walsender
             * is not sure which.
             */
            if (got_SIGUSR2)
                WalSndDone(send_data);
```

Once entering WalSndDone(), it might call pg_flush() and get stuck:
```
    if (WalSndCaughtUp && sentPtr == replicatedPtr &&
        !pq_is_send_pending())
    {
        QueryCompletion qc;

        /* Inform the standby that XLOG streaming is done */
        SetQueryCompletion(&qc, CMDTAG_COPY, 0);
        EndCommand(&qc, DestRemote, false);
        pq_flush();

        proc_exit(0);
```

And once stuck, it will never get back to WalSndCheckShutdownTimeout(), so the new GUC timeout cannot rescue it.

In WalSndDoneImmediate(), pq_flush_if_writable() is used, and the comment talks about the possible stuck:
```
        /*
         * Note that the output buffer may be full during the forced shutdown
         * of walsender. If pq_flush() is called at that time, the walsender
         * process will be stuck. Therefore, call pq_flush_if_writable()
         * instead. Successful reception of the done message with the
         * walsender forced into a shutdown is not guaranteed.
         */
        pq_flush_if_writable();
```

So, maybe switch to use pq_flush_if_writable() in WalSndDone()?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Next
From: Adam Lee
Date:
Subject: Re: [PATCH] Fix minRecoveryPoint not advanced past checkpoint in CreateRestartPoint