Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs

From Noah Misch
Subject Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Date
Msg-id 20251027041241.ab.nmisch@google.com
Whole thread Raw
In response to Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18  (Xuneng Zhou <xunengzhou@gmail.com>)
List pgsql-bugs
On Fri, Oct 24, 2025 at 02:20:39PM +0800, Xuneng Zhou wrote:
> Thanks for reporting this issue!

Yes, thank you for the detailed report.

> On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form <noreply@postgresql.org> wrote:
> > 1. Is terminating the walreceiver process in this scenario (end of WAL on a
> > diverged timeline) the expected behavior in 14.18 or later release?
> 
> I think this is not the expected behavior. It is likely a bug
> introduced in commit 3635a0a.

Right.

> > 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
> > accurately reflect streaming health in this case?

Yes.  Even before the regression from commit 3635a0a, status='streaming' was
not a reliable indicator of health.  walreceiver sets that status during early
startup, before it has attempted a connection.

Perhaps a better health check for your application would be one of these:
- If you have multiple replicas with quorum, track per-replica lag.  If a
  replica isn't advancing, it has a problem regardless of walreceiver status.
- If you have just one replica, issue a commit from the primary and see if it
  hangs.

(Long-term, in master only, perhaps we should introduce another status like
'connecting'.  Perhaps enact the connecting->streaming status transition just
before tendering the first byte of streamed WAL to the startup process.
Alternatively, enact that transition when the startup process accepts the
first streamed byte.  Then your application's health check would get what it
wants.)

> IMO, the expected behavior is for the walreceiver to remain in
> WALRCV_WAITING state with status 'waiting', clearly indicating that
> replication has stalled due to timeline issues.

Right.  While status='streaming' doesn't mean healthy in general, seeing a lot
of 'streaming' is still not expected for this particular scenario.

> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -3687,8 +3687,18 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
>                       * Before we leave XLOG_FROM_STREAM state, make sure that
>                       * walreceiver is not active, so that it won't overwrite
>                       * WAL that we restore from archive.
> +                     *
> +                     * If walreceiver is actively streaming (or attempting to
> +                     * connect), we must shut it down. However, if it's already
> +                     * in WAITING state (e.g., due to timeline divergence), we
> +                     * only need to reset the install flag to allow archive
> +                     * restoration, while keeping the process alive for
> +                     * monitoring visibility.
>                       */
> -                    XLogShutdownWalRcv();
> +                    if (WalRcvStreaming())
> +                        XLogShutdownWalRcv();
> +                    else
> +                        ResetInstallXLogFileSegmentActive();

I think this will do the right thing for all reachable cases.  In
WALRCV_WAITING, the walreceiver won't install segments until the startup
process moves the walreceiver to WALRCV_RESTARTING.  There's no reason to take
the further step of terminating the walreceiver.  The discussion before commit
3635a0a wasn't convinced status='waiting' was reachable at this location, but
this thread shows it is reachable.

WalRcvStreaming() return values map to WalRcvState values as follows:

  true: STREAMING STARTING RESTARTING
  false: STOPPED WAITING STOPPING

This change would be wrong if WALRCV_STOPPING were a reachable state here.
That state is the startup process asking walreceiver to stop.  walreceiver may
then still be installing segments, so this location would want to call
XLogShutdownWalRcv() to wait for WALRCV_STOPPED.  That said, WALRCV_STOPPING
is a transient state while the startup process is in ShutdownWalRcv().  Hence,
I expect STOPPING never appears here, and there's no bug.  An assertion may be
in order.

Can you add a TAP test for this case?  Since it was wrong in v15+ for >3y and
wrong in v14 for 5mon before this report, clearly we had a blind spot.

postgr.es/m/YyACvP++zgDphlcm@paquier.xyz discusses a
"standby.signal+primary_conninfo" case.  How will this patch interact with
that case?

Thanks,
nm



pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: Segfault in RI UPDATE CASCADE on partitioned tables with LIKE+ATTACH child (attnum drift)
Next
From: PG Bug reporting form
Date:
Subject: BUG #19095: Test if function exit() is used fail when linked static