Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs
| From | Noah Misch |
|---|---|
| Subject | Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 |
| Date | |
| Msg-id | 20251027041241.ab.nmisch@google.com Whole thread Raw |
| In response to | Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 (Xuneng Zhou <xunengzhou@gmail.com>) |
| List | pgsql-bugs |
On Fri, Oct 24, 2025 at 02:20:39PM +0800, Xuneng Zhou wrote: > Thanks for reporting this issue! Yes, thank you for the detailed report. > On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > 1. Is terminating the walreceiver process in this scenario (end of WAL on a > > diverged timeline) the expected behavior in 14.18 or later release? > > I think this is not the expected behavior. It is likely a bug > introduced in commit 3635a0a. Right. > > 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not > > accurately reflect streaming health in this case? Yes. Even before the regression from commit 3635a0a, status='streaming' was not a reliable indicator of health. walreceiver sets that status during early startup, before it has attempted a connection. Perhaps a better health check for your application would be one of these: - If you have multiple replicas with quorum, track per-replica lag. If a replica isn't advancing, it has a problem regardless of walreceiver status. - If you have just one replica, issue a commit from the primary and see if it hangs. (Long-term, in master only, perhaps we should introduce another status like 'connecting'. Perhaps enact the connecting->streaming status transition just before tendering the first byte of streamed WAL to the startup process. Alternatively, enact that transition when the startup process accepts the first streamed byte. Then your application's health check would get what it wants.) > IMO, the expected behavior is for the walreceiver to remain in > WALRCV_WAITING state with status 'waiting', clearly indicating that > replication has stalled due to timeline issues. Right. While status='streaming' doesn't mean healthy in general, seeing a lot of 'streaming' is still not expected for this particular scenario. > --- a/src/backend/access/transam/xlogrecovery.c > +++ b/src/backend/access/transam/xlogrecovery.c > @@ -3687,8 +3687,18 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, > * Before we leave XLOG_FROM_STREAM state, make sure that > * walreceiver is not active, so that it won't overwrite > * WAL that we restore from archive. > + * > + * If walreceiver is actively streaming (or attempting to > + * connect), we must shut it down. However, if it's already > + * in WAITING state (e.g., due to timeline divergence), we > + * only need to reset the install flag to allow archive > + * restoration, while keeping the process alive for > + * monitoring visibility. > */ > - XLogShutdownWalRcv(); > + if (WalRcvStreaming()) > + XLogShutdownWalRcv(); > + else > + ResetInstallXLogFileSegmentActive(); I think this will do the right thing for all reachable cases. In WALRCV_WAITING, the walreceiver won't install segments until the startup process moves the walreceiver to WALRCV_RESTARTING. There's no reason to take the further step of terminating the walreceiver. The discussion before commit 3635a0a wasn't convinced status='waiting' was reachable at this location, but this thread shows it is reachable. WalRcvStreaming() return values map to WalRcvState values as follows: true: STREAMING STARTING RESTARTING false: STOPPED WAITING STOPPING This change would be wrong if WALRCV_STOPPING were a reachable state here. That state is the startup process asking walreceiver to stop. walreceiver may then still be installing segments, so this location would want to call XLogShutdownWalRcv() to wait for WALRCV_STOPPED. That said, WALRCV_STOPPING is a transient state while the startup process is in ShutdownWalRcv(). Hence, I expect STOPPING never appears here, and there's no bug. An assertion may be in order. Can you add a TAP test for this case? Since it was wrong in v15+ for >3y and wrong in v14 for 5mon before this report, clearly we had a blind spot. postgr.es/m/YyACvP++zgDphlcm@paquier.xyz discusses a "standby.signal+primary_conninfo" case. How will this patch interact with that case? Thanks, nm
pgsql-bugs by date: