Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs
| From | Xuneng Zhou |
|---|---|
| Subject | Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 |
| Date | |
| Msg-id | CABPTF7XKZV2d4eWhH0ybfv5J0hHyzPxPDmebzQXsSt-8XvO1sA@mail.gmail.com Whole thread Raw |
| In response to | BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 (PG Bug reporting form <noreply@postgresql.org>) |
| Responses |
Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 |
| List | pgsql-bugs |
Hi, Thanks for reporting this issue! On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 19093 > Logged by: Ryan bird > Email address: ryanzxg@gmail.com > PostgreSQL version: 14.18 > Operating system: Linux(Cent OS 8) > Description: > > Behavioral change in walreceiver termination between PostgreSQL 14.17 and > 14.18 affects standby health detection > > We observed a behavior change in PostgreSQL 14.18 that affects how the > walreceiver process behaves when a standby reaches the end of WAL on a > diverged timeline. > This change impacts our monitoring service that rely on > pg_stat_wal_receiver.status to detect unhealthy replication. > > Test Scenario > PostgreSQL cluster with two instances: > - Postgres_1 (Primary) > - Postgres_2 (Standby) > - Synchronous replication > > Test Steps: > 1. Restart the device hosting Postgres_1 (this takes time). > 2. Failover occurs: Postgres_2 is promoted to primary. > 3. Postgres_1(Old primary) comes back online and starts as standby. > 4. Postgres_1(Old primary): Streaming replication begins from the new > primary Postgres_2, but fails due to timeline divergence. > > We observed a behavioral difference between PostgreSQL 14.17 and 14.18: > - In 14.17, when the standby reaches the end of the WAL on the old timeline > (before the new primary was promoted), the walreceiver process does not > terminate. > - In 14.18, under the same condition, the walreceiver process is terminated. > This leads to different results when querying: SELECT status FROM > pg_stat_wal_receiver; > Observed Behavior: > - PostgreSQL 14.17: > The status is often waiting. After reaching the end of the timeline, there > is a 5-second delay(wal_retrieve_retry_interval) before the next attempt. > During this interval, the status remains waiting. > Our monitoring service detects this as an unhealthy standby and triggers > recovery using pg_rewind or pg_basebackup. > - PostgreSQL 14.18: > The status is often streaming. The walreceiver process is repeatedly > terminated and restarted without the 5-second delay seen in 14.17. > As a result, our monitoring service sees the standby as healthy (streaming), > even though it is unable to stream WAL correctly. This causes the cluster to > become unwritable due to synchronous replication. > > Log Comparison: > We’ve attached excerpts from both versions showing the difference in > walreceiver behavior when reaching the end of WAL on a diverged timeline. > PostgreSQL 14.17: > Oct 20 10:35:52 S2 postgres LOG: replication terminated by primary server > Oct 20 10:35:52 S2 postgres DETAIL: End of WAL reached on timeline 7 at > 1/9A7C1020. > Oct 20 10:35:52 S2 postgres DEBUG: sending write 1/9A7C1020 flush > 1/9A806380 apply 1/9A806380 > Oct 20 10:35:52 S2 postgres DEBUG: walreceiver ended streaming and awaits > new instructions > Oct 20 10:35:52 S2 postgres LOG: new timeline 9 forked off current database > system timeline 7 before current recovery point 1/9A806380 > Oct 20 10:35:57 S2 postgres DEBUG: skipping restartpoint, already performed > at 0/0 > Oct 20 10:35:57 S2 postgres DEBUG: switched WAL source from stream to > archive after failure > Oct 20 10:35:57 S2 postgres DEBUG: invalid record length at 1/9A806380: > wanted 24, got 0 > Oct 20 10:35:57 S2 postgres DEBUG: switched WAL source from archive to > stream after failure > Oct 20 10:35:57 S2 postgres LOG: restarted WAL streaming at 1/9A000000 on > timeline 7 > > PostgreSQL 14.18: > Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG: replication > terminated by primary server > Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL: End of WAL > reached on timeline 5 at 0/69970A80. > Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG: sending write > 0/69970A80 flush 0/69970AF8 apply 0/69970AF8 > Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG: walreceiver > ended streaming and awaits new instructions > Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL: terminating > walreceiver process due to administrator command > Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG: shmem_exit(1): > 1 before_shmem_exit callbacks to make > Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG: shmem_exit(1): > 7 on_shmem_exit callbacks to make > Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG: new timeline 7 > forked off current database system timeline 5 before current recovery point > 0/69970AF8 > Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG: switched WAL > source from stream to archive after failure > Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG: invalid record > length at 0/69970AF8: wanted 24, got 0 > Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG: switched WAL > source from archive to stream after failure > Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG: proc_exit(1): 2 > callbacks to make > Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG: exit(1) > Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG: shmem_exit(-1): > 0 before_shmem_exit callbacks to make > Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG: shmem_exit(-1): > 0 on_shmem_exit callbacks to make > Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG: proc_exit(-1): > 0 callbacks to make > Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG: reaping dead > processes > Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG: > find_in_dynamic_libpath: trying > "/usr/lib/postgresql14/lib64/libpqwalreceiver" > Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG: > find_in_dynamic_libpath: trying > "/usr/lib/postgresql14/lib64/libpqwalreceiver.so" > > Root Cause Analysis: > We traced the change to the following commit: > Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit. > https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a > This commit was introduced to fix a regression reported by Noah Misch > (commit 1f95181), which caused spurious failures in archive recovery. > In PostgreSQL 14.17, the shutdown logic was conditional: > if (WalRcvStreaming()) > XLogShutdownWalRcv(); > In PostgreSQL 14.18, the condition was removed: > XLogShutdownWalRcv(); > As a result, the walreceiver process is terminated even when it is in a > waiting state. In 14.17, the condition would prevent termination in this > scenario. > The intention was not to terminate the walreceiver process, but rather to > reset the InstallXLogFileSegmentActive flag, which was previously only > cleared inside XLogShutdownWalRcv(). > As noted in the commit discussion: > The problem is that whenever we are going for streaming we always set > XLogCtl->InstallXLogFileSegmentActive to true, but while switching > from streaming to archive we do not always reset it so it hits > assertion in some cases. Basically we reset it inside > XLogShutdownWalRcv() but while switching from the streaming mode we > only call it conditionally if WalRcvStreaming(). But it is very much > possible that even before we call WalRcvStreaming() the walreceiver > might have set alrcv->walRcvState to WALRCV_STOPPED. So now > WalRcvStreaming() will return false. So I agree now we do not want to > really shut down the walreceiver but who will reset the flag > > This change inadvertently causes walreceiver to terminate in scenarios where > it previously remained alive, altering the observable behavior of streaming > replication. Thanks for your analysis here. > Questions > 1. Is terminating the walreceiver process in this scenario (end of WAL on a > diverged timeline) the expected behavior in 14.18 or later release? I think this is not the expected behavior. It is likely a bug introduced in commit 3635a0a. > 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not > accurately reflect streaming health in this case? IMO, the expected behavior is for the walreceiver to remain in WALRCV_WAITING state with status 'waiting', clearly indicating that replication has stalled due to timeline issues. > 3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv() > to avoid unintended process termination? +1 for decoupling here. Please see the attached patch — I’ve tested it on HEAD, and the walreceiver process is not being terminated. Best, Xuneng
Attachment
pgsql-bugs by date: