BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs
| From | PG Bug reporting form |
|---|---|
| Subject | BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 |
| Date | |
| Msg-id | 19093-c4fff49a608f82a0@postgresql.org Whole thread Raw |
| Responses |
Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
|
| List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 19093 Logged by: Ryan bird Email address: ryanzxg@gmail.com PostgreSQL version: 14.18 Operating system: Linux(Cent OS 8) Description: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 affects standby health detection We observed a behavior change in PostgreSQL 14.18 that affects how the walreceiver process behaves when a standby reaches the end of WAL on a diverged timeline. This change impacts our monitoring service that rely on pg_stat_wal_receiver.status to detect unhealthy replication. Test Scenario PostgreSQL cluster with two instances: - Postgres_1 (Primary) - Postgres_2 (Standby) - Synchronous replication Test Steps: 1. Restart the device hosting Postgres_1 (this takes time). 2. Failover occurs: Postgres_2 is promoted to primary. 3. Postgres_1(Old primary) comes back online and starts as standby. 4. Postgres_1(Old primary): Streaming replication begins from the new primary Postgres_2, but fails due to timeline divergence. We observed a behavioral difference between PostgreSQL 14.17 and 14.18: - In 14.17, when the standby reaches the end of the WAL on the old timeline (before the new primary was promoted), the walreceiver process does not terminate. - In 14.18, under the same condition, the walreceiver process is terminated. This leads to different results when querying: SELECT status FROM pg_stat_wal_receiver; Observed Behavior: - PostgreSQL 14.17: The status is often waiting. After reaching the end of the timeline, there is a 5-second delay(wal_retrieve_retry_interval) before the next attempt. During this interval, the status remains waiting. Our monitoring service detects this as an unhealthy standby and triggers recovery using pg_rewind or pg_basebackup. - PostgreSQL 14.18: The status is often streaming. The walreceiver process is repeatedly terminated and restarted without the 5-second delay seen in 14.17. As a result, our monitoring service sees the standby as healthy (streaming), even though it is unable to stream WAL correctly. This causes the cluster to become unwritable due to synchronous replication. Log Comparison: We’ve attached excerpts from both versions showing the difference in walreceiver behavior when reaching the end of WAL on a diverged timeline. PostgreSQL 14.17: Oct 20 10:35:52 S2 postgres LOG: replication terminated by primary server Oct 20 10:35:52 S2 postgres DETAIL: End of WAL reached on timeline 7 at 1/9A7C1020. Oct 20 10:35:52 S2 postgres DEBUG: sending write 1/9A7C1020 flush 1/9A806380 apply 1/9A806380 Oct 20 10:35:52 S2 postgres DEBUG: walreceiver ended streaming and awaits new instructions Oct 20 10:35:52 S2 postgres LOG: new timeline 9 forked off current database system timeline 7 before current recovery point 1/9A806380 Oct 20 10:35:57 S2 postgres DEBUG: skipping restartpoint, already performed at 0/0 Oct 20 10:35:57 S2 postgres DEBUG: switched WAL source from stream to archive after failure Oct 20 10:35:57 S2 postgres DEBUG: invalid record length at 1/9A806380: wanted 24, got 0 Oct 20 10:35:57 S2 postgres DEBUG: switched WAL source from archive to stream after failure Oct 20 10:35:57 S2 postgres LOG: restarted WAL streaming at 1/9A000000 on timeline 7 PostgreSQL 14.18: Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG: replication terminated by primary server Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL: End of WAL reached on timeline 5 at 0/69970A80. Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG: sending write 0/69970A80 flush 0/69970AF8 apply 0/69970AF8 Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG: walreceiver ended streaming and awaits new instructions Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL: terminating walreceiver process due to administrator command Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG: shmem_exit(1): 7 on_shmem_exit callbacks to make Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG: new timeline 7 forked off current database system timeline 5 before current recovery point 0/69970AF8 Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG: switched WAL source from stream to archive after failure Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG: invalid record length at 0/69970AF8: wanted 24, got 0 Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG: switched WAL source from archive to stream after failure Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG: proc_exit(1): 2 callbacks to make Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG: exit(1) Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG: proc_exit(-1): 0 callbacks to make Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG: reaping dead processes Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG: find_in_dynamic_libpath: trying "/usr/lib/postgresql14/lib64/libpqwalreceiver" Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG: find_in_dynamic_libpath: trying "/usr/lib/postgresql14/lib64/libpqwalreceiver.so" Root Cause Analysis: We traced the change to the following commit: Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit. https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a This commit was introduced to fix a regression reported by Noah Misch (commit 1f95181), which caused spurious failures in archive recovery. In PostgreSQL 14.17, the shutdown logic was conditional: if (WalRcvStreaming()) XLogShutdownWalRcv(); In PostgreSQL 14.18, the condition was removed: XLogShutdownWalRcv(); As a result, the walreceiver process is terminated even when it is in a waiting state. In 14.17, the condition would prevent termination in this scenario. The intention was not to terminate the walreceiver process, but rather to reset the InstallXLogFileSegmentActive flag, which was previously only cleared inside XLogShutdownWalRcv(). As noted in the commit discussion: The problem is that whenever we are going for streaming we always set XLogCtl->InstallXLogFileSegmentActive to true, but while switching from streaming to archive we do not always reset it so it hits assertion in some cases. Basically we reset it inside XLogShutdownWalRcv() but while switching from the streaming mode we only call it conditionally if WalRcvStreaming(). But it is very much possible that even before we call WalRcvStreaming() the walreceiver might have set alrcv->walRcvState to WALRCV_STOPPED. So now WalRcvStreaming() will return false. So I agree now we do not want to really shut down the walreceiver but who will reset the flag This change inadvertently causes walreceiver to terminate in scenarios where it previously remained alive, altering the observable behavior of streaming replication. Questions 1. Is terminating the walreceiver process in this scenario (end of WAL on a diverged timeline) the expected behavior in 14.18 or later release? 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not accurately reflect streaming health in this case? 3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv() to avoid unintended process termination? This change impacts monitoring logic and cluster availability under synchronous replication. We’d appreciate clarification and guidance on best practices for detecting unhealthy standby nodes in this context.
pgsql-bugs by date: