BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Date
Msg-id 19093-c4fff49a608f82a0@postgresql.org
Whole thread Raw
Responses Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      19093
Logged by:          Ryan bird
Email address:      ryanzxg@gmail.com
PostgreSQL version: 14.18
Operating system:   Linux(Cent OS 8)
Description:

Behavioral change in walreceiver termination between PostgreSQL 14.17 and
14.18 affects standby health detection

We observed a behavior change in PostgreSQL 14.18 that affects how the
walreceiver process behaves when a standby reaches the end of WAL on a
diverged timeline.
This change impacts our monitoring service that rely on
pg_stat_wal_receiver.status to detect unhealthy replication.

Test Scenario
PostgreSQL cluster with two instances:
- Postgres_1 (Primary)
- Postgres_2 (Standby)
- Synchronous replication

Test Steps:
1. Restart the device hosting Postgres_1 (this takes time).
2. Failover occurs: Postgres_2 is promoted to primary.
3. Postgres_1(Old primary) comes back online and starts as standby.
4. Postgres_1(Old primary): Streaming replication begins from the new
primary Postgres_2, but fails due to timeline divergence.

We observed a behavioral difference between PostgreSQL 14.17 and 14.18:
- In 14.17, when the standby reaches the end of the WAL on the old timeline
(before the new primary was promoted), the walreceiver process does not
terminate.
- In 14.18, under the same condition, the walreceiver process is terminated.
This leads to different results when querying: SELECT status FROM
pg_stat_wal_receiver;
Observed Behavior:
- PostgreSQL 14.17:
The status is often waiting. After reaching the end of the timeline, there
is a 5-second delay(wal_retrieve_retry_interval) before the next attempt.
During this interval, the status remains waiting.
Our monitoring service detects this as an unhealthy standby and triggers
recovery using pg_rewind or pg_basebackup.
- PostgreSQL 14.18:
The status is often streaming. The walreceiver process is repeatedly
terminated and restarted without the 5-second delay seen in 14.17.
As a result, our monitoring service sees the standby as healthy (streaming),
even though it is unable to stream WAL correctly. This causes the cluster to
become unwritable due to synchronous replication.

Log Comparison:
We’ve attached excerpts from both versions showing the difference in
walreceiver behavior when reaching the end of WAL on a diverged timeline.
PostgreSQL 14.17:
Oct 20 10:35:52 S2 postgres LOG:  replication terminated by primary server
Oct 20 10:35:52 S2 postgres DETAIL:  End of WAL reached on timeline 7 at
1/9A7C1020.
Oct 20 10:35:52 S2 postgres DEBUG:  sending write 1/9A7C1020 flush
1/9A806380 apply 1/9A806380
Oct 20 10:35:52 S2 postgres DEBUG:  walreceiver ended streaming and awaits
new instructions
Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current database
system timeline 7 before current recovery point 1/9A806380
Oct 20 10:35:57 S2 postgres DEBUG:  skipping restartpoint, already performed
at 0/0
Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from stream to
archive after failure
Oct 20 10:35:57 S2 postgres DEBUG:  invalid record length at 1/9A806380:
wanted 24, got 0
Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from archive to
stream after failure
Oct 20 10:35:57 S2 postgres LOG:  restarted WAL streaming at 1/9A000000 on
timeline 7

PostgreSQL 14.18:
Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG:  replication
terminated by primary server
Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL:  End of WAL
reached on timeline 5 at 0/69970A80.
Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG:  sending write
0/69970A80 flush 0/69970AF8 apply 0/69970AF8
Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG:  walreceiver
ended streaming and awaits new instructions
Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL:  terminating
walreceiver process due to administrator command
Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG:  shmem_exit(1):
1 before_shmem_exit callbacks to make
Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG:  shmem_exit(1):
7 on_shmem_exit callbacks to make
Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG:  new timeline 7
forked off current database system timeline 5 before current recovery point
0/69970AF8
Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG:  switched WAL
source from stream to archive after failure
Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG:  invalid record
length at 0/69970AF8: wanted 24, got 0
Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG:  switched WAL
source from archive to stream after failure
Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG:  proc_exit(1): 2
callbacks to make
Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG:  exit(1)
Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG:  shmem_exit(-1):
0 before_shmem_exit callbacks to make
Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG:  shmem_exit(-1):
0 on_shmem_exit callbacks to make
Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG:  proc_exit(-1):
0 callbacks to make
Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG:  reaping dead
processes
Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG:
find_in_dynamic_libpath: trying
"/usr/lib/postgresql14/lib64/libpqwalreceiver"
Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG:
find_in_dynamic_libpath: trying
"/usr/lib/postgresql14/lib64/libpqwalreceiver.so"

Root Cause Analysis:
We traced the change to the following commit:
Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit.
https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a
This commit was introduced to fix a regression reported by Noah Misch
(commit 1f95181), which caused spurious failures in archive recovery.
In PostgreSQL 14.17, the shutdown logic was conditional:
if (WalRcvStreaming())
    XLogShutdownWalRcv();
In PostgreSQL 14.18, the condition was removed:
XLogShutdownWalRcv();
As a result, the walreceiver process is terminated even when it is in a
waiting state. In 14.17, the condition would prevent termination in this
scenario.
The intention was not to terminate the walreceiver process, but rather to
reset the InstallXLogFileSegmentActive flag, which was previously only
cleared inside XLogShutdownWalRcv().
As noted in the commit discussion:
The problem is that whenever we are going for streaming we always set
XLogCtl->InstallXLogFileSegmentActive to true, but while switching
from streaming to archive we do not always reset it so it hits
assertion in some cases. Basically we reset it inside
XLogShutdownWalRcv() but while switching from the streaming mode we
only call it conditionally if WalRcvStreaming(). But it is very much
possible that even before we call WalRcvStreaming() the walreceiver
might have set alrcv->walRcvState to WALRCV_STOPPED. So now
WalRcvStreaming() will return false. So I agree now we do not want to
really shut down the walreceiver but who will reset the flag

This change inadvertently causes walreceiver to terminate in scenarios where
it previously remained alive, altering the observable behavior of streaming
replication.

Questions
1. Is terminating the walreceiver process in this scenario (end of WAL on a
diverged timeline) the expected behavior in 14.18 or later release?
2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
accurately reflect streaming health in this case?
3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv()
to avoid unintended process termination?

This change impacts monitoring logic and cluster availability under
synchronous replication. We’d appreciate clarification and guidance on best
practices for detecting unhealthy standby nodes in this context.


pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: ERROR: XX000: could not find memoization table entry (reproducible)
Next
From: Amit Langote
Date:
Subject: Re: Segfault in RI UPDATE CASCADE on partitioned tables with LIKE+ATTACH child (attnum drift)