Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs

From Xuneng Zhou
Subject Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Date
Msg-id CABPTF7XKZV2d4eWhH0ybfv5J0hHyzPxPDmebzQXsSt-8XvO1sA@mail.gmail.com
Whole thread Raw
In response to BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
List pgsql-bugs
Hi,

Thanks for reporting this issue!

On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      19093
> Logged by:          Ryan bird
> Email address:      ryanzxg@gmail.com
> PostgreSQL version: 14.18
> Operating system:   Linux(Cent OS 8)
> Description:
>
> Behavioral change in walreceiver termination between PostgreSQL 14.17 and
> 14.18 affects standby health detection
>
> We observed a behavior change in PostgreSQL 14.18 that affects how the
> walreceiver process behaves when a standby reaches the end of WAL on a
> diverged timeline.
> This change impacts our monitoring service that rely on
> pg_stat_wal_receiver.status to detect unhealthy replication.
>
> Test Scenario
> PostgreSQL cluster with two instances:
> - Postgres_1 (Primary)
> - Postgres_2 (Standby)
> - Synchronous replication
>
> Test Steps:
> 1. Restart the device hosting Postgres_1 (this takes time).
> 2. Failover occurs: Postgres_2 is promoted to primary.
> 3. Postgres_1(Old primary) comes back online and starts as standby.
> 4. Postgres_1(Old primary): Streaming replication begins from the new
> primary Postgres_2, but fails due to timeline divergence.
>
> We observed a behavioral difference between PostgreSQL 14.17 and 14.18:
> - In 14.17, when the standby reaches the end of the WAL on the old timeline
> (before the new primary was promoted), the walreceiver process does not
> terminate.
> - In 14.18, under the same condition, the walreceiver process is terminated.
> This leads to different results when querying: SELECT status FROM
> pg_stat_wal_receiver;
> Observed Behavior:
> - PostgreSQL 14.17:
> The status is often waiting. After reaching the end of the timeline, there
> is a 5-second delay(wal_retrieve_retry_interval) before the next attempt.
> During this interval, the status remains waiting.
> Our monitoring service detects this as an unhealthy standby and triggers
> recovery using pg_rewind or pg_basebackup.
> - PostgreSQL 14.18:
> The status is often streaming. The walreceiver process is repeatedly
> terminated and restarted without the 5-second delay seen in 14.17.
> As a result, our monitoring service sees the standby as healthy (streaming),
> even though it is unable to stream WAL correctly. This causes the cluster to
> become unwritable due to synchronous replication.
>
> Log Comparison:
> We’ve attached excerpts from both versions showing the difference in
> walreceiver behavior when reaching the end of WAL on a diverged timeline.
> PostgreSQL 14.17:
> Oct 20 10:35:52 S2 postgres LOG:  replication terminated by primary server
> Oct 20 10:35:52 S2 postgres DETAIL:  End of WAL reached on timeline 7 at
> 1/9A7C1020.
> Oct 20 10:35:52 S2 postgres DEBUG:  sending write 1/9A7C1020 flush
> 1/9A806380 apply 1/9A806380
> Oct 20 10:35:52 S2 postgres DEBUG:  walreceiver ended streaming and awaits
> new instructions
> Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current database
> system timeline 7 before current recovery point 1/9A806380
> Oct 20 10:35:57 S2 postgres DEBUG:  skipping restartpoint, already performed
> at 0/0
> Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from stream to
> archive after failure
> Oct 20 10:35:57 S2 postgres DEBUG:  invalid record length at 1/9A806380:
> wanted 24, got 0
> Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from archive to
> stream after failure
> Oct 20 10:35:57 S2 postgres LOG:  restarted WAL streaming at 1/9A000000 on
> timeline 7
>
> PostgreSQL 14.18:
> Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG:  replication
> terminated by primary server
> Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL:  End of WAL
> reached on timeline 5 at 0/69970A80.
> Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG:  sending write
> 0/69970A80 flush 0/69970AF8 apply 0/69970AF8
> Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG:  walreceiver
> ended streaming and awaits new instructions
> Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL:  terminating
> walreceiver process due to administrator command
> Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG:  shmem_exit(1):
> 1 before_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG:  shmem_exit(1):
> 7 on_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG:  new timeline 7
> forked off current database system timeline 5 before current recovery point
> 0/69970AF8
> Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG:  switched WAL
> source from stream to archive after failure
> Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG:  invalid record
> length at 0/69970AF8: wanted 24, got 0
> Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG:  switched WAL
> source from archive to stream after failure
> Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG:  proc_exit(1): 2
> callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG:  exit(1)
> Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG:  shmem_exit(-1):
> 0 before_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG:  shmem_exit(-1):
> 0 on_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG:  proc_exit(-1):
> 0 callbacks to make
> Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG:  reaping dead
> processes
> Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG:
> find_in_dynamic_libpath: trying
> "/usr/lib/postgresql14/lib64/libpqwalreceiver"
> Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG:
> find_in_dynamic_libpath: trying
> "/usr/lib/postgresql14/lib64/libpqwalreceiver.so"
>
> Root Cause Analysis:
> We traced the change to the following commit:
> Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit.
> https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a
> This commit was introduced to fix a regression reported by Noah Misch
> (commit 1f95181), which caused spurious failures in archive recovery.
> In PostgreSQL 14.17, the shutdown logic was conditional:
> if (WalRcvStreaming())
>     XLogShutdownWalRcv();
> In PostgreSQL 14.18, the condition was removed:
> XLogShutdownWalRcv();
> As a result, the walreceiver process is terminated even when it is in a
> waiting state. In 14.17, the condition would prevent termination in this
> scenario.
> The intention was not to terminate the walreceiver process, but rather to
> reset the InstallXLogFileSegmentActive flag, which was previously only
> cleared inside XLogShutdownWalRcv().
> As noted in the commit discussion:
> The problem is that whenever we are going for streaming we always set
> XLogCtl->InstallXLogFileSegmentActive to true, but while switching
> from streaming to archive we do not always reset it so it hits
> assertion in some cases. Basically we reset it inside
> XLogShutdownWalRcv() but while switching from the streaming mode we
> only call it conditionally if WalRcvStreaming(). But it is very much
> possible that even before we call WalRcvStreaming() the walreceiver
> might have set alrcv->walRcvState to WALRCV_STOPPED. So now
> WalRcvStreaming() will return false. So I agree now we do not want to
> really shut down the walreceiver but who will reset the flag
>
> This change inadvertently causes walreceiver to terminate in scenarios where
> it previously remained alive, altering the observable behavior of streaming
> replication.

Thanks for your analysis here.

> Questions
> 1. Is terminating the walreceiver process in this scenario (end of WAL on a
> diverged timeline) the expected behavior in 14.18 or later release?

I think this is not the expected behavior. It is likely a bug
introduced in commit 3635a0a.

> 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
> accurately reflect streaming health in this case?

IMO, the expected behavior is for the walreceiver to remain in
WALRCV_WAITING state with status 'waiting', clearly indicating that
replication has stalled due to timeline issues.

> 3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv()
> to avoid unintended process termination?

+1 for decoupling here. Please see the attached patch — I’ve tested it
on HEAD, and the walreceiver process is not being terminated.

Best,
Xuneng

Attachment

pgsql-bugs by date:

Previous
From: Jacob Champion
Date:
Subject: Re: BUG #19092: scram_free() will free on address which was not malloc()-ed in pg_scram_mech
Next
From: xingguo
Date:
Subject: Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18