Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18 - Mailing list pgsql-bugs

From Xuneng Zhou
Subject Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Date
Msg-id CABPTF7V1=f1n+uS82c11haGJsX65=_Sm-Gs9fKyGZRBSKQkf-w@mail.gmail.com
Whole thread Raw
In response to BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
Hi xingguo,

On Fri, Oct 24, 2025 at 4:33 PM xingguo <ryanzxg@gmail.com> wrote:
>
> Hi Xuneng,
>
> Thank you very much for your reply.
> I’d like to ask about the fix plan for this issue. Will there be any releases after 14.19, and if so, is this issue
expectedto be fixed in the 14 branch? Also, will it be addressed in version 16? 

It seems we may need to investigate this further. The earlier patch
appears similar to the first version of Dilip’s patch in [1], but the
community ultimately favored the current unconditional shutdown
approach.

If this turns out to be a genuine bug and the proposed fix is correct,
it may need to be backpatched to 14.18. In that case, since Heikki
split xlog.c into xlog.c and xlogrecovery.c in commit 70e8186, we
would likely need to prepare at least two patch versions as well.

[1]
https://www.postgresql.org/message-id/flat/CAFiTN-sE3ry%3DycMPVtC%2BDjw4Fd7gbUGVv_qqw6qfzp%3DJLvqT3g%40mail.gmail.com

> Best regards,
> Xingguo
> Xuneng Zhou <xunengzhou@gmail.com>于2025年10月24日 周五14:20写道:
>>
>> Hi,
>>
>> Thanks for reporting this issue!
>>
>> On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>> >
>> > The following bug has been logged on the website:
>> >
>> > Bug reference:      19093
>> > Logged by:          Ryan bird
>> > Email address:      ryanzxg@gmail.com
>> > PostgreSQL version: 14.18
>> > Operating system:   Linux(Cent OS 8)
>> > Description:
>> >
>> > Behavioral change in walreceiver termination between PostgreSQL 14.17 and
>> > 14.18 affects standby health detection
>> >
>> > We observed a behavior change in PostgreSQL 14.18 that affects how the
>> > walreceiver process behaves when a standby reaches the end of WAL on a
>> > diverged timeline.
>> > This change impacts our monitoring service that rely on
>> > pg_stat_wal_receiver.status to detect unhealthy replication.
>> >
>> > Test Scenario
>> > PostgreSQL cluster with two instances:
>> > - Postgres_1 (Primary)
>> > - Postgres_2 (Standby)
>> > - Synchronous replication
>> >
>> > Test Steps:
>> > 1. Restart the device hosting Postgres_1 (this takes time).
>> > 2. Failover occurs: Postgres_2 is promoted to primary.
>> > 3. Postgres_1(Old primary) comes back online and starts as standby.
>> > 4. Postgres_1(Old primary): Streaming replication begins from the new
>> > primary Postgres_2, but fails due to timeline divergence.
>> >
>> > We observed a behavioral difference between PostgreSQL 14.17 and 14.18:
>> > - In 14.17, when the standby reaches the end of the WAL on the old timeline
>> > (before the new primary was promoted), the walreceiver process does not
>> > terminate.
>> > - In 14.18, under the same condition, the walreceiver process is terminated.
>> > This leads to different results when querying: SELECT status FROM
>> > pg_stat_wal_receiver;
>> > Observed Behavior:
>> > - PostgreSQL 14.17:
>> > The status is often waiting. After reaching the end of the timeline, there
>> > is a 5-second delay(wal_retrieve_retry_interval) before the next attempt.
>> > During this interval, the status remains waiting.
>> > Our monitoring service detects this as an unhealthy standby and triggers
>> > recovery using pg_rewind or pg_basebackup.
>> > - PostgreSQL 14.18:
>> > The status is often streaming. The walreceiver process is repeatedly
>> > terminated and restarted without the 5-second delay seen in 14.17.
>> > As a result, our monitoring service sees the standby as healthy (streaming),
>> > even though it is unable to stream WAL correctly. This causes the cluster to
>> > become unwritable due to synchronous replication.
>> >
>> > Log Comparison:
>> > We’ve attached excerpts from both versions showing the difference in
>> > walreceiver behavior when reaching the end of WAL on a diverged timeline.
>> > PostgreSQL 14.17:
>> > Oct 20 10:35:52 S2 postgres LOG:  replication terminated by primary server
>> > Oct 20 10:35:52 S2 postgres DETAIL:  End of WAL reached on timeline 7 at
>> > 1/9A7C1020.
>> > Oct 20 10:35:52 S2 postgres DEBUG:  sending write 1/9A7C1020 flush
>> > 1/9A806380 apply 1/9A806380
>> > Oct 20 10:35:52 S2 postgres DEBUG:  walreceiver ended streaming and awaits
>> > new instructions
>> > Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current database
>> > system timeline 7 before current recovery point 1/9A806380
>> > Oct 20 10:35:57 S2 postgres DEBUG:  skipping restartpoint, already performed
>> > at 0/0
>> > Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from stream to
>> > archive after failure
>> > Oct 20 10:35:57 S2 postgres DEBUG:  invalid record length at 1/9A806380:
>> > wanted 24, got 0
>> > Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from archive to
>> > stream after failure
>> > Oct 20 10:35:57 S2 postgres LOG:  restarted WAL streaming at 1/9A000000 on
>> > timeline 7
>> >
>> > PostgreSQL 14.18:
>> > Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG:  replication
>> > terminated by primary server
>> > Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL:  End of WAL
>> > reached on timeline 5 at 0/69970A80.
>> > Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG:  sending write
>> > 0/69970A80 flush 0/69970AF8 apply 0/69970AF8
>> > Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG:  walreceiver
>> > ended streaming and awaits new instructions
>> > Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL:  terminating
>> > walreceiver process due to administrator command
>> > Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG:  shmem_exit(1):
>> > 1 before_shmem_exit callbacks to make
>> > Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG:  shmem_exit(1):
>> > 7 on_shmem_exit callbacks to make
>> > Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG:  new timeline 7
>> > forked off current database system timeline 5 before current recovery point
>> > 0/69970AF8
>> > Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG:  switched WAL
>> > source from stream to archive after failure
>> > Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG:  invalid record
>> > length at 0/69970AF8: wanted 24, got 0
>> > Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG:  switched WAL
>> > source from archive to stream after failure
>> > Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG:  proc_exit(1): 2
>> > callbacks to make
>> > Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG:  exit(1)
>> > Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG:  shmem_exit(-1):
>> > 0 before_shmem_exit callbacks to make
>> > Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG:  shmem_exit(-1):
>> > 0 on_shmem_exit callbacks to make
>> > Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG:  proc_exit(-1):
>> > 0 callbacks to make
>> > Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG:  reaping dead
>> > processes
>> > Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG:
>> > find_in_dynamic_libpath: trying
>> > "/usr/lib/postgresql14/lib64/libpqwalreceiver"
>> > Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG:
>> > find_in_dynamic_libpath: trying
>> > "/usr/lib/postgresql14/lib64/libpqwalreceiver.so"
>> >
>> > Root Cause Analysis:
>> > We traced the change to the following commit:
>> > Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit.
>> > https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a
>> > This commit was introduced to fix a regression reported by Noah Misch
>> > (commit 1f95181), which caused spurious failures in archive recovery.
>> > In PostgreSQL 14.17, the shutdown logic was conditional:
>> > if (WalRcvStreaming())
>> >     XLogShutdownWalRcv();
>> > In PostgreSQL 14.18, the condition was removed:
>> > XLogShutdownWalRcv();
>> > As a result, the walreceiver process is terminated even when it is in a
>> > waiting state. In 14.17, the condition would prevent termination in this
>> > scenario.
>> > The intention was not to terminate the walreceiver process, but rather to
>> > reset the InstallXLogFileSegmentActive flag, which was previously only
>> > cleared inside XLogShutdownWalRcv().
>> > As noted in the commit discussion:
>> > The problem is that whenever we are going for streaming we always set
>> > XLogCtl->InstallXLogFileSegmentActive to true, but while switching
>> > from streaming to archive we do not always reset it so it hits
>> > assertion in some cases. Basically we reset it inside
>> > XLogShutdownWalRcv() but while switching from the streaming mode we
>> > only call it conditionally if WalRcvStreaming(). But it is very much
>> > possible that even before we call WalRcvStreaming() the walreceiver
>> > might have set alrcv->walRcvState to WALRCV_STOPPED. So now
>> > WalRcvStreaming() will return false. So I agree now we do not want to
>> > really shut down the walreceiver but who will reset the flag
>> >
>> > This change inadvertently causes walreceiver to terminate in scenarios where
>> > it previously remained alive, altering the observable behavior of streaming
>> > replication.
>>
>> Thanks for your analysis here.
>>
>> > Questions
>> > 1. Is terminating the walreceiver process in this scenario (end of WAL on a
>> > diverged timeline) the expected behavior in 14.18 or later release?
>>
>> I think this is not the expected behavior. It is likely a bug
>> introduced in commit 3635a0a.
>>
>> > 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
>> > accurately reflect streaming health in this case?
>>
>> IMO, the expected behavior is for the walreceiver to remain in
>> WALRCV_WAITING state with status 'waiting', clearly indicating that
>> replication has stalled due to timeline issues.
>>
>> > 3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv()
>> > to avoid unintended process termination?
>>
>> +1 for decoupling here. Please see the attached patch — I’ve tested it
>> on HEAD, and the walreceiver process is not being terminated.



pgsql-bugs by date:

Previous
From: xingguo
Date:
Subject: Re: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18
Next
From: PG Bug reporting form
Date:
Subject: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results