Home > mailing lists

Re: Standby server with cascade logical replication could not be properly stopped under load - Mailing list pgsql-bugs

From	shveta malik
Subject	Re: Standby server with cascade logical replication could not be properly stopped under load
Date	May 22, 2025 13:33:14
Msg-id	CAJpy0uA9kw9WLWETPg+upwzBu145vMB7sPZwBuR_vDrbazeuag@mail.gmail.com Whole thread Raw
List	pgsql-bugs

Tree view

On Thu, May 22, 2025 at 7:51 AM Alexey Makhmutov
<a.makhmutov@postgrespro.ru> wrote:
>
> Assuming following configuration with three connected servers A->B->C: A
> (primary), B (physical standby) and C (logical replica connected to B).
> If server A is under load and B is applying incoming WAL records while
> also streaming data via logical replication to C, then attempt to stop
> server B in 'fast' mode may by unsuccessful. In this case server will
> remain in PM_SHUTDOWN state indefinitely with all 'walsender' processes
> running in an infinite busy loop (consuming a CPU core each). To get
> server out of this state it's required either to either stop B using
> 'immediate' mode or stop server C (which will cause 'walsender'
> processes on server B to exit). This issue is reproducible on latest
> 'master', as well as on current PG 16/17 branches.
>
> Attached is a test scenario to reproduce the issue: 'test_scenario.zip'.
> This archive contains shell scripts to create
> the required environment (all three serves) and then to execute required
> steps to get server into incorrect state. First, edit the 'test_env.sh'
> file and specify path to PG binaries in PG_PATH variable and optionally
> set of ports used by test instances in 'pg_port' array. Then execute the
> 'test_prepare.sh' script, which will create, configure and start all
> three PG instances. Servers then could be started and stopped using
> corresponding start and stop scripts. To reproduce the issue, ensure
> that all three servers are running and execute the 'test_execute.sh'
> script. This script starts 'pgbench' instance in background for 30
> seconds to create load on server A, waits for 20 seconds and then try to
> stop the B instance using default 'fast' mode. Expected behavior is
> normal shutdown for B, while observed behavior is different: shutdown
> attempt fails and each remaining 'walsender' process consumes entire CPU
> core. To get out of this state one could use 'stop-C.sh' script to stop
> the server C, as it will complete shutdown process of B instance as well.
>
> My understanding is that this issue seems to be caused by the logic in
> 'GetStandbyFlushRecPtr' function, which returns current flush point for
> received WAL data. This position is used in 'XLogSendLogical' to
> calculate whether current walsender is in 'caught up' state (i.e.
> whether we send all the available data to downstream instance). During
> shutdown walsenders are allowed to continue their work until they are in
> 'caught up' state, while 'postmaster' is waiting for their completion.
> Currently 'GetStandbyFlushRecPtr' returns position of last stored
> record, rather than last applied record. This is correct for physical
> replication as we can send data to downstream instance without applying
> it to local system. However, for logical replication this seems to be
> incorrect, as we could not decode data until it's applied on current
> instance. So, if current stored WAL position differs from applied
> position while server is being stopped, then
> 'WalSndLoop'/'XLogSendLogical'/'XLogReadRecord' methods will spin in a
> busy loop, waiting for applied position to advance. The recovery process
> is already stopped at this moment, so this will be an infinite loop.
> Probably either 'GetStandbyFlushRecPtr' or
> 'WalSndLoop'/'XLogSendLogical' logic need to be adjusted to take into
> consideration such case with  logical replication.
>
> Attached is also a patch, which aims to fix this issue:
> 0001-Use-only-replayed-position-as-target-flush-point-for.patch. It
> tries to to modify behavior of 'GetStandbyFlushRecPtr' function to
> ensure that it returns only applied position for logical replication.
> This function could be also invoked from slot synchronization routines
> and in this case it retains current behavior by returning last stored
> position.
>

The problem stated in 'logical-walsender' on 'physical standby' looks
genuine. I agree with the analysis for slot-sync as well. Slot-sync
does not need the fix as it deals only with flush-position and does
not care about replay-position. Since the problem area falls under
'Allow logical decoding on standbys', I am adding Bertrand for further
comments on this fix.

thanks
Shveta

pgsql-bugs by date:

From: Ajin Cherian
Date: 22 May 2025, 12:48:03
Subject: Re: Standby server with cascade logical replication could not be properly stopped under load

From: Amit Kapila
Date: 22 May 2025, 13:56:55
Subject: Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5

Re: Standby server with cascade logical replication could not be properly stopped under load - Mailing list pgsql-bugs

Previous

Next