Re: Standby server with cascade logical replication could not be properly stopped under load - Mailing list pgsql-bugs
From | shveta malik |
---|---|
Subject | Re: Standby server with cascade logical replication could not be properly stopped under load |
Date | |
Msg-id | CAJpy0uA9kw9WLWETPg+upwzBu145vMB7sPZwBuR_vDrbazeuag@mail.gmail.com Whole thread Raw |
List | pgsql-bugs |
On Thu, May 22, 2025 at 7:51 AM Alexey Makhmutov <a.makhmutov@postgrespro.ru> wrote: > > Assuming following configuration with three connected servers A->B->C: A > (primary), B (physical standby) and C (logical replica connected to B). > If server A is under load and B is applying incoming WAL records while > also streaming data via logical replication to C, then attempt to stop > server B in 'fast' mode may by unsuccessful. In this case server will > remain in PM_SHUTDOWN state indefinitely with all 'walsender' processes > running in an infinite busy loop (consuming a CPU core each). To get > server out of this state it's required either to either stop B using > 'immediate' mode or stop server C (which will cause 'walsender' > processes on server B to exit). This issue is reproducible on latest > 'master', as well as on current PG 16/17 branches. > > Attached is a test scenario to reproduce the issue: 'test_scenario.zip'. > This archive contains shell scripts to create > the required environment (all three serves) and then to execute required > steps to get server into incorrect state. First, edit the 'test_env.sh' > file and specify path to PG binaries in PG_PATH variable and optionally > set of ports used by test instances in 'pg_port' array. Then execute the > 'test_prepare.sh' script, which will create, configure and start all > three PG instances. Servers then could be started and stopped using > corresponding start and stop scripts. To reproduce the issue, ensure > that all three servers are running and execute the 'test_execute.sh' > script. This script starts 'pgbench' instance in background for 30 > seconds to create load on server A, waits for 20 seconds and then try to > stop the B instance using default 'fast' mode. Expected behavior is > normal shutdown for B, while observed behavior is different: shutdown > attempt fails and each remaining 'walsender' process consumes entire CPU > core. To get out of this state one could use 'stop-C.sh' script to stop > the server C, as it will complete shutdown process of B instance as well. > > My understanding is that this issue seems to be caused by the logic in > 'GetStandbyFlushRecPtr' function, which returns current flush point for > received WAL data. This position is used in 'XLogSendLogical' to > calculate whether current walsender is in 'caught up' state (i.e. > whether we send all the available data to downstream instance). During > shutdown walsenders are allowed to continue their work until they are in > 'caught up' state, while 'postmaster' is waiting for their completion. > Currently 'GetStandbyFlushRecPtr' returns position of last stored > record, rather than last applied record. This is correct for physical > replication as we can send data to downstream instance without applying > it to local system. However, for logical replication this seems to be > incorrect, as we could not decode data until it's applied on current > instance. So, if current stored WAL position differs from applied > position while server is being stopped, then > 'WalSndLoop'/'XLogSendLogical'/'XLogReadRecord' methods will spin in a > busy loop, waiting for applied position to advance. The recovery process > is already stopped at this moment, so this will be an infinite loop. > Probably either 'GetStandbyFlushRecPtr' or > 'WalSndLoop'/'XLogSendLogical' logic need to be adjusted to take into > consideration such case with logical replication. > > Attached is also a patch, which aims to fix this issue: > 0001-Use-only-replayed-position-as-target-flush-point-for.patch. It > tries to to modify behavior of 'GetStandbyFlushRecPtr' function to > ensure that it returns only applied position for logical replication. > This function could be also invoked from slot synchronization routines > and in this case it retains current behavior by returning last stored > position. > The problem stated in 'logical-walsender' on 'physical standby' looks genuine. I agree with the analysis for slot-sync as well. Slot-sync does not need the fix as it deals only with flush-position and does not care about replay-position. Since the problem area falls under 'Allow logical decoding on standbys', I am adding Bertrand for further comments on this fix. thanks Shveta
pgsql-bugs by date: