On 3/4/24 09:35, Rintaro.Ikeda@nttdata.com wrote:
> Hi,
>
> I correct the previous bug report [1] to provide a more accurate
> description. The bug report demonstrated undetected deadlock between
> client backend and startup processes on a standby server. (The title
> in the previous bug report is "Undetected deadlock between primary
> and standby processes". But this was wrong. Actually, this should be
> noted that "Undetected deadlock between client backend and startup
> process on a standby server".)
>
> After the procedures proposed in my bug report [1], a recovery
> conflict is present because the tablespace which startup process
> tries to drop is used by cliend backend process in standby. We see
> the pg_stat_activity (shown below), which implies a deadlock. A
> client backend process waits for AccessExclusiveLock to be released.
> Startup process waits for recovery conflict resolution for dropping
> the tablespace. This deadlock is not resolved after deadlock_timeout
> passes.
>
> (Standby server)
> postgres=# select datid, datname, wait_event_type, wait_event, query, backend_type from pg_stat_activity ;
> datid | datname | wait_event_type | wait_event | query
| backend_type
>
-------+----------+-----------------+----------------------------+-------------------------------------------------------------------------------------------------+-------------------
> 5 | postgres | Lock | relation | SELECT * FROM t;
| client backend
> | | IPC | RecoveryConflictTablespace |
| startup
>
>
> This deadlock is similar to the previously identified and patched
> issue [2], which also involved an undetected deadlock between
> backend process and recovery on a standby server. I think the
> deadlock explained in this report should be detected and resolved.
>
Thanks for the report.
So what are the steps to reproduce this? The previous message did all
kinds of stuff on the primary and then got stuck on pg_switch_wal() on
the primary, but this updated seems to do stuff on the standby and gets
the lockup there.
It seems similar in the sense that it's about interaction between
recovery and a regular backend, but unfortunately
ResolveRecoveryConflictWithVirtualXIDs does not wait for a lock, it just
checks if the XID is still running, so it's invisible to the deadlock
detector :-(
But it's still checked against max_standby_streaming_delay, which should
resolve the deadlock (unless set to -1 to allow infinite delays) at some
point, right?
Also, I'm not very familiar with ResolveRecoveryConflictWithVirtualXIDs,
but it seems it's doing a busy wait. I wonder if that's a good idea, but
it's independent of this bug report.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company