Home > mailing lists

Deadlock between backend and recovery may not be detected - Mailing list pgsql-hackers

From	Fujii Masao
Subject	Deadlock between backend and recovery may not be detected
Date	December 16, 2020 12:49:07
Msg-id	4041d6b6-cf24-a120-36fa-1294220f8243@oss.nttdata.com Whole thread Raw
Responses	Re: Deadlock between backend and recovery may not be detected
List	pgsql-hackers

Tree view

Hi,

While reviewing the patch proposed at [1], I found that there is the case
where deadlock that recovery conflict on lock is involved in may not be
detected. This deadlock can happen between backends and the startup
process, in the standby server. Please see the following procedure to
reproduce the deadlock.

#1. Set up streaming replication.

#2. Set max_standby_streaming_delay to -1 in the standby.

#3. Create two tables in the primary.

     [PRIMARY: SESSION1]
     CREATE TABLE t1 ();
     CREATE TABLE t2 ();

#4. Start transaction and access to the table t1.

     [STANDBY: SESSION2]
     BEGIN;
     SELECT * FROM t1;

#5. Start transaction and lock table t2 in access exclusive mode,
     in the primary. Also execute pg_switch_wal() to transfer WAL record
     for access exclusive lock to the standby.

     [PRIMARY: SESSION1]
     BEGIN;
     LOCK TABLE t2 IN ACCESS EXCLUSIVE MODE;
     SELECT pg_switch_wal();

#6. Access to the table t2 within the transaction that started at #4,
     in the standby.

     [STANDBY: SESSION2]
     SELECT * FROM t2;

#7. Lock table t1 in access exclusive mode within the transaction that
     started in #5, in the primary. Also execute pg_switch_wal() to transfer
     WAL record for access exclusive lock to the standby.

     [PRIMARY: SESSION1]
     LOCK TABLE t1 IN ACCESS EXCLUSIVE MODE;
     SELECT pg_switch_wal();

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

> * Deadlocks involving the Startup process and an ordinary backend process
> * will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

Thought?

Regards,

[1] https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

recovery_conflict_lock_deadlock_v1.patch

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 16 December 2020, 12:17:58
Subject: Perform COPY FROM encoding conversions in larger chunks

From: Bharath Rupireddy
Date: 16 December 2020, 12:50:27
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

Deadlock between backend and recovery may not be detected - Mailing list pgsql-hackers

Attachment

Previous

Next