Deadlock between backend and recovery may not be detected - Mailing list pgsql-hackers

From Fujii Masao
Subject Deadlock between backend and recovery may not be detected
Date
Msg-id 4041d6b6-cf24-a120-36fa-1294220f8243@oss.nttdata.com
Whole thread Raw
Responses Re: Deadlock between backend and recovery may not be detected
List pgsql-hackers
Hi,

While reviewing the patch proposed at [1], I found that there is the case
where deadlock that recovery conflict on lock is involved in may not be
detected. This deadlock can happen between backends and the startup
process, in the standby server. Please see the following procedure to
reproduce the deadlock.

#1. Set up streaming replication.

#2. Set max_standby_streaming_delay to -1 in the standby.

#3. Create two tables in the primary.

     [PRIMARY: SESSION1]
     CREATE TABLE t1 ();
     CREATE TABLE t2 ();

#4. Start transaction and access to the table t1.

     [STANDBY: SESSION2]
     BEGIN;
     SELECT * FROM t1;

#5. Start transaction and lock table t2 in access exclusive mode,
     in the primary. Also execute pg_switch_wal() to transfer WAL record
     for access exclusive lock to the standby.

     [PRIMARY: SESSION1]
     BEGIN;
     LOCK TABLE t2 IN ACCESS EXCLUSIVE MODE;
     SELECT pg_switch_wal();

#6. Access to the table t2 within the transaction that started at #4,
     in the standby.

     [STANDBY: SESSION2]
     SELECT * FROM t2;

#7. Lock table t1 in access exclusive mode within the transaction that
     started in #5, in the primary. Also execute pg_switch_wal() to transfer
     WAL record for access exclusive lock to the standby.

     [PRIMARY: SESSION1]
     LOCK TABLE t1 IN ACCESS EXCLUSIVE MODE;
     SELECT pg_switch_wal();

After doing this procedure, you can see the startup process and backend
wait for the table lock each other, i.e., deadlock. But this deadlock remains
even after deadlock_timeout passes.

This seems a bug to me.

> * Deadlocks involving the Startup process and an ordinary backend process
> * will be detected by the deadlock detector within the ordinary backend.

The cause of this issue seems that ResolveRecoveryConflictWithLock() that
the startup process calls when recovery conflict on lock happens doesn't
take care of deadlock case at all. You can see this fact by reading the above
source code comment for ResolveRecoveryConflictWithLock().

To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
timer in ResolveRecoveryConflictWithLock() so that the startup process can
send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
the backend should check whether the deadlock actually happens or not.
Attached is the POC patch implimenting this.

Thought?

Regards,

[1] https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Perform COPY FROM encoding conversions in larger chunks
Next
From: Bharath Rupireddy
Date:
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit