Re: 024_add_drop_pub.pl might fail due to deadlock - Mailing list pgsql-hackers

From Ajin Cherian
Subject Re: 024_add_drop_pub.pl might fail due to deadlock
Date
Msg-id CAFPTHDZTDWv0wpccxm73koixvQcorPrPSHg=d+sLb23xXS7qEQ@mail.gmail.com
Whole thread Raw
In response to RE: 024_add_drop_pub.pl might fail due to deadlock  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
List pgsql-hackers
On Wed, Jul 23, 2025 at 4:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > Yes, this is correct. I have also verified this in my testing that
> > when there is a second subscription, a deadlock can still happen with
> > my previous patch. I have now modified the patch in tablesync worker
> > to take locks on both SubscriptionRelationId and
> > SubscriptionRelRelationId prior to taking lock on
> > ReplicationOriginRelationId. I have also found that there is a similar
> > problem in DropSubscription() where lock on SubscriptionRelRelationId
> > is not taken before dropping the tracking origin. I've also modified
> > the signature of UpdateSubscriptionRelState to take a bool
> > "lock_needed" which if false, the SubscriptionRelationId is not
> > locked. I've made a new version of the patch on PG_15.
> >
>
> Review comments:
> ================
> 1.
> + if (!rel)
> + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
>
> Why did you acquire AccessExclusiveLock here when the current code has
> RowExclusiveLock? It should be RowExclusiveLock.
>

Yes, you are correct. I have replaced it with RowExclusiveLock.

> 2.
> + *
> + * Lock SubscriptionRelationId with AccessShareLock and
> + * take AccessExclusiveLock on SubscriptionRelRelationId to
> + * prevent any deadlocks with user concurrently performing
> + * refresh on the subscription.
>   */
>
> Try to tell in the comments that we want to keep the locking order
> same as DDL commands to prevent deadlocks.
>

Modified.

> 3.
> + * Also close any tables prior to the commit.
>   */
> + if (rel)
> + {
> + table_close(rel, AccessExclusiveLock);
> + rel = NULL;
>
> We don't need to explicitly release lock on table_close, it will be
> done at transaction end, so use NoLock here as we do in current HEAD
> code.
>

Done.

> 4.
> DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
>  {
> - Relation rel;
> + Relation rel, sub_rel;
>   ObjectAddress myself;
>   HeapTuple tup;
>   Oid subid;
> @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt,
> bool isTopLevel)
>   * Note that the state can't change because we have already stopped both
>   * the apply and tablesync workers and they can't restart because of
>   * exclusive lock on the subscription.
> + *
> + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
> + * deadlock with apply workers of other subscriptions trying
> + * to drop tracking origin.
>   */
> + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
>
> I don't think we need AccessExclusiveLock on SubscriptionRelRelationId
> in DropSubscription. Kindly test again after fixing the first comment
> above.
> --

Yes, it was failing because I was taking AccessExclusiveLock in the
apply worker, and that was causing the deadlock in my testing. I
tested this worked.



On Wed, Jul 23, 2025 at 7:12 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Ajin,
>
> Thanks for the patch. Firstly let me confirm my understanding. While altering the
> subscription, locks are acquired with below ordering:
>
> Order   target                                          level
> 1               pg_subscription                         row exclusive
> 2               pg_subscription, my tuple       access exclusive
> 3               pg_subscription_rel                     access exclusive
> 4               pg_replication_origin           excluive
>
> In contrast, apply worker works like:
>
> Order   target                                          level
> 1               pg_replication_origin           excluive
> 2               pg_subscription, my tuple       access share
> 3               pg_subscrition_rel                      row exclusive
>
> Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.
>

Yes, that is correct.


> Below are my comments:
>
> ```
> @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>                  * is dropped. So passing missing_ok = false.
>                  */
>                 ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
> -
> ```
> This change is not needed.

Removed.

>
> ```
> +                               if (!rel)
> +                                       rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> +
> ```
>
> I feel it is too strong, isn't it enough to use row exclusive as initially used?
>

Yes, agree. Fixed.

> ```
>  UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
> -                                                  XLogRecPtr sublsn)
> +                                                  XLogRecPtr sublsn, bool lock_needed)
> ```
>
> I feel the name `lock_needed` is bit misleading, because the function still opens
> the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?
>

I have modified it to not take lock while table_open as well and
changed the name of the variable to already_locked.

> ```
> @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
>          * Note that the state can't change because we have already stopped both
>          * the apply and tablesync workers and they can't restart because of
>          * exclusive lock on the subscription.
> +        *
> +        * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
> +        * deadlock with apply workers of other subscriptions trying
> +        * to drop tracking origin.
>          */
> +       sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> ```
>
> Hmm. Per my understanding, DropSubscription acquires below locks till it reaches
> replorigin_drop_by_name().
>
> Order   target                                          level
> 1               pg_subscription                         access exclusive
> 2               pg_subscription, my tuple       access exclusive
> 3               pg_replication_origin           excluive
>
> IIUC we must preserve the ordering, not the target of locks.

I have removed this change as this does not now conflict with the apply worker.

Two patches attached. One for HEAD and the other for PG_15.

regards,
Ajin Cherian
Fujitsu Australia

Attachment

pgsql-hackers by date:

Previous
From: "Daniel Verite"
Date:
Subject: Re: Collation and primary keys
Next
From: Srinath Reddy Sadipiralla
Date:
Subject: is there any specific use of setting transaction_read_only GUC externally?