Re: 024_add_drop_pub.pl might fail due to deadlock - Mailing list pgsql-hackers
From | Ajin Cherian |
---|---|
Subject | Re: 024_add_drop_pub.pl might fail due to deadlock |
Date | |
Msg-id | CAFPTHDZTDWv0wpccxm73koixvQcorPrPSHg=d+sLb23xXS7qEQ@mail.gmail.com Whole thread Raw |
In response to | RE: 024_add_drop_pub.pl might fail due to deadlock ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
List | pgsql-hackers |
On Wed, Jul 23, 2025 at 4:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > Yes, this is correct. I have also verified this in my testing that > > when there is a second subscription, a deadlock can still happen with > > my previous patch. I have now modified the patch in tablesync worker > > to take locks on both SubscriptionRelationId and > > SubscriptionRelRelationId prior to taking lock on > > ReplicationOriginRelationId. I have also found that there is a similar > > problem in DropSubscription() where lock on SubscriptionRelRelationId > > is not taken before dropping the tracking origin. I've also modified > > the signature of UpdateSubscriptionRelState to take a bool > > "lock_needed" which if false, the SubscriptionRelationId is not > > locked. I've made a new version of the patch on PG_15. > > > > Review comments: > ================ > 1. > + if (!rel) > + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > > Why did you acquire AccessExclusiveLock here when the current code has > RowExclusiveLock? It should be RowExclusiveLock. > Yes, you are correct. I have replaced it with RowExclusiveLock. > 2. > + * > + * Lock SubscriptionRelationId with AccessShareLock and > + * take AccessExclusiveLock on SubscriptionRelRelationId to > + * prevent any deadlocks with user concurrently performing > + * refresh on the subscription. > */ > > Try to tell in the comments that we want to keep the locking order > same as DDL commands to prevent deadlocks. > Modified. > 3. > + * Also close any tables prior to the commit. > */ > + if (rel) > + { > + table_close(rel, AccessExclusiveLock); > + rel = NULL; > > We don't need to explicitly release lock on table_close, it will be > done at transaction end, so use NoLock here as we do in current HEAD > code. > Done. > 4. > DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) > { > - Relation rel; > + Relation rel, sub_rel; > ObjectAddress myself; > HeapTuple tup; > Oid subid; > @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, > bool isTopLevel) > * Note that the state can't change because we have already stopped both > * the apply and tablesync workers and they can't restart because of > * exclusive lock on the subscription. > + * > + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any > + * deadlock with apply workers of other subscriptions trying > + * to drop tracking origin. > */ > + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > > I don't think we need AccessExclusiveLock on SubscriptionRelRelationId > in DropSubscription. Kindly test again after fixing the first comment > above. > -- Yes, it was failing because I was taking AccessExclusiveLock in the apply worker, and that was causing the deadlock in my testing. I tested this worked. On Wed, Jul 23, 2025 at 7:12 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Ajin, > > Thanks for the patch. Firstly let me confirm my understanding. While altering the > subscription, locks are acquired with below ordering: > > Order target level > 1 pg_subscription row exclusive > 2 pg_subscription, my tuple access exclusive > 3 pg_subscription_rel access exclusive > 4 pg_replication_origin excluive > > In contrast, apply worker works like: > > Order target level > 1 pg_replication_origin excluive > 2 pg_subscription, my tuple access share > 3 pg_subscrition_rel row exclusive > > Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3. > Yes, that is correct. > Below are my comments: > > ``` > @@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn) > * is dropped. So passing missing_ok = false. > */ > ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false); > - > ``` > This change is not needed. Removed. > > ``` > + if (!rel) > + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > + > ``` > > I feel it is too strong, isn't it enough to use row exclusive as initially used? > Yes, agree. Fixed. > ``` > UpdateSubscriptionRelState(Oid subid, Oid relid, char state, > - XLogRecPtr sublsn) > + XLogRecPtr sublsn, bool lock_needed) > ``` > > I feel the name `lock_needed` is bit misleading, because the function still opens > the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"? > I have modified it to not take lock while table_open as well and changed the name of the variable to already_locked. > ``` > @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) > * Note that the state can't change because we have already stopped both > * the apply and tablesync workers and they can't restart because of > * exclusive lock on the subscription. > + * > + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any > + * deadlock with apply workers of other subscriptions trying > + * to drop tracking origin. > */ > + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); > ``` > > Hmm. Per my understanding, DropSubscription acquires below locks till it reaches > replorigin_drop_by_name(). > > Order target level > 1 pg_subscription access exclusive > 2 pg_subscription, my tuple access exclusive > 3 pg_replication_origin excluive > > IIUC we must preserve the ordering, not the target of locks. I have removed this change as this does not now conflict with the apply worker. Two patches attached. One for HEAD and the other for PG_15. regards, Ajin Cherian Fujitsu Australia
Attachment
pgsql-hackers by date: