Re: 024_add_drop_pub.pl might fail due to deadlock - Mailing list pgsql-hackers

From vignesh C
Subject Re: 024_add_drop_pub.pl might fail due to deadlock
Date
Msg-id CALDaNm3PrTkVc2uxMyQTkqw0sg7O6i0EXe1jJo9CzOyW2gFS+Q@mail.gmail.com
Whole thread Raw
In response to Re: 024_add_drop_pub.pl might fail due to deadlock  (vignesh C <vignesh21@gmail.com>)
List pgsql-hackers
On Mon, 14 Jul 2025 at 16:15, vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > Patch with fix attached.
> > > I'll continue investigating whether this issue also affects HEAD.
> > >
> >
> > While debugging if this problem can occur on HEAD, I found out that on
> > head, it is mostly the tablesync worker that drops the origin on HEAD
> > and since the tablesysnc worker does not attempt to update the
> > SubscriptionRel state in that process, there doesn't seem to be the
> > possibility of a deadlock. But there is a rare situation where the
> > tablesync worker could crash or get an error just prior to dropping
> > the origin, then the origin is dropped in the apply worker (this is
> > explained in the comments in process_syncing_tables_for_sync()). If
> > the origin has to be dropped in the apply worker, then the same
> > deadlock can happen in HEAD code as well. I was able to simulate this
> > by using an injection point to create an error on the tablesync worker
> > and then the similar deadlock happens on HEAD as well. Attaching a
> > patch for fixing this on HEAD as well.
>
> I was able to reproduce the deadlock on HEAD as well using the
> attached patch, which introduces a delay in the tablesync worker
> before dropping the replication origin by adding a sleep of a few
> seconds. During this delay, the apply worker also attempts to drop the
> replication origin. If an ALTER SUBSCRIPTION command is executed
> concurrently, a deadlock frequently occurs:
> 2025-07-14 15:59:53.572 IST [141100] DETAIL:  Process 141100 waits for
> AccessExclusiveLock on object 2 of class 6000 of database 0; blocked
> by process 140974.
> Process 140974 waits for AccessShareLock on object 16396 of class 6100
> of database 0; blocked by process 141100.
> Process 141100: alter subscription sub1 drop publication pub1
> Process 140974: <command string not enabled>
>
> After apply the attached patch, create the logical replication setup
> for a publication pub1 having table t1 and then run the following
> commands in a loop:
> alter subscription sub1 drop publication pub1;
> alter subscription sub1 add publication pub1;
> sleep 4

Attached is the script used to reproduce the issue and the deadlock
logs for the same. Your patch fixes the issue.
Couple of comments:
1) This change is not required:
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
+#include "utils/injection_point.h"

2) This can not only happen in error case but also in normal cases
where the tablesync worker is slower as shown in the script to
reproduce, we can update the commit message accordingly:
In most situations the tablesync worker will drop the corresponding
origin before it
finishes executing, but if an error causes the tablesync worker to
fail just prior to
dropping the origin, the apply worker will later find the origin and drop it.

Regards,
Vignesh

Attachment

pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: speedup COPY TO for partitioned table.
Next
From: Michael Paquier
Date:
Subject: Re: Missing NULL check after calling ecpg_strdup