Re: 024_add_drop_pub.pl might fail due to deadlock - Mailing list pgsql-hackers

From vignesh C
Subject Re: 024_add_drop_pub.pl might fail due to deadlock
Date
Msg-id CALDaNm14FkrASB8jj27k6MSgrDpOJSZpVv=y=BHvhAoz5B7rNw@mail.gmail.com
Whole thread Raw
In response to Re: 024_add_drop_pub.pl might fail due to deadlock  (Ajin Cherian <itsajin@gmail.com>)
List pgsql-hackers
On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > Patch with fix attached.
> > I'll continue investigating whether this issue also affects HEAD.
> >
>
> While debugging if this problem can occur on HEAD, I found out that on
> head, it is mostly the tablesync worker that drops the origin on HEAD
> and since the tablesysnc worker does not attempt to update the
> SubscriptionRel state in that process, there doesn't seem to be the
> possibility of a deadlock. But there is a rare situation where the
> tablesync worker could crash or get an error just prior to dropping
> the origin, then the origin is dropped in the apply worker (this is
> explained in the comments in process_syncing_tables_for_sync()). If
> the origin has to be dropped in the apply worker, then the same
> deadlock can happen in HEAD code as well. I was able to simulate this
> by using an injection point to create an error on the tablesync worker
> and then the similar deadlock happens on HEAD as well. Attaching a
> patch for fixing this on HEAD as well.

I was able to reproduce the deadlock on HEAD as well using the
attached patch, which introduces a delay in the tablesync worker
before dropping the replication origin by adding a sleep of a few
seconds. During this delay, the apply worker also attempts to drop the
replication origin. If an ALTER SUBSCRIPTION command is executed
concurrently, a deadlock frequently occurs:
2025-07-14 15:59:53.572 IST [141100] DETAIL:  Process 141100 waits for
AccessExclusiveLock on object 2 of class 6000 of database 0; blocked
by process 140974.
Process 140974 waits for AccessShareLock on object 16396 of class 6100
of database 0; blocked by process 141100.
Process 141100: alter subscription sub1 drop publication pub1
Process 140974: <command string not enabled>

After apply the attached patch, create the logical replication setup
for a publication pub1 having table t1 and then run the following
commands in a loop:
alter subscription sub1 drop publication pub1;
alter subscription sub1 add publication pub1;
sleep 4

Regards,
Vignesh

Attachment

pgsql-hackers by date:

Previous
From: Japin Li
Date:
Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Next
From: Daniil Davydov
Date:
Subject: Re: POC: Parallel processing of indexes in autovacuum