Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Date
Msg-id CAA4eK1Kj6THmgXZ98d8rxAJa+V1KGS79KgqVve5i0Oyn+=tbMw@mail.gmail.com
Whole thread Raw
In response to Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
List pgsql-hackers
On Tue, Nov 21, 2023 at 6:56 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/21/23 14:16, Amit Kapila wrote:
> > On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >
> > It seems there is some inconsistency in what you have written for
> > client backends/tablesync worker vs. apply worker. The above text
> > seems to be saying that the client backend and table sync worker are
> > waiting on a "subscription row in pg_subscription" and the apply
> > worker is operating on "pg_subscription_rel". So, if that is true then
> > they shouldn't get stuck.
> >
> > I think here client backend and tablesync worker seems to be blocked
> > for a lock on pg_subscription_rel.
> >
>
> Not really, they are all locking the subscription. All the locks are on
> classid=6100, which is pg_subscription:
>
>   test=# select 6100::regclass;
>       regclass
>   -----------------
>    pg_subscription
>   (1 row)
>
> The thing is, the tablesync workers call UpdateSubscriptionRelState,
> which locks the pg_subscription catalog at the very beginning:
>
>    LockSharedObject(SubscriptionRelationId, ...);
>
> So that's the issue. I haven't explored why it's done this way, and
> there's no comment explaining locking the subscriptions is needed ...
>

I think it prevents concurrent drop of rel during the REFRESH operation.

> >> The tablesync workers can't proceed because their lock request is stuck
> >> behind the AccessExclusiveLock request.
> >>
> >> And the apply worker can't proceed, because it's waiting for status
> >> update from the tablesync workers.
> >>
> >
> > This part is not clear to me because
> > wait_for_relation_state_change()->GetSubscriptionRelState() seems to
> > be releasing the lock while closing the relation. Am, I missing
> > something?
> >
>
> I think you're missing the fact that GetSubscriptionRelState() acquires
> and releases the lock on pg_subscription_rel, but that's not the lock
> causing the issue. The problem is the lock on the pg_subscription row.
>

Okay. IIUC, what's going on here is that the apply worker acquires
AccessShareLock on pg_subscription to update rel state for one of the
tables say tbl-1, and then for another table say tbl-2, it started
waiting for a state change via wait_for_relation_state_change(). I
think here the fix is to commit the transaction before we go for a
wait. I guess we need something along the lines of what is proposed in
[1] though we have solved the problem in that thread in some other
way..

[1] - https://www.postgresql.org/message-id/1412708.1674417574%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Давыдов Виталий
Date:
Subject: Re: How to accurately determine when a relation should use local buffers?
Next
From: Tomas Vondra
Date:
Subject: Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION