Thread: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Tomas Vondra
Date:
Hi,

I decided to do some stress-testing of the built-in logical replication,
as part of the sequence decoding work. And I soon ran into an undetected
deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(

The attached bash scripts triggers that in a couple seconds for me. The
script looks complicated, but most of the code is waiting for sync to
complete, catchup, and that sort of thing.

What the script does is pretty simple:

1) initialize two clusters, set them as publisher/subscriber pair

2) create some number of tables, add them to publication and wait for
   the sync to complete

3) start two pgbench runs in the background, modifying the publication
   (one removes+adds all tables in a single transaction, one does that
    with transaction per table)

4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
   in a loop (now that I think about it, could be another pgbench
   script, but well ...)

5) some consistency checks, but the lockup happens earlier so this does
   not really matter

After a small number of refresh cycles (for me it's usually a couple
dozen), we end up with a couple stuck locks (I shortened the backend
type string a bit, for formatting reasons):

  test=# select a.pid, classid, objid, backend_type, query
           from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
          where not granted;

     pid   | classid | objid | backend_type     |         query
  ---------+---------+-------+------------------+----------------------
   2691941 |    6100 | 16785 | client backend   | ALTER SUBSCRIPTION s
                                                  REFRESH PUBLICATION
   2691837 |    6100 | 16785 | tablesync worker |
   2691936 |    6100 | 16785 | tablesync worker |
  (3 rows)

All these backends wait for 6100/16785, which is the subscription row in
pg_subscription. The tablesync workers are requesting AccessShareLock,
the client backend however asks for AccessExclusiveLock.

The entry is currently locked by:

  test=# select a.pid, mode, backend_type from pg_locks l
           join pg_stat_activity a on (a.pid = l.pid)
          where classid=6100 and objid=16785 and granted;

     pid   |      mode       |           backend_type
  ---------+-----------------+----------------------------------
   2690477 | AccessShareLock | logical replication apply worker
  (1 row)

But the apply worker is not waiting for any locks, so what's going on?

Well, the problem is the apply worker is waiting for notification from
the tablesync workers the relation is synced, which happens through
updating the pg_subscription_rel row. And that wait happens in
wait_for_relation_state_change, which simply checks the row in a loop,
with a sleep by WaitLatch().

Unfortunately, the tablesync workers can't update the row because the
client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
sneaked in, and waits for an AccessExclusiveLock. So the tablesync
workers are stuck in the queue and can't proceed.

The client backend can't proceed, because it's waiting for a lock held
by the apply worker.

The tablesync workers can't proceed because their lock request is stuck
behind the AccessExclusiveLock request.

And the apply worker can't proceed, because it's waiting for status
update from the tablesync workers.

And the deadlock is undetected, because the apply worker is not waiting
on a lock, but sleeping on a latch :-(


I don't know what's the right solution here. I wonder if the apply
worker might release the lock before waiting for the update, that'd
solve this whole issue.

Alternatively, ALTER PUBLICATION might wait for the lock only for a
limited amount of time, and try again (but then it'd be susceptible to
starving, of course).

Or maybe there's a way to make this work in a way that would be visible
to the deadlock detector? That'd mean we occasionally get processes
killed to resolve a deadlock, but that's still better than processes
stuck indefinitely ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I decided to do some stress-testing of the built-in logical replication,
> as part of the sequence decoding work. And I soon ran into an undetected
> deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(
>
> The attached bash scripts triggers that in a couple seconds for me. The
> script looks complicated, but most of the code is waiting for sync to
> complete, catchup, and that sort of thing.
>
> What the script does is pretty simple:
>
> 1) initialize two clusters, set them as publisher/subscriber pair
>
> 2) create some number of tables, add them to publication and wait for
>    the sync to complete
>
> 3) start two pgbench runs in the background, modifying the publication
>    (one removes+adds all tables in a single transaction, one does that
>     with transaction per table)
>
> 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
>    in a loop (now that I think about it, could be another pgbench
>    script, but well ...)
>
> 5) some consistency checks, but the lockup happens earlier so this does
>    not really matter
>
> After a small number of refresh cycles (for me it's usually a couple
> dozen), we end up with a couple stuck locks (I shortened the backend
> type string a bit, for formatting reasons):
>
>   test=# select a.pid, classid, objid, backend_type, query
>            from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
>           where not granted;
>
>      pid   | classid | objid | backend_type     |         query
>   ---------+---------+-------+------------------+----------------------
>    2691941 |    6100 | 16785 | client backend   | ALTER SUBSCRIPTION s
>                                                   REFRESH PUBLICATION
>    2691837 |    6100 | 16785 | tablesync worker |
>    2691936 |    6100 | 16785 | tablesync worker |
>   (3 rows)
>
> All these backends wait for 6100/16785, which is the subscription row in
> pg_subscription. The tablesync workers are requesting AccessShareLock,
> the client backend however asks for AccessExclusiveLock.
>
> The entry is currently locked by:
>
>   test=# select a.pid, mode, backend_type from pg_locks l
>            join pg_stat_activity a on (a.pid = l.pid)
>           where classid=6100 and objid=16785 and granted;
>
>      pid   |      mode       |           backend_type
>   ---------+-----------------+----------------------------------
>    2690477 | AccessShareLock | logical replication apply worker
>   (1 row)
>
> But the apply worker is not waiting for any locks, so what's going on?
>
> Well, the problem is the apply worker is waiting for notification from
> the tablesync workers the relation is synced, which happens through
> updating the pg_subscription_rel row. And that wait happens in
> wait_for_relation_state_change, which simply checks the row in a loop,
> with a sleep by WaitLatch().
>
> Unfortunately, the tablesync workers can't update the row because the
> client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
> sneaked in, and waits for an AccessExclusiveLock. So the tablesync
> workers are stuck in the queue and can't proceed.
>
> The client backend can't proceed, because it's waiting for a lock held
> by the apply worker.
>

It seems there is some inconsistency in what you have written for
client backends/tablesync worker vs. apply worker. The above text
seems to be saying that the client backend and table sync worker are
waiting on a "subscription row in pg_subscription" and the apply
worker is operating on "pg_subscription_rel". So, if that is true then
they shouldn't get stuck.

I think here client backend and tablesync worker seems to be blocked
for a lock on pg_subscription_rel.

> The tablesync workers can't proceed because their lock request is stuck
> behind the AccessExclusiveLock request.
>
> And the apply worker can't proceed, because it's waiting for status
> update from the tablesync workers.
>

This part is not clear to me because
wait_for_relation_state_change()->GetSubscriptionRelState() seems to
be releasing the lock while closing the relation. Am, I missing
something?

--
With Regards,
Amit Kapila.



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Tomas Vondra
Date:

On 11/21/23 14:16, Amit Kapila wrote:
> On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I decided to do some stress-testing of the built-in logical replication,
>> as part of the sequence decoding work. And I soon ran into an undetected
>> deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(
>>
>> The attached bash scripts triggers that in a couple seconds for me. The
>> script looks complicated, but most of the code is waiting for sync to
>> complete, catchup, and that sort of thing.
>>
>> What the script does is pretty simple:
>>
>> 1) initialize two clusters, set them as publisher/subscriber pair
>>
>> 2) create some number of tables, add them to publication and wait for
>>    the sync to complete
>>
>> 3) start two pgbench runs in the background, modifying the publication
>>    (one removes+adds all tables in a single transaction, one does that
>>     with transaction per table)
>>
>> 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
>>    in a loop (now that I think about it, could be another pgbench
>>    script, but well ...)
>>
>> 5) some consistency checks, but the lockup happens earlier so this does
>>    not really matter
>>
>> After a small number of refresh cycles (for me it's usually a couple
>> dozen), we end up with a couple stuck locks (I shortened the backend
>> type string a bit, for formatting reasons):
>>
>>   test=# select a.pid, classid, objid, backend_type, query
>>            from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
>>           where not granted;
>>
>>      pid   | classid | objid | backend_type     |         query
>>   ---------+---------+-------+------------------+----------------------
>>    2691941 |    6100 | 16785 | client backend   | ALTER SUBSCRIPTION s
>>                                                   REFRESH PUBLICATION
>>    2691837 |    6100 | 16785 | tablesync worker |
>>    2691936 |    6100 | 16785 | tablesync worker |
>>   (3 rows)
>>
>> All these backends wait for 6100/16785, which is the subscription row in
>> pg_subscription. The tablesync workers are requesting AccessShareLock,
>> the client backend however asks for AccessExclusiveLock.
>>
>> The entry is currently locked by:
>>
>>   test=# select a.pid, mode, backend_type from pg_locks l
>>            join pg_stat_activity a on (a.pid = l.pid)
>>           where classid=6100 and objid=16785 and granted;
>>
>>      pid   |      mode       |           backend_type
>>   ---------+-----------------+----------------------------------
>>    2690477 | AccessShareLock | logical replication apply worker
>>   (1 row)
>>
>> But the apply worker is not waiting for any locks, so what's going on?
>>
>> Well, the problem is the apply worker is waiting for notification from
>> the tablesync workers the relation is synced, which happens through
>> updating the pg_subscription_rel row. And that wait happens in
>> wait_for_relation_state_change, which simply checks the row in a loop,
>> with a sleep by WaitLatch().
>>
>> Unfortunately, the tablesync workers can't update the row because the
>> client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
>> sneaked in, and waits for an AccessExclusiveLock. So the tablesync
>> workers are stuck in the queue and can't proceed.
>>
>> The client backend can't proceed, because it's waiting for a lock held
>> by the apply worker.
>>
> 
> It seems there is some inconsistency in what you have written for
> client backends/tablesync worker vs. apply worker. The above text
> seems to be saying that the client backend and table sync worker are
> waiting on a "subscription row in pg_subscription" and the apply
> worker is operating on "pg_subscription_rel". So, if that is true then
> they shouldn't get stuck.
> 
> I think here client backend and tablesync worker seems to be blocked
> for a lock on pg_subscription_rel.
> 

Not really, they are all locking the subscription. All the locks are on
classid=6100, which is pg_subscription:

  test=# select 6100::regclass;
      regclass
  -----------------
   pg_subscription
  (1 row)

The thing is, the tablesync workers call UpdateSubscriptionRelState,
which locks the pg_subscription catalog at the very beginning:

   LockSharedObject(SubscriptionRelationId, ...);

So that's the issue. I haven't explored why it's done this way, and
there's no comment explaining locking the subscriptions is needed ...

>> The tablesync workers can't proceed because their lock request is stuck
>> behind the AccessExclusiveLock request.
>>
>> And the apply worker can't proceed, because it's waiting for status
>> update from the tablesync workers.
>>
> 
> This part is not clear to me because
> wait_for_relation_state_change()->GetSubscriptionRelState() seems to
> be releasing the lock while closing the relation. Am, I missing
> something?
> 

I think you're missing the fact that GetSubscriptionRelState() acquires
and releases the lock on pg_subscription_rel, but that's not the lock
causing the issue. The problem is the lock on the pg_subscription row.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Tue, Nov 21, 2023 at 6:56 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/21/23 14:16, Amit Kapila wrote:
> > On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >
> > It seems there is some inconsistency in what you have written for
> > client backends/tablesync worker vs. apply worker. The above text
> > seems to be saying that the client backend and table sync worker are
> > waiting on a "subscription row in pg_subscription" and the apply
> > worker is operating on "pg_subscription_rel". So, if that is true then
> > they shouldn't get stuck.
> >
> > I think here client backend and tablesync worker seems to be blocked
> > for a lock on pg_subscription_rel.
> >
>
> Not really, they are all locking the subscription. All the locks are on
> classid=6100, which is pg_subscription:
>
>   test=# select 6100::regclass;
>       regclass
>   -----------------
>    pg_subscription
>   (1 row)
>
> The thing is, the tablesync workers call UpdateSubscriptionRelState,
> which locks the pg_subscription catalog at the very beginning:
>
>    LockSharedObject(SubscriptionRelationId, ...);
>
> So that's the issue. I haven't explored why it's done this way, and
> there's no comment explaining locking the subscriptions is needed ...
>

I think it prevents concurrent drop of rel during the REFRESH operation.

> >> The tablesync workers can't proceed because their lock request is stuck
> >> behind the AccessExclusiveLock request.
> >>
> >> And the apply worker can't proceed, because it's waiting for status
> >> update from the tablesync workers.
> >>
> >
> > This part is not clear to me because
> > wait_for_relation_state_change()->GetSubscriptionRelState() seems to
> > be releasing the lock while closing the relation. Am, I missing
> > something?
> >
>
> I think you're missing the fact that GetSubscriptionRelState() acquires
> and releases the lock on pg_subscription_rel, but that's not the lock
> causing the issue. The problem is the lock on the pg_subscription row.
>

Okay. IIUC, what's going on here is that the apply worker acquires
AccessShareLock on pg_subscription to update rel state for one of the
tables say tbl-1, and then for another table say tbl-2, it started
waiting for a state change via wait_for_relation_state_change(). I
think here the fix is to commit the transaction before we go for a
wait. I guess we need something along the lines of what is proposed in
[1] though we have solved the problem in that thread in some other
way..

[1] - https://www.postgresql.org/message-id/1412708.1674417574%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Tomas Vondra
Date:
On 11/22/23 11:38, Amit Kapila wrote:
> On Tue, Nov 21, 2023 at 6:56 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/21/23 14:16, Amit Kapila wrote:
>>> On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>
>>> It seems there is some inconsistency in what you have written for
>>> client backends/tablesync worker vs. apply worker. The above text
>>> seems to be saying that the client backend and table sync worker are
>>> waiting on a "subscription row in pg_subscription" and the apply
>>> worker is operating on "pg_subscription_rel". So, if that is true then
>>> they shouldn't get stuck.
>>>
>>> I think here client backend and tablesync worker seems to be blocked
>>> for a lock on pg_subscription_rel.
>>>
>>
>> Not really, they are all locking the subscription. All the locks are on
>> classid=6100, which is pg_subscription:
>>
>>   test=# select 6100::regclass;
>>       regclass
>>   -----------------
>>    pg_subscription
>>   (1 row)
>>
>> The thing is, the tablesync workers call UpdateSubscriptionRelState,
>> which locks the pg_subscription catalog at the very beginning:
>>
>>    LockSharedObject(SubscriptionRelationId, ...);
>>
>> So that's the issue. I haven't explored why it's done this way, and
>> there's no comment explaining locking the subscriptions is needed ...
>>
> 
> I think it prevents concurrent drop of rel during the REFRESH operation.
> 

Yes. Or maybe some other concurrent DDL on the relations included in the
subscription.

>>>> The tablesync workers can't proceed because their lock request is stuck
>>>> behind the AccessExclusiveLock request.
>>>>
>>>> And the apply worker can't proceed, because it's waiting for status
>>>> update from the tablesync workers.
>>>>
>>>
>>> This part is not clear to me because
>>> wait_for_relation_state_change()->GetSubscriptionRelState() seems to
>>> be releasing the lock while closing the relation. Am, I missing
>>> something?
>>>
>>
>> I think you're missing the fact that GetSubscriptionRelState() acquires
>> and releases the lock on pg_subscription_rel, but that's not the lock
>> causing the issue. The problem is the lock on the pg_subscription row.
>>
> 
> Okay. IIUC, what's going on here is that the apply worker acquires
> AccessShareLock on pg_subscription to update rel state for one of the
> tables say tbl-1, and then for another table say tbl-2, it started
> waiting for a state change via wait_for_relation_state_change(). I
> think here the fix is to commit the transaction before we go for a
> wait. I guess we need something along the lines of what is proposed in
> [1] though we have solved the problem in that thread in some other
> way..
> 

Possibly. I haven't checked if the commit might have some unexpected
consequences, but I can confirm I can no longer reproduce the deadlock
with the patch applied.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Wed, Nov 22, 2023 at 4:51 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/22/23 11:38, Amit Kapila wrote:
> >
> > Okay. IIUC, what's going on here is that the apply worker acquires
> > AccessShareLock on pg_subscription to update rel state for one of the
> > tables say tbl-1, and then for another table say tbl-2, it started
> > waiting for a state change via wait_for_relation_state_change(). I
> > think here the fix is to commit the transaction before we go for a
> > wait. I guess we need something along the lines of what is proposed in
> > [1] though we have solved the problem in that thread in some other
> > way..
> >
>
> Possibly. I haven't checked if the commit might have some unexpected
> consequences, but I can confirm I can no longer reproduce the deadlock
> with the patch applied.
>

Thanks for the verification. Offhand, I don't see any problem with
doing a commit at that place but will try to think some more about it.
I think we may want to call pgstat_report_stat(false) after commit to
avoid a long delay in stats.

I haven't verified but I think this will be a problem in back-branches as well.

--
With Regards,
Amit Kapila.



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Tomas Vondra
Date:
On 11/23/23 10:24, Amit Kapila wrote:
> On Wed, Nov 22, 2023 at 4:51 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/22/23 11:38, Amit Kapila wrote:
>>>
>>> Okay. IIUC, what's going on here is that the apply worker acquires
>>> AccessShareLock on pg_subscription to update rel state for one of the
>>> tables say tbl-1, and then for another table say tbl-2, it started
>>> waiting for a state change via wait_for_relation_state_change(). I
>>> think here the fix is to commit the transaction before we go for a
>>> wait. I guess we need something along the lines of what is proposed in
>>> [1] though we have solved the problem in that thread in some other
>>> way..
>>>
>>
>> Possibly. I haven't checked if the commit might have some unexpected
>> consequences, but I can confirm I can no longer reproduce the deadlock
>> with the patch applied.
>>
> 
> Thanks for the verification. Offhand, I don't see any problem with
> doing a commit at that place but will try to think some more about it.
> I think we may want to call pgstat_report_stat(false) after commit to
> avoid a long delay in stats.
> 

Makes sense.

> I haven't verified but I think this will be a problem in
> back-branches as well.
> 

Yes. I haven't tried but I don't see why backbranches wouldn't have the
same issue.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Shlok Kyal
Date:
Hi,

I tried to reproduce the issue and was able to reproduce it with
scripts shared by Tomas.
I tried testing it from PG17 to PG 11. This issue is reproducible for
each version.

Next I would try to test with the patch in the thread shared by Amit.

Thanks,
Shlok Kumar Kyal



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Shlok Kyal
Date:
Hi,

> I tried to reproduce the issue and was able to reproduce it with
> scripts shared by Tomas.
> I tried testing it from PG17 to PG 11. This issue is reproducible for
> each version.
>
> Next I would try to test with the patch in the thread shared by Amit.

I have created the v1 patch to resolve the issue. Have tested the
patch on HEAD to PG12.
The same patch applies to all the versions. The changes are similar to
the one posted in the thread
https://www.postgresql.org/message-id/1412708.1674417574%40sss.pgh.pa.us

Thanks and Regards,
Shlok Kyal

Attachment

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Fri, Nov 24, 2023 at 5:05 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> > I tried to reproduce the issue and was able to reproduce it with
> > scripts shared by Tomas.
> > I tried testing it from PG17 to PG 11. This issue is reproducible for
> > each version.
> >
> > Next I would try to test with the patch in the thread shared by Amit.
>
> I have created the v1 patch to resolve the issue. Have tested the
> patch on HEAD to PG12.
> The same patch applies to all the versions. The changes are similar to
> the one posted in the thread
> https://www.postgresql.org/message-id/1412708.1674417574%40sss.pgh.pa.us
>

(it's quite likely we hold lock on
+ * pg_replication_origin, which the sync worker will need
+ * to update).

This part of the comment is stale and doesn't hold true. You need to
update the reason based on the latest problem discovered in this
thread. I think you can compare the timing of regression tests in
subscription, with and without the patch to show there is no
regression. And probably some tests with a large number of tables for
sync with very little data.

--
With Regards,
Amit Kapila.



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Shlok Kyal
Date:
Hi,

> thread. I think you can compare the timing of regression tests in
> subscription, with and without the patch to show there is no
> regression. And probably some tests with a large number of tables for
> sync with very little data.

I have tested the regression test timings for subscription with and
without patch. I also did the timing test for sync of subscription
with the publisher for 100 and 1000 tables respectively.
I have attached the test script and results of the timing test are as follows:

Time taken for test to run in Linux VM
Summary                                |  Subscription Test (sec)
|    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
(sec)
Without patch Release           |  95.564
 |     7.877                                             |   58.919
With patch Release                |  96.513
   |     6.533                                             |   45.807

Time Taken for test to run in another Linux VM
Summary                                |  Subscription Test (sec)
|    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
(sec)
Without patch Release           |  109.8145
|    6.4675                                           |    83.001
With patch Release                |  113.162
   |    7.947                                              |   87.113

Time Taken for test to run in Performance Machine Linux
Summary                                |  Subscription Test (sec)
|    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
(sec)
Without patch Release           |  115.871
 |     6.656                                             |   81.157
With patch Release                |  115.922
   |     6.7305                                           |   81.1525

thoughts?

Thanks and Regards,
Shlok Kyal

Attachment

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Sat, Dec 2, 2023 at 9:52 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> > thread. I think you can compare the timing of regression tests in
> > subscription, with and without the patch to show there is no
> > regression. And probably some tests with a large number of tables for
> > sync with very little data.
>
> I have tested the regression test timings for subscription with and
> without patch. I also did the timing test for sync of subscription
> with the publisher for 100 and 1000 tables respectively.
> I have attached the test script and results of the timing test are as follows:
>
> Time taken for test to run in Linux VM
> Summary                                |  Subscription Test (sec)
> |    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
> (sec)
> Without patch Release           |  95.564
>  |     7.877                                             |   58.919
> With patch Release                |  96.513
>    |     6.533                                             |   45.807
>
> Time Taken for test to run in another Linux VM
> Summary                                |  Subscription Test (sec)
> |    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
> (sec)
> Without patch Release           |  109.8145
> |    6.4675                                           |    83.001
> With patch Release                |  113.162
>    |    7.947                                              |   87.113
>

So, on some machines, it may increase the test timing although not too
much. I think the reason is probably doing the work in multiple
transactions for multiple relations. I am wondering that instead of
committing and starting a new transaction before
wait_for_relation_state_change(), what if we do it inside that
function just before we decide to wait? It is quite possible that in
many cases we don't need any wait at all.

--
With Regards,
Amit Kapila.



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Tomas Vondra
Date:

On 12/4/23 12:37, Amit Kapila wrote:
> On Sat, Dec 2, 2023 at 9:52 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>>
>>> thread. I think you can compare the timing of regression tests in
>>> subscription, with and without the patch to show there is no
>>> regression. And probably some tests with a large number of tables for
>>> sync with very little data.
>>
>> I have tested the regression test timings for subscription with and
>> without patch. I also did the timing test for sync of subscription
>> with the publisher for 100 and 1000 tables respectively.
>> I have attached the test script and results of the timing test are as follows:
>>
>> Time taken for test to run in Linux VM
>> Summary                                |  Subscription Test (sec)
>> |    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
>> (sec)
>> Without patch Release           |  95.564
>>  |     7.877                                             |   58.919
>> With patch Release                |  96.513
>>    |     6.533                                             |   45.807
>>
>> Time Taken for test to run in another Linux VM
>> Summary                                |  Subscription Test (sec)
>> |    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
>> (sec)
>> Without patch Release           |  109.8145
>> |    6.4675                                           |    83.001
>> With patch Release                |  113.162
>>    |    7.947                                              |   87.113
>>
> 
> So, on some machines, it may increase the test timing although not too
> much. I think the reason is probably doing the work in multiple
> transactions for multiple relations. I am wondering that instead of
> committing and starting a new transaction before
> wait_for_relation_state_change(), what if we do it inside that
> function just before we decide to wait? It is quite possible that in
> many cases we don't need any wait at all.
> 

I'm not sure what you mean by "do it". What should the function do?

As for the test results, I very much doubt the differences are not
caused simply by random timing variations, or something like that. And I
don't understand what "Performance Machine Linux" is, considering those
timings are slower than the other two machines.

Also, even if it was a bit slower, does it really matter? I mean, the
current code is wrong, can lead to infinite duration if it happens to
hit the deadlock. And it's a one-time action, I don't think it's a very
sensitive in terms of performance.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Mon, Dec 4, 2023 at 5:30 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/4/23 12:37, Amit Kapila wrote:
> > On Sat, Dec 2, 2023 at 9:52 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> >>
> >>> thread. I think you can compare the timing of regression tests in
> >>> subscription, with and without the patch to show there is no
> >>> regression. And probably some tests with a large number of tables for
> >>> sync with very little data.
> >>
> >> I have tested the regression test timings for subscription with and
> >> without patch. I also did the timing test for sync of subscription
> >> with the publisher for 100 and 1000 tables respectively.
> >> I have attached the test script and results of the timing test are as follows:
> >>
> >> Time taken for test to run in Linux VM
> >> Summary                                |  Subscription Test (sec)
> >> |    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
> >> (sec)
> >> Without patch Release           |  95.564
> >>  |     7.877                                             |   58.919
> >> With patch Release                |  96.513
> >>    |     6.533                                             |   45.807
> >>
> >> Time Taken for test to run in another Linux VM
> >> Summary                                |  Subscription Test (sec)
> >> |    100 tables in pub and Sub (sec)    |  1000 tables in pub and Sub
> >> (sec)
> >> Without patch Release           |  109.8145
> >> |    6.4675                                           |    83.001
> >> With patch Release                |  113.162
> >>    |    7.947                                              |   87.113
> >>
> >
> > So, on some machines, it may increase the test timing although not too
> > much. I think the reason is probably doing the work in multiple
> > transactions for multiple relations. I am wondering that instead of
> > committing and starting a new transaction before
> > wait_for_relation_state_change(), what if we do it inside that
> > function just before we decide to wait? It is quite possible that in
> > many cases we don't need any wait at all.
> >
>
> I'm not sure what you mean by "do it". What should the function do?
>

I mean to commit the open transaction at the below place in
wait_for_relation_state_change()

wait_for_relation_state_change()
{
...
-- commit the xact
WaitLatch();
...
}

Then start after the wait is over. This is just to test whether it
improves the difference in regression test timing.

> As for the test results, I very much doubt the differences are not
> caused simply by random timing variations, or something like that. And I
> don't understand what "Performance Machine Linux" is, considering those
> timings are slower than the other two machines.
>
> Also, even if it was a bit slower, does it really matter? I mean, the
> current code is wrong, can lead to infinite duration if it happens to
> hit the deadlock. And it's a one-time action, I don't think it's a very
> sensitive in terms of performance.
>

Yeah, I see that point but trying to evaluate if we can avoid an
increase in regression test timing at least for the HEAD. The tests
are done in release mode, so, I suspect there could be a slightly
bigger gap in debug mode (we can check that once though) which may hit
developers running regressions quite often in their development
environments. Now, if we there is no easy way to avoid the increase in
regression test timing, we have to still fix the problem, so we have
to take the hit of some increase in time.

--
With Regards,
Amit Kapila.



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Shlok Kyal
Date:
Hi,

> As for the test results, I very much doubt the differences are not
> caused simply by random timing variations, or something like that. And I
> don't understand what "Performance Machine Linux" is, considering those
> timings are slower than the other two machines.

The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System
Also find the detailed info of the performance machine attached.

Thanks and Regards,
Shlok Kyal

Attachment

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Tomas Vondra
Date:
On 12/5/23 08:14, Shlok Kyal wrote:
> Hi,
> 
>> As for the test results, I very much doubt the differences are not
>> caused simply by random timing variations, or something like that. And I
>> don't understand what "Performance Machine Linux" is, considering those
>> timings are slower than the other two machines.
> 
> The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System
> Also find the detailed info of the performance machine attached.
> 

Thanks for the info. I don't think the tests really benefit from this
much resources, I would be rather surprised if it was faster beyond 8
cores or so. The CPU frequency likely matters much more. Which probably
explains why this machine was the slowest.

Also, I wonder how much the results vary between the runs. I suppose you
only did s single run for each, right?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Shlok Kyal
Date:
On Tue, 5 Dec 2023 at 17:18, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 12/5/23 08:14, Shlok Kyal wrote:
> > Hi,
> >
> >> As for the test results, I very much doubt the differences are not
> >> caused simply by random timing variations, or something like that. And I
> >> don't understand what "Performance Machine Linux" is, considering those
> >> timings are slower than the other two machines.
> >
> > The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System
> > Also find the detailed info of the performance machine attached.
> >
>
> Thanks for the info. I don't think the tests really benefit from this
> much resources, I would be rather surprised if it was faster beyond 8
> cores or so. The CPU frequency likely matters much more. Which probably
> explains why this machine was the slowest.
>
> Also, I wonder how much the results vary between the runs. I suppose you
> only did s single run for each, right?

I did 10 runs for each of the cases and reported the median result in
the previous thread.
I have documented the result of the runs and have attached the same here.

Thanks and Regards,
Shlok Kyal

Attachment

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Shlok Kyal
Date:
Hi,

> I mean to commit the open transaction at the below place in
> wait_for_relation_state_change()
>
> wait_for_relation_state_change()
> {
> ...
> -- commit the xact
> WaitLatch();
> ...
> }
>
> Then start after the wait is over. This is just to test whether it
> improves the difference in regression test timing.

I tried the above approach and observed that the performance of this
approach is nearly same as the previous approach.

For Linux VM:
Summary    | Subscription | 100 tables in pub  | 1000 tables in pub
                    | Test (sec)     | and Sub (sec)       | and Sub (sec)
------------------------------------------------------------------------------
old patch     |  107.4545     |   6.911           | 77.918
alternate      |  108.3985    | 6.9835           | 78.111
approach

For Performance Machine:
Summary    | Subscription | 100 tables in pub  | 1000 tables in pub
                    | Test (sec)     | and Sub (sec)       | and Sub (sec)
------------------------------------------------------------------------------
old patch     |  115.922      |   6.7305         | 81.1525
alternate      |  115.8215   | 6.7685            | 81.2335
approach

I have attached the patch for this approach as 'alternate_approach.patch'.

Since the performance is the same, I think that the previous approach
is better. As in this approach we are using CommitTransactionCommand()
and StartTransactionCommand() inside a 'for loop'.

I also fixed the comment in previous approach and attached here as
'v2-0001-Deadlock-when-apply-worker-tablesync-worker-and-c.patch'

Thanks and Regards


Shlok Kyal

Attachment

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Thu, Dec 7, 2023 at 11:21 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> > I mean to commit the open transaction at the below place in
> > wait_for_relation_state_change()
> >
> > wait_for_relation_state_change()
> > {
> > ...
> > -- commit the xact
> > WaitLatch();
> > ...
> > }
> >
> > Then start after the wait is over. This is just to test whether it
> > improves the difference in regression test timing.
>
> I tried the above approach and observed that the performance of this
> approach is nearly same as the previous approach.
>

Then let's go with the original patch only. BTW, it took almost the
same time (105 wallclock secs) in my environment (CentOs VM) to run
tests in src/test/subscription both with and without the patch. I took
a median of five runs. I have slightly adjusted the comments and
commit message in the attached. If you are fine with this, we can
commit and backpatch this.

--
With Regards,
Amit Kapila.

Attachment

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Shlok Kyal
Date:
Hi,

> Then let's go with the original patch only. BTW, it took almost the
> same time (105 wallclock secs) in my environment (CentOs VM) to run
> tests in src/test/subscription both with and without the patch. I took
> a median of five runs. I have slightly adjusted the comments and
> commit message in the attached. If you are fine with this, we can
> commit and backpatch this.

I have tested the patch for all the branches from PG 17 to PG 12.
The same patch applies cleanly on all branches. Also, the same patch
resolves the issue on all the branches.
I ran all the tests and all the tests passed on each branch.

I also reviewed the patch and it looks good to me.

Thanks and Regards,
Shlok Kyal



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Fri, Dec 8, 2023 at 7:16 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> > Then let's go with the original patch only. BTW, it took almost the
> > same time (105 wallclock secs) in my environment (CentOs VM) to run
> > tests in src/test/subscription both with and without the patch. I took
> > a median of five runs. I have slightly adjusted the comments and
> > commit message in the attached. If you are fine with this, we can
> > commit and backpatch this.
>
> I have tested the patch for all the branches from PG 17 to PG 12.
> The same patch applies cleanly on all branches. Also, the same patch
> resolves the issue on all the branches.
> I ran all the tests and all the tests passed on each branch.
>
> I also reviewed the patch and it looks good to me.
>

Thanks, I could also reproduce the issue on back branches (tried till
12), and the fix works. I'll push this on Monday.

--
With Regards,
Amit Kapila.



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Amit Kapila
Date:
On Sat, Dec 9, 2023 at 12:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Thanks, I could also reproduce the issue on back branches (tried till
> 12), and the fix works. I'll push this on Monday.
>

Peter sent one minor suggestion (to write the check differently for
easier understanding) offlist which I addressed and pushed the patch.

--
With Regards,
Amit Kapila.



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From
Tomas Vondra
Date:
On 12/11/23 07:12, Amit Kapila wrote:
> On Sat, Dec 9, 2023 at 12:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Thanks, I could also reproduce the issue on back branches (tried till
>> 12), and the fix works. I'll push this on Monday.
>>
> 
> Peter sent one minor suggestion (to write the check differently for
> easier understanding) offlist which I addressed and pushed the patch.
> 

Thanks for taking care of fixing this!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company