Thread: CF Bot failure in wait_for_subscription_sync()

CF Bot failure in wait_for_subscription_sync()

From
Bharath Rupireddy
Date:
Hi,

I have seen 2 patches registered in CF failing on Linux - Debian
Bullseye in wait_for_subscription_sync(). It seems like the tables
aren't being synced. I have not done any further analysis. I'm not
sure if this issue is being discussed elsewhere.

# Postmaster PID for node "twoways" is 50208
Waiting for all subscriptions in "twoways" to synchronize data
[14:12:43.092](198.391s) # poll_query_until timed out executing this query:
# SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN
('r', 's');
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for subscriber to synchronize data at t/100_bugs.pl line 147.


https://api.cirrus-ci.com/v1/artifact/task/6618623857917952/log/src/test/subscription/tmp_check/log/regress_log_100_bugs
https://cirrus-ci.com/task/6618623857917952
https://cirrus-ci.com/task/5764058174455808

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



RE: CF Bot failure in wait_for_subscription_sync()

From
"houzj.fnst@fujitsu.com"
Date:
On Tuesday, October 18, 2022 2:16 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> 
> Hi,
> 
> I have seen 2 patches registered in CF failing on Linux - Debian Bullseye in
> wait_for_subscription_sync(). It seems like the tables aren't being synced. I
> have not done any further analysis. I'm not sure if this issue is being discussed
> elsewhere.
> 
> # Postmaster PID for node "twoways" is 50208 Waiting for all subscriptions in
> "twoways" to synchronize data
> [14:12:43.092](198.391s) # poll_query_until timed out executing this query:
> # SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r',
> 's'); # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> timed out waiting for subscriber to synchronize data at t/100_bugs.pl line 147.
> 
> https://api.cirrus-ci.com/v1/artifact/task/6618623857917952/log/src/test/sub
> scription/tmp_check/log/regress_log_100_bugs
> https://cirrus-ci.com/task/6618623857917952
> https://cirrus-ci.com/task/5764058174455808

Thanks for reporting this. I am not sure about the root cause but just share
some initial analysis here.

This testcase waits for table sync to finish for both table "t" and table "t2".
But from the log, I can only see the log[1] related to the table sync of table
"t". So it seems that the table sync worker for table "t2" was never started
due to some reason. I tried it locally but have not reproduced this yet.

[1]---
2022-10-17 10:16:37.216 UTC [48051][logical replication worker] LOG:  logical replication table synchronization worker
forsubscription "testsub", table "t" has finished
 
---

Best regards,
Hou zj

Re: CF Bot failure in wait_for_subscription_sync()

From
Amit Kapila
Date:
On Tue, Oct 18, 2022 at 2:57 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, October 18, 2022 2:16 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Hi,
> >
> > I have seen 2 patches registered in CF failing on Linux - Debian Bullseye in
> > wait_for_subscription_sync(). It seems like the tables aren't being synced. I
> > have not done any further analysis. I'm not sure if this issue is being discussed
> > elsewhere.
> >
> > # Postmaster PID for node "twoways" is 50208 Waiting for all subscriptions in
> > "twoways" to synchronize data
> > [14:12:43.092](198.391s) # poll_query_until timed out executing this query:
> > # SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r',
> > 's'); # expecting this output:
> > # t
> > # last actual query output:
> > # f
> > # with stderr:
> > timed out waiting for subscriber to synchronize data at t/100_bugs.pl line 147.
> >
> > https://api.cirrus-ci.com/v1/artifact/task/6618623857917952/log/src/test/sub
> > scription/tmp_check/log/regress_log_100_bugs
> > https://cirrus-ci.com/task/6618623857917952
> > https://cirrus-ci.com/task/5764058174455808
>
> Thanks for reporting this. I am not sure about the root cause but just share
> some initial analysis here.
>
> This testcase waits for table sync to finish for both table "t" and table "t2".
> But from the log, I can only see the log[1] related to the table sync of table
> "t". So it seems that the table sync worker for table "t2" was never started
> due to some reason.
>

Yeah, the reason is not clear to me either. Let me state my
understanding of the issue. IIUC, the following test in 100_bugs.pl
has failed:
$node_twoways->safe_psql('d1', 'ALTER PUBLICATION testpub ADD TABLE t2');
$node_twoways->safe_psql('d2',
'ALTER SUBSCRIPTION testsub REFRESH PUBLICATION');
...
...
$node_twoways->wait_for_subscription_sync($node_twoways, 'testsub', 'd2');

After the REFRESH operation, the new table should have been added to
pg_subscription_rel. This is also visible from LOGS because one of
tables has synced and for table another worker is not started. Now,
ideally, after the new entry is created in pg_subscription_rel, the
apply worker should have got an invalidation message and refreshed the
'table_states_not_ready' list which should have let it start the new
table sync worker for table 't2' but that is not happening here.

-- 
With Regards,
Amit Kapila.