Thread: Tables getting stuck at 's' state during logical replication

Tables getting stuck at 's' state during logical replication

From
Padmavathi G
Date:
Hello,
I am trying out logical replication for upgrading postgres instance from version 11 to 15.x. In the process, I noticed that some tables get stuck in the 's' state during logical replication and they do not move to the 'r' state. I tried to drop the subscription and create a new subscriber, but the same thing repeated. Also, each time different tables get stuck, it is not like the same tables get stuck every time. This also happens to tables that do not have frequent writes to them. Then I tried to drop the tables which got stuck from the original publication and created a new publication and new subscribers just with the tables which got stuck in the previous subscription. Now, in this new subscription, some tables go to 'r' state (even though they did not in the older subscription) and some tables remain in 's' state. If I continue doing this process for 5-6 times I am able to cover all of the tables. Since these tables get stuck at 's' state, the replication origins are not deleted automatically when I delete the table from the publication and I had to manually delete the replication origin. The logs contain "table synchronization worker for <table_name> has finished even for table in 's' state. Can somebody please help with this issue as I do not want to do manual intervention of creating a new publication and subscription for tables that get stuck at 's'.

Configuration
max_replication_slots : 75 in publisher and subscriber
max_worker_processes: 60 (in subscriber)
max_logical_replication_workers: 55 (in subscriber)

Thanks, 
Padmavathi

Re: Tables getting stuck at 's' state during logical replication

From
Amit Kapila
Date:
On Fri, May 5, 2023 at 3:04 PM Padmavathi G <padma9.9.1999@gmail.com> wrote:
>
> Hello,
> I am trying out logical replication for upgrading postgres instance from version 11 to 15.x. In the process, I
noticedthat some tables get stuck in the 's' state during logical replication and they do not move to the 'r' state. I
triedto drop the subscription and create a new subscriber, but the same thing repeated. Also, each time different
tablesget stuck, it is not like the same tables get stuck every time. 
>

This is strange. BTW, we don't save slots after the upgrade, so the
subscriptions in the upgraded node won't be valid. We have some
discussion on this topic in threads [1][2]. So, I think after the
upgrade one needs to anyway re-create the subscriptions. Can you share
your exact steps for the upgrade and what is the state before the
upgrade? Is it possible to share some minimal test case to show the
exact problem you are facing?

[1] - https://www.postgresql.org/message-id/20230217075433.u5mjly4d5cr4hcfe%40jrouhaud
[2] -
https://www.postgresql.org/message-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939%40TYAPR01MB5866.jpnprd01.prod.outlook.com


--
With Regards,
Amit Kapila.



Re: Tables getting stuck at 's' state during logical replication

From
Padmavathi G
Date:
Some background on the setup on which I am trying to carry out the upgrade:

We have a pod in a kubernetes cluster which contains the postgres 11 image. We are following the logical replication process for upgrade

Steps followed for logical replication:

1. Created a new pod in the same kubernetes cluster with the latest postgres 15 image
2. Created a publication  (say publication 1) in the old pod including all tables in a database
3. Created a subscription (say subscription 1) in the new pod for the above mentioned publication
4. When monitoring the subscription via pg_subscription_rel in the subscriber, I noticed that out of 45 tables 20 were in the 'r' state and 25 were in 's' state and they remained in the same state for almost 2 days, there was no improvement in the state. But the logs showed that the tables which had 's' state also had "synchronization workers for <table_name> finished".
5. Then I removed the tables which got stuck in the 's' state from publication 1 and created a new publication (publication 2) with only these tables which got stuck and created a new subscription (subscription 2) for this publication in the subscriber.
6. Now on monitoring subscription 2 via pg_subscription_rel I noticed that out of 25, now 12 were in 'r' state and 13 again got stuck in 's' state. Repeated this process of dropping tables which got stuck from publication and created a new publisher and subscriber and finally I was able to bring all tables to sync in this way. But still the tables were present in replication origin.
7. On executing pg_replication_origins command, I saw that every subscription had one origin and every table which got stuck in each publication had one origin with roname pg_<subid>_<relid>. Eventhough they were stuck, these replication origins were not removed.


On Fri, May 5, 2023 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 5, 2023 at 3:04 PM Padmavathi G <padma9.9.1999@gmail.com> wrote:
>
> Hello,
> I am trying out logical replication for upgrading postgres instance from version 11 to 15.x. In the process, I noticed that some tables get stuck in the 's' state during logical replication and they do not move to the 'r' state. I tried to drop the subscription and create a new subscriber, but the same thing repeated. Also, each time different tables get stuck, it is not like the same tables get stuck every time.
>

This is strange. BTW, we don't save slots after the upgrade, so the
subscriptions in the upgraded node won't be valid. We have some
discussion on this topic in threads [1][2]. So, I think after the
upgrade one needs to anyway re-create the subscriptions. Can you share
your exact steps for the upgrade and what is the state before the
upgrade? Is it possible to share some minimal test case to show the
exact problem you are facing?

[1] - https://www.postgresql.org/message-id/20230217075433.u5mjly4d5cr4hcfe%40jrouhaud
[2] - https://www.postgresql.org/message-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939%40TYAPR01MB5866.jpnprd01.prod.outlook.com


--
With Regards,
Amit Kapila.

Re: Tables getting stuck at 's' state during logical replication

From
Amit Kapila
Date:
On Fri, May 5, 2023 at 7:27 PM Padmavathi G <padma9.9.1999@gmail.com> wrote:
>
> Some background on the setup on which I am trying to carry out the upgrade:
>
> We have a pod in a kubernetes cluster which contains the postgres 11 image. We are following the logical replication
processfor upgrade 
>
> Steps followed for logical replication:
>
> 1. Created a new pod in the same kubernetes cluster with the latest postgres 15 image
> 2. Created a publication  (say publication 1) in the old pod including all tables in a database
> 3. Created a subscription (say subscription 1) in the new pod for the above mentioned publication
> 4. When monitoring the subscription via pg_subscription_rel in the subscriber, I noticed that out of 45 tables 20
werein the 'r' state and 25 were in 's' state and they remained in the same state for almost 2 days, there was no
improvementin the state. But the logs showed that the tables which had 's' state also had "synchronization workers for
<table_name>finished". 
>

I think the problem happened in this step where some of the tables
remained in 's' state. It is not clear why this could happen because
apply worker should eventually update these relations state to 'r'. If
this is reproducible, we can try two things to further investigate the
issue: (a) Disable and Enable the subscription once and see if that
helps; (b) try increasing the LOG level to DEBUG2 and see if we get
any useful LOG message.

--
With Regards,
Amit Kapila.



Re: Tables getting stuck at 's' state during logical replication

From
Bharath Rupireddy
Date:
On Tue, May 9, 2023 at 4:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 5, 2023 at 7:27 PM Padmavathi G <padma9.9.1999@gmail.com> wrote:
> >
> > Some background on the setup on which I am trying to carry out the upgrade:
> >
> > We have a pod in a kubernetes cluster which contains the postgres 11 image. We are following the logical
replicationprocess for upgrade 
> >
> > Steps followed for logical replication:
> >
> > 1. Created a new pod in the same kubernetes cluster with the latest postgres 15 image
> > 2. Created a publication  (say publication 1) in the old pod including all tables in a database
> > 3. Created a subscription (say subscription 1) in the new pod for the above mentioned publication
> > 4. When monitoring the subscription via pg_subscription_rel in the subscriber, I noticed that out of 45 tables 20
werein the 'r' state and 25 were in 's' state and they remained in the same state for almost 2 days, there was no
improvementin the state. But the logs showed that the tables which had 's' state also had "synchronization workers for
<table_name>finished". 
> >
>
> I think the problem happened in this step where some of the tables
> remained in 's' state. It is not clear why this could happen because
> apply worker should eventually update these relations state to 'r'. If
> this is reproducible, we can try two things to further investigate the
> issue: (a) Disable and Enable the subscription once and see if that
> helps; (b) try increasing the LOG level to DEBUG2 and see if we get
> any useful LOG message.

It looks like the respective apply workers for the tables whose state
was 's' are not able to get to UpdateSubscriptionRelState with
SUBREL_STATE_READY. In addition to what Amit suggested above, is it
possible for you to reproduce this problem on upstream (reproducing on
HEAD cool otherwise PG 15 enough) code? If yes, you can add custom
debug messages to see what's happening.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com