Thread: pg_createsubscriber: drop pre-existing subscriptions from the converted node

pg_createsubscriber: drop pre-existing subscriptions from the converted node

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Hackers,

This is a follow-up thread for pg_createsubscriber [1]. I started a new thread
since there is no activity around here.

## Problem

Assuming that there is a cascading replication like below:

node A --(logical replication)--> node B --(streaming replication)--> node C

In this case, subscriptions exist even on node C, but it does not try to connect
to node A because the logical replication launcher/worker won't be launched.
After the conversion, node C becomes a subscriber for node B, and the subscription
toward node A remains. Therefore, another worker that tries to connect to node A
will be launched, raising an ERROR [2]. This failure may occur even during the
conversion.

## Solution

The easiest solution is to drop pre-existing subscriptions from the converted node.
To avoid establishing connections during the conversion, slot_name is set to NONE
on the primary first, then drop on the standby. The setting will be restored on the
primary node.
Attached patch implements the idea. Test script is also included, but not sure it should
be on the HEAD

BTW, I found that LogicalRepInfo.oid won't be used. If needed, I can create
another patch to remove the attribute.

How do you think?

[1]: https://www.postgresql.org/message-id/CAA4eK1J22UEfrqx222h5j9DQ7nxGrTbAa_BC%2B%3DmQXdXs-RCsew%40mail.gmail.com
[2]:
https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/


Attachment
On Fri, 21 Jun 2024 at 16:51, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Hackers,
>
> This is a follow-up thread for pg_createsubscriber [1]. I started a new thread
> since there is no activity around here.
>
> ## Problem
>
> Assuming that there is a cascading replication like below:
>
> node A --(logical replication)--> node B --(streaming replication)--> node C
>
> In this case, subscriptions exist even on node C, but it does not try to connect
> to node A because the logical replication launcher/worker won't be launched.
> After the conversion, node C becomes a subscriber for node B, and the subscription
> toward node A remains. Therefore, another worker that tries to connect to node A
> will be launched, raising an ERROR [2]. This failure may occur even during the
> conversion.
>
> ## Solution
>
> The easiest solution is to drop pre-existing subscriptions from the converted node.
> To avoid establishing connections during the conversion, slot_name is set to NONE
> on the primary first, then drop on the standby. The setting will be restored on the
> primary node.
> Attached patch implements the idea. Test script is also included, but not sure it should
> be on the HEAD

Few comments:
1) Should we do this only for the enabled subscription, otherwise the
disabled subscriptions will be enabled after running
pg_createsubscriber:
+obtain_and_disable_pre_existing_subscriptions(struct LogicalRepInfo *dbinfo)
+{
+       PQExpBuffer query = createPQExpBuffer();
+
+       for (int i = 0; i < num_dbs; i++)
+       {
+               PGconn     *conn;
+               PGresult   *res;
+               int                     ntups;
+
+               /* Connect to publisher */
+               conn = connect_database(dbinfo[i].pubconninfo, true);
+
+               appendPQExpBuffer(query,
+                                                 "SELECT s.subname,
s.subslotname FROM pg_catalog.pg_subscription s "
+                                                 "INNER JOIN
pg_catalog.pg_database d ON (s.subdbid = d.oid) "
+                                                 "WHERE d.datname = '%s'",
+                                                 dbinfo[i].dbname);
+

2) disconnect_database not called here, should the connection be disconnected:
+drop_pre_existing_subscriptions(struct LogicalRepInfo *dbinfo)
+{
+       PQExpBuffer query = createPQExpBuffer();
+
+       for (int i = 0; i < num_dbs; i++)
+       {
+               PGconn                           *conn;
+               struct LogicalRepInfo info = dbinfo[i];
+
+               /* Connect to subscriber */
+               conn = connect_database(info.subconninfo, false);
+
+               for (int j = 0; j < info.num_subscriptions; j++)
+               {
+                       appendPQExpBuffer(query,
+                                                         "DROP
SUBSCRIPTION %s;", info.pre_subnames[j]);
+                       PQexec(conn, query->data);
+                       resetPQExpBuffer(query);
+               }
+       }

3) Similarly here too:
+static void
+enable_subscirptions_on_publisher(struct LogicalRepInfo *dbinfo)
+{
+       PQExpBuffer query = createPQExpBuffer();
+
+       for (int i = 0; i < num_dbs; i++)
+       {
+               PGconn                           *conn;
+               struct LogicalRepInfo info = dbinfo[i];
+
+               /* Connect to publisher */
+               conn = connect_database(info.pubconninfo, false);

4) them should be then here:
+                       /* ...and them enable the subscription */
+                       appendPQExpBuffer(query,
+                                                         "ALTER
SUBSCRIPTION %s ENABLE",
+                                                         info.pre_subnames[j]);
+                       PQclear(PQexec(conn, query->data));
+                       resetPQExpBuffer(query);


> BTW, I found that LogicalRepInfo.oid won't be used. If needed, I can create
> another patch to remove the attribute.

I was able to compile without this, I think this can be removed.

Regards,
Vignesh



On Fri, Jun 21, 2024 at 4:51 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> This is a follow-up thread for pg_createsubscriber [1]. I started a new thread
> since there is no activity around here.
>
> ## Problem
>
> Assuming that there is a cascading replication like below:
>
> node A --(logical replication)--> node B --(streaming replication)--> node C
>
> In this case, subscriptions exist even on node C, but it does not try to connect
> to node A because the logical replication launcher/worker won't be launched.
> After the conversion, node C becomes a subscriber for node B, and the subscription
> toward node A remains. Therefore, another worker that tries to connect to node A
> will be launched, raising an ERROR [2]. This failure may occur even during the
> conversion.
>
> ## Solution
>
> The easiest solution is to drop pre-existing subscriptions from the converted node.
> To avoid establishing connections during the conversion, slot_name is set to NONE
> on the primary first, then drop on the standby. The setting will be restored on the
> primary node.
>

It seems disabling subscriptions on the primary can make the primary
stop functioning for some duration of time. I feel we need some
solution where after converting to subscriber, we disable and drop
pre-existing subscriptions. One idea could be that we use the list of
new subscriptions created by the tool such that any subscription not
existing in that list will be dropped.

Shouldn't this be an open item for PG17?

--
With Regards,
Amit Kapila.



RE: pg_createsubscriber: drop pre-existing subscriptions from the converted node

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Amit, Vingesh,

Thanks for giving comments!

> It seems disabling subscriptions on the primary can make the primary
> stop functioning for some duration of time. I feel we need some
> solution where after converting to subscriber, we disable and drop
> pre-existing subscriptions. One idea could be that we use the list of
> new subscriptions created by the tool such that any subscription not
> existing in that list will be dropped.

Previously I avoided coding like yours, because there is a room that converted
node can connect to another publisher. But per off-list discussion, we can skip
it by setting max_logical_replication_workers = 0. I refactored with the approach.
Note that the GUC is checked at verification phase, so an attribute is added to
start_standby_server() to select the workload.

Most of comments by Vignesh were invalidated due to the code change, but I hoped
I checked your comments were not reproduced. Also, 0001 was created to remove an
unused attribute.

> Shouldn't this be an open item for PG17?

Added this thread to wikipage.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

Attachment
On Thu, Jun 27, 2024 at 11:47 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > It seems disabling subscriptions on the primary can make the primary
> > stop functioning for some duration of time. I feel we need some
> > solution where after converting to subscriber, we disable and drop
> > pre-existing subscriptions. One idea could be that we use the list of
> > new subscriptions created by the tool such that any subscription not
> > existing in that list will be dropped.
>
> Previously I avoided coding like yours, because there is a room that converted
> node can connect to another publisher. But per off-list discussion, we can skip
> it by setting max_logical_replication_workers = 0. I refactored with the approach.
> Note that the GUC is checked at verification phase, so an attribute is added to
> start_standby_server() to select the workload.
>

Thanks, this is a better approach. I have changed a few comments and
made some other cosmetic changes. See attached.

Euler, Peter E., and others, do you have any comments/suggestions?

BTW, why have you created a separate test file for this test? I think
we should add a new test to one of the existing tests in
040_pg_createsubscriber. You can create a dummy subscription on node_p
and do a test similar to what we are doing in "# Create failover slot
to test its removal".

--
With Regards,
Amit Kapila.

Attachment