Re: pg_upgrade and logical replication - Mailing list pgsql-hackers

From vignesh C
Subject Re: pg_upgrade and logical replication
Date
Msg-id CALDaNm2ssmSFs4bjpfxbkfUbPE=xFSGqxFoip87kF259FG=X2g@mail.gmail.com
Whole thread Raw
In response to Re: pg_upgrade and logical replication  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: pg_upgrade and logical replication
List pgsql-hackers
On Thu, 16 Nov 2023 at 07:45, Peter Smith <smithpb2250@gmail.com> wrote:
>
> Here are some review comments for patch v14-0001
>
> ======
> src/backend/utils/adt/pg_upgrade_support.c
>
> 1. binary_upgrade_replorigin_advance
>
> + /* lock to prevent the replication origin from vanishing */
> + LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
> + originid = replorigin_by_name(originname, false);
>
> Use uppercase for the lock comment.

Modified

> ======
> src/bin/pg_upgrade/check.c
>
> 2. check_for_subscription_state
>
> > > + prep_status("Checking for subscription state");
> > > +
> > > + snprintf(output_path, sizeof(output_path), "%s/%s",
> > > + log_opts.basedir,
> > > + "subscription_state.txt");
> > >
> > > I felt this filename ought to be more like
> > > 'subscriptions_with_bad_state.txt' because the current name looks like
> > > a normal logfile with nothing to indicate that it is only for the
> > > states of the "bad" subscriptions.
> >
> > I  have kept the file name intentionally shorted as we noticed that
> > when the upgrade of the publisher patch used a longer name there were
> > some buildfarm failures because of longer names.
>
> OK, but how about some other short meaningful name like 'subs_invalid.txt'?
>
> I also thought "state" in the original name was misleading because
> this file contains not only subscriptions with bad 'state' but also
> subscriptions with missing 'origin'.

Modified

> ~~~
>
> 3. check_new_cluster_logical_replication_slots
>
>   int nslots_on_old;
>   int nslots_on_new;
> + int nsubs_on_old = old_cluster.subscription_count;
>
> I felt it might be better to make both these quantities 'unsigned' to
> make it more obvious that there are no special meanings for negative
> numbers.

I have used int itself as all others also use int like in case of
logical slots. I tried making the changes, but the code was not
consistent, so used int like that is used for others.


> ~~~
>
> 4. check_new_cluster_logical_replication_slots
>
> nslots_on_old = count_old_cluster_logical_slots();
>
> ~
>
> IMO the 'nsubs_on_old' should be coded the same as above. AFAICT, this
> is the only code where you are interested in the number of
> subscribers, and furthermore, it seems you only care about that count
> in the *old* cluster. This means the current implementation of
> get_subscription_count() seems more generic than it needs to be and
> that results in more unnecessary patch code. (I will repeat this same
> review comment in the other relevant places).
>
> SUGGESTION
> nslots_on_old = count_old_cluster_logical_slots();
> nsubs_on_old = count_old_cluster_subscriptions();

Modified to keep it similar to logical slot implementation.

> ~~~
>
> 5.
> + /*
> + * Quick return if there are no logical slots and subscriptions to be
> + * migrated.
> + */
> + if (nslots_on_old == 0 && nsubs_on_old == 0)
>   return;
>
> /and subscriptions/and no subscriptions/

Modified

> ~~~
>
> 6.
> - if (nslots_on_old > max_replication_slots)
> + if (nslots_on_old && nslots_on_old > max_replication_slots)
>   pg_fatal("max_replication_slots (%d) must be greater than or equal
> to the number of "
>   "logical replication slots (%d) on the old cluster",
>   max_replication_slots, nslots_on_old);
>
> Neither nslots_on_old nor max_replication_slots can be < 0, so I don't
> see why the additional check is needed here.
> AFAICT "if (nslots_on_old > max_replication_slots)" acheives the same
> thing that you want.

This part of code is changed now

> ~~~
>
> 7.
> + if (nsubs_on_old && nsubs_on_old > max_replication_slots)
> + pg_fatal("max_replication_slots (%d) must be greater than or equal
> to the number of "
> + "subscriptions (%d) on the old cluster",
> + max_replication_slots, nsubs_on_old);
>
> Neither nsubs_on_old nor max_replication_slots can be < 0, so I don't
> see why the additional check is needed here.
> AFAICT "if (nsubs_on_old > max_replication_slots)" achieves the same
> thing that you want.

This part of code is changed now

> ======
> src/bin/pg_upgrade/info.c
>
> 8. get_db_rel_and_slot_infos
>
> + if (cluster == &old_cluster)
> + get_subscription_count(cluster);
> +
>
> I felt this is unnecessary because you only want to know the
> nsubs_on_old in one place and then only for the old cluster, so
> calling this to set a generic attribute for the cluster is overkill.

We need to do this here because when we do the validation of new
cluster the old cluster will not be running. I have made the flow
similar to logical slots now.

> ~~~
>
> 9.
> +/*
> + * Get the number of subscriptions in the old cluster.
> + */
> +static void
> +get_subscription_count(ClusterInfo *cluster)
> +{
> + PGconn    *conn;
> + PGresult   *res;
> +
> + if (GET_MAJOR_VERSION(cluster->major_version) < 1700)
> + return;
> +
> + conn = connectToServer(cluster, "template1");
> + res = executeQueryOrDie(conn,
> +   "SELECT oid FROM pg_catalog.pg_subscription");
> +
> + cluster->subscription_count = PQntuples(res);
> +
> + PQclear(res);
> + PQfinish(conn);
> +}
>
> 9a.
> Currently, this is needed only for the old_cluster (like the function
> comment implies), so the parameter is not required.
>
> Also, AFAICT this number is only needed in one place
> (check_new_cluster_logical_replication_slots) so IMO it would be
> better to make lots of changes to simplify this code:
> - change the function name to be like the other one. e.g.
> count_old_cluster_subscriptions()
> - function to return unsigned
>
> SUGGESTION (something like this...)
>
> unsigned
> count_old_cluster_subscriptions(void)
> {
>   unsigned nsubs = 0;
>
>   if (GET_MAJOR_VERSION(cluster->major_version) >= 1700)
>   {
>     PGconn *conn = connectToServer(&old_cluster, "template1");
>     PGresult *res = executeQueryOrDie(conn,
>                             "SELECT oid FROM pg_catalog.pg_subscription");
>     nsubs = PQntuples(res);
>     PQclear(res);
>     PQfinish(conn);
>   }
>
>   return nsubs;
> }

 This function is not needed anymore, making the logic similar to logical slots.

> ~
>
> 9b.
> This function is returning 0 (aka not assigning
> cluster->subscription_count) for clusters before PG17. IIUC this is
> effectively the same behaviour as count_old_cluster_logical_slots()
> but probably it needs to be mentioned more in this function comment
> why it is like this.

This function is not needed anymore, making the logic similar to logical slots.

> ======
> src/bin/pg_upgrade/pg_upgrade.h
>
> 10.
>   const char *tablespace_suffix; /* directory specification */
> + int subscription_count; /* number of subscriptions */
>  } ClusterInfo;
>
> I felt this is not needed because you only need to know the
> nsubs_on_old in one place, so you can just call the counting function
> from there. Making this a generic attribute for the cluster seems
> overkill.

We need to do this here because when we do the validation of a new
cluster the old cluster will not be running. I have made the flow
similar to logical slots now.

> ======
> src/bin/pg_upgrade/t/004_subscription.pl
>
> 11. TEST: Check that pg_upgrade is successful when the table is in init state.
>
> +$synced_query =
> +  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'i'";
> +$old_sub1->poll_query_until('postgres', $synced_query)
> +  or die "Timed out while waiting for subscriber to synchronize data";
>
> But it doesn't get to "synchronize data", so should that message say
> more like "Timed out while waiting for the table to reach INIT state"

Modified

> ~
>
> 12.
> +command_ok(
> + [
> + 'pg_upgrade', '--no-sync',        '-d', $old_sub1->data_dir,
> + '-D',         $new_sub1->data_dir, '-b', $bindir,
> + '-B',         $bindir,            '-s', $new_sub1->host,
> + '-p',         $old_sub1->port,     '-P', $new_sub1->port,
> + $mode,
> + ],
> + 'run of pg_upgrade --check for old instance when the subscription
> tables are in ready state'
> +);
>
> Should that message say "init state" instead of "ready state"?

Modified

> ~~~
>
> 13. TEST: when the subscription's replication origin does not exist.
>
> +$old_sub2->safe_psql('postgres',
> + "ALTER SUBSCRIPTION regress_sub2 disable");
>
> /disable/DISABLE/

Modified

> ~~~
>
> 14.
> +my $subid = $old_sub2->safe_psql('postgres',
> + "SELECT oid FROM pg_subscription WHERE subname = 'regress_sub2'");
> +my $reporigin = 'pg_'.qq($subid);
> +$old_sub2->safe_psql('postgres',
> + "SELECT pg_replication_origin_drop('$reporigin')"
> +);
>
> Maybe this part needs a comment to say the reason why the origin does
> not exist -- it's because you found and explicitly dropped it.

Modified

The attached v15 version patch has the changes for the same.

Regards,
Vignesh

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: long-standing data loss bug in initial sync of logical replication
Next
From: vignesh C
Date:
Subject: Re: pg_upgrade and logical replication