Re: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

From Peter Smith
Subject Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date
Msg-id CAHut+Pt7Ri1r2Q7BQ3qwtGtDYFjvBOU8tUOUF-E41-jjxYSFTA@mail.gmail.com
Whole thread Raw
In response to RE: [PoC] pg_upgrade: allow to upgrade publisher node  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses RE: [PoC] pg_upgrade: allow to upgrade publisher node
List pgsql-hackers
Here are some review comments for v22-0002

======
Commit Message

1.
This commit allows nodes with logical replication slots to be upgraded. While
reading information from the old cluster, a list of logical replication slots is
newly extracted. At the later part of upgrading, pg_upgrade revisits the list
and restores slots by using the pg_create_logical_replication_slots() on the new
clushter.

~

1a
/is newly extracted/is fetched/

~

1b.
/using the pg_create_logical_replication_slots()/executing
pg_create_logical_replication_slots()/

~

1c.
/clushter/cluster/

~~~

2.
Note that it must be done after the final pg_resetwal command during the upgrade
because pg_resetwal will remove WALs that are required by the slots. Due to the
restriction, the timing of restoring replication slots is different from other
objects.

~

2a.
/it must/slot restoration/

~

2b.
/the restriction/this restriction/

======
doc/src/sgml/ref/pgupgrade.sgml

3.
+    <para>
+     <application>pg_upgrade</application> attempts to migrate logical
+     replication slots. This helps avoid the need for manually defining the
+     same replication slot on the new publisher.
+    </para>

/same replication slot/same replication slots/

~~~

4.
+    <para>
+     Before you start upgrading the publisher cluster, ensure that the
+     subscription is temporarily disabled, by executing
+     <link linkend="sql-altersubscription"><command>ALTER
SUBSCRIPTION ... DISABLE</command></link>.
+     After the upgrade is complete, execute the
+     <command>ALTER SUBSCRIPTION ... CONNECTION</command> command to update the
+     connection string, and then re-enable the subscription.
+    </para>

On the rendered page, it looks a bit strange that DISABLE has a link
but COMMENTION does not have a link.

~~~

5.
+    <para>
+     There are some prerequisites for <application>pg_upgrade</application> to
+     be able to upgrade the replication slots. If these are not met an error
+     will be reported.
+    </para>
+
+    <itemizedlist>

+1 to use all the itemizedlist changes that Amit suggested [1] in his
attachment.

======
src/bin/pg_upgrade/check.c

6.
+static void check_for_logical_replication_slots(ClusterInfo *new_cluster);

IMO the arg name should not shadow a global with the same name. See
other review comment for this function signature.

~~~

7.
+ /* Extract a list of logical replication slots */
+ get_logical_slot_infos(&old_cluster, live_check);

But 'live_check' is never used?

~~~

8. check_for_logical_replication_slots
+
+/*
+ * Verify the parameter settings necessary for creating logical replication
+ * slots.
+ */
+static void
+check_for_logical_replication_slots(ClusterInfo *new_cluster)

IMO the arg name should not shadow a global with the same name. If
this is never going to be called with any param other than
&new_cluster then probably it is better not even to pass have that
argument at all. Just refer to the global new_cluster inside the
function.

You can't say that 'check_for_new_tablespace_dir' does it already so
it must be OK -- I think that the existing function has the same issue
and it also ought to be fixed to avoid shadowing!

~~~

9. check_for_logical_replication_slots

+ /* logical replication slots can be migrated since PG17. */
+ if (GET_MAJOR_VERSION(new_cluster->major_version) <= 1600)
+ return;

IMO the code matches the comment better if you say < 1700 instead of <= 1600.

======
src/bin/pg_upgrade/function.c

10. get_loadable_libraries
  /*
- * Fetch all libraries containing non-built-in C functions in this DB.
+ * Fetch all libraries containing non-built-in C functions or referred
+ * by logical replication slots in this DB.
  */
  ress[dbnum] = executeQueryOrDie(conn,
~

/referred by/referred to by/

======
src/bin/pg_upgrade/info.c

11.
+/*
+ * get_logical_slot_infos()
+ *
+ * Higher level routine to generate LogicalSlotInfoArr for all databases.
+ */
+void
+get_logical_slot_infos(ClusterInfo *cluster, bool live_check)
+{
+ int dbnum;
+ int slot_count = 0;
+
+ if (cluster == &old_cluster)
+ pg_log(PG_VERBOSE, "\nsource databases:");
+ else
+ pg_log(PG_VERBOSE, "\ntarget databases:");
+
+ for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+ {
+ DbInfo    *pDbInfo = &cluster->dbarr.dbs[dbnum];
+
+ get_logical_slot_infos_per_db(cluster, pDbInfo);
+ slot_count += pDbInfo->slot_arr.nslots;
+
+ if (log_opts.verbose)
+ {
+ pg_log(PG_VERBOSE, "Database: \"%s\"", pDbInfo->db_name);
+ print_slot_infos(&pDbInfo->slot_arr);
+ }
+ }
+}
+

11a.
Now the variable 'slot_count' is no longer being returned so it seems redundant.

~

11b.
What is the 'live_check' parameter for? Nobody is using it.

~~~

12. count_logical_slots

+int
+count_logical_slots(ClusterInfo *cluster)
+{
+ int dbnum;
+ int slotnum = 0;
+
+ for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+ slotnum += cluster->dbarr.dbs[dbnum].slot_arr.nslots;
+
+ return slotnum;
+}

IMO this variable should be called something like 'slot_count'. This
is the same review comment also made in a previous review. (See [2]
comment#12).

~~~

13. print_slot_infos

+
+static void
+print_slot_infos(LogicalSlotInfoArr *slot_arr)
+{
+ int slotnum;
+
+ for (slotnum = 0; slotnum < slot_arr->nslots; slotnum++)
+ pg_log(PG_VERBOSE, "slotname: \"%s\", plugin: \"%s\", two_phase: %d",
+    slot_arr->slots[slotnum].slotname,
+    slot_arr->slots[slotnum].plugin,
+    slot_arr->slots[slotnum].two_phase);
+}

It might be nicer to introduce a variable, instead of all those array
dereferences:

LogicalSlotInfo *slot_info = &slot_arr->slots[slotnum];

~~~

14.
+ for (slotnum = 0; slotnum < slot_arr->nslots; slotnum++)
+ {
+ /*
+ * Constructs query for creating logical replication slots.
+ *
+ * XXX: For simplification, pg_create_logical_replication_slot() is
+ * used. Is it sufficient?
+ */
+ appendPQExpBuffer(query, "SELECT
pg_catalog.pg_create_logical_replication_slot(");
+ appendStringLiteralConn(query, slot_arr->slots[slotnum].slotname,
+ conn);
+ appendPQExpBuffer(query, ", ");
+ appendStringLiteralConn(query, slot_arr->slots[slotnum].plugin,
+ conn);
+ appendPQExpBuffer(query, ", false, %s);",
+   slot_arr->slots[slotnum].two_phase ? "true" : "false");
+
+ PQclear(executeQueryOrDie(conn, "%s", query->data));
+
+ resetPQExpBuffer(query);
+ }
+
+ PQfinish(conn);
+
+ destroyPQExpBuffer(query);
+ }
+
+ end_progress_output();
+ check_ok();

14a
Similar to the previous comment (#13). It might be nicer to introduce
a variable, instead of all those array dereferences:

LogicalSlotInfo *slot_info = &slot_arr->slots[slotnum];
~

14b.
It was not clear to me why this command is not being built using
executeQueryOrDie directly instead of using the query buffer. Is there
some reason?

~

14c.
I think it would be cleaner to have a separate res variable like you
used elsewhere:
res = executeQueryOrDie(...)

instead of doing PQclear(executeQueryOrDie(conn, "%s", query->data));
in one line

======
src/bin/pg_upgrade/pg_upgrade.

15.
+void get_logical_slot_infos(ClusterInfo *cluster, bool live_check);

I didn't see a reason for that 'live_check' parameter.

======
.../pg_upgrade/t/003_logical_replication_slots.pl

16.
IMO this would be much easier to read if there were BIG comments
between the actual TEST parts

For example

# ------------------------------
# TEST: Confirm pg_upgrade fails is new node wal_level is not 'logical'
<preparation>
<test>
<cleanup>

# ------------------------------
# TEST: Confirm pg_upgrade fails max_replication_slots on new node is too low
<preparation>
<test>
<cleanup>

# ------------------------------
# TEST: Successful upgrade
<preparation>
<test>
<cleanup>

~~~

17.
+# Cause a failure at the start of pg_upgrade because wal_level is replica
+command_fails(
+ [
+ 'pg_upgrade', '--no-sync',
+ '-d',         $old_publisher->data_dir,
+ '-D',         $new_publisher->data_dir,
+ '-b',         $bindir,
+ '-B',         $bindir,
+ '-s',         $new_publisher->host,
+ '-p',         $old_publisher->port,
+ '-P',         $new_publisher->port,
+ $mode,
+ ],
+ 'run of pg_upgrade of old node with wrong wal_level');
+ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d",
+ "pg_upgrade_output.d/ not removed after pg_upgrade failure");

The message is ambiguous

BEFORE
'run of pg_upgrade of old node with wrong wal_level'

SUGGESTION
'run of pg_upgrade where the new node has the wrong wal_level'

~~~

18.
+# Create an unnecessary slot on old node
+$old_publisher->start;
+$old_publisher->safe_psql(
+ 'postgres', qq[
+ SELECT pg_create_logical_replication_slot('test_slot2',
'test_decoding', false, true);
+]);
+
+$old_publisher->stop;
+
+# Preparations for the subsequent test. max_replication_slots is set to
+# smaller than existing slots on old node
+$new_publisher->append_conf('postgresql.conf', "wal_level = 'logical'");
+$new_publisher->append_conf('postgresql.conf', "max_replication_slots = 1");


IMO the comment is misleading. It is not an "unnecessary slot", it is
just a 2nd slot. And this is all part of the preparation for the next
test so it should be under the other comment.

For example SUGGESTION changes like this:

# Preparations for the subsequent test.
# 1. Create an unnecessary slot on the old node
$old_publisher->start;
$old_publisher->safe_psql(
'postgres', qq[
SELECT pg_create_logical_replication_slot('test_slot2',
'test_decoding', false, true);
]);
$old_publisher->stop;
# 2. max_replication_slots is set to smaller than the number of slots
(2) present on the old node
$new_publisher->append_conf('postgresql.conf', "max_replication_slots = 1");
# 3. new node wal_level is set correctly
$new_publisher->append_conf('postgresql.conf', "wal_level = 'logical'");

~~~

19.
+# Remove an unnecessary slot and consume WAL records
+$old_publisher->start;
+$old_publisher->safe_psql(
+ 'postgres', qq[
+ SELECT pg_drop_replication_slot('test_slot2');
+ SELECT count(*) FROM pg_logical_slot_get_changes('test_slot1', NULL, NULL)
+]);
+$old_publisher->stop;
+

This comment should say more like:

# Preparations for the subsequent test.

~~~

20.
+# Actual run, pg_upgrade_output.d is removed at the end

This comment should mention that "successful upgrade is expected"
because all the other prerequisites are now satisfied.

~~~

21.
+$new_publisher->start;
+my $result = $new_publisher->safe_psql('postgres',
+ "SELECT slot_name, two_phase FROM pg_replication_slots");
+is($result, qq(test_slot1|t), 'check the slot exists on new node');

Should there be a matching new_pulisher->stop;?

------
[1] https://www.postgresql.org/message-id/CAA4eK1%2BdT2g8gmerguNd_TA%3DXMnm00nLzuEJ_Sddw6Pj-bvKVQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/TYAPR01MB586604802ABE42E11866762FF51BA%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Gurjeet Singh
Date:
Subject: Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns
Next
From: Michael Paquier
Date:
Subject: Re: Remove distprep