Home > mailing lists

Re: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

From	Bharath Rupireddy
Subject	Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date	September 25, 2023 07:36:16
Msg-id	CALj2ACXp+LXioY_=9mboEbLD--4c4nnpJCZ+j4fckBdSOQhENA@mail.gmail.com Whole thread Raw
In response to	RE: [PoC] pg_upgrade: allow to upgrade publisher node ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses	RE: [PoC] pg_upgrade: allow to upgrade publisher node RE: [PoC] pg_upgrade: allow to upgrade publisher node
List	pgsql-hackers

Tree view

On Sat, Sep 23, 2023 at 10:18 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Again, thank you for reviewing! Here is a new version patch.

Here are some more comments/thoughts on the v44 patch:

1.
+# pg_upgrade will fail because the slot still has unconsumed WAL records
+command_fails(
+    [

Add a test case to hit fprintf(script, "The slot \"%s\" is invalid\n",
file as well?

2.
+    'run of pg_upgrade where the new cluster has insufficient
max_replication_slots');
+ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d",
+    "pg_upgrade_output.d/ not removed after pg_upgrade failure");

+    'run of pg_upgrade where the new cluster has the wrong wal_level');
+ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d",
+    "pg_upgrade_output.d/ not removed after pg_upgrade failure");

+    'run of pg_upgrade of old cluster with idle replication slots');
+ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d",
+    "pg_upgrade_output.d/ not removed after pg_upgrade failure");

How do these tests recognize the failures are the intended ones? I
mean, for instance when pg_upgrade fails for unused replication
slots/unconsumed WAL records, then just looking at the presence of
pg_upgrade_output.d might not be sufficient, no? Using
command_fails_like instead of command_fails and looking at the
contents of invalid_logical_relication_slots.txt might help make these
tests more focused.

3.
+        pg_log(PG_REPORT, "fatal");
+        pg_fatal("Your installation contains invalid logical
replication slots.\n"
+                 "These slots can't be copied, so this cluster cannot
be upgraded.\n"
+                 "Consider removing such slots or consuming the
pending WAL if any,\n"
+                 "and then restart the upgrade.\n"
+                 "A list of invalid logical replication slots is in
the file:\n"
+                 "    %s", output_path);

It's not just the invalid logical replication slots, but also the
slots with unconsumed WALs which aren't invalid and can be upgraded if
ensured the WAL is consumed. So, a better wording would be:
        pg_fatal("Your installation contains logical replication slots
that cannot be upgraded.\n"
                 "List of all such logical replication slots is in the file:\n"
                 "These slots can't be copied, so this cluster cannot
be upgraded.\n"
                 "Consider removing invalid slots and/or consuming the
pending WAL if any,\n"
                 "and then restart the upgrade.\n"
                 "    %s", output_path);

4.
+        /*
+         * There is a possibility that following records may be generated
+         * during the upgrade.
+         */
+        is_valid = is_xlog_record_type(rmid, info, RM_XLOG_ID,
XLOG_CHECKPOINT_SHUTDOWN) ||
+            is_xlog_record_type(rmid, info, RM_XLOG_ID,
XLOG_CHECKPOINT_ONLINE) ||
+            is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_SWITCH) ||
+            is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_FPI_FOR_HINT) ||
+            is_xlog_record_type(rmid, info, RM_XLOG_ID,
XLOG_PARAMETER_CHANGE) ||
+            is_xlog_record_type(rmid, info, RM_STANDBY_ID,
XLOG_RUNNING_XACTS) ||
+            is_xlog_record_type(rmid, info, RM_HEAP2_ID, XLOG_HEAP2_PRUNE);

What if we missed to capture the WAL records that may be generated
during upgrade?

What happens if a custom WAL resource manager generates table/index AM
WAL records during upgrade?

What happens if new WAL records are added that may be generated during
the upgrade? Isn't keeping this code extensible and in sync with
future changes a problem? Or we'd better say that any custom WAL
records are found after the slot's confirmed flush LSN, then the slot
isn't upgraded?

5. In continuation to the above comment:

Why can't this logic be something like - if there's any WAL record
seen after a slot's confirmed flush LSN is of type generated by WAL
resource manager having the rm_decode function defined, then the slot
can't be upgraded.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: vignesh C
Date: 25 September 2023, 07:35:03
Subject: Re: Invalidate the subscription worker in cases where a user loses their superuser status

From: Michael Paquier
Date: 25 September 2023, 07:50:05
Subject: Re: GUC for temporarily disabling event triggers

Re: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

Previous

Next