Re: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: [PoC] pg_upgrade: allow to upgrade publisher node |
Date | |
Msg-id | CALj2ACXp+LXioY_=9mboEbLD--4c4nnpJCZ+j4fckBdSOQhENA@mail.gmail.com Whole thread Raw |
In response to | RE: [PoC] pg_upgrade: allow to upgrade publisher node ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Responses |
RE: [PoC] pg_upgrade: allow to upgrade publisher node
RE: [PoC] pg_upgrade: allow to upgrade publisher node |
List | pgsql-hackers |
On Sat, Sep 23, 2023 at 10:18 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Again, thank you for reviewing! Here is a new version patch. Here are some more comments/thoughts on the v44 patch: 1. +# pg_upgrade will fail because the slot still has unconsumed WAL records +command_fails( + [ Add a test case to hit fprintf(script, "The slot \"%s\" is invalid\n", file as well? 2. + 'run of pg_upgrade where the new cluster has insufficient max_replication_slots'); +ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d", + "pg_upgrade_output.d/ not removed after pg_upgrade failure"); + 'run of pg_upgrade where the new cluster has the wrong wal_level'); +ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d", + "pg_upgrade_output.d/ not removed after pg_upgrade failure"); + 'run of pg_upgrade of old cluster with idle replication slots'); +ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d", + "pg_upgrade_output.d/ not removed after pg_upgrade failure"); How do these tests recognize the failures are the intended ones? I mean, for instance when pg_upgrade fails for unused replication slots/unconsumed WAL records, then just looking at the presence of pg_upgrade_output.d might not be sufficient, no? Using command_fails_like instead of command_fails and looking at the contents of invalid_logical_relication_slots.txt might help make these tests more focused. 3. + pg_log(PG_REPORT, "fatal"); + pg_fatal("Your installation contains invalid logical replication slots.\n" + "These slots can't be copied, so this cluster cannot be upgraded.\n" + "Consider removing such slots or consuming the pending WAL if any,\n" + "and then restart the upgrade.\n" + "A list of invalid logical replication slots is in the file:\n" + " %s", output_path); It's not just the invalid logical replication slots, but also the slots with unconsumed WALs which aren't invalid and can be upgraded if ensured the WAL is consumed. So, a better wording would be: pg_fatal("Your installation contains logical replication slots that cannot be upgraded.\n" "List of all such logical replication slots is in the file:\n" "These slots can't be copied, so this cluster cannot be upgraded.\n" "Consider removing invalid slots and/or consuming the pending WAL if any,\n" "and then restart the upgrade.\n" " %s", output_path); 4. + /* + * There is a possibility that following records may be generated + * during the upgrade. + */ + is_valid = is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_CHECKPOINT_SHUTDOWN) || + is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_CHECKPOINT_ONLINE) || + is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_SWITCH) || + is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_FPI_FOR_HINT) || + is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_PARAMETER_CHANGE) || + is_xlog_record_type(rmid, info, RM_STANDBY_ID, XLOG_RUNNING_XACTS) || + is_xlog_record_type(rmid, info, RM_HEAP2_ID, XLOG_HEAP2_PRUNE); What if we missed to capture the WAL records that may be generated during upgrade? What happens if a custom WAL resource manager generates table/index AM WAL records during upgrade? What happens if new WAL records are added that may be generated during the upgrade? Isn't keeping this code extensible and in sync with future changes a problem? Or we'd better say that any custom WAL records are found after the slot's confirmed flush LSN, then the slot isn't upgraded? 5. In continuation to the above comment: Why can't this logic be something like - if there's any WAL record seen after a slot's confirmed flush LSN is of type generated by WAL resource manager having the rm_decode function defined, then the slot can't be upgraded. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: