Re: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: [PoC] pg_upgrade: allow to upgrade publisher node |
Date | |
Msg-id | CAHut+PuR=Os=kQ=JLqvUqqVGei24V4B=9h+s19rnozh-uf=T4Q@mail.gmail.com Whole thread Raw |
In response to | RE: [PoC] pg_upgrade: allow to upgrade publisher node ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Responses |
RE: [PoC] pg_upgrade: allow to upgrade publisher node
RE: [PoC] pg_upgrade: allow to upgrade publisher node |
List | pgsql-hackers |
Hi Kuroda-san, here are my review comments for v34-0002 There is likely to be some overlap because others have modified and/or commented on some of the same points as me, and v35 was already posted before this review. I'll leave it to you to sort out any clashes and ignore them where appropriate. ====== 1. GENERAL -- Cluster Terminology This is not really a problem of your patch, but during message review, I noticed the terms old/new cluster VERSUS source/target cluster and both were used many times: For example. ".*new clusmter --> 44 occurences ".*old cluster --> 21 occurences ".*source cluster --> 6 occurences ".*target cluster --> 12 occurences Perhaps there should be a new thread/patch to use consistent terms. Thoughts? ~~~ 2. GENERAL - Error message cases Just FYI, there are many inconsistent capitalising in these patch messages, but then the same is also true for the HEAD code. It's a bit messy, but generally, I think your capitalisation was aligned with what I saw in HEAD, so I didn't comment anywhere about it. ====== src/backend/replication/slot.c 3. InvalidatePossiblyObsoleteSlot + /* + * Raise an ERROR if the logical replication slot is invalidating. It + * would not happen because max_slot_wal_keep_size is set to -1 during + * the upgrade, but it stays safe. + */ + if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade) + elog(ERROR, "Replication slots must not be invalidated during the upgrade."); 3a. That comment didn't seem good. I think you mean like in the suggestion below. SUGGESTION It should not be possible for logical replication slots to be invalidated because max_slot_wal_keep_size is set to -1 during the upgrade. The following is just for sanity-checking. ~ 3b. I wasn't sure if 'max_slot_wal_keep_size' GUC is accessible in this scope, but if it is available then maybe Assert(max_slot_wal_keep_size_mb == -1); should also be included in this sanity check. ====== src/bin/pg_upgrade/check.c 4. check_new_cluster_logical_replication_slots + conn = connectToServer(&new_cluster, "template1"); + + prep_status("Checking for logical replication slots"); There is some inconsistency with all the subsequent pg_fatals within this function -- some of them mention "New cluster" but most of them do not. Meanwhile, Kuroda-san showed me sample output like: Checking for presence of required libraries ok Checking database user is the install user ok Checking for prepared transactions ok Checking for new cluster tablespace directories ok Checking for logical replication slots New cluster must not have logical replication slots but found 1 slot. Failure, exiting So, I felt the log message title ("Checking...") should be changed to include the words "new cluster" just like the log preceding it: "Checking for logical replication slots" ==> "Checking for new cluster logical replication slots" Now all the subsequent pg_fatals clearly are for "new cluster" ~ 5. check_new_cluster_logical_replication_slots + if (nslots_on_new) + pg_fatal(ngettext("New cluster must not have logical replication slots but found %d slot.", + "New cluster must not have logical replication slots but found %d slots.", + nslots_on_new), + nslots_on_new); 5a. TBH, I didn't see why you go to unnecessary trouble to have a plural message here. The message could just be like: "New cluster must have 0 logical replication slots but found %d." ~ 5b. However, now (from the previous review comment #4) if "New cluster" is already explicit in the log, the pg_fatal message can become just: "New cluster must have ..." ==> "Expected 0 logical replication slots but found %d." ~~~ 6. check_old_cluster_for_valid_slots + if (script) + { + fclose(script); + + pg_log(PG_REPORT, "fatal"); + pg_fatal("The source cluster contains one or more problematic logical replication slots.\n" + "The needed workaround depends on the problem.\n" + "1) If the problem is \"The slot is unusable,\" You can drop such replication slots.\n" + "2) If the problem is \"The slot has not consumed WALs yet,\" you can consume all remaining WALs.\n" + "Then, you can restart the upgrade.\n" + "A list of problematic logical replication slots is in the file:\n" + " %s", output_path); + } This needs fixing but I saw it has been updated in v35, so I'll check it there later. ====== src/bin/pg_upgrade/info.c 7. get_db_rel_and_slot_infos void get_db_rel_and_slot_infos(ClusterInfo *cluster) { int dbnum; if (cluster->dbarr.dbs != NULL) free_db_and_rel_infos(&cluster->dbarr); ~ Judging from the HEAD code this function was intended to be reentrant -- e.g. it does cleanup code free_db_and_rel_infos in case there was something there from before. IIUC there is no such cleanup for the slot_arr. I forget why this was removed. Sure, you might be able to survive the memory leaks, but choosing NOT to clean up the slot_arr seems to contradict the intention of HEAD calling free_db_and_rel_infos. ~~~ 8. get_db_infos I noticed the pg_malloc0 is reverted in this function. - dbinfos = (DbInfo *) pg_malloc(sizeof(DbInfo) * ntups); + dbinfos = (DbInfo *) pg_malloc0(sizeof(DbInfo) * ntups); IMO it is better to do pg_malloc0 here. Sure, everything probably works OK for the current code, but it seems unnecessarily risky to assume that functions will forever be called in a specific order. AFAICT if someone (e.g. for debugging) calls count_old_cluster_logical_slots() or calls print_slot_infos() then the behaviour is undefined because slot_arr.nslots remains uninitialized. ~~~ 9. get_old_cluster_logical_slot_infos + i_slotname = PQfnumber(res, "slot_name"); + i_plugin = PQfnumber(res, "plugin"); + i_twophase = PQfnumber(res, "two_phase"); + i_caught_up = PQfnumber(res, "caught_up"); + i_invalid = PQfnumber(res, "conflicting"); IMO SQL should be using an alias for this column, so you can say: i_invalid = PQfnumber(res, "invalid") which seems better than switching the wording in code. ====== src/bin/pg_upgrade/pg_upgrade.h 10. LogicalSlotInfo +typedef struct +{ + char *slotname; /* slot name */ + char *plugin; /* plugin */ + bool two_phase; /* can the slot decode 2PC? */ + bool caught_up; /* Is confirmed_flush_lsn the same as latest + * checkpoint LSN? */ + bool invalid; /* Is the slot usable? */ +} LogicalSlotInfo; ~ + bool invalid; /* Is the slot usable? */ This field name and comment have opposite meanings. Invalid means NOT usable. SUGGESTION /* If true, the slot is unusable. */ ====== src/bin/pg_upgrade/server.c 11. start_postmaster * we only modify the new cluster, so only use it there. If there is a * crash, the new cluster has to be recreated anyway. fsync=off is a big * win on ext4. + * + * Also, the max_slot_wal_keep_size is set to -1 to prevent the WAL removal + * required by logical slots. The setting could avoid the invalidation of + * slots during the upgrade. */ ~ IMO this comment "to prevent the WAL removal required by logical slots" is ambiguous about how it could be interpreted. Needs rearranging for clarity. ~ 12. start_postmaster (cluster == &new_cluster) ? - " -c synchronous_commit=off -c fsync=off -c full_page_writes=off" : "", + " -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c max_slot_wal_keep_size=-1 " : + " -c max_slot_wal_keep_size=-1", Instead of putting the same option on both sides of the ternary, I was wondering if it might be better to hardwire the max_slot_wal_keep_size just 1 time in the format string? ====== .../pg_upgrade/t/003_logical_replication_slots.pl 13. # Remove the remained slot /remained/remaining/ ------ Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: