RE: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers
From | Hayato Kuroda (Fujitsu) |
---|---|
Subject | RE: [PoC] pg_upgrade: allow to upgrade publisher node |
Date | |
Msg-id | TYAPR01MB58668021BB233D129B466122F51CA@TYAPR01MB5866.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: [PoC] pg_upgrade: allow to upgrade publisher node (Peter Smith <smithpb2250@gmail.com>) |
List | pgsql-hackers |
Dear Peter, Thanks for giving comments! New version will be available in the upcoming post. > 1. check_for_lost_slots + /* logical slots can be migrated since PG17. */ + if (GET_MAJOR_VERSION(cluster->major_version) <= 1600) + return; 1a Maybe the comment should start uppercase for consistency with others. > Seems right, but I revisit check_and_dump_old_cluster() again and found that some version-specific checks are done outside the checking function. So I started to follow the style and then the part is moved to check_and_dump_old_cluster(). Also, version checking for new cluster is also moved to check_new_cluster(). Is it OK for you? > 1b. IMO if you check < 1700 instead of <= 1600 it will be a better match with the comment. > Per suggestion from Amit, I used < 1700. Some other changes in 0002 were reverted. > 2. check_for_lost_slots + for (i = 0; i < ntups; i++) + { + pg_log(PG_WARNING, + "\nWARNING: logical replication slot \"%s\" is in 'lost' state.", + PQgetvalue(res, i, i_slotname)); + } + + The braces {} are not needed anymore > Fixed. > 3. check_for_confirmed_flush_lsn + /* logical slots can be migrated since PG17. */ + if (GET_MAJOR_VERSION(cluster->major_version) <= 1600) + return; 3a. Maybe the comment should start uppercase for consistency with others. > Per reply for comment 1, the part was no longer needed. > 3b. IMO if you check < 1700 instead of <= 1600 it will be a better match with the comment. > Per suggestion from Amit, I used < 1700. > 4. check_for_confirmed_flush_lsn + for (i = 0; i < ntups; i++) + { + pg_log(PG_WARNING, + "\nWARNING: logical replication slot \"%s\" has not consumed WALs yet", + PQgetvalue(res, i, i_slotname)); + } + The braces {} are not needed anymore > Fixed. > 5. get_control_data + /* + * Gather latest checkpoint location if the cluster is newer or + * equal to 17. This is used for upgrading logical replication + * slots. + */ + if (GET_MAJOR_VERSION(cluster->major_version) >= 17) 5a. /newer or equal to 17/PG17 or later/ > Fixed. > 5b. >= 17 should be >= 1700 > Per suggestion from Amit, I used < 1700. > 6. get_control_data + { + char *slash = NULL; + uint64 upper_lsn, lower_lsn; + + p = strchr(p, ':'); + + if (p == NULL || strlen(p) <= 1) + pg_fatal("%d: controldata retrieval problem", __LINE__); + + p++; /* remove ':' char */ + + p = strpbrk(p, "01234567890ABCDEF"); + + /* + * Upper and lower part of LSN must be read separately + * because it is reported as %X/%X format. + */ + upper_lsn = strtoul(p, &slash, 16); + lower_lsn = strtoul(++slash, NULL, 16); + + /* And combine them */ + cluster->controldata.chkpnt_latest = + (upper_lsn << 32) | lower_lsn; + } Should 'upper_lsn' and 'lower_lsn' be declared as uint32? That seems a better mirror for LSN_FORMAT_ARGS. > Changed the definition to uint32, and a cast was added. > 7. get_logical_slot_infos + + /* + * Do additional checks if slots are found on the old node. If something is + * found on the new node, a subsequent function + * check_new_cluster_is_empty() would report the name of slots and raise a + * fatal error. + */ + if (cluster == &old_cluster && slot_count) + { + check_for_lost_slots(cluster); + + if (!live_check) + check_for_confirmed_flush_lsn(cluster); + } It somehow doesn't feel right for these extra checks to be jammed into this function, just because you conveniently havethe slot_count available. On the NEW cluster side, there was extra checking in the check_new_cluster() function. For consistency, I think this OLD cluster checking should be done in the check_and_dump_old_cluster() function -- see the"Check for various failure cases" comment -- IMO this new fragment belongs there with the other checks. > All the checks were moved to check_and_dump_old_cluster(), and adds a check for its major version. > 8. bool date_is_int; bool float8_pass_by_value; uint32 data_checksum_version; + + XLogRecPtr chkpnt_latest; } ControlData; I don't think the new field is particularly different from all the others that it needs a blank line separator. > I removed the blank. Actually I wondered where the attribute should be, but kept at last. > 9. # Initialize old node my $old_publisher = PostgreSQL::Test::Cluster->new('old_publisher'); $old_publisher->init(allows_streaming => 'logical'); -$old_publisher->start; # Initialize new node my $new_publisher = PostgreSQL::Test::Cluster->new('new_publisher'); $new_publisher->init(allows_streaming => 'replica'); -my $bindir = $new_publisher->config_data('--bindir'); +# Initialize subscriber node +my $subscriber = PostgreSQL::Test::Cluster->new('subscriber'); +$subscriber->init(allows_streaming => 'logical'); -$old_publisher->stop; +my $bindir = $new_publisher->config_data('--bindir'); ~ Are those removal of the old_publisher start/stop changes that actually should be done in the 0002 patch? > Yes, It should be removed from 0002. > 10. $old_publisher->safe_psql( 'postgres', qq[ SELECT pg_create_logical_replication_slot('test_slot2', 'test_decoding', false, true); + SELECT count(*) FROM pg_logical_slot_get_changes('test_slot1', NULL, NULL); ]); ~ What is the purpose of the added SELECT? It doesn't seem covered by the comment. > The SELECT statement is needed to trigger the failure caused by the insufficient max_replication_slots. Checking on new cluster is started after old servers are verified, so if the step is omitted, another error is reported: ``` Checking confirmed_flush_lsn for logical replication slots WARNING: logical replication slot "test_slot1" has not consumed WALs yet One or more logical replication slots still have unconsumed WAL records. ``` I added a comment about it. > 11. # Remove an unnecessary slot and generate WALs. These records would not be # consumed before doing pg_upgrade, so that the upcoming test would fail. $old_publisher->start; $old_publisher->safe_psql( 'postgres', qq[ SELECT pg_drop_replication_slot('test_slot2'); CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a; ]); $old_publisher->stop; Minor rewording of comment sentence. SUGGESTION Because these WAL records do not get consumed it will cause the upcoming pg_upgrade test to fail. > Added. > 12. # Cause a failure at the start of pg_upgrade because the slot still have # unconsumed WAL records ~ /still have/still has/ > Fixed. Best Regards, Hayato Kuroda FUJITSU LIMITED
pgsql-hackers by date: