RE: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: [PoC] pg_upgrade: allow to upgrade publisher node
Date
Msg-id TYAPR01MB58668021BB233D129B466122F51CA@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [PoC] pg_upgrade: allow to upgrade publisher node  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers
Dear Peter,

Thanks for giving comments! New version will be available
in the upcoming post.

>
1. check_for_lost_slots

+ /* logical slots can be migrated since PG17. */
+ if (GET_MAJOR_VERSION(cluster->major_version) <= 1600)
+ return;

1a
Maybe the comment should start uppercase for consistency with others.
>

Seems right, but I revisit check_and_dump_old_cluster() again and found that
some version-specific checks are done outside the checking function.
So I started to follow the style and then the part is moved to
check_and_dump_old_cluster(). Also, version checking for new cluster is also
moved to check_new_cluster(). Is it OK for you?

>
1b.
IMO if you check < 1700 instead of <= 1600 it will be a better match with the comment.
>

Per suggestion from Amit, I used < 1700. Some other changes in 0002 were reverted.

>
2. check_for_lost_slots
+ for (i = 0; i < ntups; i++)
+ {
+ pg_log(PG_WARNING,
+   "\nWARNING: logical replication slot \"%s\" is in 'lost' state.",
+   PQgetvalue(res, i, i_slotname));
+ }
+
+

The braces {} are not needed anymore
>

Fixed. 

>
3. check_for_confirmed_flush_lsn

+ /* logical slots can be migrated since PG17. */
+ if (GET_MAJOR_VERSION(cluster->major_version) <= 1600)
+ return;

3a.
Maybe the comment should start uppercase for consistency with others.
>

Per reply for comment 1, the part was no longer needed.

>
3b.
IMO if you check < 1700 instead of <= 1600 it will be a better match with the comment.
>

Per suggestion from Amit, I used < 1700.

>
4. check_for_confirmed_flush_lsn
+ for (i = 0; i < ntups; i++)
+ {
+ pg_log(PG_WARNING,
+ "\nWARNING: logical replication slot \"%s\" has not consumed WALs yet",
+ PQgetvalue(res, i, i_slotname));
+ }
+

The braces {} are not needed anymore
>

Fixed.

>
5. get_control_data
+ /*
+ * Gather latest checkpoint location if the cluster is newer or
+ * equal to 17. This is used for upgrading logical replication
+ * slots.
+ */
+ if (GET_MAJOR_VERSION(cluster->major_version) >= 17)

5a.
/newer or equal to 17/PG17 or later/
>

Fixed.

>
5b.
>= 17 should be >= 1700
>

Per suggestion from Amit, I used < 1700.

>
6. get_control_data
+ {
+ char *slash = NULL;
+ uint64 upper_lsn, lower_lsn;
+
+ p = strchr(p, ':');
+
+ if (p == NULL || strlen(p) <= 1)
+ pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+ p++; /* remove ':' char */
+
+ p = strpbrk(p, "01234567890ABCDEF");
+
+ /*
+ * Upper and lower part of LSN must be read separately
+ * because it is reported as %X/%X format.
+ */
+ upper_lsn = strtoul(p, &slash, 16);
+ lower_lsn = strtoul(++slash, NULL, 16);
+
+ /* And combine them */
+ cluster->controldata.chkpnt_latest =
+ (upper_lsn << 32) | lower_lsn;
+ }

Should 'upper_lsn' and 'lower_lsn' be declared as uint32? That seems a better mirror for LSN_FORMAT_ARGS.
>

Changed the definition to uint32, and a cast was added.

>
7. get_logical_slot_infos
+
+ /*
+ * Do additional checks if slots are found on the old node. If something is
+ * found on the new node, a subsequent function
+ * check_new_cluster_is_empty() would report the name of slots and raise a
+ * fatal error.
+ */
+ if (cluster == &old_cluster && slot_count)
+ {
+ check_for_lost_slots(cluster);
+
+ if (!live_check)
+ check_for_confirmed_flush_lsn(cluster);
+ }

It somehow doesn't feel right for these extra checks to be jammed into this function, just because you conveniently
havethe slot_count available.
 

On the NEW cluster side, there was extra checking in the check_new_cluster() function.

For consistency, I think this OLD cluster checking should be done in the check_and_dump_old_cluster() function -- see
the"Check for various failure cases" comment -- IMO this new fragment belongs there with the other checks.
 
>

All the checks were moved to check_and_dump_old_cluster(), and adds a check for its major version.

>
8.
  bool date_is_int;
  bool float8_pass_by_value;
  uint32 data_checksum_version;
+
+ XLogRecPtr chkpnt_latest;
 } ControlData;

I don't think the new field is particularly different from all the others that it needs a blank line separator.
 >

I removed the blank. Actually I wondered where the attribute should be, but kept at last.

>
9.
 # Initialize old node
 my $old_publisher = PostgreSQL::Test::Cluster->new('old_publisher');
 $old_publisher->init(allows_streaming => 'logical');
-$old_publisher->start;
 
 # Initialize new node
 my $new_publisher = PostgreSQL::Test::Cluster->new('new_publisher');
 $new_publisher->init(allows_streaming => 'replica');
 
-my $bindir = $new_publisher->config_data('--bindir');
+# Initialize subscriber node
+my $subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$subscriber->init(allows_streaming => 'logical');
 
-$old_publisher->stop;
+my $bindir = $new_publisher->config_data('--bindir');

~

Are those removal of the old_publisher start/stop changes that actually should be done in the 0002 patch?
>

Yes, It should be removed from 0002.

>
10.
 $old_publisher->safe_psql(
  'postgres', qq[
  SELECT pg_create_logical_replication_slot('test_slot2', 'test_decoding', false, true);
+ SELECT count(*) FROM pg_logical_slot_get_changes('test_slot1', NULL, NULL);
 ]);
 
~

What is the purpose of the added SELECT? It doesn't seem covered by the comment.
>

The SELECT statement is needed to trigger the failure caused by the insufficient
max_replication_slots. Checking on new cluster is started after old servers are
verified, so if the step is omitted, another error is reported:

```
Checking confirmed_flush_lsn for logical replication slots  
WARNING: logical replication slot "test_slot1" has not consumed WALs yet

One or more logical replication slots still have unconsumed WAL records.
```

I added a comment about it.

>
11.
# Remove an unnecessary slot and generate WALs. These records would not be
# consumed before doing pg_upgrade, so that the upcoming test would fail.
$old_publisher->start;
$old_publisher->safe_psql(
'postgres', qq[
SELECT pg_drop_replication_slot('test_slot2');
CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a;
]);
$old_publisher->stop;

Minor rewording of comment sentence.

SUGGESTION
Because these WAL records do not get consumed it will cause the upcoming pg_upgrade test to fail.
>

Added.


>
12.
# Cause a failure at the start of pg_upgrade because the slot still have
# unconsumed WAL records

~

/still have/still has/
>

Fixed.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: subscription/015_stream sometimes breaks
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node