RE: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: [PoC] pg_upgrade: allow to upgrade publisher node
Date
Msg-id TYAPR01MB5866A112A644DDF3D2E0D3E4F5E3A@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [PoC] pg_upgrade: allow to upgrade publisher node  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers
Dear Peter,

Thanks for reviewing! New patch could be available in [1].

> 1. check_for_lost_slots
> 
> +
> +/*
> + * Verify that all logical replication slots are usable.
> + */
> +void
> +check_for_lost_slots(ClusterInfo *cluster)
> 
> 1a.
> AFAIK we don't ever need to call this also for 'new_cluster'. So the
> function should have no parameter and just access 'old_cluster'
> directly.

Actually I have asked in previous post, and I understood you like the style.
Fixed. Also, get_logical_slot_infos() and count_logical_slots() are also called
only for old_cluster, then removed the argument.

> 1b.
> Can't this be a static function now?

Yeah, changed to static.

> 2.
> + for (i = 0; i < ntups; i++)
> + pg_log(PG_WARNING,
> +    "\nWARNING: logical replication slot \"%s\" is in 'lost' state.",
> +    PQgetvalue(res, i, i_slotname));
> 
> Is it correct that this message also includes the word "WARNING"?
> Other PG_WARNING messages don't do that.

create_script_for_old_cluster_deletion() has the word and I followed that:

```
pg_log(PG_WARNING,
       "\nWARNING:  new data directory should not be inside the old data directory, i.e. %s", old_cluster_pgdata);
```

> 3. check_for_confirmed_flush_lsn
> 
> +/*
> + * Verify that all logical replication slots consumed all WALs, except a
> + * CHECKPOINT_SHUTDOWN record.
> + */
> +static void
> +check_for_confirmed_flush_lsn(ClusterInfo *cluster)
> 
> AFAIK we don't ever need to call this also for 'new_cluster'. So the
> function should have no parameter and just access 'old_cluster'
> directly.

Removed.

> 4.
> + for (i = 0; i < ntups; i++)
> + pg_log(PG_WARNING,
> + "\nWARNING: logical replication slot \"%s\" has not consumed WALs yet",
> + PQgetvalue(res, i, i_slotname));
> 
> Is it correct that this message also includes the word "WARNING"?
> Other PG_WARNING messages don't do that.

See above reply, create_script_for_old_cluster_deletion() has that.

> src/bin/pg_upgrade/controldata.c
> 
> 5. get_control_data
> 
> + else if ((p = strstr(bufin, "Latest checkpoint location:")) != NULL)
> + {
> + /*
> + * Gather the latest checkpoint location if the cluster is PG17
> + * or later. This is used for upgrading logical replication
> + * slots.
> + */
> + if (GET_MAJOR_VERSION(cluster->major_version) >= 1700)
> 
> But we are not "gathering" anything. It's just one LSN. I think this
> ought to just say "Read the latest..."

Changed.

> 6.
> + /*
> + * The upper and lower part of LSN must be read separately
> + * because it is reported in %X/%X format.
> + */
> 
> /reported/stored as/

Changed.

> src/bin/pg_upgrade/pg_upgrade.h
> 
> 7.
> +void check_for_lost_slots(ClusterInfo *cluster);\
> 
> Why is this needed here? Can't this be a static function?

Removed.

> .../t/003_logical_replication_slots.pl
> 
> 8.
> +# 2. Consume WAL records to avoid another type of upgrade failure. It will be
> +# tested in subsequent cases.
> +$old_publisher->safe_psql('postgres',
> + "SELECT count(*) FROM pg_logical_slot_get_changes('test_slot1', NULL,
> NULL);"
> +);
> 
> I wondered if that step really needed. Why will there be WAL records to consume?
> 
> IIUC we haven't published anything yet.

The primal reason was described in [2], the reply for comment 10.
After creating 'test_slot1', another 'test_slot2' is also created, and the
function generates the RUNNING_XLOG record. The backtrace is as follows:

pg_create_logical_replication_slot
create_logical_replication_slot
CreateInitDecodingContext
ReplicationSlotReserveWal
LogStandbySnapshot
LogCurrentRunningXacts
XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);

check_for_confirmed_flush_lsn() detects the record and raises FATAL error before
checking GUC on new cluster.

> 9.
> +# ------------------------------
> +# TEST: Successful upgrade
> +
> +# Preparations for the subsequent test:
> +# 1. Remove the remained slot
> +$old_publisher->start;
> +$old_publisher->safe_psql('postgres',
> + "SELECT * FROM pg_drop_replication_slot('test_slot1');"
> +);
> 
> Should removal of the slot be done as part of the cleanup of the
> previous test, instead of preparing for this one?

Moved to cleanup part.

> 10.
> # 3. Disable the subscription once
> $subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION sub DISABLE");
> $old_publisher->stop;
> 
> 10a.
> What do you mean by "once"?

I added the word because the subscription would be enabled again.
But after considering more, I thought "Temporarily" seems better. Fixed.

> 10b.
> That old_publisher->stop; seems strangely placed. Why is it here?

We must shut down the cluster before doing pg_upgrade. Isn't it same as line 124?

```
# 2. Generate extra WAL records. Because these WAL records do not get consumed
#     it will cause the upcoming pg_upgrade test to fail.
$old_publisher->safe_psql('postgres',
    "CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a;"
);
$old_publisher->stop;
```

> 11.
> # Check that the slot 'test_slot1' has migrated to the new cluster
> $new_publisher->start;
> my $result = $new_publisher->safe_psql('postgres',
> "SELECT slot_name, two_phase FROM pg_replication_slots");
> is($result, qq(sub|t), 'check the slot exists on new cluster');
> 
> ~
> 
> That comment now seems wrong. That slot was previously removed, right?

Yeah, it should be 'sub'. Changed.

> 12.
> # Update the connection
> my $new_connstr = $new_publisher->connstr . ' dbname=postgres';
> $subscriber->safe_psql('postgres',
> "ALTER SUBSCRIPTION sub CONNECTION '$new_connstr'");
> $subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION sub ENABLE");
> 
> ~
> 
> Maybe better to combine both SQL.

Combined. 

[1]:
https://www.postgresql.org/message-id/TYAPR01MB5866D7677BAE6F66839570FCF5E3A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2]:
https://www.postgresql.org/message-id/TYAPR01MB58668021BB233D129B466122F51CA%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node
Next
From: "Tristan Partin"
Date:
Subject: Re: psql --no-connect option