RE: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: [PoC] pg_upgrade: allow to upgrade publisher node
Date
Msg-id OS0PR01MB571640E1B58741979A5E586594F7A@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: [PoC] pg_upgrade: allow to upgrade publisher node  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses Re: [PoC] pg_upgrade: allow to upgrade publisher node
List pgsql-hackers
On Wednesday, September 13, 2023 9:52 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> 
> Dear Amit,
> 
> Thank you for reviewing! Before making a patch I can reply the important point.
> 
> > 1. One thing to note is that if user checks whether the old cluster is
> > upgradable with --check option and then try to upgrade, that will also
> > fail. Because during the --check run there would at least one
> > additional shutdown checkpoint WAL and then in the next run the slots
> > position won't match. Note, I am saying this in context of using
> > --check option with not-running old cluster. Won't that be surprising
> > to users? One possibility is that we document such a behaviour and
> > other is that we go back to WAL reading design where we can ignore
> > known WAL records like shutdown checkpoint, XLOG_RUNNING_XACTS, etc.
> 
> Good catch, we have never considered the case that --check is executed for
> stopped cluster. You are right, the old cluster is turned on/off during the
> check and it generates SHUTDOWN_CHECKPOINT. This leads that
> confirmed_flush is
> behind the latest checkpoint lsn.
> 
> Here are other approaches we came up with:
> 
> 1. adds WARNING message when the --check is executed and slots are
> checked.
>    We can say like:
> 
> ```
> ...
> Checking for valid logical replication slots
> WARNING: this check generated WALs
> Next pg_uprade would fail.
> Please ensure again that all WALs are replicated.
> ...
> ```
> 
> 
> 2. adds hint message in the FATAL error when the confirmed_flush is not same
> as
>    the latest checkpoint:
> 
> ```
> ...
> Checking for valid logical replication slots                  fatal
> 
> Your installation contains invalid logical replication slots.
> These slots can't be copied, so this cluster cannot be upgraded.
> Consider removing such slots or consuming the pending WAL if any,
> and then restart the upgrade.
> If you did pg_upgrade --check before this run, it may be a cause.
> Please start clusters and confirm again that all changes are
> replicated.
> A list of invalid logical replication slots is in the file:
> ```
> 
> 3. requests users to do pg_upgrade --check on backup database, if old cluster
>    has logical slots. Basically they save a whole of cluster before doing
> pg_uprade,
>    so it may be acceptable. This is not a modification of codes.
> 

Here are some more ideas about the issue for reference.

1) Extending the controlfile.

We can dd a new field (e.g. non_upgrade_checkPoint) to record the last check point
ptr happened in non-upgrade mode. The new field won't be updated due to
"pg_upgrade --check", so pg_upgrade can use this LSN to compare with the slot's
confirmed_flush_lsn.

Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
--check" in advance.

Cons: Not sure if this is a enough reason to introduce new field in
controlfile.

-----------

2) Advance the slot's confirmed_flush_lsn in pg_upgrade if the check passes.

Introducing an upgrade support SQL function
(binary_upgrade_advance_logical_slot_lsn()) to set a
flag(catch_confirmed_lsn_up) on server side. On server side, when trying to
flush the slot in shutdown checkpoint(CheckPointReplicationSlots), we update
the slot's confirmed_flush_lsn to the lsn of the current checkpoint if
catch_confirmed_lsn_up is set.

Pros: User can smoothly upgrade the cluster even if they run "pg_upgrade
--check" in advance.

Cons: Although we have some examples for using functions
(binary_upgrade_set_next_pg_enum_oid ...) to set some variables during upgrade
, but not sure if it's a standard behavior to change the slot's lsn during
upgrade.

-----------

3) Introduce a new pg_upgrade option(e.g. skip_slot_check), and suggest if user
   already did the upgrade check for stopped server, they can use this option
   when trying to upgrade later.

Pros: Can save some efforts for user to advance each slot's lsn.

Cons: I didn't see similar options in pg_upgrade, might need some agreement.


Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: BufferUsage counters' values have changed
Next
From: Dilip Kumar
Date:
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node