Re: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date
Msg-id 20230407024823.3j2s4doslsjemvis@jrouhaud
Whole thread Raw
In response to [PoC] pg_upgrade: allow to upgrade publisher node  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses RE: [PoC] pg_upgrade: allow to upgrade publisher node
List pgsql-hackers
Hi,

On Tue, Apr 04, 2023 at 07:00:01AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Dear hackers,
> (CC: Amit and Julien)

(thanks for the Cc)

> This is a fork thread of Julien's thread, which allows to upgrade subscribers
> without losing changes [1].
>
> I briefly implemented a prototype for allowing to upgrade publisher node.
> IIUC the key lack was that replication slots used for logical replication could
> not be copied to new node by pg_upgrade command, so this patch allows that.
> This feature can be used when '--include-replication-slot' is specified. Also,
> I added a small test for the typical case. It may be helpful to understand.
>
> Pg_upgrade internally executes pg_dump for dumping a database object from the old.
> This feature follows this, adds a new option '--slot-only' to pg_dump command.
> When specified, it extracts needed info from old node and generate an SQL file
> that executes pg_create_logical_replication_slot().
>
> The notable deference from pre-existing is that restoring slots are done at the
> different time. Currently pg_upgrade works with following steps:
>
> ...
> 1. dump schema from old nodes
> 2. do pg_resetwal several times to new node
> 3. restore schema to new node
> 4. do pg_resetwal again to new node
> ...
>
> The probem is that if we create replication slots at step 3, the restart_lsn and
> confirmed_flush_lsn are set to current_wal_insert_lsn at that time, whereas
> pg_resetwal discards the WAL file. Such slots cannot extracting changes.
> To handle the issue the resotring is seprarated into two phases. At the first phase
> restoring is done at step 3, excepts replicatin slots. At the second phase
> replication slots are restored at step 5, after doing pg_resetwal.
>
> Before upgrading a publisher node, all the changes gerenated on publisher must
> be sent and applied on subscirber. This is because restart_lsn and confirmed_flush_lsn
> of copied replication slots is same as current_wal_insert_lsn. New node resets
> the information which WALs are really applied on subscriber and restart.
> Basically it is not problematic because before shutting donw the publisher, its
> walsender processes confirm all data is replicated. See WalSndDone() and related code.

As I mentioned in my original thread, I'm not very familiar with that code, but
I'm a bit worried about "all the changes generated on publisher must be send
and applied".  Is that a hard requirement for the feature to work reliably?  If
yes, how does this work if some subscriber node isn't connected when the
publisher node is stopped?  I guess you could add a check in pg_upgrade to make
sure that all logical slot are indeed caught up and fail if that's not the case
rather than assuming that a clean shutdown implies it.  It would be good to
cover that in the TAP test, and also cover some corner cases, like any new row
added on the publisher node after the pg_upgrade but before the subscriber is
reconnected is also replicated as expected.
>
> Currently physical slots are ignored because this is out-of-scope for me.
> I did not any analysis about it.

Agreed, but then shouldn't the option be named "--logical-slots-only" or
something like that, same for all internal function names?



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: pg_upgrade and logical replication
Next
From: "houzj.fnst@fujitsu.com"
Date:
Subject: RE: Support logical replication of DDLs