Dear hackers,
(CC: Amit and Julien)
This is a fork thread of Julien's thread, which allows to upgrade subscribers
without losing changes [1].
I briefly implemented a prototype for allowing to upgrade publisher node.
IIUC the key lack was that replication slots used for logical replication could
not be copied to new node by pg_upgrade command, so this patch allows that.
This feature can be used when '--include-replication-slot' is specified. Also,
I added a small test for the typical case. It may be helpful to understand.
Pg_upgrade internally executes pg_dump for dumping a database object from the old.
This feature follows this, adds a new option '--slot-only' to pg_dump command.
When specified, it extracts needed info from old node and generate an SQL file
that executes pg_create_logical_replication_slot().
The notable deference from pre-existing is that restoring slots are done at the
different time. Currently pg_upgrade works with following steps:
...
1. dump schema from old nodes
2. do pg_resetwal several times to new node
3. restore schema to new node
4. do pg_resetwal again to new node
...
The probem is that if we create replication slots at step 3, the restart_lsn and
confirmed_flush_lsn are set to current_wal_insert_lsn at that time, whereas
pg_resetwal discards the WAL file. Such slots cannot extracting changes.
To handle the issue the resotring is seprarated into two phases. At the first phase
restoring is done at step 3, excepts replicatin slots. At the second phase
replication slots are restored at step 5, after doing pg_resetwal.
Before upgrading a publisher node, all the changes gerenated on publisher must
be sent and applied on subscirber. This is because restart_lsn and confirmed_flush_lsn
of copied replication slots is same as current_wal_insert_lsn. New node resets
the information which WALs are really applied on subscriber and restart.
Basically it is not problematic because before shutting donw the publisher, its
walsender processes confirm all data is replicated. See WalSndDone() and related code.
Currently physical slots are ignored because this is out-of-scope for me.
I did not any analysis about it.
[1]: https://www.postgresql.org/message-id/flat/20230217075433.u5mjly4d5cr4hcfe%40jrouhaud
Best Regards,
Hayato Kuroda
FUJITSU LIMITED