We maintain c.50 logical replicas. Typically the producer version is 12 or 13,
and the subscriber version is 14. We intend to upgrade the subscribers to 15
using pg_upgrade. However, we ran into an unexpected problem with that
approach. I couldn't find much being mentioned about it on the web, so I'm
sending this message for a sanity check that what we are seeing is expected,
and, that the proposed solution make sense.
The problem we see is that after running pg_upgrade on the subscriber, details
about the subscription are lost; specifically, pg_subscription_rel is empty, and
the associated replication origin has no LSN information in
pg_replication_origin_status.
I found a thread on the hackers list that seems to call out this problem:
https://www.postgresql.org/message-id/flat/20230217075433.u5mjly4d5cr4hcfe%40jrouhaud
It is a very long thread. TLDR; Julien describes a similar problem to what we
have (although in our case, it is a "pure" logical replica with no consideration
about switching between physical replication, or needing to cut over, etc), and
a fix has gone in to PG17 that sorts this problem.
However, we can't go to 17 yet, so need a solution for 15 and 16. We are doing
the following sequence of events:
(1) Shutdown subscriber;
(2) Startup subscriber with max_logical_replication_workers set to 0;
(3) Copy the data in pg_replication_origin_status to a table;
(4) Shutdown subscriber;
(5) Run pg_upgrade;
(6) Startup subscriber with max_logical_replication_workers set to 0;
(7) Apply the LSNs recorded in (3) to the appropriate origin using
pg_replication_origin_advance;
(8) ENABLE, REFRESH PUBLICATION WITH (COPY_DATA=FALSE), & DISABLE the
subscription;
(9) Shutdown subscriber;
(10) Startup subscriber normally;
(11) ENABLE subscription.
Step 7 requires a mapping from old origin name to new origin name as the names
are derived from subscription OIDs. Steps 2 and 6 are required to ensure no
activity occurs that could advance LSNs.
Unlike in the mentioned thread, we know we won't have to deal with tables that
are partially replicated i.e we know all tables will be 'r'.
We have done some testing with this and it seems to work. We run a repeated
INSERT on the producer several times a second, and notice that after step 11,
we have all expected rows arrive on the subscriber without gaps. However, I
thought I'd ping this list for a sanity check that what we are doing makes
sense. And, if so, perhaps it'll be useful to other people in a similar
situation.
Comments / suggestions welcome.
-Joe