pg_upgrade and logical replication - Mailing list pgsql-hackers
From | Julien Rouhaud |
---|---|
Subject | pg_upgrade and logical replication |
Date | |
Msg-id | 20230217075433.u5mjly4d5cr4hcfe@jrouhaud Whole thread Raw |
Responses |
Re: pg_upgrade and logical replication
|
List | pgsql-hackers |
Hi, I was working on testing a major upgrade scenario using a mix of physical and logical replication when I faced some unexpected problem leading to missing rows. Note that my motivation is to rely on physical replication / physical backup to avoid recreating a node from scratch using logical replication, as the initial sync with logical replication is much more costly and impacting compared to pg_basebackup / restoring a physical backup, but the same problem exist if you just pg_upgrade a node that has subscriptions. The problem is that pg_upgrade creates the subscriptions on the newly upgraded node using "WITH (connect = false)", which seems expected as you obviously don't want to try to connect to the publisher at that point. But then once the newly upgraded node is restarted and ready to replace the previous one, unless I'm missing something there's absolutely no possibility to use the created subscriptions without losing some data from the publisher. The reason is that the subscription doesn't have a local list of relation to process until you refresh the subscription, but you can't refresh the subscription without enabling it (and you can't enable it in a transaction), which means that you have to let the logical worker start, consume and ignore all changes that happened on the publisher side until the refresh happens. An easy workaround that I tried is to allow something like ALTER SUBSCRIPTION ... ENABLE WITH (refresh = true, copy_data = false) so that the refresh internally happens before the apply worker is started and you just keep consuming the delta, which works on naive scenario. One concern I have with this approach is that the default values for both "refresh" and "copy_data" for all other subcommands is "true, but we would probably need a different default value in that exact scenario (as we know we already have the data). I think that it would otherwise be safe in my very specific scenario, assuming that you created the slot beforehand and moved the slot's LSN at the promotion point, as even if you add non-empty tables to the publication you will only need the delta whether those were initially empty or not given your initial physical replica state. Any other scenario would make this new option dangerous, if not entirely useless, but not more than any of the current commands that lead to refreshing a subscription and have the same options I guess. All in all, currently the only way to somewhat safely resume logical replication after a pg_upgrade is to drop all the subscriptions that were transferred during pg_upgrade on all databases and recreate them (using the existing slots on the publisher side obviously), allowing the initial connection. But this approach only works in the exact scenario I mentioned (physical to logical replication, or at least a case where *all* the tables where logically replicated prior to the pg_ugprade), otherwise you have to recreate the follower node from scratch using logical repication. Is that indeed the current behavior, or did I miss something? Is this "resume logical replication on pg_upgraded node" something we want to support better? I was thinking that we could add a new pg_dump mode (maybe only usable during pg_upgrade) that also restores the pg_subscription_rel content in each subscription or something like that. If not, should pg_upgrade keep preserving the subscriptions as it doesn't seem safe to use them, or at least document the hazards (I didn't find anything about it in the documentation)?
pgsql-hackers by date: