Re: pg_upgrade and logical replication - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: pg_upgrade and logical replication |
Date | |
Msg-id | ZUyB5oa-jfF2xrZr@paquier.xyz Whole thread Raw |
In response to | Re: pg_upgrade and logical replication (Peter Smith <smithpb2250@gmail.com>) |
Responses |
Re: pg_upgrade and logical replication
|
List | pgsql-hackers |
On Thu, Nov 09, 2023 at 01:14:05PM +1100, Peter Smith wrote: > Looks like v12 accidentally forgot to update this to the modified > function name 'binary_upgrade_add_sub_rel_state' This v12 is overall cleaner than its predecessors. Nice to see. +my $result = $publisher->safe_psql('postgres', "SELECT count(*) FROM t1"); +is($result, qq(1), "check initial t1 table data on publisher"); +$result = $publisher->safe_psql('postgres', "SELECT count(*) FROM t2"); +is($result, qq(1), "check initial t1 table data on publisher"); +$result = $old_sub->safe_psql('postgres', "SELECT count(*) FROM t1"); +is($result, qq(1), "check initial t1 table data on the old subscriber"); +$result = $old_sub->safe_psql('postgres', "SELECT count(*) FROM t2"); I'd argue that t1 and t2 should have less generic names. t1 is used to check that the upgrade process works, while t2 is added to the publication after upgrading the subscriber. Say something like tab_upgraded or tab_not_upgraded? +my $synced_query = + "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r');"; Perhaps it would be safer to use a query that checks the number of relations in 'r' state? This query would return true if pg_subscription_rel has no tuples. +# Table will be in 'd' (data is being copied) state as table sync will fail +# because of primary key constraint error. +my $started_query = + "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd';"; Relying on a pkey error to enforce an incorrect state is a good trick. Nice. +command_fails( + [ + 'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir, + '-D', $new_sub->data_dir, '-b', $bindir, + '-B', $bindir, '-s', $new_sub->host, + '-p', $old_sub->port, '-P', $new_sub->port, + $mode, '--check', + ], + 'run of pg_upgrade --check for old instance with relation in \'d\' datasync(invalid) state'); +rmtree($new_sub->data_dir . "/pg_upgrade_output.d"); Okay by me to not stop the cluster for the --check to shave a few cycles. It's a bit sad that we don't cross-check the contents of subscription_state.txt before removing pg_upgrade_output.d. Finding the file is easy even if the subdir where it is included is not a constant name. Then it is possible to apply a regexp with the contents consumed by a slurp_file(). +my $remote_lsn = $old_sub->safe_psql('postgres', + "SELECT remote_lsn FROM pg_replication_origin_status"); Perhaps you've not noticed, but this would be 0/0 most of the time. However the intention is to check after a valid LSN to make sure that the origin is set, no? I am wondering whether this should use a bit more data than just one tuple, say at least two transaction, one of them with a multi-value INSERT? +# ------------------------------------------------------ +# Check that pg_upgrade is successful when all tables are in ready state. +# ------------------------------------------------------ This comment is a bit inconsistent with the state that are accepted, but why not, at least that's predictible. + * relation to pg_subscripion_rel table. This will be used only in Typo: s/pg_subscripion_rel/pg_subscription_rel/. This needs some word-smithing to explain the reasons why a state is not needed: + /* + * The subscription relation should be in either i (initialize), + * r (ready) or s (synchronized) state as either the replication slot + * is not created or the replication slot is already dropped and the + * required WAL files will be present in the publisher. The other + * states are not ok as the worker has dependency on the replication + * slot/origin in these case: A slot not created yet refers to the 'i' state, while 'r' and 's' refer to a slot created previously but already dropped, right? Shouldn't this comment tell that rather than mixing the assumptions? + * a) SUBREL_STATE_DATASYNC: In this case, the table sync worker will + * try to drop the replication slot but as the replication slots will + * be created with old subscription id in the publisher and the + * upgraded subscriber will not be able to clean the slots in this + * case. Proposal: A relation upgraded while in this state would retain a replication slot, which could not be dropped by the sync worker spawned after the upgrade because the subscription ID tracked by the publisher does not match anymore. Note: actually, this would be OK if we are able to keep the OIDs of the subscribers consistent across upgrades? I'm OK to not do nothing about that in this patch, to keep it simpler. Just asking in passing. + * b) SUBREL_STATE_FINISHEDCOPY: In this case, the tablesync worker will + * expect the origin to be already existing as the origin is created + * with an old subscription id, tablesync worker will not be able to + * find the origin in this case. Proposal: A tablesync worker spawned to work on a relation upgraded while in this state would expect an origin ID with the OID of the subscription used before the upgrade, causing it to fail. + "A list of problem subscriptions is in the file:\n" Sounds a bit strange, perhaps use an extra "the", as of "the problem subscriptions"? Could it be worth mentioning in the docs that one could also DISABLE the subscriptions before running the upgrade? + The replication origin entry corresponding to each of the subscriptions + should exist in the old cluster. This can be found by checking + <link linkend="catalog-pg-subscription">pg_subscription</link> and + <link linkend="catalog-pg-replication-origin">pg_replication_origin</link> + system tables. Hmm. No need to mention pg_replication_origin_status? If I may ask, how did you check that the given relation states were OK or not OK? Did you hardcode some wait points in tablesync.c up to where a state is updated in pg_subscription_rel, then shutdown the cluster before the upgrade to maintain the catalog in this state? Finally, after the upgrade, you've cross-checked the dependencies on the slots and origins to see that the spawned sync workers turned crazy because of the inconsistencies. Right? -- Michael
Attachment
pgsql-hackers by date: