Re: pg_upgrade and logical replication - Mailing list pgsql-hackers

From vignesh C
Subject Re: pg_upgrade and logical replication
Date
Msg-id CALDaNm21B2o9gEvfC7XS9W+NsXVFHupNPHFMc0_2oij=Jd5qtA@mail.gmail.com
Whole thread Raw
In response to Re: pg_upgrade and logical replication  (Michael Paquier <michael@paquier.xyz>)
Responses Re: pg_upgrade and logical replication
List pgsql-hackers
On Mon, 13 Nov 2023 at 13:52, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 10, 2023 at 07:26:18PM +0530, vignesh C wrote:
> > I did testing in the same lines that you mentioned. Apart from that I
> > also reviewed the design where it was using the old subscription id
> > like in case of table sync workers, the tables sync worker will use
> > replication using old subscription id. replication slot and
> > replication origin. I also checked the impact of remote_lsn's.
> > Few example: IN SUBREL_STATE_DATASYNC state we will try to drop the
> > replication slot once worker is started but since the slot will be
> > created with an old subscription, we will not be able to drop the
> > replication slot and create a leak. Similarly the problem exists with
> > SUBREL_STATE_FINISHEDCOPY where we will not be able to drop the origin
> > created with an old sub id.
>
> Yeah, I was playing a bit with these states and I can confirm that
> leaving around a DATASYNC relation in pg_subscription_rel during
> the upgrade would leave a slot on the publisher of the old cluster,
> which is no good.  It would be an option to explore later what could
> be improved, but I'm also looking forward at hearing from the users
> first, as what you have here may be enough for the basic purposes we
> are trying to cover.  FINISHEDCOPY similarly, is not OK.  I was able
> to get an origin lying around after an upgrade.
>
> Anyway, after a closer lookup, I think that your conclusions regarding
> the states that are allowed in the patch during the upgrade have some
> flaws.
>
> First, are you sure that SYNCDONE is OK to keep?  This catalog state
> is set in process_syncing_tables_for_sync(), and just after the code
> opens a transaction to clean up the tablesync slot, followed by a
> second transaction to clean up the origin.  However, imagine that
> there is a failure in dropping the slot, the origin, or just in
> transaction processing, cannot we finish in a state where the relation
> is marked as SYNCDONE in the catalog but still has an origin and/or a
> tablesync slot lying around?  Assuming that SYNCDONE is an OK state
> seems incorrect to me.  I am pretty sure that injecting an error in a
> code path after the slot is created would equally lead to an
> inconsistency.

There are couple of things happening here: a) In the first part we
take care of setting subscription relation to SYNCDONE and dropping
the replication slot at publisher node, only if drop replication slot
is successful the relation state will be set to SYNCDONE , if drop
replication slot fails the relation state will still be in
FINISHEDCOPY. So if there is a failure in the drop replication slot we
will not have an issue as the tablesync worker will be in
FINISHEDCOPYstate and this state is not allowed for upgrade. When the
state is in SYNCDONE the tablesync slot will not be present. b) In the
second part we drop the replication origin, even if there is a chance
that drop replication origin fails due to some reason, there will be
no problem as we do not copy the table sync replication origin to the
new cluster while upgrading. Since the table sync replication origin
is not copied to the new cluster there will be no replication origin
leaks.
I feel these issues will not be there in SYNCDONE state.

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: Thomas wen
Date:
Subject: 回复: Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?
Next
From: Andres Freund
Date:
Subject: Re: building with meson on windows with ssl