Re: Impact of checkpointer during pg_upgrade - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Impact of checkpointer during pg_upgrade
Date
Msg-id CAA4eK1J8YGQCD9Z3665XPWqpQUmrHTNaPXJhmW0VbvGhtfUScg@mail.gmail.com
Whole thread Raw
In response to Re: Impact of checkpointer during pg_upgrade  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Impact of checkpointer during pg_upgrade
List pgsql-hackers
On Mon, Sep 4, 2023 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 4, 2023 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Sep 2, 2023 at 6:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sat, Sep 2, 2023 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > The other possibilities apart from not allowing an upgrade in such a
> > > > case could be (a) Before starting the old cluster, we fetch the slots
> > > > directly from the disk using some tool like [2] and make the decisions
> > > > based on that state;
> > >
> > > Okay, so IIUC along with dumping the slot data we also need to dump
> > > the latest checkpoint LSN because during upgrade we do check that the
> > > confirmed flush lsn for all the slots should be the same as the latest
> > > checkpoint.  Yeah but I think we could work this out.
> > >
> > We already have the latest checkpoint LSN information from
> > pg_controldata. I think we can use that as the patch proposed in the
> > thread [1] is doing now. Do you have something else in mind?
>
> I think I did not understood the complete proposal.  And what I meant
> is that if we dump the slot before we start the cluster thats fine.
> But then later after starting the old cluster if we run some query
> like we do in check_old_cluster_for_valid_slots() then thats not
> right, because there is gap between the status of the slots what we
> dumped before starting the cluster and what we are checking after the
> cluster, so there is not point of that check right?
>

That's right but if we do read slots from disk, we preserve those in
the memory and use that information instead of querying it again in
check_old_cluster_for_valid_slots().

> > >  (b) During the upgrade, we don't allow WAL to be
> > > > removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> > > > as well but for that, we need to expose an API to invalidate the
> > > > slots;
> > >
> > >  (d) somehow distinguish the slots that are invalidated during
> > > > an upgrade and then simply copy such slots because anyway we ensure
> > > > that all the WAL required by slot is sent before shutdown.
> > >
> > > Yeah this could also be an option, although we need to think the
> > > mechanism of distinguishing those slots looks clean and fit well with
> > > other architecture.
> > >
> >
> > If we want to do this we probably need to maintain a flag in the slot
> > indicating that it was invalidated during an upgrade and then use the
> > same flag in the upgrade to check the validity of slots. I think such
> > a flag needs to be maintained at the same level as
> > ReplicationSlotInvalidationCause to avoid any inconsistency among
> > those.
>
> I think we can do better, like we can just read the latest
> checkpoint's LSN before starting the old cluster.  And now while
> checking the slot can't we check if the the slot is invalidated then
> their confirmed_flush_lsn >= the latest_checkpoint_lsn we preserved
> before starting the cluster because if so then those slot might have
> got invalidated during the upgrade no?
>

Isn't that possible only if we update confirmend_flush LSN while
invalidating? Otherwise, how the check you are proposing can succeed?

> >
> > > Alternatively can't we just ignore all the invalidated slots and do
> > > not migrate them at all.  Because such scenarios are very rare that
> > > some of the segments are getting dropped just during the upgrade time
> > > and that too from the old cluster so in such cases not migrating the
> > > slots which are invalidated should be fine no?
> > >
> >
> > I also think that such a scenario would be very rare but are you
> > suggesting to ignore all invalidated slots or just the slots that got
> > invalidated during an upgrade? BTW, if we simply ignore invalidated
> > slots then users won't be able to drop corresponding subscriptions
> > after an upgrade. They need to first use the Alter Subscription
> > command to disassociate the slot (by using the command ALTER
> > SUBSCRIPTION ... SET (slot_name = NONE)) and then drop the
> > subscription similar to what we suggest in other cases as described in
> > the Replication Slot Management section in docs [2].
>
> Yeah I think thats not the best thing to do.
>

Right, but OTOH, there is an argument to just document it instead of
having additional complexities in the code as such cases should be
rare.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: "Lepikhov Andrei"
Date:
Subject: Optimize planner memory consumption for huge arrays
Next
From: Will Mortensen
Date:
Subject: Re: Exposing the lock manager's WaitForLockers() to SQL