Thread: Impact of checkpointer during pg_upgrade

Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
During pg_upgrade, we start the server for the old cluster which can
allow the checkpointer to remove the WAL files. It has been noticed
that we do generate certain types of WAL records (e.g
XLOG_RUNNING_XACTS, XLOG_CHECKPOINT_ONLINE, and XLOG_FPI_FOR_HINT)
even during pg_upgrade for old cluster, so additional WAL records
could let checkpointer decide that certain WAL segments can be removed
(e.g. say wal size crosses max_slot_wal_keep_size_mb) and invalidate
the slots. Currently, I can't see any problem with this but for future
work where we want to migrate logical slots during an upgrade[1], we
need to decide what to do for such cases. The initial idea we had was
that if the old cluster has some invalid slots, we won't allow an
upgrade unless the user removes such slots or uses some option like
--exclude-slots. It is quite possible that slots got invalidated
during pg_upgrade due to no user activity. Now, even though the
possibility of the same is less I think it is worth considering what
should be the behavior.

The other possibilities apart from not allowing an upgrade in such a
case could be (a) Before starting the old cluster, we fetch the slots
directly from the disk using some tool like [2] and make the decisions
based on that state; (b) During the upgrade, we don't allow WAL to be
removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
as well but for that, we need to expose an API to invalidate the
slots; (d) somehow distinguish the slots that are invalidated during
an upgrade and then simply copy such slots because anyway we ensure
that all the WAL required by slot is sent before shutdown.

Thoughts?

[1] -
https://www.postgresql.org/message-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2] -
https://www.postgresql.org/message-id/flat/CALj2ACW0rV5gWK8A3m6_X62qH%2BVfaq5hznC%3Di0R5Wojt5%2Byhyw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Dilip Kumar
Date:
On Sat, Sep 2, 2023 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> During pg_upgrade, we start the server for the old cluster which can
> allow the checkpointer to remove the WAL files. It has been noticed
> that we do generate certain types of WAL records (e.g
> XLOG_RUNNING_XACTS, XLOG_CHECKPOINT_ONLINE, and XLOG_FPI_FOR_HINT)
> even during pg_upgrade for old cluster, so additional WAL records
> could let checkpointer decide that certain WAL segments can be removed
> (e.g. say wal size crosses max_slot_wal_keep_size_mb) and invalidate
> the slots. Currently, I can't see any problem with this but for future
> work where we want to migrate logical slots during an upgrade[1], we
> need to decide what to do for such cases. The initial idea we had was
> that if the old cluster has some invalid slots, we won't allow an
> upgrade unless the user removes such slots or uses some option like
> --exclude-slots. It is quite possible that slots got invalidated
> during pg_upgrade due to no user activity. Now, even though the
> possibility of the same is less I think it is worth considering what
> should be the behavior.

Right

> The other possibilities apart from not allowing an upgrade in such a
> case could be (a) Before starting the old cluster, we fetch the slots
> directly from the disk using some tool like [2] and make the decisions
> based on that state;

Okay, so IIUC along with dumping the slot data we also need to dump
the latest checkpoint LSN because during upgrade we do check that the
confirmed flush lsn for all the slots should be the same as the latest
checkpoint.  Yeah but I think we could work this out.

 (b) During the upgrade, we don't allow WAL to be
> removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> as well but for that, we need to expose an API to invalidate the
> slots;

 (d) somehow distinguish the slots that are invalidated during
> an upgrade and then simply copy such slots because anyway we ensure
> that all the WAL required by slot is sent before shutdown.

Yeah this could also be an option, although we need to think the
mechanism of distinguishing those slots looks clean and fit well with
other architecture.

Alternatively can't we just ignore all the invalidated slots and do
not migrate them at all.  Because such scenarios are very rare that
some of the segments are getting dropped just during the upgrade time
and that too from the old cluster so in such cases not migrating the
slots which are invalidated should be fine no?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
On Sat, Sep 2, 2023 at 6:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Sep 2, 2023 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > The other possibilities apart from not allowing an upgrade in such a
> > case could be (a) Before starting the old cluster, we fetch the slots
> > directly from the disk using some tool like [2] and make the decisions
> > based on that state;
>
> Okay, so IIUC along with dumping the slot data we also need to dump
> the latest checkpoint LSN because during upgrade we do check that the
> confirmed flush lsn for all the slots should be the same as the latest
> checkpoint.  Yeah but I think we could work this out.
>

We already have the latest checkpoint LSN information from
pg_controldata. I think we can use that as the patch proposed in the
thread [1] is doing now. Do you have something else in mind?

>  (b) During the upgrade, we don't allow WAL to be
> > removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> > as well but for that, we need to expose an API to invalidate the
> > slots;
>
>  (d) somehow distinguish the slots that are invalidated during
> > an upgrade and then simply copy such slots because anyway we ensure
> > that all the WAL required by slot is sent before shutdown.
>
> Yeah this could also be an option, although we need to think the
> mechanism of distinguishing those slots looks clean and fit well with
> other architecture.
>

If we want to do this we probably need to maintain a flag in the slot
indicating that it was invalidated during an upgrade and then use the
same flag in the upgrade to check the validity of slots. I think such
a flag needs to be maintained at the same level as
ReplicationSlotInvalidationCause to avoid any inconsistency among
those.

> Alternatively can't we just ignore all the invalidated slots and do
> not migrate them at all.  Because such scenarios are very rare that
> some of the segments are getting dropped just during the upgrade time
> and that too from the old cluster so in such cases not migrating the
> slots which are invalidated should be fine no?
>

I also think that such a scenario would be very rare but are you
suggesting to ignore all invalidated slots or just the slots that got
invalidated during an upgrade? BTW, if we simply ignore invalidated
slots then users won't be able to drop corresponding subscriptions
after an upgrade. They need to first use the Alter Subscription
command to disassociate the slot (by using the command ALTER
SUBSCRIPTION ... SET (slot_name = NONE)) and then drop the
subscription similar to what we suggest in other cases as described in
the Replication Slot Management section in docs [2]. Also, if users
really want to continue that subscription by syncing corresponding
tables then they can recreate the slots manually and then continue
with replication. So, if we want to do this then we will just rely on
the current state (at the time we query for them in the old cluster)
of slots, and even if they later got invalidated during the upgrade,
we will just ignore such invalidations as anyway the required WAL is
already copied.


[1] -
https://www.postgresql.org/message-id/TYAPR01MB58664C81887B3AF2EB6B16E3F5939%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2] - https://www.postgresql.org/docs/devel/logical-replication-subscription.html

--
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Dilip Kumar
Date:
On Mon, Sep 4, 2023 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Sep 2, 2023 at 6:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Sep 2, 2023 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > The other possibilities apart from not allowing an upgrade in such a
> > > case could be (a) Before starting the old cluster, we fetch the slots
> > > directly from the disk using some tool like [2] and make the decisions
> > > based on that state;
> >
> > Okay, so IIUC along with dumping the slot data we also need to dump
> > the latest checkpoint LSN because during upgrade we do check that the
> > confirmed flush lsn for all the slots should be the same as the latest
> > checkpoint.  Yeah but I think we could work this out.
> >
> We already have the latest checkpoint LSN information from
> pg_controldata. I think we can use that as the patch proposed in the
> thread [1] is doing now. Do you have something else in mind?

I think I did not understood the complete proposal.  And what I meant
is that if we dump the slot before we start the cluster thats fine.
But then later after starting the old cluster if we run some query
like we do in check_old_cluster_for_valid_slots() then thats not
right, because there is gap between the status of the slots what we
dumped before starting the cluster and what we are checking after the
cluster, so there is not point of that check right?

> >  (b) During the upgrade, we don't allow WAL to be
> > > removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> > > as well but for that, we need to expose an API to invalidate the
> > > slots;
> >
> >  (d) somehow distinguish the slots that are invalidated during
> > > an upgrade and then simply copy such slots because anyway we ensure
> > > that all the WAL required by slot is sent before shutdown.
> >
> > Yeah this could also be an option, although we need to think the
> > mechanism of distinguishing those slots looks clean and fit well with
> > other architecture.
> >
>
> If we want to do this we probably need to maintain a flag in the slot
> indicating that it was invalidated during an upgrade and then use the
> same flag in the upgrade to check the validity of slots. I think such
> a flag needs to be maintained at the same level as
> ReplicationSlotInvalidationCause to avoid any inconsistency among
> those.

I think we can do better, like we can just read the latest
checkpoint's LSN before starting the old cluster.  And now while
checking the slot can't we check if the the slot is invalidated then
their confirmed_flush_lsn >= the latest_checkpoint_lsn we preserved
before starting the cluster because if so then those slot might have
got invalidated during the upgrade no?

>
> > Alternatively can't we just ignore all the invalidated slots and do
> > not migrate them at all.  Because such scenarios are very rare that
> > some of the segments are getting dropped just during the upgrade time
> > and that too from the old cluster so in such cases not migrating the
> > slots which are invalidated should be fine no?
> >
>
> I also think that such a scenario would be very rare but are you
> suggesting to ignore all invalidated slots or just the slots that got
> invalidated during an upgrade? BTW, if we simply ignore invalidated
> slots then users won't be able to drop corresponding subscriptions
> after an upgrade. They need to first use the Alter Subscription
> command to disassociate the slot (by using the command ALTER
> SUBSCRIPTION ... SET (slot_name = NONE)) and then drop the
> subscription similar to what we suggest in other cases as described in
> the Replication Slot Management section in docs [2].

Yeah I think thats not the best thing to do.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
On Mon, Sep 4, 2023 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 4, 2023 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Sep 2, 2023 at 6:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sat, Sep 2, 2023 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > The other possibilities apart from not allowing an upgrade in such a
> > > > case could be (a) Before starting the old cluster, we fetch the slots
> > > > directly from the disk using some tool like [2] and make the decisions
> > > > based on that state;
> > >
> > > Okay, so IIUC along with dumping the slot data we also need to dump
> > > the latest checkpoint LSN because during upgrade we do check that the
> > > confirmed flush lsn for all the slots should be the same as the latest
> > > checkpoint.  Yeah but I think we could work this out.
> > >
> > We already have the latest checkpoint LSN information from
> > pg_controldata. I think we can use that as the patch proposed in the
> > thread [1] is doing now. Do you have something else in mind?
>
> I think I did not understood the complete proposal.  And what I meant
> is that if we dump the slot before we start the cluster thats fine.
> But then later after starting the old cluster if we run some query
> like we do in check_old_cluster_for_valid_slots() then thats not
> right, because there is gap between the status of the slots what we
> dumped before starting the cluster and what we are checking after the
> cluster, so there is not point of that check right?
>

That's right but if we do read slots from disk, we preserve those in
the memory and use that information instead of querying it again in
check_old_cluster_for_valid_slots().

> > >  (b) During the upgrade, we don't allow WAL to be
> > > > removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> > > > as well but for that, we need to expose an API to invalidate the
> > > > slots;
> > >
> > >  (d) somehow distinguish the slots that are invalidated during
> > > > an upgrade and then simply copy such slots because anyway we ensure
> > > > that all the WAL required by slot is sent before shutdown.
> > >
> > > Yeah this could also be an option, although we need to think the
> > > mechanism of distinguishing those slots looks clean and fit well with
> > > other architecture.
> > >
> >
> > If we want to do this we probably need to maintain a flag in the slot
> > indicating that it was invalidated during an upgrade and then use the
> > same flag in the upgrade to check the validity of slots. I think such
> > a flag needs to be maintained at the same level as
> > ReplicationSlotInvalidationCause to avoid any inconsistency among
> > those.
>
> I think we can do better, like we can just read the latest
> checkpoint's LSN before starting the old cluster.  And now while
> checking the slot can't we check if the the slot is invalidated then
> their confirmed_flush_lsn >= the latest_checkpoint_lsn we preserved
> before starting the cluster because if so then those slot might have
> got invalidated during the upgrade no?
>

Isn't that possible only if we update confirmend_flush LSN while
invalidating? Otherwise, how the check you are proposing can succeed?

> >
> > > Alternatively can't we just ignore all the invalidated slots and do
> > > not migrate them at all.  Because such scenarios are very rare that
> > > some of the segments are getting dropped just during the upgrade time
> > > and that too from the old cluster so in such cases not migrating the
> > > slots which are invalidated should be fine no?
> > >
> >
> > I also think that such a scenario would be very rare but are you
> > suggesting to ignore all invalidated slots or just the slots that got
> > invalidated during an upgrade? BTW, if we simply ignore invalidated
> > slots then users won't be able to drop corresponding subscriptions
> > after an upgrade. They need to first use the Alter Subscription
> > command to disassociate the slot (by using the command ALTER
> > SUBSCRIPTION ... SET (slot_name = NONE)) and then drop the
> > subscription similar to what we suggest in other cases as described in
> > the Replication Slot Management section in docs [2].
>
> Yeah I think thats not the best thing to do.
>

Right, but OTOH, there is an argument to just document it instead of
having additional complexities in the code as such cases should be
rare.

--
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Dilip Kumar
Date:
On Mon, Sep 4, 2023 at 11:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 4, 2023 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Sep 4, 2023 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sat, Sep 2, 2023 at 6:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Sat, Sep 2, 2023 at 10:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > The other possibilities apart from not allowing an upgrade in such a
> > > > > case could be (a) Before starting the old cluster, we fetch the slots
> > > > > directly from the disk using some tool like [2] and make the decisions
> > > > > based on that state;
> > > >
> > > > Okay, so IIUC along with dumping the slot data we also need to dump
> > > > the latest checkpoint LSN because during upgrade we do check that the
> > > > confirmed flush lsn for all the slots should be the same as the latest
> > > > checkpoint.  Yeah but I think we could work this out.
> > > >
> > > We already have the latest checkpoint LSN information from
> > > pg_controldata. I think we can use that as the patch proposed in the
> > > thread [1] is doing now. Do you have something else in mind?
> >
> > I think I did not understood the complete proposal.  And what I meant
> > is that if we dump the slot before we start the cluster thats fine.
> > But then later after starting the old cluster if we run some query
> > like we do in check_old_cluster_for_valid_slots() then thats not
> > right, because there is gap between the status of the slots what we
> > dumped before starting the cluster and what we are checking after the
> > cluster, so there is not point of that check right?
> >
>
> That's right but if we do read slots from disk, we preserve those in
> the memory and use that information instead of querying it again in
> check_old_cluster_for_valid_slots().
>
> > > >  (b) During the upgrade, we don't allow WAL to be
> > > > > removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> > > > > as well but for that, we need to expose an API to invalidate the
> > > > > slots;
> > > >
> > > >  (d) somehow distinguish the slots that are invalidated during
> > > > > an upgrade and then simply copy such slots because anyway we ensure
> > > > > that all the WAL required by slot is sent before shutdown.
> > > >
> > > > Yeah this could also be an option, although we need to think the
> > > > mechanism of distinguishing those slots looks clean and fit well with
> > > > other architecture.
> > > >
> > >
> > > If we want to do this we probably need to maintain a flag in the slot
> > > indicating that it was invalidated during an upgrade and then use the
> > > same flag in the upgrade to check the validity of slots. I think such
> > > a flag needs to be maintained at the same level as
> > > ReplicationSlotInvalidationCause to avoid any inconsistency among
> > > those.
> >
> > I think we can do better, like we can just read the latest
> > checkpoint's LSN before starting the old cluster.  And now while
> > checking the slot can't we check if the the slot is invalidated then
> > their confirmed_flush_lsn >= the latest_checkpoint_lsn we preserved
> > before starting the cluster because if so then those slot might have
> > got invalidated during the upgrade no?
> >
>
> Isn't that possible only if we update confirmend_flush LSN while
> invalidating? Otherwise, how the check you are proposing can succeed?

I am not suggesting to compare the confirmend_flush_lsn to the latest
checkpoint LSN instead I am suggesting that before starting the
cluster we get the location of the latest checkpoint LSN that should
be the shutdown checkpoint LSN.  So now also in [1] we check that
confirmed flush lsn should be equal to the latest checkpoint lsn.   So
the only problem is that after we restart the cluster during the
upgrade we might invalidate some of the slots which are perfectly fine
to migrate and we want to identify those slots.  So if we know the the
LSN of the shutdown checkpoint before the cluster started then we can
perform a additional checks on all the invalidated slots that their
confirmed lsn >= shutdown checkpoint lsn we preserved before
restarting the cluster (not the latest checkpoint lsn) then those
slots got invalidated only after we started the cluster for upgrade?
Is there any loophole in this theory?   This theory is based on the
assumption that the confirmed flush lsn are not moving forward for the
already invalidated slots that means the slot which got invalidated
before we shutdown for upgrade will have confirm flush lsn value <
shutdown checkpoint and the slots which got invalidated during the
upgrade will have confirm flush lsn at least equal to the shutdown
checkpoint.

[1]
https://www.postgresql.org/message-id/TYAPR01MB5866F7D8ED15BA1E8E4A2AB0F5E4A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Impact of checkpointer during pg_upgrade

From
Dilip Kumar
Date:
On Mon, Sep 4, 2023 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > > I think we can do better, like we can just read the latest
> > > checkpoint's LSN before starting the old cluster.  And now while
> > > checking the slot can't we check if the the slot is invalidated then
> > > their confirmed_flush_lsn >= the latest_checkpoint_lsn we preserved
> > > before starting the cluster because if so then those slot might have
> > > got invalidated during the upgrade no?
> > >
> >
> > Isn't that possible only if we update confirmend_flush LSN while
> > invalidating? Otherwise, how the check you are proposing can succeed?
>
> I am not suggesting to compare the confirmend_flush_lsn to the latest
> checkpoint LSN instead I am suggesting that before starting the
> cluster we get the location of the latest checkpoint LSN that should
> be the shutdown checkpoint LSN.  So now also in [1] we check that
> confirmed flush lsn should be equal to the latest checkpoint lsn.   So
> the only problem is that after we restart the cluster during the
> upgrade we might invalidate some of the slots which are perfectly fine
> to migrate and we want to identify those slots.  So if we know the the
> LSN of the shutdown checkpoint before the cluster started then we can
> perform a additional checks on all the invalidated slots that their
> confirmed lsn >= shutdown checkpoint lsn we preserved before
> restarting the cluster (not the latest checkpoint lsn) then those
> slots got invalidated only after we started the cluster for upgrade?
> Is there any loophole in this theory?   This theory is based on the
> assumption that the confirmed flush lsn are not moving forward for the
> already invalidated slots that means the slot which got invalidated
> before we shutdown for upgrade will have confirm flush lsn value <
> shutdown checkpoint and the slots which got invalidated during the
> upgrade will have confirm flush lsn at least equal to the shutdown
> checkpoint.

Said that there is a possibility that some of the slots which got
invalidated even on the previous checkpoint might get the same LSN as
the slot which got invalidated later if there is no activity between
these two checkpoints. So if we go with this approach then there is
some risk of migrating some of the slots which were already
invalidated even before the shutdown checkpoint.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
On Mon, Sep 4, 2023 at 4:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Said that there is a possibility that some of the slots which got
> invalidated even on the previous checkpoint might get the same LSN as
> the slot which got invalidated later if there is no activity between
> these two checkpoints. So if we go with this approach then there is
> some risk of migrating some of the slots which were already
> invalidated even before the shutdown checkpoint.
>

I think even during the shutdown checkpoint, after writing shutdown
checkpoint WAL, we can invalidate some slots that in theory are safe
to migrate/copy because all the WAL for those slots would also have
been sent. So, those would be similar to what we invalidate during the
upgrade, no? If so, I think it is better to have the same behavior for
invalidated slots irrespective of the time it gets invalidated. We can
either give an error for such slots during the upgrade (which means
disallow the upgrade) or simply ignore such slots during the upgrade.
I would prefer ERROR but if we want to ignore such slots, we can
probably inform the user in some way about ignored slots, so that she
can later drop corresponding subscritions or recreate such slots and
do the required sync-up to continue the replication.

--
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Dilip Kumar
Date:
On Tue, Sep 5, 2023 at 9:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 4, 2023 at 4:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Said that there is a possibility that some of the slots which got
> > invalidated even on the previous checkpoint might get the same LSN as
> > the slot which got invalidated later if there is no activity between
> > these two checkpoints. So if we go with this approach then there is
> > some risk of migrating some of the slots which were already
> > invalidated even before the shutdown checkpoint.
> >
>
> I think even during the shutdown checkpoint, after writing shutdown
> checkpoint WAL, we can invalidate some slots that in theory are safe
> to migrate/copy because all the WAL for those slots would also have
> been sent. So, those would be similar to what we invalidate during the
> upgrade, no?

Thats correct

 If so, I think it is better to have the same behavior for
> invalidated slots irrespective of the time it gets invalidated. We can
> either give an error for such slots during the upgrade (which means
> disallow the upgrade) or simply ignore such slots during the upgrade.
> I would prefer ERROR but if we want to ignore such slots, we can
> probably inform the user in some way about ignored slots, so that she
> can later drop corresponding subscritions or recreate such slots and
> do the required sync-up to continue the replication.

Earlier I was thinking that ERRORing out is better so that the user
can take necessary action for the invalidated slots and then retry
upgrade.  But thinking again I could not find what are the advantages
of this because if we error out then also users need to restart the
old cluster again and have to drop the corresponding subscriptions
OTOH if we allow the upgrade by ignoring the slots then also the user
has to take similar actions on the new cluster?  So what's the
advantage of erroring out over upgrading?  I see a clear advantage of
upgrading is that the user wants to upgrade and that's successful
without reattempting.   If we say that if we error out and then there
are some option for user to salvage those invalidated slots and he can
somehow migrate those slot as well by retrying upgrade then it make
sense to error out and let user take some action on old cluster, but
if all he has to do is to drop the subscription or recreate the slot
in both the cases then letting the upgrade pass is better option at
least IMHO.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
On Tue, Sep 5, 2023 at 10:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Sep 5, 2023 at 9:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Sep 4, 2023 at 4:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Said that there is a possibility that some of the slots which got
> > > invalidated even on the previous checkpoint might get the same LSN as
> > > the slot which got invalidated later if there is no activity between
> > > these two checkpoints. So if we go with this approach then there is
> > > some risk of migrating some of the slots which were already
> > > invalidated even before the shutdown checkpoint.
> > >
> >
> > I think even during the shutdown checkpoint, after writing shutdown
> > checkpoint WAL, we can invalidate some slots that in theory are safe
> > to migrate/copy because all the WAL for those slots would also have
> > been sent. So, those would be similar to what we invalidate during the
> > upgrade, no?
>
> Thats correct
>
>  If so, I think it is better to have the same behavior for
> > invalidated slots irrespective of the time it gets invalidated. We can
> > either give an error for such slots during the upgrade (which means
> > disallow the upgrade) or simply ignore such slots during the upgrade.
> > I would prefer ERROR but if we want to ignore such slots, we can
> > probably inform the user in some way about ignored slots, so that she
> > can later drop corresponding subscritions or recreate such slots and
> > do the required sync-up to continue the replication.
>
> Earlier I was thinking that ERRORing out is better so that the user
> can take necessary action for the invalidated slots and then retry
> upgrade.  But thinking again I could not find what are the advantages
> of this because if we error out then also users need to restart the
> old cluster again and have to drop the corresponding subscriptions
> OTOH if we allow the upgrade by ignoring the slots then also the user
> has to take similar actions on the new cluster?  So what's the
> advantage of erroring out over upgrading?
>

The advantage is that we avoid inconvenience caused to users because
Drop Subscription will be unsuccessful as the corresponding slots are
not present. So users first need to disassociate slots for the
subscription and then drop the subscription. Also, I am not sure
leaving behind some slots doesn't have any other impact, otherwise,
why don't we drop such slots from time to time after they are marked
invalidated during normal operation? If users really want to leave
behind such invalidated slots after upgrade, we can even think of
providing some option like "exclude_invalid_logical_slots".

--
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Dilip Kumar
Date:
On Tue, Sep 5, 2023 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Earlier I was thinking that ERRORing out is better so that the user
> > can take necessary action for the invalidated slots and then retry
> > upgrade.  But thinking again I could not find what are the advantages
> > of this because if we error out then also users need to restart the
> > old cluster again and have to drop the corresponding subscriptions
> > OTOH if we allow the upgrade by ignoring the slots then also the user
> > has to take similar actions on the new cluster?  So what's the
> > advantage of erroring out over upgrading?
> >
>
> The advantage is that we avoid inconvenience caused to users because
> Drop Subscription will be unsuccessful as the corresponding slots are
> not present. So users first need to disassociate slots for the
> subscription and then drop the subscription.

Yeah that's a valid argument for erroring out.

 Also, I am not sure
> leaving behind some slots doesn't have any other impact, otherwise,
> why don't we drop such slots from time to time after they are marked
> invalidated during normal operation?

Okay, I am also not sure of that.

 If users really want to leave
> behind such invalidated slots after upgrade, we can even think of
> providing some option like "exclude_invalid_logical_slots".

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Impact of checkpointer during pg_upgrade

From
Michael Paquier
Date:
(Just dumping what I have in mind while reading the thread.)

On Sat, Sep 02, 2023 at 10:08:51AM +0530, Amit Kapila wrote:
> During pg_upgrade, we start the server for the old cluster which can
> allow the checkpointer to remove the WAL files. It has been noticed
> that we do generate certain types of WAL records (e.g
> XLOG_RUNNING_XACTS, XLOG_CHECKPOINT_ONLINE, and XLOG_FPI_FOR_HINT)
> even during pg_upgrade for old cluster, so additional WAL records
> could let checkpointer decide that certain WAL segments can be removed
> (e.g. say wal size crosses max_slot_wal_keep_size_mb) and invalidate
> the slots. Currently, I can't see any problem with this but for future
> work where we want to migrate logical slots during an upgrade[1], we
> need to decide what to do for such cases. The initial idea we had was
> that if the old cluster has some invalid slots, we won't allow an
> upgrade unless the user removes such slots or uses some option like
> --exclude-slots. It is quite possible that slots got invalidated
> during pg_upgrade due to no user activity. Now, even though the
> possibility of the same is less I think it is worth considering what
> should be the behavior.
>
> The other possibilities apart from not allowing an upgrade in such a
> case could be (a) Before starting the old cluster, we fetch the slots
> directly from the disk using some tool like [2] and make the decisions
> based on that state; (b) During the upgrade, we don't allow WAL to be
> removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> as well but for that, we need to expose an API to invalidate the
> slots; (d) somehow distinguish the slots that are invalidated during
> an upgrade and then simply copy such slots because anyway we ensure
> that all the WAL required by slot is sent before shutdown.

Checking for any invalid slots at the beginning of the upgrade and
complain sounds like a good thing to do, because these are not
expected to begin with, no?  That's similar to a pre-check requirement
that the slots should have fed the subscribers with all the data
available up to the shutdown checkpoint when the publisher was stopped
before running pg_upgrade.  So (a) is a good idea to prevent an
upgrade based on a state we don't expect from the start, as something
in check.c, I assume.

On top of that, (b) sounds like a good idea to me anyway to be more
defensive.  But couldn't we do that just like we do for autovacuum and
force the GUCs that could remove the slot's WAL to not do their work
instead?  An upgrade is a special state of the cluster, and I'm not
much into painting more checks based on IsBinaryUpgrade to prevent WAL
segments to be removed while we can have a full control of what we
want with just the GUCs that force the hand of the slots.  That just
seems simpler to me, and the WAL recycling logic is already complex
particularly with all the GUCs popping lately to force some conditions
to do the WAL recycling and/or removal.

During the upgrade, if some segments are removed and some of the slots
we expect to still be valid once the upgrade is done are marked as
invalid and become unusable, I think that we should just copy these
slots, but also update their state data so as they can still be used
with what we expect, as these could be wanted by the subscribers.
That's what you mean with (d), I assume.  Do you think that it would
be possible to force the slot's data on the publishers so as they use
a local LSN based on the new WAL we are resetting at?  At least that
seems more friendly to me as this limits the set of manipulations to
do on the slots for the end-user.  The protection from (b) ought to be
enough, in itself.  (c) overlaps with (a), especially if we want to be
able to read or write some of the slot's data offline.
--
Michael

Attachment

Re: Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
On Thu, Sep 7, 2023 at 12:55 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> (Just dumping what I have in mind while reading the thread.)
>
> On Sat, Sep 02, 2023 at 10:08:51AM +0530, Amit Kapila wrote:
> > During pg_upgrade, we start the server for the old cluster which can
> > allow the checkpointer to remove the WAL files. It has been noticed
> > that we do generate certain types of WAL records (e.g
> > XLOG_RUNNING_XACTS, XLOG_CHECKPOINT_ONLINE, and XLOG_FPI_FOR_HINT)
> > even during pg_upgrade for old cluster, so additional WAL records
> > could let checkpointer decide that certain WAL segments can be removed
> > (e.g. say wal size crosses max_slot_wal_keep_size_mb) and invalidate
> > the slots. Currently, I can't see any problem with this but for future
> > work where we want to migrate logical slots during an upgrade[1], we
> > need to decide what to do for such cases. The initial idea we had was
> > that if the old cluster has some invalid slots, we won't allow an
> > upgrade unless the user removes such slots or uses some option like
> > --exclude-slots. It is quite possible that slots got invalidated
> > during pg_upgrade due to no user activity. Now, even though the
> > possibility of the same is less I think it is worth considering what
> > should be the behavior.
> >
> > The other possibilities apart from not allowing an upgrade in such a
> > case could be (a) Before starting the old cluster, we fetch the slots
> > directly from the disk using some tool like [2] and make the decisions
> > based on that state; (b) During the upgrade, we don't allow WAL to be
> > removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> > as well but for that, we need to expose an API to invalidate the
> > slots; (d) somehow distinguish the slots that are invalidated during
> > an upgrade and then simply copy such slots because anyway we ensure
> > that all the WAL required by slot is sent before shutdown.
>
> Checking for any invalid slots at the beginning of the upgrade and
> complain sounds like a good thing to do, because these are not
> expected to begin with, no?  That's similar to a pre-check requirement
> that the slots should have fed the subscribers with all the data
> available up to the shutdown checkpoint when the publisher was stopped
> before running pg_upgrade.  So (a) is a good idea to prevent an
> upgrade based on a state we don't expect from the start, as something
> in check.c, I assume.
>
> On top of that, (b) sounds like a good idea to me anyway to be more
> defensive.  But couldn't we do that just like we do for autovacuum and
> force the GUCs that could remove the slot's WAL to not do their work
> instead?
>

I think if we just make max_slot_wal_keep_size to -1 that should be
sufficient to not let any slots get invalidated during upgrade. Do you
have anything else in mind?

  An upgrade is a special state of the cluster, and I'm not
> much into painting more checks based on IsBinaryUpgrade to prevent WAL
> segments to be removed while we can have a full control of what we
> want with just the GUCs that force the hand of the slots.  That just
> seems simpler to me, and the WAL recycling logic is already complex
> particularly with all the GUCs popping lately to force some conditions
> to do the WAL recycling and/or removal.
>
> During the upgrade, if some segments are removed and some of the slots
> we expect to still be valid once the upgrade is done are marked as
> invalid and become unusable, I think that we should just copy these
> slots, but also update their state data so as they can still be used
> with what we expect, as these could be wanted by the subscribers.
> That's what you mean with (d), I assume. Do you think that it would
> be possible to force the slot's data on the publishers so as they use
> a local LSN based on the new WAL we are resetting at?
>

Yes.

>
> At least that
> seems more friendly to me as this limits the set of manipulations to
> do on the slots for the end-user.  The protection from (b) ought to be
> enough, in itself.  (c) overlaps with (a), especially if we want to be
> able to read or write some of the slot's data offline.
>

If we do (b) either via GUCs or IsBinaryUpgrade check we don't need to
do any of (a), (b), or (d). I feel that would be a minimal and
sufficient fix to prevent any side impact of checkpointer on slots
during an upgrade.

--
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Dilip Kumar
Date:
On Thu, Sep 7, 2023 at 3:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 7, 2023 at 12:55 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > (Just dumping what I have in mind while reading the thread.)
> >
> > On Sat, Sep 02, 2023 at 10:08:51AM +0530, Amit Kapila wrote:
> > > During pg_upgrade, we start the server for the old cluster which can
> > > allow the checkpointer to remove the WAL files. It has been noticed
> > > that we do generate certain types of WAL records (e.g
> > > XLOG_RUNNING_XACTS, XLOG_CHECKPOINT_ONLINE, and XLOG_FPI_FOR_HINT)
> > > even during pg_upgrade for old cluster, so additional WAL records
> > > could let checkpointer decide that certain WAL segments can be removed
> > > (e.g. say wal size crosses max_slot_wal_keep_size_mb) and invalidate
> > > the slots. Currently, I can't see any problem with this but for future
> > > work where we want to migrate logical slots during an upgrade[1], we
> > > need to decide what to do for such cases. The initial idea we had was
> > > that if the old cluster has some invalid slots, we won't allow an
> > > upgrade unless the user removes such slots or uses some option like
> > > --exclude-slots. It is quite possible that slots got invalidated
> > > during pg_upgrade due to no user activity. Now, even though the
> > > possibility of the same is less I think it is worth considering what
> > > should be the behavior.
> > >
> > > The other possibilities apart from not allowing an upgrade in such a
> > > case could be (a) Before starting the old cluster, we fetch the slots
> > > directly from the disk using some tool like [2] and make the decisions
> > > based on that state; (b) During the upgrade, we don't allow WAL to be
> > > removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> > > as well but for that, we need to expose an API to invalidate the
> > > slots; (d) somehow distinguish the slots that are invalidated during
> > > an upgrade and then simply copy such slots because anyway we ensure
> > > that all the WAL required by slot is sent before shutdown.
> >
> > Checking for any invalid slots at the beginning of the upgrade and
> > complain sounds like a good thing to do, because these are not
> > expected to begin with, no?  That's similar to a pre-check requirement
> > that the slots should have fed the subscribers with all the data
> > available up to the shutdown checkpoint when the publisher was stopped
> > before running pg_upgrade.  So (a) is a good idea to prevent an
> > upgrade based on a state we don't expect from the start, as something
> > in check.c, I assume.
> >
> > On top of that, (b) sounds like a good idea to me anyway to be more
> > defensive.  But couldn't we do that just like we do for autovacuum and
> > force the GUCs that could remove the slot's WAL to not do their work
> > instead?
> >
>
> I think if we just make max_slot_wal_keep_size to -1 that should be
> sufficient to not let any slots get invalidated during upgrade. Do you
> have anything else in mind?

This seems like a good solution to the problem.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Impact of checkpointer during pg_upgrade

From
Michael Paquier
Date:
On Thu, Sep 07, 2023 at 03:33:52PM +0530, Amit Kapila wrote:
> I think if we just make max_slot_wal_keep_size to -1 that should be
> sufficient to not let any slots get invalidated during upgrade. Do you
> have anything else in mind?

Forcing wal_keep_size while on it may be a good thing.

> If we do (b) either via GUCs or IsBinaryUpgrade check we don't need to
> do any of (a), (b), or (d). I feel that would be a minimal and
> sufficient fix to prevent any side impact of checkpointer on slots
> during an upgrade.

I could get into the addition of a post-upgrade check to make sure
that nothing got invalidated while the upgrade was running, FWIW.
--
Michael

Attachment

Re: Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
On Fri, Sep 8, 2023 at 5:37 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 07, 2023 at 03:33:52PM +0530, Amit Kapila wrote:
> > I think if we just make max_slot_wal_keep_size to -1 that should be
> > sufficient to not let any slots get invalidated during upgrade. Do you
> > have anything else in mind?
>
> Forcing wal_keep_size while on it may be a good thing.
>

I had thought about it but couldn't come up with a reason to force
wal_keep_size for this purpose.

> > If we do (b) either via GUCs or IsBinaryUpgrade check we don't need to
> > do any of (a), (b), or (d). I feel that would be a minimal and
> > sufficient fix to prevent any side impact of checkpointer on slots
> > during an upgrade.
>
> I could get into the addition of a post-upgrade check to make sure
> that nothing got invalidated while the upgrade was running, FWIW.
>

This validation tries to ensure that we don't have any bugs/issues in
our patch. It may be a candidate for assert but I feel even if we
encounter any bug it is better to fix the bug.

--
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Michael Paquier
Date:
On Fri, Sep 08, 2023 at 08:18:14AM +0530, Amit Kapila wrote:
> This validation tries to ensure that we don't have any bugs/issues in
> our patch. It may be a candidate for assert but I feel even if we
> encounter any bug it is better to fix the bug.

My guess is that an elog-like error is more adapted so as we are able
to detect problems in more cases, but perhaps an assert may be enough
for the buildfarm.  If there is anything in the backend that causes
slots to become invalidated, I agree that any issue causing that
should be fixed, but isn't the point different here?  Having a check
at the end of an upgrade is a mean to improve the detection rate of
bugs where slots get invalidated, so it is actually helpful to have
one anyway?  I am not sure what is your strategy here, do you mean to
keep a check at the end of pg_upgrade only in the patch to validate
it?  Or do you mean to add something in pg_upgrade as part of the
feature?  I mean that doing the latter is benefitial for the sake of
any patch committed and as a long-term method to rely on.
--
Michael

Attachment

RE: Impact of checkpointer during pg_upgrade

From
"Zhijie Hou (Fujitsu)"
Date:
On Friday, September 8, 2023 10:58 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Sep 08, 2023 at 08:18:14AM +0530, Amit Kapila wrote:
> > This validation tries to ensure that we don't have any bugs/issues in
> > our patch. It may be a candidate for assert but I feel even if we
> > encounter any bug it is better to fix the bug.
>
> My guess is that an elog-like error is more adapted so as we are able to detect
> problems in more cases, but perhaps an assert may be enough for the
> buildfarm.  If there is anything in the backend that causes slots to become
> invalidated, I agree that any issue causing that should be fixed, but isn't the
> point different here?  Having a check at the end of an upgrade is a mean to
> improve the detection rate of bugs where slots get invalidated, so it is actually
> helpful to have one anyway?  I am not sure what is your strategy here, do you
> mean to keep a check at the end of pg_upgrade only in the patch to validate it?
> Or do you mean to add something in pg_upgrade as part of the feature?  I
> mean that doing the latter is benefitial for the sake of any patch committed and
> as a long-term method to rely on.

I feel adding a check at pg_upgrade may not 100% detect the slot invalidation
if we check by querying the old cluster to get the slot info, because the
invalidation can happen before the first time we fetch the info or after the
last time we fetch the info(e.g. shutdown checkpoint could also invalidate
slots)

Personally, I think if we really want to add a check, it might be better to put
it at server side, Like: reporting an ERROR at server side when invalidating
the slot(InvalidatePossiblyObsoleteSlot) if in upgrade mode.

Having said that I feel it's fine if we don't add this check as setting
max_slot_wal_keep_size to -1 looks sufficient.


Best Regards,
Hou zj



Re: Impact of checkpointer during pg_upgrade

From
Michael Paquier
Date:
On Fri, Sep 08, 2023 at 03:30:23AM +0000, Zhijie Hou (Fujitsu) wrote:
> I feel adding a check at pg_upgrade may not 100% detect the slot invalidation
> if we check by querying the old cluster to get the slot info, because the
> invalidation can happen before the first time we fetch the info or after the
> last time we fetch the info(e.g. shutdown checkpoint could also invalidate
> slots)
>
> Personally, I think if we really want to add a check, it might be better to put
> it at server side, Like: reporting an ERROR at server side when invalidating
> the slot(InvalidatePossiblyObsoleteSlot) if in upgrade mode.

Yeah, that may be enough to paint one isBinaryUpgrade in the
invalidation path of the backend, with en elog(ERROR) as that would be
an unexpected state.

> Having said that I feel it's fine if we don't add this check as setting
> max_slot_wal_keep_size to -1 looks sufficient.

I would do both, FWIW, to stay on the safe side.  And both are
non-invasive.
--
Michael

Attachment

Re: Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
On Fri, Sep 8, 2023 at 9:00 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, September 8, 2023 10:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Sep 08, 2023 at 08:18:14AM +0530, Amit Kapila wrote:
> > > This validation tries to ensure that we don't have any bugs/issues in
> > > our patch. It may be a candidate for assert but I feel even if we
> > > encounter any bug it is better to fix the bug.
> >
> > My guess is that an elog-like error is more adapted so as we are able to detect
> > problems in more cases, but perhaps an assert may be enough for the
> > buildfarm.  If there is anything in the backend that causes slots to become
> > invalidated, I agree that any issue causing that should be fixed, but isn't the
> > point different here?  Having a check at the end of an upgrade is a mean to
> > improve the detection rate of bugs where slots get invalidated, so it is actually
> > helpful to have one anyway?  I am not sure what is your strategy here, do you
> > mean to keep a check at the end of pg_upgrade only in the patch to validate it?
> > Or do you mean to add something in pg_upgrade as part of the feature?
> >

We can do whatever the consensus is but I feel such an end-check to
some extent is only helpful for the testing of a patch before the
commit but not otherwise.

> >  I
> > mean that doing the latter is benefitial for the sake of any patch committed and
> > as a long-term method to rely on.
>

What is your worry here? Are you worried that unknowingly in the
future we could add some other way to invalidate slots during upgrades
that we won't be able to detect?

> I feel adding a check at pg_upgrade may not 100% detect the slot invalidation
> if we check by querying the old cluster to get the slot info, because the
> invalidation can happen before the first time we fetch the info or after the
> last time we fetch the info(e.g. shutdown checkpoint could also invalidate
> slots)
>
> Personally, I think if we really want to add a check, it might be better to put
> it at server side, Like: reporting an ERROR at server side when invalidating
> the slot(InvalidatePossiblyObsoleteSlot) if in upgrade mode.
>

I don't know whether we really need to be extra careful in this case,
the same could be said about other consistency checks on the old
cluster.

--
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Michael Paquier
Date:
On Fri, Sep 08, 2023 at 09:14:59AM +0530, Amit Kapila wrote:
> On Fri, Sep 8, 2023 at 9:00 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
>>>  I
>>> mean that doing the latter is benefitial for the sake of any patch committed and
>>> as a long-term method to rely on.
>
> What is your worry here? Are you worried that unknowingly in the
> future we could add some other way to invalidate slots during upgrades
> that we won't be able to detect?

Exactly.  A safety belt would not hurt, especially if the belt added
is simple.  The idea of a backend side elog(ERROR) with
isBinaryUpgrade is tempting in the invalidation slot path.
--
Michael

Attachment

Re: Impact of checkpointer during pg_upgrade

From
Amit Kapila
Date:
On Fri, Sep 8, 2023 at 10:10 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Sep 08, 2023 at 09:14:59AM +0530, Amit Kapila wrote:
> > On Fri, Sep 8, 2023 at 9:00 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> >>>  I
> >>> mean that doing the latter is benefitial for the sake of any patch committed and
> >>> as a long-term method to rely on.
> >
> > What is your worry here? Are you worried that unknowingly in the
> > future we could add some other way to invalidate slots during upgrades
> > that we won't be able to detect?
>
> Exactly.  A safety belt would not hurt, especially if the belt added
> is simple.  The idea of a backend side elog(ERROR) with
> isBinaryUpgrade is tempting in the invalidation slot path.
>

I agree with doing something simple. So, to conclude, we agree on two
things in this thread (a) Use max_slot_wal_keep_size to -1 to start
postmaster for the old cluster during the upgrade; (b) Have an
elog(ERROR) to avoid invalidating slots during the upgrade.

--
With Regards,
Amit Kapila.



Re: Impact of checkpointer during pg_upgrade

From
Michael Paquier
Date:
On Fri, Sep 08, 2023 at 11:59:41AM +0530, Amit Kapila wrote:
> I agree with doing something simple. So, to conclude, we agree on two
> things in this thread (a) Use max_slot_wal_keep_size to -1 to start
> postmaster for the old cluster during the upgrade; (b) Have an
> elog(ERROR) to avoid invalidating slots during the upgrade.

+1.
--
Michael

Attachment

Re: Impact of checkpointer during pg_upgrade

From
Dilip Kumar
Date:
On Fri, Sep 8, 2023 at 11:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 8, 2023 at 10:10 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Sep 08, 2023 at 09:14:59AM +0530, Amit Kapila wrote:
> > > On Fri, Sep 8, 2023 at 9:00 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > >>>  I
> > >>> mean that doing the latter is benefitial for the sake of any patch committed and
> > >>> as a long-term method to rely on.
> > >
> > > What is your worry here? Are you worried that unknowingly in the
> > > future we could add some other way to invalidate slots during upgrades
> > > that we won't be able to detect?
> >
> > Exactly.  A safety belt would not hurt, especially if the belt added
> > is simple.  The idea of a backend side elog(ERROR) with
> > isBinaryUpgrade is tempting in the invalidation slot path.
> >
>
> I agree with doing something simple. So, to conclude, we agree on two
> things in this thread (a) Use max_slot_wal_keep_size to -1 to start
> postmaster for the old cluster during the upgrade; (b) Have an
> elog(ERROR) to avoid invalidating slots during the upgrade.

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com