Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation - Mailing list pgsql-hackers

From Joao Foltran
Subject Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
Date
Msg-id CAF8B20DoV6p+hzuQC22mSYyX3Xc+Xypf2yN3TqVqduLutruEXg@mail.gmail.com
Whole thread Raw
In response to Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
> The slots could be invalidated due to other reasons like
> RS_INVAL_IDLE_TIMEOUT as well.

We could just filter which invalidation reasons could be "revalidated"
for only reasons that can be resolved this way.

As for recreating vs not recreating the slots: in situations where you
have tons of clusters that have disk space constraints this would help
tremendously. There's probably a lot of users that would prefer
self-healing in situations it can happen.

Self-healing doesn't mean not reporting it. They can later check the
reason in the logs why it happened and prevent it from happening in
the future.

If making this the default, it could be a flag in the slot? Something
like "self-healing: true", this way any possible self-healing
operations are enabled for the slot, this would enable for new
self-healing enhancements in the future to also be behind a flag and
prevent it from running when someone prefers error+investigate instead
of self-heal+investigate.

--
Regards,
João Foltran

On Tue, Dec 16, 2025 at 6:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 16, 2025 at 9:54 AM Joao Foltran <joao@foltrandba.com> wrote:
> >
> > Thank you for clarifying this behavior to me! I've tested it and it
> > really doesn't hold back wals anymore once it has been invalidated due
> > to the check inside ReplicationSlotsComputeRequiredLSN().
> >
> > You are correct that simply letting the slot be reacquired and
> > continue working would be dangerous leading to possibly losing WALs.
> > Can we then check if the standby was able to reconnect and start
> > streaming successfully and then change the slots information for it to
> > be considered inside ReplicationSlotsComputeRequiredLSN() again?
> >
> > Example:
> >
> > in XLogSendPhysical(), after we seen that the first record was sent:
> >
> > // In XLogSendPhysical() after XLogReadRecord() succeeds
> > if (first_record_sent &&
> >     MyReplicationSlot &&
> >     SlotIsPhysical(MyReplicationSlot) &&
> >     MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> > {
> >     // Clear invalidation - we successfully read WAL
> > }
> >
> > This would clear the invalidation only after we know for sure that it
> > can continue streaming wals without problem.
> >
>
> The slots could be invalidated due to other reasons like
> RS_INVAL_IDLE_TIMEOUT as well. It doesn't sound like a good to clear
> the invalidation flag of the slot because tomorrow we could decide to
> invalidate due to other reasons as well. I think it would be better to
> do the required forensic with invalid slots and re-create the slot if
> we want to retain the required WAL. Why don't you prefer to re-create
> it once the slot is invalidated?
>
> --
> With Regards,
> Amit Kapila.



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Decouple C++ support in Meson's PGXS from LLVM enablement
Next
From: "Matheus Alcantara"
Date:
Subject: Re: LLVM 22