Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation - Mailing list pgsql-hackers

From Joao Foltran
Subject Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
Date
Msg-id CAF8B20Cvh-pdr37DpN_-n1tjpS8zLQB5JTbPbZzewvww0VOyBA@mail.gmail.com
Whole thread
In response to Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation  (Joao Foltran <joao@foltrandba.com>)
List pgsql-hackers
Hello all,

I've made a v2 of this patch, turning it into a patchset with guidance
from Fabrizio Mello.

This patchset includes a new feature that self-heals (auto
revalidates) physical replication slots after they have been
invalidated for two reasons: RS_INVAL_WAL_REMOVED or
RS_INVAL_IDLE_TIMEOUT.

Requiring an user to manually recreate slot isn't necessary in cases
where the standby server connected to these slots recovers itself
using restore_command and can become burdensome when managing a fleet
of clusters, creating a need to handle this kind of problem
automatically due to the scale of your operation.

The patch adds a opt-in mechanism that allows the physical slots to be
reinvalidated in those cases, a new persistent field called
`auto_revalidate` (default false) controls which physical slots are
eligible. When enabled, StartReplication issues a WARNING instead of
an ERROR when acquiring physical invalidated slots and
PhysicalConfirmReceivedLocation clears the invalidation atomically
with the restart_lsn update upon the first flush ACK. The revalidation
is persisted to disk immediately so it survives a crash.

Only RS_INVAL_WAL_REMOVED and RS_INVAL_IDLE_TIMEOUT revalidatable, via
an explicit allowlist in SlotCanBeRevalidated(). Future invalidation
reasons must be added there to become eligible.

I appreciate Fabrizio's help reviewing everything and walking me
through my questions.

The series is split into five patches:

0001 - Core infrastructure: SlotCanBeRevalidated helper, SlotIsValid
macro, revalidation logic in walsender.c, SLOT_VERSION bump.
0002 - SQL function: new auto_revalidate parameter on
pg_create_physical_replication_slot(), copy-path propagation via
pg_copy_physical_replication_slot(), regression test.
0003 - View exposure: auto_revalidate column in pg_replication_slots.
0004 - TAP recovery test: six scenarios covering revalidation, WAL
retention, xmin recovery, error preservation for
auto_revalidate=false, slot copy revalidation, and idle_timeout
revalidation (some of these require injection_points).
0005 - Documentation: system-views.sgml and func-admin.sgml.

João Foltran
Linkedin: https://www.linkedin.com/in/joao-foltran-031b9312b

On Thu, Jan 22, 2026 at 4:41 PM Joao Foltran <joao@foltrandba.com> wrote:
>
> Hi Amit!
>
> Unless we have hot_standby_feedback = on, xmin would be null on the
> physical replication slot.
>
> But, even if using that parameter, as long as we know that the standby
> already has caught up by using the archived wals then the xmin
> wouldn't matter, since we don't need those rows to be visible anymore.
>
> I've attached a simple patch and test here that revalidates the slot
> after it is lost. It is still missing any filtering besides checking
> if the slot is physical or logical, but we can add filters for
> specific invalidations.
>
> Let me know what you think.
>
> Regards,
> João Foltran
>
> On Wed, Jan 14, 2026 at 8:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jan 6, 2026 at 3:26 AM Joao Foltran <joao@foltrandba.com> wrote:
> > >
> > > > The slots could be invalidated due to other reasons like
> > > > RS_INVAL_IDLE_TIMEOUT as well.
> > >
> > > We could just filter which invalidation reasons could be "revalidated"
> > > for only reasons that can be resolved this way.
> > >
> >
> > Can we make the slot valid even the required WAL is made available
> > afterwards? What about the removed rows due to the slot's xmin?
> >
> > --
> > With Regards,
> > Amit Kapila.

Attachment

pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: Improve pgindent's formatting named fields in struct literals and varidic functions
Next
From: Masahiko Sawada
Date:
Subject: Re: Initial COPY of Logical Replication is too slow