Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation - Mailing list pgsql-hackers

From Joao Foltran
Subject Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
Date
Msg-id CAF8B20AV2fKQ4SAYDiRjOCBkS1AdnJ=5DRzkm7VuH11A165NRQ@mail.gmail.com
Whole thread Raw
In response to RE: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Responses Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
List pgsql-hackers
Hi Zhijie,

Thank you for clarifying this behavior to me! I've tested it and it
really doesn't hold back wals anymore once it has been invalidated due
to the check inside ReplicationSlotsComputeRequiredLSN().

You are correct that simply letting the slot be reacquired and
continue working would be dangerous leading to possibly losing WALs.
Can we then check if the standby was able to reconnect and start
streaming successfully and then change the slots information for it to
be considered inside ReplicationSlotsComputeRequiredLSN() again?

Example:

in XLogSendPhysical(), after we seen that the first record was sent:

// In XLogSendPhysical() after XLogReadRecord() succeeds
if (first_record_sent &&
    MyReplicationSlot &&
    SlotIsPhysical(MyReplicationSlot) &&
    MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
{
    // Clear invalidation - we successfully read WAL
}

This would clear the invalidation only after we know for sure that it
can continue streaming wals without problem.

After we clear the invalidation then the slot should be able to start
holding back wals again, right?

Regards,
Joao Foltran

On Tue, Dec 16, 2025 at 12:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, December 16, 2025 2:54 AM Joao Foltran <joao@foltrandba.com> wrote:
> > Hi hackers,
> >
> > I'd like to report a regression in PostgreSQL 18 regarding physical replication
> > slot invalidation and propose a fix.
> >
> > It's my first time sending any type of contribution, so please let me know if I
> > made anything incorrectly and I'll fix it ASAP.
> >
> > It's also my first time doing any type of code inside the postgres project, so if
> > the logic or anything I used is incorrect let me know.
> >
> > CCing Amit, since he committed f41d8468 and 8709dcc.
> >
> > ## Problem
> >
> > Commit f41d8468 introduced an ERROR when trying to acquire an invalidated
> > replication slot. While this is correct for logical replication slots (which cannot
> > safely recover after invalidation), it breaks recovery for physical replication
> > slots.
> >
> > Later, commit 8709dcc improved upon this code to prevent a race condition
> > and moved the check to after the slot was already acquired.
> >
> > In PostgreSQL 17 and earlier, when a physical replication slot was invalidated
> > due to max_slot_wal_keep_size, the slot could still be reacquired if the
> > required WAL became available through restore_command or archive
> > recovery in the standby. This is a common operational scenario:
> >
> > - Temporary network issues
> > - Planned maintenance windows
> > - Standbys temporarily falling behind
>
> I think the ability to acquire an invalidated slot represents an
> potentially risky behavior. AFAICS, we do not currently support
> recovering invalidated slots. This implies that once a slot becomes invalidated,
> it does not offer any protection anymore. Even if the restart_lsn or xmin is valid for
> such a slot, WAL and rows can be removed at any time. For further clarification,
> please refer to ReplicationSlotsComputeRequiredLSN(), where we deliberately
> exclude counting the restart_lsn for an invalidated slot.
>
> >
> > After commit f41d8468, physical replication slots cannot be reacquired once
> > invalidated, even when the required WAL is available via archive recovery.
> > The standby remains stuck recovering from archive and cannot resume
> > streaming replication, demanding manual intervention (slot recreation).
> >
>
> I think even if the WALs is temporary available via archive recovery, since the slot
> cannot protect any further WALs and rows from being removed, it could cause
> problems later.
>
> Best Regards,
> Hou zj
>



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Proposal: Conflict log history table for Logical Replication
Next
From: Xuneng Zhou
Date:
Subject: Re: Implement waiting for wal lsn replay: reloaded