Re: Introduce XID age based replication slot invalidation - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Introduce XID age based replication slot invalidation
Date
Msg-id CAD21AoB4MsTpG5JEkifght_tQ91VHJO_8kpsDCrG-z9qkkmN5g@mail.gmail.com
Whole thread
In response to Re: Introduce XID age based replication slot invalidation  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Introduce XID age based replication slot invalidation
List pgsql-hackers
On Sun, Mar 29, 2026 at 6:35 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> On Sun, Mar 29, 2026 at 1:16 PM Srinath Reddy Sadipiralla
> <srinath2133@gmail.com> wrote:
> >
> > Hello,
> >
> > Thanks for the v5 patch set, I have reviewed and did initial testing on
> > v5 patch set, and it LGTM, except these
>
> Thank you for reviewing and testing. I appreciate it.
>
> > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > index 286f0f46341..c2ff7e464f0 100644
> > --- a/src/backend/replication/slot.c
> > +++ b/src/backend/replication/slot.c
> > @@ -1849,7 +1849,7 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
> >                                 else
> >                                 {
> >                                         /* translator: %s is a GUC variable name */
> > -                                       appendStringInfo(&err_detail, _("The slot's xmin %u is %d transactions old,
whichexceeds the configured \"%s\" value of %d."), 
> > +                                       appendStringInfo(&err_detail, _("The slot's catalog_xmin %u is %d
transactionsold, which exceeds the configured \"%s\" value of %d."), 
> >                                                                          catalog_xmin, (int32) (recentXid -
catalog_xmin),"max_slot_xid_age", max_slot_xid_age); 
> >                                 }
>
> Fixed the typo.
>
> > while testing the active slot XID age invalidation (SIGTERM path) , i
> > observed that slot got invalidated , walsender was killed because of
> > SIGTERM , then starts the infinite-retry-cycle problem where
> > walreceiver starts walsender and walsender will try to use an invalidated
> > slot and dies, will think more on this.
>
> I would like to clarify that once a slot is invalidated due to any of
> the reasons (ReplicationSlotInvalidationCause), it becomes unusable;
> the sender will error out if the receiver tries to use it. This is
> consistent with all existing slot invalidation mechanisms.
>
> Please find the attached v6 patches fixing the typo for further review.
>

I've reviewed the v6 patch. Here are some comments.

 bool
 vacuum_get_cutoffs(Relation rel, const VacuumParams params,
-                  struct VacuumCutoffs *cutoffs)
+                  struct VacuumCutoffs *cutoffs,
+                  TransactionId *slot_xmin,
+                  TransactionId *slot_catalog_xmin)

How about storing both slot_xmin and catalog_xmin into VacuumCutoffs?

---
-   if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED |
RS_INVAL_IDLE_TIMEOUT,
+   possibleInvalidationCauses = RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT |
+       RS_INVAL_XID_AGE;
+
+   if (InvalidateObsoleteReplicationSlots(possibleInvalidationCauses,
                                           _logSegNo, InvalidOid,
+                                          InvalidTransactionId,
+                                          max_slot_xid_age > 0 ?
+                                          ReadNextTransactionId() :
                                           InvalidTransactionId))

It's odd to me that we specify RS_INVAL_XID_AGE while passing
InvalidTransactionId. I think we can specify RS_INVAL_XID_AGE along
with a valid recentXId only when we'd like to check the slots based on
their XIDs.

---
+   /* Check if the slot needs to be invalidated due to max_slot_xid_age GUC */
+   if ((possible_causes & RS_INVAL_XID_AGE) && CanInvalidateXidAgedSlot(s))
+   {
+       TransactionId xidLimit;
+
+       Assert(TransactionIdIsValid(recentXid));
+
+       xidLimit = TransactionIdRetreatedBy(recentXid, max_slot_xid_age);
+

I think we can avoid calculating xidLimit for every slot by
calculating it in InvalidatePossiblyObsoleteSlot() and passing it to
DetermineSlotInvalidationCause().

---
  */
 TransactionId
 GetOldestNonRemovableTransactionId(Relation rel)
+{
+   return GetOldestNonRemovableTransactionIdExt(rel, NULL, NULL);
+}
+
+/*
+ * Same as GetOldestNonRemovableTransactionId(), but also returns the
+ * replication slot xmin and catalog_xmin from the same ComputeXidHorizons()
+ * call.  This avoids a separate ProcArrayLock acquisition when the caller
+ * needs both values.
+ */
+TransactionId
+GetOldestNonRemovableTransactionIdExt(Relation rel,
+                                     TransactionId *slot_xmin,
+                                     TransactionId *slot_catalog_xmin)
 {

I understand that the primary reason why the patch introduces another
variant of GetOldestNonRemovableTransactionId() is to avoid extra
ProcArrayLock acquision to get replication slot xmin and catalog_xmin.
While it's not very elegant, I find that it would not be bad because
otherwise autovacuum takes extra ProcArrayLock (in shared mode) for
every table to vacuum. The ProcArrayLock is already known
high-contented lock it would be better to avoid taking it once more.
If others think differently, we can just call
ProcArrayGetReplicationSlotXmin() separately and compare them to the
limit of XID-age based slot invalidation.

Having said that, I personally don't want to add new instructions to
the existing GetOldestNonRemovableTransactionId(). I guess we might
want to make both the existing function and new function call a common
(inline) function that takes ComputeXidHorizonsResult and returns
appropriate transaction id based on the given relation .

---
+   # Do some work to advance xids
+   $node->safe_psql(
+       'postgres', qq[
+       do \$\$
+       begin
+       for i in 1..$nxids loop
+           -- use an exception block so that each iteration eats an XID
+           begin
+           insert into $table_name values (i);
+           exception
+           when division_by_zero then null;
+           end;
+       end loop;
+       end\$\$;
+   ]);

I think it's fater to use pg_current_xact_id() instead.

---
+   else
+   {
+       $node->safe_psql('postgres', "VACUUM");
+   }

We don't need to vacuum all tables here.

---
+# Configure primary with XID age settings. Set autovacuum_naptime high so
+# that the checkpointer (not vacuum) triggers the invalidation.
+my $max_slot_xid_age = 500;
+$primary5->append_conf(
+   'postgresql.conf', qq{
+max_slot_xid_age = $max_slot_xid_age
+autovacuum_naptime = '1h'
+});

I think that it's better to disable autovacuum than setting a large number.

---
+# Testcase end: Invalidate streaming standby's slot due to max_slot_xid_age
+# GUC (via checkpoint).

I think that we can say "physical slot" instead of standby's slot to
avoid confusion as I thought standby's slot is a slot created on the
standby at the first glance.

---
Do we have tests for invalidating slots on the standbys?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Custom oauth validator options
Next
From: SATYANARAYANA NARLAPURAM
Date:
Subject: Re: POC: Parallel processing of indexes in autovacuum