Re: Introduce XID age based replication slot invalidation - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Introduce XID age based replication slot invalidation
Date
Msg-id CALj2ACVGpVHuqchPPFWdiLDN-PDPCEe=sU43YB7nqafE+VMXaQ@mail.gmail.com
Whole thread Raw
In response to Re: Introduce XID age based replication slot invalidation  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Hi,

On Mon, Mar 30, 2026 at 5:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've reviewed the v6 patch. Here are some comments.

Thank you for reviewing the patch.

>  bool
>  vacuum_get_cutoffs(Relation rel, const VacuumParams params,
> -                  struct VacuumCutoffs *cutoffs)
> +                  struct VacuumCutoffs *cutoffs,
> +                  TransactionId *slot_xmin,
> +                  TransactionId *slot_catalog_xmin)
>
> How about storing both slot_xmin and catalog_xmin into VacuumCutoffs?

Done.

> ---
> -   if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED |
> RS_INVAL_IDLE_TIMEOUT,
> +   possibleInvalidationCauses = RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT |
> +       RS_INVAL_XID_AGE;
> +
> +   if (InvalidateObsoleteReplicationSlots(possibleInvalidationCauses,
>                                            _logSegNo, InvalidOid,
> +                                          InvalidTransactionId,
> +                                          max_slot_xid_age > 0 ?
> +                                          ReadNextTransactionId() :
>                                            InvalidTransactionId))
>
> It's odd to me that we specify RS_INVAL_XID_AGE while passing
> InvalidTransactionId. I think we can specify RS_INVAL_XID_AGE along
> with a valid recentXId only when we'd like to check the slots based on
> their XIDs.

Done.

> ---
> +   /* Check if the slot needs to be invalidated due to max_slot_xid_age GUC */
> +   if ((possible_causes & RS_INVAL_XID_AGE) && CanInvalidateXidAgedSlot(s))
> +   {
> +       TransactionId xidLimit;
> +
> +       Assert(TransactionIdIsValid(recentXid));
> +
> +       xidLimit = TransactionIdRetreatedBy(recentXid, max_slot_xid_age);
> +
>
> I think we can avoid calculating xidLimit for every slot by
> calculating it in InvalidatePossiblyObsoleteSlot() and passing it to
> DetermineSlotInvalidationCause().

Done.

> ---
>   */
>  TransactionId
>  GetOldestNonRemovableTransactionId(Relation rel)
> +{
> +   return GetOldestNonRemovableTransactionIdExt(rel, NULL, NULL);
> +}
> +
> +/*
> + * Same as GetOldestNonRemovableTransactionId(), but also returns the
> + * replication slot xmin and catalog_xmin from the same ComputeXidHorizons()
> + * call.  This avoids a separate ProcArrayLock acquisition when the caller
> + * needs both values.
> + */
> +TransactionId
> +GetOldestNonRemovableTransactionIdExt(Relation rel,
> +                                     TransactionId *slot_xmin,
> +                                     TransactionId *slot_catalog_xmin)
>  {
>
> I understand that the primary reason why the patch introduces another
> variant of GetOldestNonRemovableTransactionId() is to avoid extra
> ProcArrayLock acquision to get replication slot xmin and catalog_xmin.
> While it's not very elegant, I find that it would not be bad because
> otherwise autovacuum takes extra ProcArrayLock (in shared mode) for
> every table to vacuum. The ProcArrayLock is already known
> high-contented lock it would be better to avoid taking it once more.
> If others think differently, we can just call
> ProcArrayGetReplicationSlotXmin() separately and compare them to the
> limit of XID-age based slot invalidation.

I understand the concerns around the ProcArrayLock and I think a new
function to return the computed slot's xmin and catalog_xmin is good.

> Having said that, I personally don't want to add new instructions to
> the existing GetOldestNonRemovableTransactionId(). I guess we might
> want to make both the existing function and new function call a common
> (inline) function that takes ComputeXidHorizonsResult and returns
> appropriate transaction id based on the given relation .

Done.

> ---
> +   # Do some work to advance xids
> +   $node->safe_psql(
> +       'postgres', qq[
> +       do \$\$
> +       begin
> +       for i in 1..$nxids loop
> +           -- use an exception block so that each iteration eats an XID
> +           begin
> +           insert into $table_name values (i);
> +           exception
> +           when division_by_zero then null;
> +           end;
> +       end loop;
> +       end\$\$;
> +   ]);
>
> I think it's fater to use pg_current_xact_id() instead.

Done. I pulled this from an existing test case in 001_stream_rep.pl.
Used the pg_current_xact_id approach. Testing times stay the same i.e.
9 wallclock secs.

> ---
> +   else
> +   {
> +       $node->safe_psql('postgres', "VACUUM");
> +   }
>
> We don't need to vacuum all tables here.

Fixed.

> ---
> +# Configure primary with XID age settings. Set autovacuum_naptime high so
> +# that the checkpointer (not vacuum) triggers the invalidation.
> +my $max_slot_xid_age = 500;
> +$primary5->append_conf(
> +   'postgresql.conf', qq{
> +max_slot_xid_age = $max_slot_xid_age
> +autovacuum_naptime = '1h'
> +});
>
> I think that it's better to disable autovacuum than setting a large number.

Done.

> ---
> +# Testcase end: Invalidate streaming standby's slot due to max_slot_xid_age
> +# GUC (via checkpoint).
>
> I think that we can say "physical slot" instead of standby's slot to
> avoid confusion as I thought standby's slot is a slot created on the
> standby at the first glance.

Fixed.

> ---
> Do we have tests for invalidating slots on the standbys?

Added a test case for this.

Please find the attached v7 patches for further review. Thank you!

--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Next
From: Álvaro Herrera
Date:
Subject: Re: Improve pgindent's formatting named fields in struct literals and varidic functions