Re: Introduce XID age based replication slot invalidation - Mailing list pgsql-hackers
| From | Bharath Rupireddy |
|---|---|
| Subject | Re: Introduce XID age based replication slot invalidation |
| Date | |
| Msg-id | CALj2ACVGpVHuqchPPFWdiLDN-PDPCEe=sU43YB7nqafE+VMXaQ@mail.gmail.com Whole thread Raw |
| In response to | Re: Introduce XID age based replication slot invalidation (Masahiko Sawada <sawada.mshk@gmail.com>) |
| List | pgsql-hackers |
Hi,
On Mon, Mar 30, 2026 at 5:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've reviewed the v6 patch. Here are some comments.
Thank you for reviewing the patch.
> bool
> vacuum_get_cutoffs(Relation rel, const VacuumParams params,
> - struct VacuumCutoffs *cutoffs)
> + struct VacuumCutoffs *cutoffs,
> + TransactionId *slot_xmin,
> + TransactionId *slot_catalog_xmin)
>
> How about storing both slot_xmin and catalog_xmin into VacuumCutoffs?
Done.
> ---
> - if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED |
> RS_INVAL_IDLE_TIMEOUT,
> + possibleInvalidationCauses = RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT |
> + RS_INVAL_XID_AGE;
> +
> + if (InvalidateObsoleteReplicationSlots(possibleInvalidationCauses,
> _logSegNo, InvalidOid,
> + InvalidTransactionId,
> + max_slot_xid_age > 0 ?
> + ReadNextTransactionId() :
> InvalidTransactionId))
>
> It's odd to me that we specify RS_INVAL_XID_AGE while passing
> InvalidTransactionId. I think we can specify RS_INVAL_XID_AGE along
> with a valid recentXId only when we'd like to check the slots based on
> their XIDs.
Done.
> ---
> + /* Check if the slot needs to be invalidated due to max_slot_xid_age GUC */
> + if ((possible_causes & RS_INVAL_XID_AGE) && CanInvalidateXidAgedSlot(s))
> + {
> + TransactionId xidLimit;
> +
> + Assert(TransactionIdIsValid(recentXid));
> +
> + xidLimit = TransactionIdRetreatedBy(recentXid, max_slot_xid_age);
> +
>
> I think we can avoid calculating xidLimit for every slot by
> calculating it in InvalidatePossiblyObsoleteSlot() and passing it to
> DetermineSlotInvalidationCause().
Done.
> ---
> */
> TransactionId
> GetOldestNonRemovableTransactionId(Relation rel)
> +{
> + return GetOldestNonRemovableTransactionIdExt(rel, NULL, NULL);
> +}
> +
> +/*
> + * Same as GetOldestNonRemovableTransactionId(), but also returns the
> + * replication slot xmin and catalog_xmin from the same ComputeXidHorizons()
> + * call. This avoids a separate ProcArrayLock acquisition when the caller
> + * needs both values.
> + */
> +TransactionId
> +GetOldestNonRemovableTransactionIdExt(Relation rel,
> + TransactionId *slot_xmin,
> + TransactionId *slot_catalog_xmin)
> {
>
> I understand that the primary reason why the patch introduces another
> variant of GetOldestNonRemovableTransactionId() is to avoid extra
> ProcArrayLock acquision to get replication slot xmin and catalog_xmin.
> While it's not very elegant, I find that it would not be bad because
> otherwise autovacuum takes extra ProcArrayLock (in shared mode) for
> every table to vacuum. The ProcArrayLock is already known
> high-contented lock it would be better to avoid taking it once more.
> If others think differently, we can just call
> ProcArrayGetReplicationSlotXmin() separately and compare them to the
> limit of XID-age based slot invalidation.
I understand the concerns around the ProcArrayLock and I think a new
function to return the computed slot's xmin and catalog_xmin is good.
> Having said that, I personally don't want to add new instructions to
> the existing GetOldestNonRemovableTransactionId(). I guess we might
> want to make both the existing function and new function call a common
> (inline) function that takes ComputeXidHorizonsResult and returns
> appropriate transaction id based on the given relation .
Done.
> ---
> + # Do some work to advance xids
> + $node->safe_psql(
> + 'postgres', qq[
> + do \$\$
> + begin
> + for i in 1..$nxids loop
> + -- use an exception block so that each iteration eats an XID
> + begin
> + insert into $table_name values (i);
> + exception
> + when division_by_zero then null;
> + end;
> + end loop;
> + end\$\$;
> + ]);
>
> I think it's fater to use pg_current_xact_id() instead.
Done. I pulled this from an existing test case in 001_stream_rep.pl.
Used the pg_current_xact_id approach. Testing times stay the same i.e.
9 wallclock secs.
> ---
> + else
> + {
> + $node->safe_psql('postgres', "VACUUM");
> + }
>
> We don't need to vacuum all tables here.
Fixed.
> ---
> +# Configure primary with XID age settings. Set autovacuum_naptime high so
> +# that the checkpointer (not vacuum) triggers the invalidation.
> +my $max_slot_xid_age = 500;
> +$primary5->append_conf(
> + 'postgresql.conf', qq{
> +max_slot_xid_age = $max_slot_xid_age
> +autovacuum_naptime = '1h'
> +});
>
> I think that it's better to disable autovacuum than setting a large number.
Done.
> ---
> +# Testcase end: Invalidate streaming standby's slot due to max_slot_xid_age
> +# GUC (via checkpoint).
>
> I think that we can say "physical slot" instead of standby's slot to
> avoid confusion as I thought standby's slot is a slot created on the
> standby at the first glance.
Fixed.
> ---
> Do we have tests for invalidating slots on the standbys?
Added a test case for this.
Please find the attached v7 patches for further review. Thank you!
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: