Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date | |
Msg-id | ZgU70MjdOfO6l0O0@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Introduce XID age and inactive timeout based replication slot invalidation (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: Introduce XID age and inactive timeout based replication slot invalidation
|
List | pgsql-hackers |
Hi, On Wed, Mar 27, 2024 at 09:00:37PM +0530, Bharath Rupireddy wrote: > standby for sync slots. 0002 implementing inactive timeout GUC based > invalidation mechanism. > > Please have a look. Thanks! Regarding 0002: Some testing: T1 === When the slot is invalidated on the primary, then the reason is propagated to the sync slot (if any). That's fine but we are loosing the inactive_since on the standby: Primary: postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot'; slot_name | inactive_since | conflicting | invalidation_reason -------------+-------------------------------+-------------+--------------------- lsub29_slot | 2024-03-28 08:24:51.672528+00 | f | inactive_timeout (1 row) Standby: postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot'; slot_name | inactive_since | conflicting | invalidation_reason -------------+----------------+-------------+--------------------- lsub29_slot | | f | inactive_timeout (1 row) I think in this case it should always reflect the value from the primary (so that one can understand why it is invalidated). T2 === And it is set to a value during promotion: postgres=# select pg_promote(); pg_promote ------------ t (1 row) postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot'; slot_name | inactive_since | conflicting | invalidation_reason -------------+------------------------------+-------------+--------------------- lsub29_slot | 2024-03-28 08:30:11.74505+00 | f | inactive_timeout (1 row) I think when it is invalidated it should always reflect the value from the primary (so that one can understand why it is invalidated). T3 === As far the slot invalidation on the primary: postgres=# SELECT * FROM pg_logical_slot_get_changes('lsub29_slot', NULL, NULL, 'include-xids', '0'); ERROR: cannot acquire invalidated replication slot "lsub29_slot" Can we make the message more consistent with what can be found in CreateDecodingContext() for example? T4 === Also, it looks like querying pg_replication_slots() does not trigger an invalidation: I think it should if the slot is not invalidated yet (and matches the invalidation criteria). Code review: CR1 === + Invalidate replication slots that are inactive for longer than this + amount of time. If this value is specified without units, it is taken s/Invalidate/Invalidates/? Should we mention the relationship with inactive_since? CR2 === + * + * If check_for_invalidation is true, the slot is checked for invalidation + * based on replication_slot_inactive_timeout GUC and an error is raised after making the slot ours. */ void -ReplicationSlotAcquire(const char *name, bool nowait) +ReplicationSlotAcquire(const char *name, bool nowait, + bool check_for_invalidation) s/check_for_invalidation/check_for_timeout_invalidation/? CR3 === + if (slot->inactive_since == 0 || + replication_slot_inactive_timeout == 0) + return false; Better to test replication_slot_inactive_timeout first? (I mean there is no point of testing inactive_since if replication_slot_inactive_timeout == 0) CR4 === + if (slot->inactive_since > 0 && + replication_slot_inactive_timeout > 0) + { Same. So, instead of CR3 === and CR4 ===, I wonder if it wouldn't be better to do something like: if (replication_slot_inactive_timeout == 0) return false; else if (slot->inactive_since > 0) . . . . else return false; That would avoid checking replication_slot_inactive_timeout and inactive_since multiple times. CR5 === + * held to avoid race conditions -- for example the restart_lsn could move + * forward, or the slot could be dropped. Does the restart_lsn example makes sense here? CR6 === +static bool +InvalidateSlotForInactiveTimeout(ReplicationSlot *slot, bool need_locks) +{ InvalidatePossiblyInactiveSlot() maybe? CR7 === + /* Make sure the invalidated state persists across server restart */ + slot->just_dirtied = true; + slot->dirty = true; + SpinLockRelease(&slot->mutex); Maybe we could create a new function say MarkGivenReplicationSlotDirty() with a slot as parameter, that ReplicationSlotMarkDirty could call too? Then maybe we could set slot->data.invalidated = RS_INVAL_INACTIVE_TIMEOUT in InvalidateSlotForInactiveTimeout()? (to avoid multiple SpinLockAcquire/SpinLockRelease). CR8 === + if (persist_state) + { + char path[MAXPGPATH]; + + sprintf(path, "pg_replslot/%s", NameStr(slot->data.name)); + SaveSlotToPath(slot, path, ERROR); + } Maybe we could create a new function say GivenReplicationSlotSave() with a slot as parameter, that ReplicationSlotSave() could call too? CR9 === + if (check_for_invalidation) + { + /* The slot is ours by now */ + Assert(s->active_pid == MyProcPid); + + /* + * Well, the slot is not yet ours really unless we check for the + * invalidation below. + */ + s->active_pid = 0; + if (InvalidateReplicationSlotForInactiveTimeout(s, true, true)) + { + /* + * If the slot has been invalidated, recalculate the resource + * limits. + */ + ReplicationSlotsComputeRequiredXmin(false); + ReplicationSlotsComputeRequiredLSN(); + + /* Might need it for slot clean up on error, so restore it */ + s->active_pid = MyProcPid; + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot acquire invalidated replication slot \"%s\"", + NameStr(MyReplicationSlot->data.name)))); + } + s->active_pid = MyProcPid; Are we not missing some SpinLockAcquire/Release on the slot's mutex here? (the places where we set the active_pid). CR10 === @@ -1628,6 +1674,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, if (SlotIsLogical(s)) invalidation_cause = cause; break; + case RS_INVAL_INACTIVE_TIMEOUT: + if (InvalidateReplicationSlotForInactiveTimeout(s, false, false)) + invalidation_cause = cause; + break; InvalidatePossiblyObsoleteSlot() is not called with such a reason, better to use an Assert here and in the caller too? CR11 === +++ b/src/test/recovery/t/050_invalidate_slots.pl why not using 019_replslot_limit.pl? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: