Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date | |
Msg-id | CALDaNm1F2YrswzM_WM37BYmiZ9Cf60UD_mgtm8HnMHRGA7tx4g@mail.gmail.com Whole thread Raw |
In response to | Re: Introduce XID age and inactive timeout based replication slot invalidation (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Wed, 27 Nov 2024 at 16:25, Nisha Moond <nisha.moond412@gmail.com> wrote: > > On Wed, Nov 27, 2024 at 8:39 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi Nisha, > > > > Here are some review comments for the patch v50-0002. > > > > ====== > > src/backend/replication/slot.c > > > > InvalidatePossiblyObsoleteSlot: > > > > 1. > > + if (now && > > + TimestampDifferenceExceeds(s->inactive_since, now, > > + replication_slot_inactive_timeout_sec * 1000)) > > > > Previously this was using an additional call to SlotInactiveTimeoutCheckAllowed: > > > > + if (SlotInactiveTimeoutCheckAllowed(s) && > > + TimestampDifferenceExceeds(s->inactive_since, now, > > + replication_slot_inactive_timeout * 1000)) > > > > Is it OK to skip that call? e.g. can the slot fields possibly change > > between assigning the 'now' and acquiring the mutex? If not, then the > > current code is fine. The only reason for asking is because it is > > slightly suspicious that it was not done this "easy" way in the first > > place. > > > Good catch! While the mutex was being acquired right after the now > assignment, there was a rare chance of another process modifying the > slot in the meantime. So, I reverted the change in v51. To optimize > the SlotInactiveTimeoutCheckAllowed() call, it's sufficient to check > it here instead of during the 'now' assignment. > > Attached v51 patch-set addressing all comments in [1] and [2]. Few comments: 1) replication_slot_inactive_timeout can be mentioned in logical replication config, we could mention something like: Logical replication slot is also affected by replication_slot_inactive_timeout 2.a) Is this change applicable only for inactive timeout or it is applicable to others like wal removed, wal level etc also? If it is applicable to all of them we could move this to the first patch and update the commit message: + * If the slot can be acquired, do so and mark it as invalidated. If + * the slot is already ours, mark it as invalidated. Otherwise, we'll + * signal the owning process below and retry. */ - if (active_pid == 0) + if (active_pid == 0 || + (MyReplicationSlot == s && + active_pid == MyProcPid)) 2.b) Also this MyReplicationSlot and active_pid check can be in same line: + (MyReplicationSlot == s && + active_pid == MyProcPid)) 3) Error detail should start in upper case here similar to how others are done: + case RS_INVAL_INACTIVE_TIMEOUT: + appendStringInfo(&err_detail, _("inactivity exceeded the time limit set by \"%s\"."), + "replication_slot_inactive_timeout"); + break; 4) Since this change is not related to this patch, we can move this to the first patch and update the commit message: --- a/src/backend/replication/logical/slotsync.c +++ b/src/backend/replication/logical/slotsync.c @@ -1508,7 +1508,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len) static void update_synced_slots_inactive_since(void) { - TimestampTz now = 0; + TimestampTz now; /* * We need to update inactive_since only when we are promoting standby to @@ -1523,6 +1523,9 @@ update_synced_slots_inactive_since(void) /* The slot sync worker or SQL function mustn't be running by now */ Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing); + /* Use same inactive_since time for all slots */ + now = GetCurrentTimestamp(); 5) Since this change is not related to this patch, we can move this to the first patch. @@ -2250,6 +2350,7 @@ RestoreSlotFromDisk(const char *name) bool restored = false; int readBytes; pg_crc32c checksum; + TimestampTz now; /* no need to lock here, no concurrent access allowed yet */ @@ -2410,6 +2511,9 @@ RestoreSlotFromDisk(const char *name) NameStr(cp.slotdata.name)), errhint("Change \"wal_level\" to be \"replica\" or higher."))); + /* Use same inactive_since time for all slots */ + now = GetCurrentTimestamp(); + /* nothing can be active yet, don't lock anything */ for (i = 0; i < max_replication_slots; i++) { @@ -2440,9 +2544,11 @@ RestoreSlotFromDisk(const char *name) /* * Set the time since the slot has become inactive after loading the * slot from the disk into memory. Whoever acquires the slot i.e. - * makes the slot active will reset it. + * makes the slot active will reset it. Avoid calling + * ReplicationSlotSetInactiveSince() here, as it will not set the time + * for invalid slots. */ - slot->inactive_since = GetCurrentTimestamp(); + slot->inactive_since = now; [1] - https://www.postgresql.org/docs/current/logical-replication-config.html Regards, Vignesh
pgsql-hackers by date: