Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From vignesh C
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id CALDaNm1F2YrswzM_WM37BYmiZ9Cf60UD_mgtm8HnMHRGA7tx4g@mail.gmail.com
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Wed, 27 Nov 2024 at 16:25, Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> On Wed, Nov 27, 2024 at 8:39 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hi Nisha,
> >
> > Here are some review comments for the patch v50-0002.
> >
> > ======
> > src/backend/replication/slot.c
> >
> > InvalidatePossiblyObsoleteSlot:
> >
> > 1.
> > + if (now &&
> > + TimestampDifferenceExceeds(s->inactive_since, now,
> > +    replication_slot_inactive_timeout_sec * 1000))
> >
> > Previously this was using an additional call to SlotInactiveTimeoutCheckAllowed:
> >
> > + if (SlotInactiveTimeoutCheckAllowed(s) &&
> > + TimestampDifferenceExceeds(s->inactive_since, now,
> > +    replication_slot_inactive_timeout * 1000))
> >
> > Is it OK to skip that call? e.g. can the slot fields possibly change
> > between assigning the 'now' and acquiring the mutex? If not, then the
> > current code is fine. The only reason for asking is because it is
> > slightly suspicious that it was not done this "easy" way in the first
> > place.
> >
> Good catch! While the mutex was being acquired right after the now
> assignment, there was a rare chance of another process modifying the
> slot in the meantime. So, I reverted the change in v51. To optimize
> the SlotInactiveTimeoutCheckAllowed() call, it's sufficient to check
> it here instead of during the 'now' assignment.
>
> Attached v51 patch-set addressing all comments in [1] and [2].

Few comments:
1) replication_slot_inactive_timeout can be mentioned in logical
replication config, we could mention something like:
Logical replication slot is also affected by replication_slot_inactive_timeout

2.a) Is this change applicable only for inactive timeout or it is
applicable to others like wal removed, wal level etc also? If it is
applicable to all of them we could move this to the first patch and
update the commit message:
+                * If the slot can be acquired, do so and mark it as
invalidated. If
+                * the slot is already ours, mark it as invalidated.
Otherwise, we'll
+                * signal the owning process below and retry.
                 */
-               if (active_pid == 0)
+               if (active_pid == 0 ||
+                       (MyReplicationSlot == s &&
+                        active_pid == MyProcPid))

2.b) Also this MyReplicationSlot and active_pid check can be in same line:
+                       (MyReplicationSlot == s &&
+                        active_pid == MyProcPid))


3) Error detail should start in upper case here similar to how others are done:
+                       case RS_INVAL_INACTIVE_TIMEOUT:
+                               appendStringInfo(&err_detail,
_("inactivity exceeded the time limit set by \"%s\"."),
+
"replication_slot_inactive_timeout");
+                               break;

4) Since this change is not related to this patch, we can move this to
the first patch and update the commit message:
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1508,7 +1508,7 @@ ReplSlotSyncWorkerMain(char *startup_data,
size_t startup_data_len)
 static void
 update_synced_slots_inactive_since(void)
 {
-       TimestampTz now = 0;
+       TimestampTz now;

        /*
         * We need to update inactive_since only when we are promoting
standby to
@@ -1523,6 +1523,9 @@ update_synced_slots_inactive_since(void)
        /* The slot sync worker or SQL function mustn't be running by now */
        Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);

+       /* Use same inactive_since time for all slots */
+       now = GetCurrentTimestamp();

5) Since this change is not related to this patch, we can move this to
the first patch.
@@ -2250,6 +2350,7 @@ RestoreSlotFromDisk(const char *name)
        bool            restored = false;
        int                     readBytes;
        pg_crc32c       checksum;
+       TimestampTz now;

        /* no need to lock here, no concurrent access allowed yet */

@@ -2410,6 +2511,9 @@ RestoreSlotFromDisk(const char *name)
                                                NameStr(cp.slotdata.name)),
                                 errhint("Change \"wal_level\" to be
\"replica\" or higher.")));

+       /* Use same inactive_since time for all slots */
+       now = GetCurrentTimestamp();
+
        /* nothing can be active yet, don't lock anything */
        for (i = 0; i < max_replication_slots; i++)
        {
@@ -2440,9 +2544,11 @@ RestoreSlotFromDisk(const char *name)
                /*
                 * Set the time since the slot has become inactive
after loading the
                 * slot from the disk into memory. Whoever acquires
the slot i.e.
-                * makes the slot active will reset it.
+                * makes the slot active will reset it. Avoid calling
+                * ReplicationSlotSetInactiveSince() here, as it will
not set the time
+                * for invalid slots.
                 */
-               slot->inactive_since = GetCurrentTimestamp();
+               slot->inactive_since = now;

[1] - https://www.postgresql.org/docs/current/logical-replication-config.html

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Added prosupport function for estimating numeric generate_series rows
Next
From: vignesh C
Date:
Subject: Re: Fix comment in reorderbuffer.h