Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From vignesh C
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id CALDaNm2VQW_gpOJ-QWkEA_h18DN31ELEz2_7QmwWCAg9=Zew4A@mail.gmail.com
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
On Wed, 13 Nov 2024 at 15:00, Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> Please find the v48 patch attached.
>
> On Thu, Sep 19, 2024 at 9:40 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > When we promote hot standby with synced logical slots to become new
> > primary, the logical slots are never invalidated with
> > 'inactive_timeout' on new primary.  It seems the check in
> > SlotInactiveTimeoutCheckAllowed() is wrong. We should allow
> > invalidation of slots on primary even if they are marked as 'synced'.
>
> fixed.
>
> > I have raised 4 issues so far on v46, the first 3 are in [1],[2],[3].
> > Once all these are addressed, I can continue reviewing further.
> >
>
> Fixed issues reported in [1], [2].

Few comments:
1) Since we don't change the value of now in
ReplicationSlotSetInactiveSince, the function parameter can be passed
by value:
+/*
+ * Set slot's inactive_since property unless it was previously invalidated.
+ */
+static inline void
+ReplicationSlotSetInactiveSince(ReplicationSlot *s, TimestampTz *now,
+                                                               bool
acquire_lock)
+{
+       if (s->data.invalidated != RS_INVAL_NONE)
+               return;
+
+       if (acquire_lock)
+               SpinLockAcquire(&s->mutex);
+
+       s->inactive_since = *now;

2) Currently it allows a minimum value of less than 1 second like in
milliseconds, I feel we can have some minimum value at least something
like checkpoint_timeout:
diff --git a/src/backend/utils/misc/guc_tables.c
b/src/backend/utils/misc/guc_tables.c
index 8a67f01200..367f510118 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3028,6 +3028,18 @@ struct config_int ConfigureNamesInt[] =
                NULL, NULL, NULL
        },

+       {
+               {"replication_slot_inactive_timeout", PGC_SIGHUP,
REPLICATION_SENDING,
+                       gettext_noop("Sets the amount of time a
replication slot can remain inactive before "
+                                                "it will be invalidated."),
+                       NULL,
+                       GUC_UNIT_S
+               },
+               &replication_slot_inactive_timeout,
+               0, 0, INT_MAX,
+               NULL, NULL, NULL
+       },

3) Since SlotInactiveTimeoutCheckAllowed check is just done above and
the current time has been retrieved can we used "now" variable instead
of SlotInactiveTimeoutCheckAllowed again second time:
@@ -1651,6 +1713,26 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
                                        if (SlotIsLogical(s))
                                                invalidation_cause = cause;
                                        break;
+                               case RS_INVAL_INACTIVE_TIMEOUT:
+
+                                       /*
+                                        * Check if the slot needs to
be invalidated due to
+                                        *
replication_slot_inactive_timeout GUC.
+                                        */
+                                       if
(SlotInactiveTimeoutCheckAllowed(s) &&
+
TimestampDifferenceExceeds(s->inactive_since, now,
+
                            replication_slot_inactive_timeout * 1000))
+                                       {
+                                               invalidation_cause = cause;
+                                               inactive_since =
s->inactive_since;

4) I'm not sure if this change required by this patch or is it a
general optimization, if it is required for this patch we can detail
the comments:
@@ -2208,6 +2328,7 @@ RestoreSlotFromDisk(const char *name)
        bool            restored = false;
        int                     readBytes;
        pg_crc32c       checksum;
+       TimestampTz now;

        /* no need to lock here, no concurrent access allowed yet */

@@ -2368,6 +2489,9 @@ RestoreSlotFromDisk(const char *name)
                                                NameStr(cp.slotdata.name)),
                                 errhint("Change \"wal_level\" to be
\"replica\" or higher.")));

+       /* Use same inactive_since time for all slots */
+       now = GetCurrentTimestamp();
+
        /* nothing can be active yet, don't lock anything */
        for (i = 0; i < max_replication_slots; i++)
        {
@@ -2400,7 +2524,7 @@ RestoreSlotFromDisk(const char *name)
                 * slot from the disk into memory. Whoever acquires
the slot i.e.
                 * makes the slot active will reset it.
                 */
-               slot->inactive_since = GetCurrentTimestamp();
+               slot->inactive_since = now;

5) Why should the slot invalidation be updated during shutdown,
shouldn't the inactive_since value be intact during shutdown?
-        <literal>NULL</literal> if the slot is currently being used.
-        Note that for slots on the standby that are being synced from a
+        <literal>NULL</literal> if the slot is currently being used. Once the
+        slot is invalidated, this value will remain unchanged until we shutdown
+        the server. Note that for slots on the standby that are being
synced from a

6) New Style of ereport does not need braces around errcode, it can be
changed similarly:
+       if (error_if_invalid &&
+               s->data.invalidated == RS_INVAL_INACTIVE_TIMEOUT)
+       {
+               Assert(s->inactive_since > 0);
+               ereport(ERROR,
+
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                errmsg("can no longer get changes
from replication slot \"%s\"",
+                                               NameStr(s->data.name)),
+                                errdetail("This slot has been
invalidated because it was inactive for longer than the amount of time
specified by \"%s\".",
+
"replication_slot_inactive_timeout")));

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Fix for pageinspect bug in PG 17
Next
From: Peter Smith
Date:
Subject: Re: Skip collecting decoded changes of already-aborted transactions