Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id CAHut+Pvx294U-XBB6-BvabesUNxbnuDQmk-VOFm=pbcNWSsHvQ@mail.gmail.com
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (vignesh C <vignesh21@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
Hi Nisha.

Here are some review comments for patch v54-0002.

(I had also checked patch v54-0001, but have no further review
comments for that one).

======
doc/src/sgml/config.sgml

1.
+       <para>
+        Slot invalidation due to idle timeout occurs during checkpoint.
+        If the <varname>checkpoint_timeout</varname> exceeds
+        <varname>idle_replication_slot_timeout</varname>, the slot
+        invalidation will be delayed until the next checkpoint is triggered.
+        To avoid delays, users can force a checkpoint to promptly invalidate
+        inactive slots. The duration of slot inactivity is calculated
using the slot's
+        <link
linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>inactive_since</structfield>
+        value.
+       </para>
+

The wording of "If the checkpoint_timeout exceeds
idle_replication_slot_timeout, the slot invalidation will be delayed
until the next checkpoint is triggered." seems slightly misleading,
because AFAIK it is not conditional on the GUC value differences like
that -- i.e. slot invalidation is *always* delayed until the next
checkpoint occurs.

SUGGESTION:
Slot invalidation due to idle timeout occurs during checkpoint.
Because checkpoints happen at checkpoint_timeout intervals, there can
be some lag between when the idle_replication_slot_timeout was
exceeded and when the slot invalidation is triggered at the next
checkpoint. To avoid such lags, users can force...

=======
src/backend/replication/slot.c

2. GENERAL

+/* Invalidate replication slots idle beyond this time; '0' disables it */
+int idle_replication_slot_timeout_ms = 0;

I noticed this patch is using a variety of ways of describing the same thing:
* guc var: Invalidate replication slots idle beyond this time...
* guc_tables: ... the amount of time a replication slot can remain
idle before it will be invalidated.
* docs: means that the slot has remained idle beyond the duration
specified by the idle_replication_slot_timeout parameter
* errmsg: ... slot has been invalidated because inactivity exceeded
the time limit set by ...
* etc..

They are all the same, but they are all worded slightly differently:
* "idle" vs "inactivity" vs ...
* "time" vs "amount of time" vs "duration" vs "time limit" vs ...

There may not be a one-size-fits-all, but still, it might be better to
try to search for all different phrasing and use common wording as
much as possible.

~~~

CheckPointReplicationSlots:

3.
+ * XXX: Slot invalidation due to 'idle_timeout' occurs only for
+ * released slots, based on 'idle_replication_slot_timeout'. Active
+ * slots in use for replication are excluded, preventing accidental
+ * invalidation. Slots where communication between the publisher and
+ * subscriber is down are also excluded, as they are managed by the
+ * 'wal_sender_timeout'.

Maybe a slight rewording like below is better. Maybe not. YMMV.

SUGGESTION:
XXX: Slot invalidation due to 'idle_timeout' applies only to released
slots, and is based on the 'idle_replication_slot_timeout' GUC. Active
slots
currently in use for replication are excluded to prevent accidental
invalidation.  Slots...

======
src/bin/pg_upgrade/server.c

4.
+ /*
+ * Use idle_replication_slot_timeout=0 to prevent slot invalidation due to
+ * inactive_timeout by checkpointer process during upgrade.
+ */
+ if (GET_MAJOR_VERSION(cluster->major_version) >= 1800)
+ appendPQExpBufferStr(&pgoptions, " -c idle_replication_slot_timeout=0");
+

/inactive_timeout/idle_timeout/

======
src/test/recovery/t/043_invalidate_inactive_slots.pl

5.
+# Wait for slot to first become idle and then get invalidated
+sub wait_for_slot_invalidation
+{
+ my ($node, $slot, $offset, $idle_timeout) = @_;
+ my $node_name = $node->name;

AFAICT this 'idle_timeout' parameter is passed units of "seconds", so
it would be better to call it something like 'idle_timeout_s' to make
the units clear.

~~~

6.
+# Trigger slot invalidation and confirm it in the server log
+sub trigger_slot_invalidation
+{
+ my ($node, $slot, $offset, $idle_timeout) = @_;
+ my $node_name = $node->name;
+ my $invalidated = 0;

Ditto above review comment #5 -- better to call it something like
'idle_timeout_s' to make the units clear.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Parallel heap vacuum
Next
From: Dilip Kumar
Date:
Subject: Re: Skip collecting decoded changes of already-aborted transactions