Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id Zg0QgarM9UlRutLw@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
Hi,

On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 8:38 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > > > Or a simple solution is that the slotsync worker updates
> > > > inactive_since as it does for non-synced slots, and disables
> > > > timeout-based slot invalidation for synced slots.
> >
> > I like this idea better, it takes care of such a case too when the
> > user is relying on sync-function rather than worker and does not want
> > to get the slots invalidated in between 2 sync function calls.
> 
> Please find the attached v31 patches implementing the above idea:

Thanks!

Some comments regarding v31-0002:

=== testing the behavior

T1 ===

> - synced slots don't get invalidated due to inactive timeout because
> such slots not considered active at all as they don't perform logical
> decoding (of course, they will perform in fast_forward mode to fix the
> other data loss issue, but they don't generate changes for them to be
> called as *active* slots)

It behaves as described. OTOH non synced logical slots on the standby and
physical slots on the standby are invalidated which is what is expected.

T2 ===

In case the slot is invalidated on the primary,

primary:

postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1';
 slot_name |        inactive_since         | invalidation_reason
-----------+-------------------------------+---------------------
 s1        | 2024-04-03 06:56:28.075637+00 | inactive_timeout

then on the standby we get:

standby:

postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1';
 slot_name |        inactive_since        | invalidation_reason
-----------+------------------------------+---------------------
 s1        | 2024-04-03 07:06:43.37486+00 | inactive_timeout

shouldn't the slot be dropped/recreated instead of updating inactive_since?

=== code

CR1 ===

+        Invalidates replication slots that are inactive for longer the
+        specified amount of time

s/for longer the/for longer that/?

CR2 ===

+        <literal>true</literal>) as such synced slots don't actually perform
+        logical decoding.

We're switching in fast forward logical due to [1], so I'm not sure that's 100%
accurate here. I'm not sure we need to specify a reason.

CR3 ===

+ errdetail("This slot has been invalidated because it was inactive for more than the time specified by
replication_slot_inactive_timeoutparameter.")));
 

I think we can remove "parameter" (see for example the error message in
validate_remote_info()) and reduce it a bit, something like?

"This slot has been invalidated because it was inactive for more than replication_slot_inactive_timeout"?

CR4 ===

+ appendStringInfoString(&err_detail, _("The slot has been inactive for more than the time specified by
replication_slot_inactive_timeoutparameter."));
 

Same.

CR5 ===

+       /*
+        * This function isn't expected to be called for inactive timeout based
+        * invalidation. A separate function InvalidateInactiveReplicationSlot is
+        * to be used for that.

Do you think it's worth to explain why?

CR6 ===

+       if (replication_slot_inactive_timeout == 0)
+               return false;
+       else if (slot->inactive_since > 0)

"else" is not needed here.

CR7 ===

+               SpinLockAcquire(&slot->mutex);
+
+               /*
+                * Check if the slot needs to be invalidated due to
+                * replication_slot_inactive_timeout GUC. We do this with the spinlock
+                * held to avoid race conditions -- for example the inactive_since
+                * could change, or the slot could be dropped.
+                */
+               now = GetCurrentTimestamp();

We should not call GetCurrentTimestamp() while holding a spinlock.

CR8 ===

+# Testcase start: Invalidate streaming standby's slot as well as logical
+# failover slot on primary due to inactive timeout GUC. Also, check the logical

s/inactive timeout GUC/replication_slot_inactive_timeout/?

CR9 ===

+# Start: Helper functions used for this test file
+# End: Helper functions used for this test file

I think that's the first TAP test with this comment. Not saying we should not but
why did you feel the need to add those?

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716B3942AE49F3F725ACA92943B2@OS0PR01MB5716.jpnprd01.prod.outlook.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: [HACKERS] make async slave to wait for lsn to be replayed
Next
From: Daniel Gustafsson
Date:
Subject: Re: RFC: Additional Directory for Extensions