Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date | |
Msg-id | Zg0QgarM9UlRutLw@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Introduce XID age and inactive timeout based replication slot invalidation (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: Introduce XID age and inactive timeout based replication slot invalidation
|
List | pgsql-hackers |
Hi, On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote: > On Wed, Apr 3, 2024 at 8:38 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Or a simple solution is that the slotsync worker updates > > > > inactive_since as it does for non-synced slots, and disables > > > > timeout-based slot invalidation for synced slots. > > > > I like this idea better, it takes care of such a case too when the > > user is relying on sync-function rather than worker and does not want > > to get the slots invalidated in between 2 sync function calls. > > Please find the attached v31 patches implementing the above idea: Thanks! Some comments regarding v31-0002: === testing the behavior T1 === > - synced slots don't get invalidated due to inactive timeout because > such slots not considered active at all as they don't perform logical > decoding (of course, they will perform in fast_forward mode to fix the > other data loss issue, but they don't generate changes for them to be > called as *active* slots) It behaves as described. OTOH non synced logical slots on the standby and physical slots on the standby are invalidated which is what is expected. T2 === In case the slot is invalidated on the primary, primary: postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1'; slot_name | inactive_since | invalidation_reason -----------+-------------------------------+--------------------- s1 | 2024-04-03 06:56:28.075637+00 | inactive_timeout then on the standby we get: standby: postgres=# select slot_name, inactive_since, invalidation_reason from pg_replication_slots where slot_name = 's1'; slot_name | inactive_since | invalidation_reason -----------+------------------------------+--------------------- s1 | 2024-04-03 07:06:43.37486+00 | inactive_timeout shouldn't the slot be dropped/recreated instead of updating inactive_since? === code CR1 === + Invalidates replication slots that are inactive for longer the + specified amount of time s/for longer the/for longer that/? CR2 === + <literal>true</literal>) as such synced slots don't actually perform + logical decoding. We're switching in fast forward logical due to [1], so I'm not sure that's 100% accurate here. I'm not sure we need to specify a reason. CR3 === + errdetail("This slot has been invalidated because it was inactive for more than the time specified by replication_slot_inactive_timeoutparameter."))); I think we can remove "parameter" (see for example the error message in validate_remote_info()) and reduce it a bit, something like? "This slot has been invalidated because it was inactive for more than replication_slot_inactive_timeout"? CR4 === + appendStringInfoString(&err_detail, _("The slot has been inactive for more than the time specified by replication_slot_inactive_timeoutparameter.")); Same. CR5 === + /* + * This function isn't expected to be called for inactive timeout based + * invalidation. A separate function InvalidateInactiveReplicationSlot is + * to be used for that. Do you think it's worth to explain why? CR6 === + if (replication_slot_inactive_timeout == 0) + return false; + else if (slot->inactive_since > 0) "else" is not needed here. CR7 === + SpinLockAcquire(&slot->mutex); + + /* + * Check if the slot needs to be invalidated due to + * replication_slot_inactive_timeout GUC. We do this with the spinlock + * held to avoid race conditions -- for example the inactive_since + * could change, or the slot could be dropped. + */ + now = GetCurrentTimestamp(); We should not call GetCurrentTimestamp() while holding a spinlock. CR8 === +# Testcase start: Invalidate streaming standby's slot as well as logical +# failover slot on primary due to inactive timeout GUC. Also, check the logical s/inactive timeout GUC/replication_slot_inactive_timeout/? CR9 === +# Start: Helper functions used for this test file +# End: Helper functions used for this test file I think that's the first TAP test with this comment. Not saying we should not but why did you feel the need to add those? [1]: https://www.postgresql.org/message-id/OS0PR01MB5716B3942AE49F3F725ACA92943B2@OS0PR01MB5716.jpnprd01.prod.outlook.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: