Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date | |
Msg-id | CAJpy0uBthb-RG=Xg+mM-G6Xy01JLcBsizJuz+xbti08qP_W7AQ@mail.gmail.com Whole thread Raw |
In response to | Re: Introduce XID age and inactive timeout based replication slot invalidation (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
Responses |
Re: Introduce XID age and inactive timeout based replication slot invalidation
|
List | pgsql-hackers |
On Tue, Apr 2, 2024 at 11:58 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Tue, Apr 02, 2024 at 12:07:54PM +0900, Masahiko Sawada wrote: > > On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy > > > > FWIW, coming to this thread late, I think that the inactive_since > > should not be synchronized from the primary. The wall clocks are > > different on the primary and the standby so having the primary's > > timestamp on the standby can confuse users, especially when there is a > > big clock drift. Also, as Amit mentioned, inactive_since seems not to > > be consistent with other parameters we copy. The > > replication_slot_inactive_timeout feature should work on the standby > > independent from the primary, like other slot invalidation mechanisms, > > and it should be based on its own local clock. > > Thanks for sharing your thoughts! So, it looks like that most of us agree to not > sync inactive_since from the primary, I'm fine with that. +1 on not syncing slots from primary. > > If we want to invalidate the synced slots due to the timeout, I think > > we need to define what is "inactive" for synced slots. > > > > Suppose that the slotsync worker updates the local (synced) slot's > > inactive_since whenever releasing the slot, irrespective of the actual > > LSNs (or other slot parameters) having been updated. I think that this > > idea cannot handle a slot that is not acquired on the primary. In this > > case, the remote slot is inactive but the local slot is regarded as > > active. WAL files are piled up on the standby (and on the primary) as > > the slot's LSNs don't move forward. I think we want to regard such a > > slot as "inactive" also on the standby and invalidate it because of > > the timeout. > > I think that makes sense to somehow link inactive_since on the standby to > the actual LSNs (or other slot parameters) being updated or not. > > > > > Now, the other concern is that calling GetCurrentTimestamp() > > > > could be costly when the values for the slot are not going to be > > > > updated but if that happens we can optimize such that before acquiring > > > > the slot we can have some minimal pre-checks to ensure whether we need > > > > to update the slot or not. > > > > If we use such pre-checks, another problem might happen; it cannot > > handle a case where the slot is acquired on the primary but its LSNs > > don't move forward. Imagine a logical replication conflict happened on > > the subscriber, and the logical replication enters the retry loop. In > > this case, the remote slot's inactive_since gets updated for every > > retry, but it looks inactive from the standby since the slot LSNs > > don't change. Therefore, only the local slot could be invalidated due > > to the timeout but probably we don't want to regard such a slot as > > "inactive". > > > > Another idea I came up with is that the slotsync worker updates the > > local slot's inactive_since to the local timestamp only when the > > remote slot might have got inactive. If the remote slot is acquired by > > someone, the local slot's inactive_since is also NULL. If the remote > > slot gets inactive, the slotsync worker sets the local timestamp to > > the local slot's inactive_since. Since the remote slot could be > > acquired and released before the slotsync worker gets the remote slot > > data again, if the remote slot's inactive_since > the local slot's > > inactive_since, the slotsync worker updates the local one. > > Then I think we would need to be careful about time zone comparison. Yes. Also we need to consider the case when a user is relying on pg_sync_replication_slots() and has not enabled slot-sync worker. In such a case if synced slot's inactive_since is derived from inactivity of remote-slot, it might not be that frequently updated (based on when the user actually runs the SQL function) and thus may be misleading. OTOH, if inactivty_since of synced slots represents its own inactivity, then it will give correct info even for the case when the SQL function is run after a long time and slot-sync worker is disabled. > > IOW, we > > detect whether the remote slot was acquired and released since the > > last synchronization, by checking the remote slot's inactive_since. > > This idea seems to handle these cases I mentioned unless I'm missing > > something, but it requires for the slotsync worker to update > > inactive_since in a different way than other parameters. > > > > Or a simple solution is that the slotsync worker updates > > inactive_since as it does for non-synced slots, and disables > > timeout-based slot invalidation for synced slots. I like this idea better, it takes care of such a case too when the user is relying on sync-function rather than worker and does not want to get the slots invalidated in between 2 sync function calls. > Yeah, I think the main question to help us decide is: do we want to invalidate > "inactive" synced slots locally (in addition to synchronizing the invalidation > from the primary)? thanks Shveta
pgsql-hackers by date: