Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date | |
Msg-id | ZgulmKzdqxuBuhCU@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Introduce XID age and inactive timeout based replication slot invalidation (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Introduce XID age and inactive timeout based replication slot invalidation
Re: Introduce XID age and inactive timeout based replication slot invalidation |
List | pgsql-hackers |
Hi, On Tue, Apr 02, 2024 at 12:07:54PM +0900, Masahiko Sawada wrote: > On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy > > FWIW, coming to this thread late, I think that the inactive_since > should not be synchronized from the primary. The wall clocks are > different on the primary and the standby so having the primary's > timestamp on the standby can confuse users, especially when there is a > big clock drift. Also, as Amit mentioned, inactive_since seems not to > be consistent with other parameters we copy. The > replication_slot_inactive_timeout feature should work on the standby > independent from the primary, like other slot invalidation mechanisms, > and it should be based on its own local clock. Thanks for sharing your thoughts! So, it looks like that most of us agree to not sync inactive_since from the primary, I'm fine with that. > If we want to invalidate the synced slots due to the timeout, I think > we need to define what is "inactive" for synced slots. > > Suppose that the slotsync worker updates the local (synced) slot's > inactive_since whenever releasing the slot, irrespective of the actual > LSNs (or other slot parameters) having been updated. I think that this > idea cannot handle a slot that is not acquired on the primary. In this > case, the remote slot is inactive but the local slot is regarded as > active. WAL files are piled up on the standby (and on the primary) as > the slot's LSNs don't move forward. I think we want to regard such a > slot as "inactive" also on the standby and invalidate it because of > the timeout. I think that makes sense to somehow link inactive_since on the standby to the actual LSNs (or other slot parameters) being updated or not. > > > Now, the other concern is that calling GetCurrentTimestamp() > > > could be costly when the values for the slot are not going to be > > > updated but if that happens we can optimize such that before acquiring > > > the slot we can have some minimal pre-checks to ensure whether we need > > > to update the slot or not. > > If we use such pre-checks, another problem might happen; it cannot > handle a case where the slot is acquired on the primary but its LSNs > don't move forward. Imagine a logical replication conflict happened on > the subscriber, and the logical replication enters the retry loop. In > this case, the remote slot's inactive_since gets updated for every > retry, but it looks inactive from the standby since the slot LSNs > don't change. Therefore, only the local slot could be invalidated due > to the timeout but probably we don't want to regard such a slot as > "inactive". > > Another idea I came up with is that the slotsync worker updates the > local slot's inactive_since to the local timestamp only when the > remote slot might have got inactive. If the remote slot is acquired by > someone, the local slot's inactive_since is also NULL. If the remote > slot gets inactive, the slotsync worker sets the local timestamp to > the local slot's inactive_since. Since the remote slot could be > acquired and released before the slotsync worker gets the remote slot > data again, if the remote slot's inactive_since > the local slot's > inactive_since, the slotsync worker updates the local one. Then I think we would need to be careful about time zone comparison. > IOW, we > detect whether the remote slot was acquired and released since the > last synchronization, by checking the remote slot's inactive_since. > This idea seems to handle these cases I mentioned unless I'm missing > something, but it requires for the slotsync worker to update > inactive_since in a different way than other parameters. > > Or a simple solution is that the slotsync worker updates > inactive_since as it does for non-synced slots, and disables > timeout-based slot invalidation for synced slots. Yeah, I think the main question to help us decide is: do we want to invalidate "inactive" synced slots locally (in addition to synchronizing the invalidation from the primary)? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: