Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id ZgulmKzdqxuBuhCU@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
Hi,

On Tue, Apr 02, 2024 at 12:07:54PM +0900, Masahiko Sawada wrote:
> On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy
> 
> FWIW, coming to this thread late, I think that the inactive_since
> should not be synchronized from the primary. The wall clocks are
> different on the primary and the standby so having the primary's
> timestamp on the standby can confuse users, especially when there is a
> big clock drift. Also, as Amit mentioned, inactive_since seems not to
> be consistent with other parameters we copy. The
> replication_slot_inactive_timeout feature should work on the standby
> independent from the primary, like other slot invalidation mechanisms,
> and it should be based on its own local clock.

Thanks for sharing your thoughts! So, it looks like that most of us agree to not
sync inactive_since from the primary, I'm fine with that.

> If we want to invalidate the synced slots due to the timeout, I think
> we need to define what is "inactive" for synced slots.
> 
> Suppose that the slotsync worker updates the local (synced) slot's
> inactive_since whenever releasing the slot, irrespective of the actual
> LSNs (or other slot parameters) having been updated. I think that this
> idea cannot handle a slot that is not acquired on the primary. In this
> case, the remote slot is inactive but the local slot is regarded as
> active.  WAL files are piled up on the standby (and on the primary) as
> the slot's LSNs don't move forward. I think we want to regard such a
> slot as "inactive" also on the standby and invalidate it because of
> the timeout.

I think that makes sense to somehow link inactive_since on the standby to 
the actual LSNs (or other slot parameters) being updated or not.

> > > Now, the other concern is that calling GetCurrentTimestamp()
> > > could be costly when the values for the slot are not going to be
> > > updated but if that happens we can optimize such that before acquiring
> > > the slot we can have some minimal pre-checks to ensure whether we need
> > > to update the slot or not.
> 
> If we use such pre-checks, another problem might happen; it cannot
> handle a case where the slot is acquired on the primary but its LSNs
> don't move forward. Imagine a logical replication conflict happened on
> the subscriber, and the logical replication enters the retry loop. In
> this case, the remote slot's inactive_since gets updated for every
> retry, but it looks inactive from the standby since the slot LSNs
> don't change. Therefore, only the local slot could be invalidated due
> to the timeout but probably we don't want to regard such a slot as
> "inactive".
> 
> Another idea I came up with is that the slotsync worker updates the
> local slot's inactive_since to the local timestamp only when the
> remote slot might have got inactive. If the remote slot is acquired by
> someone, the local slot's inactive_since is also NULL. If the remote
> slot gets inactive, the slotsync worker sets the local timestamp to
> the local slot's inactive_since. Since the remote slot could be
> acquired and released before the slotsync worker gets the remote slot
> data again, if the remote slot's inactive_since > the local slot's
> inactive_since, the slotsync worker updates the local one.

Then I think we would need to be careful about time zone comparison.

> IOW, we
> detect whether the remote slot was acquired and released since the
> last synchronization, by checking the remote slot's inactive_since.
> This idea seems to handle these cases I mentioned unless I'm missing
> something, but it requires for the slotsync worker to update
> inactive_since in a different way than other parameters.
> 
> Or a simple solution is that the slotsync worker updates
> inactive_since as it does for non-synced slots, and disables
> timeout-based slot invalidation for synced slots.

Yeah, I think the main question to help us decide is: do we want to invalidate
"inactive" synced slots locally (in addition to synchronizing the invalidation
from the primary)? 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Table AM Interface Enhancements
Next
From: Pavel Stehule
Date:
Subject: Re: Schema variables - new implementation for Postgres 15